### **Overview & Background**

To train our model for recommending movies, we will be implementing a few recommendation systems (content based, popularity based and collaborative filtering) and eventually building an ensemble of these models to come up with our final hybrid recommendation system.

we will be using the full MovieLens dataset which consists of 26,000,000 ratings and 750,000 tag applications applied to 45,000 movies by 270,000 users.

We have decided to implement a variety of recommendation system so our hybrid system can learn from these individual systems and be better trained at recommending movies.

In [1]:
# importing all necessary libraries
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate


import warnings; warnings.simplefilter('ignore')

In [2]:
#Connection to Fauna DB
import os
from dotenv import load_dotenv

load_dotenv()
FAUNA_KEY="fnAEMDlAEQACBRgdU1d3dWDNmswtpvVnDF4kERne"

from faunadb import query as q
from faunadb.client import FaunaClient
from faunadb.errors import NotFound

# Fauna Client Config
client = FaunaClient(secret=FAUNA_KEY)

### **Reading & Loading DataSets**

For this project, we will be making use of two datasets, namely: <br>

(1) **The Full Dataset**: Consists of 26,000,000 ratings and 750,000 tag applications applied to 45,000 movies by 270,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags. <br>
(2) **Small Dataset**: Comprises of 100,000 ratings and 1,300 tag applications applied to 9,000 movies by 700 users.

### **First Model: Popularity Based Filtering**
**Concept**: *This model offers generalized recommendations to every user based on movie popularity and genre. The key idea behind this model is that movies that are more popular and more critically acclaimed will have a higher probability of being liked by the average audience.*

**Implementation**: *We would need to sort the movies based on rating and popularity adn then display the top movies of our list. As an added step, we can pass in a genre argument to get the top movies of a specific genre.*

In [3]:
# This is a pandas built in function to read csv files and output the first 5 rows using "df.head()"
df = pd.read_csv("~/Desktop/archive/movies_metadata.csv")
df.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [4]:
df['genres'] = df['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

We will be using the Weighted Rating formula to construct a chart. <br>
**Weighted Rating** = (v/(v+m)*R) + (m/(v+m)*C)

v = number of votes for the movie <br>
m = minimum votes required to be listed in the chart <br>
R = averge rating of the movie <br>
C = mean vote across the whole report <br>

For m, we would be using 85th percentile as out cutoff. In other words, for a movie to feature in the charts, it must have more votes than at least 85% of the movies in the dataset

In [5]:
# We first check how many movies have missing values for vote_count & vote_average
# the output is 6 (6 movies)

df['vote_count'].isnull().sum()
df['vote_average'].isnull().sum()

vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
votes_average = df[df['vote_average'].notnull()]['vote_average'].astype('int')

C = votes_average.mean()
C

5.244896612406511

In [6]:
# we will set m as the 85th percentile
m = vote_counts.quantile(0.85)
m

82.0

In [7]:
# We create a new column called year and split the string to only record the year of release 
df['year'] = pd.to_datetime(df['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

In [8]:
# We create a new dataframe called Qualified and it comprises of the top 85 percentile movies
# Calling 'qualified.shape' outputs (6832, 6) which means there are 6832 movies in this dataframe
qualified = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & 
               (df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity', 'genres']]
qualified['vote_count'] = qualified['vote_count'].astype('int')
qualified['vote_average'] = qualified['vote_average'].astype('int')
qualified.shape

(6832, 6)

In [9]:
# We use this function to calculate the weighted average of each movie
def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

In [10]:
# We create a new column called "weighted rat" which comprises of the weighted rating 
# We sort it in a descending order and this dataframe becomes our dataset for the most 'popular movies'
qualified['weighted_rat'] = qualified.apply(weighted_rating, axis=1)
qualified = qualified.sort_values('weighted_rat', ascending=False).head(250)

In [11]:
# This code outputs the top 15 movies of all time, based on the weighted rating approach
qualified.head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,genres,weighted_rat
10309,Dilwale Dulhania Le Jayenge,1995,661,9,34.457,"[Comedy, Drama, Romance]",8.585574
15480,Inception,2010,14075,8,29.1081,"[Action, Thriller, Science Fiction, Mystery, A...",7.984042
12481,The Dark Knight,2008,12269,8,123.167,"[Drama, Action, Crime, Thriller]",7.981708
22879,Interstellar,2014,11187,8,32.2135,"[Adventure, Drama, Science Fiction]",7.979952
2843,Fight Club,1999,9678,8,63.8696,[Drama],7.976853
4863,The Lord of the Rings: The Fellowship of the Ring,2001,8892,8,32.0707,"[Adventure, Fantasy, Action]",7.974825
292,Pulp Fiction,1994,8670,8,140.95,"[Thriller, Crime]",7.974187
314,The Shawshank Redemption,1994,8358,8,51.6454,"[Drama, Crime]",7.973232
7000,The Lord of the Rings: The Return of the King,2003,8226,8,29.3244,"[Adventure, Fantasy, Action]",7.972807
351,Forrest Gump,1994,8147,8,48.3072,"[Comedy, Drama, Romance]",7.972546


In [12]:
s = df.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'
gen_md = df.drop('genres', axis=1).join(s)

### **Second Model: Content Based Filtering**
**Concept**: *This model personalizes recommendations by computing similarity between movies based on certain metrics and suggests movies that are most similar to a particular movie that a user liked*

**Implementation**: *This Recommender system will be based on:* <br>
                    *(1) Movie overviews and Taglines* <br>
                    *(2) Movie Cast, Crew, Keywords and Genre*
                    
 *Due to limited computational power, we will be using a trimmed version of the movie metadata.* <br>
*In this version, we have 9099 movies available, which is 5 times smaller than our original dataset of 45000 movies*

In [13]:
links_small = pd.read_csv("~/Desktop/archive/links_small.csv")
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')
df = df.drop([19730, 29503, 35587])
df['id'] = df['id'].astype('int')
df2 = df[df['id'].isin(links_small)]
df2.shape

(9099, 25)

*(1) We will first build a recommender system using movie descriptions and taglines. Since we do not have any quantitative metric, we will be 
using qualitative metrics to judge our machine's performance*

In [14]:
# we will be replacing null values with empty strings for consistency in datatypes
df2['tagline'] = df2['tagline'].fillna('') 
df2['description'] = df2['overview'] + df2['tagline']
df2['description'] = df2['description'].fillna('')

In [15]:
# TfidfVectorizer is a sklearn python library which transforms texts into meaningful respresentation of numbers which is used to fit
# machine learning algorithms
# we will use this library to analyze our description column 
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,2), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(df2['description'])

In [16]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

*Now, we would be writing a function that returns the 30 most similar movies based on the cosine similarity score*


In [17]:
# setting the title of the movie as the index of the dataframe
df2.reset_index()
indices = pd.Series(df2.index, index = df2['title'])

In [18]:
def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return df2['title'].iloc[movie_indices]

*To further improve this situation, we also need to consider other factors such as 
cast, crew, director and genre, which determine the rating and the popularity of a movie.* <br>
*(2) In the next part, we are going to build a more sophisticated model that takes genre, keywords, cast and crew into consideration.*

In [19]:
credits = pd.read_csv("~/Desktop/archive/credits.csv")
keywords = pd.read_csv("~/Desktop/archive/keywords.csv")
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')

# we would now merge our current dataset with the crew and keyword datasets
df = df.merge(credits, on='id')
df = df.merge(keywords, on='id')
df['id'] = df['id'].astype('int')

df2 = df[df['id'].isin(links_small)]
df2.shape

(9219, 28)

*We now have our cast,crew,genres and credits, all in one dataframe.* <br>
*Crew: from the crew, we only pick the director as our feature as the other are not significant contributors* <br>
*Cast: We will be selecting the top 3 actors that appeal in the credits list*

In [20]:
df2['cast'] = df2['cast'].apply(literal_eval)
df2['crew'] = df2['crew'].apply(literal_eval)
df2['cast_size'] = df2['cast'].apply(lambda x: len(x))
df2['crew_size'] = df2['crew'].apply(lambda x: len(x))

In [21]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [22]:
df2['director'] = df2['crew'].apply(get_director)

In [23]:
df2['cast'] = df2['cast'].apply(lambda x: [i['name'] for i in x] if isinstance (x, list) else [])
df2['cast'] = df2['cast'].apply(lambda x: x[:3] if len(x) >=3 else x)
df2['keywords'] = df2['keywords'].apply(lambda x: [i['name'] for i in x] if isinstance (x, list) else [])

*We will be creating a metadata dump for every movie which consists of genres, directors, main actors & keywords. We will then use **Count Vectorizer** to create our count matrix as we did with the Description Recommender. We will calculate the cosine similarity and return movies that are most similar.*

*Before doing so, we will be doing some **data cleaning**.*

In [24]:
# this code basically converts the text into lower case and removes the space between the first and last name
# rationale behind this is so the machine does not get confused between 2 people having the same first name
df2['cast'] = df2['cast'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])
df2['director'] = df2['director'].astype('str').apply(lambda x: str.lower(x.replace(" ", "")))

In [25]:
# this code allows us to mention director 3 times to give it more weight relative to the entire cast
df2['director'] = df2['director'].apply(lambda x: [x,x, x])

**For our keywords**, we would be first calculating the frequent counts of every keyword that appears in the dataset

In [26]:
s = df.apply(lambda x: pd.Series(x['keywords']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'keyword'

In [27]:
s = s.value_counts()
s[:5]

[]                                             14889
[{'id': 187056, 'name': 'woman director'}]      1329
[{'id': 10183, 'name': 'independent film'}]      509
[{'id': 9716, 'name': 'stand-up comedy'}]        235
[{'id': 4344, 'name': 'musical'}]                170
Name: keyword, dtype: int64

In [28]:
s = s[s > 1]

We will now convert every word to its stem so that words such as Cats and Cat are considered the same. This is a very essential technique in Natual Langauge Processing for better analysis of text

In [29]:
stemmer = SnowballStemmer('english')
stemmer.stem('cats')

'cat'

In [30]:
def filter_keywords(x):
    words = []
    for i in x:
        if i in s:
            words.append(i)
    return words

In [31]:
df2['keywords'] = df2['keywords'].apply(filter_keywords)
df2['keywords'] = df2['keywords'].apply(lambda x: [stemmer.stem(i) for i in x])
# this last line of code converts all keywords into lower case and removes any spaces
df2['keywords'] = df2['keywords'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

In [32]:
# the following code joins all columns into a meta-column (from which we can dervice qualitative insights)
df2['text_analysis'] = df2['keywords'] + df2['cast'] + df2['director'] + df2['genres']
df2['text_analysis'] = df2['text_analysis'].apply(lambda x: ' '.join(x))

In [33]:
count = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
count_matrix = count.fit_transform(df2['text_analysis'])

In [34]:
# this is different from the earlier output as now our cosine similarity scores have changed
cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [35]:
df2 = df2.reset_index()
indices = pd.Series(df2.index, index=df2['title'])

In [36]:
# Time to do an experiment to test the results
get_recommendations('The Dark Knight').head(10)

8031      The Dark Knight Rises
6218              Batman Begins
6623               The Prestige
2085                  Following
4145                   Insomnia
3381                    Memento
8613               Interstellar
7648                  Inception
5943                   Thursday
8927    Kidnapping Mr. Heineken
Name: title, dtype: object

*It is interesting to note that this time, the system has recognize other Christopher Nolan movies (due to the high weightage given to director) and put them as top recommendations.* <br>

*However, it does not consider popularity & ratings while recommending movie. We will need to take that in consideration as well by taking the top 25 movies based on similarity score and calculating their weighted ratings, just how we did in our first model*

In [37]:
def improved_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]
    
    movies = df2.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year']]
    vote_counts = movies[movies['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = movies[movies['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(0.60)
    qualified = movies[(movies['vote_count'] >= m) & (movies['vote_count'].notnull()) & (movies['vote_average'].notnull())]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    qualified['wr'] = qualified.apply(weighted_rating, axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(10)
    return qualified

In [38]:
# Time to do the same experiment with the same movie
improved_recommendations('The Dark Knight')

Unnamed: 0,title,vote_count,vote_average,year,wr
7648,Inception,14075,8,2010,7.984042
8613,Interstellar,11187,8,2014,7.979952
6623,The Prestige,4510,8,2006,7.950802
3381,Memento,4168,8,2000,7.946843
8031,The Dark Knight Rises,9263,7,2012,6.984599
6218,Batman Begins,7511,7,2005,6.981046
1532,The French Connection,435,7,1971,6.721628
2986,Gone in Sixty Seconds,1511,6,2000,5.961131
4145,Insomnia,1181,6,2002,5.950975
149,Hackers,406,6,1995,5.873118


*This is a good conclusion to our content based recommendator that makes good use of both qualitative as well as quantitative metrics in producing a movie which is not only similar to another movie by its story or content but also by its crew, cast and other significant factors*

### **Third Model: Collaborative Filtering**
**Concept**: *In this section we use a technique known as 'Collaborative Filtering', to make recommendations to Movie Watchers. Collaborative Filtering is based on the idea that users similar to a me can be used to predict how much I will like a particular product or service those users have used/experienced but I have not.*

**Implementation**: *Collaborative Filtering will not be implemented from scratch but rather we will be using the surprise library that uses extremely powerful algorithms like Singular Value Decomposition (SVD) to minimise RMSE (Root Mean Square Error) and give quality recommendations*

In [39]:
reader = Reader()

In [40]:
ratings = pd.read_csv("~/Desktop/archive/ratings_small.csv")
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [41]:
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
algo = SVD()
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8947  0.8944  0.8952  0.9017  0.8894  0.8951  0.0039  
MAE (testset)     0.6905  0.6921  0.6882  0.6917  0.6852  0.6895  0.0026  
Fit time          4.46    4.23    4.17    4.18    4.18    4.24    0.11    
Test time         0.13    0.11    0.11    0.11    0.11    0.11    0.01    


{'test_rmse': array([0.89469036, 0.89439665, 0.89518142, 0.90170538, 0.88939866]),
 'test_mae': array([0.69051127, 0.69214415, 0.68818731, 0.6916613 , 0.68523179]),
 'fit_time': (4.45569109916687,
  4.232640027999878,
  4.169701814651489,
  4.17745304107666,
  4.175548076629639),
 'test_time': (0.1256239414215088,
  0.11264491081237793,
  0.10949492454528809,
  0.11156010627746582,
  0.11044478416442871)}

**We get a mean Root Mean Sqaure Error of 0.8971 which is more than good enough for our case. Let us now train on our dataset and arrive at predictions.*

In [42]:
trainset = data.build_full_trainset()
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f93d9521160>

In [43]:
# Here we pick the first user and see what rating it has given!
ratings[ratings['userId'] == 1]

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205
5,1,1263,2.0,1260759151
6,1,1287,2.0,1260759187
7,1,1293,2.0,1260759148
8,1,1339,3.5,1260759125
9,1,1343,2.0,1260759131


In [44]:
algo.predict(1, 302, 3)

Prediction(uid=1, iid=302, r_ui=3, est=2.752026469648851, details={'was_impossible': False})

*For movie with ID 302, we get an estimated prediction of 2.686. One startling feature of this recommender system is that it doesn't care what the movie is (or what it contains). It works purely on the basis of an assigned movie ID and tries to predict ratings based on how the other users have predicted the movie.*

### **Final Model: Hybrid Version**

*In this section, I will try to build a simple hybrid recommender that brings together techniques we have implemented in the content based and collaborative filter based engines. This is how it will work:* <br>

Input:         <br>
Output: 

Questions: <br>
**if user exits / has been registered** 
(1) did you like the previous recommendation? <br>
(2) - what mood best describes your current state of mind? <br>
**if user does not exist / has not been registered** <br>
(1) - what mood best describes your current state of mind? <br>

In [45]:
def convert_int(x):
    try:
        return int(x)
    except:
        return np.nan

In [46]:
id_map = pd.read_csv("~/Desktop/archive/links_small.csv")[['movieId', 'tmdbId']]
id_map['tmdbId'] = id_map['tmdbId'].apply(convert_int)
id_map.columns = ['movieId', 'id']
id_map = id_map.merge(df2[['title', 'id', 'genres']], on='id').set_index('title')
id_map.head()

Unnamed: 0_level_0,movieId,id,genres
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Toy Story,1,862.0,"[Animation, Comedy, Family]"
Jumanji,2,8844.0,"[Adventure, Fantasy, Family]"
Grumpier Old Men,3,15602.0,"[Romance, Comedy]"
Waiting to Exhale,4,31357.0,"[Comedy, Drama, Romance]"
Father of the Bride Part II,5,11862.0,[Comedy]


In [47]:
indices_map = id_map.set_index('id')
indices_map.head()

Unnamed: 0_level_0,movieId,genres
id,Unnamed: 1_level_1,Unnamed: 2_level_1
862.0,1,"[Animation, Comedy, Family]"
8844.0,2,"[Adventure, Fantasy, Family]"
15602.0,3,"[Romance, Comedy]"
31357.0,4,"[Comedy, Drama, Romance]"
11862.0,5,[Comedy]


In [48]:
moods_dict_1: {"Loney": "Family", "Depressed": "Comedy", "Cheerful": "Animation",
               "Excited": "Science Fiction", "Stressed": "Romance"}
moods_dict_2: {"Loney": "Thriller", "Depressed": "Animated", "Cheerful": "Thriller",
               "Excited": "Romance", "Stressed": "Family"}

In [49]:
list_of_romance = ["Dilwale Dulhania Le Jayenge", "Paperman", "Sing Street", "The Handmaiden",
                   "The Way He Looks", "In a Heartbeat", "Titanic", "Silver Linings Playbook", "La La Land",
                   "Maleficent", "Her", "The Great Gatsby", "The Fault in Our Stars", "Eternal Sunshine of the Spotless Mind"]
                   
list_of_family = ["Spirited Away", "Paperman", "Piper", "Wolf Children", "Feast", "Song of the Sea", "Harry Potter and the Philosopher's Stone",
                 "Up", "Inside Out", "Despicable Me", "WALL·E", "Finding Nemo", "Big Hero 6", "Monsters, Inc.",
                  "Harry Potter and the Deathly Hallows: Part 2"]

list_of_comedy = ["Dilwale Dulhania Le Jayenge", "The Intouchables", "The Grand Budapest Hotel",
                 "The Apartment", "Feast", "Deadpool", "Up", "The Wolf of Wall Street", 
                 "Inside Out", "The Hangover", "Big Hero 6", "Monsters, Inc.", "Kingsman: The Secret Service",
                 "Zootopia"]

list_of_animation = ["Spirited Away", "Howl's Moving Castle", "Princess Mononoke",
                    "Paperman", "Piper", "Wolf Children", "Song of the Sea", "Feast", 
                    "Presto", "The Tale of the Princess Kaguya", "Up", "nside Out", "Despicable Me", "WALL·E"]

list_of_thriller = ["Inception", "The Dark Knight", "Se7en", "The Imitation Game", "The Prestige",
                    "Momento", "The Usual Suspects","Room", "Psycho", "Oldboy", "The Handmaiden", "The Invisible Guest",
                   "Mad Max: Fury Road", "The Dark Knight Rises", "Titanic"]

list_of_science_fiction = ["Inception", "Interstellar", "Avatar", "The Avengers", "Guardians of the Galaxy", 
                           "Mad Max: Fury Road", "The Matrix", "Iron Man", "Star Wars: The Force Awakens", 
                           "Captain America: Civil War", "The Martian", "Avengers: Age of Ultron", "The Hunger Games: Catching Fire",
                          "Logan", "X-Men: Days of Future Past"]

In [50]:
def recommend_movie(mood):
    selector = random.randint(0, 13)

    if mood == "Lonely":
        try: 
            title = (list_of_family[selector] or list_of_thriller[selector])
            return process(title)
        except: 
            title = random.choice(list_of_family) or random.choice(list_of_thriller)
            #print(title)
            return title

    elif mood == "Depressed":
        try: 
            title = (list_of_comedy[selector] or list_of_animation[selector])
            return process(title)
        except: 
            title = random.choice(list_of_comedy) or random.choice(list_of_animation)
            #print(title)
            return title

    elif mood == "Cheerful":
        try:
            title = (list_of_animation[selector] or list_of_thriller[selector])
            return process(title)
        except: 
            title = random.choice(list_of_animation) or random.choice(list_of_thriller)
            #print(title)
            return title

    elif mood == "Excited":
        try:
            title = (list_of_science_fiction[selector] or list_of_romance[selector])
            return process(title)
        except: 
            title = random.choice(list_of_science_fiction) or random.choice(list_of_romance)
            #print(title)
            return title

    elif mood == "Stressed":
        try:
            title = (list_of_romance[selector] or list_of_family[selector])
            return process(title)
        except: 
            title = random.choice(list_of_romance) or random.choice(list_of_family)
            #print(title)
            return title
        
def recommend_movie_backend(movie):
    title = movie
    return process(title)


def process(title):        
    userId = random.randrange(0, 100)
    idx = indices[title]
    tmdbId = id_map.loc[title]['id']
    movie_id = id_map.loc[title]['movieId']
        
    sim_scores = list(enumerate(cosine_sim[int(idx)]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]
        
    movies = df2.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year', 'id']]
    movies.drop(movies[movies.year < '1994'].index, inplace=True)
    movies['est'] = movies['id'].apply(lambda x: algo.predict(userId, indices_map.loc[x]['movieId']).est)
    movies = movies.sort_values('est', ascending=False)
    #print(movies.head(1)['title'])
    return movies.head(1)['title']

In [51]:
def random_recommendation():
    movies = qualified['title']
    selector = random.randint(1, 250)
    return movies.iloc[selector]

In [52]:
def genre_recommendation(genre):
    selector = random.randint(1, 250)
    df = gen_md[gen_md['genre'] == genre]
#     vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
#     vote_averages = df[df['vote_average'].notnull()]['vote_average'].astype('int')
#     C = vote_averages.mean()
#     m = vote_counts.quantile(percentile)

#     qualified = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & (df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity']]
#     qualified['vote_count'] = qualified['vote_count'].astype('int')
#     qualified['vote_average'] = qualified['vote_average'].astype('int')
    
#     qualified['wr'] = qualified.apply(lambda x: (x['vote_count']/(x['vote_count']+m) * x['vote_average']) + (m/(m+x['vote_count']) * C), axis=1)
#     qualified = qualified.sort_values('wr', ascending=False).head(250)
    
    return qualified['title'].iloc[selector]

In [53]:
genre_recommendation('Thriller')

'Back to the Future Part II'

In [54]:
print(qualified['genres'])

10309                             [Comedy, Drama, Romance]
15480    [Action, Thriller, Science Fiction, Mystery, A...
12481                     [Drama, Action, Crime, Thriller]
22879                  [Adventure, Drama, Science Fiction]
2843                                               [Drama]
                               ...                        
581      [Animation, Family, Comedy, Adventure, Fantasy...
15798                                              [Drama]
24445                             [Crime, Drama, Thriller]
19590                                    [Drama, Thriller]
41457                       [Adventure, Animation, Family]
Name: genres, Length: 250, dtype: object


In [55]:
# (Q1) Please select a mood that best describes your current state from the list below?
# Excited, Lonely, Depressed, Cheerful, Stressed
recommend_movie('Depressed')

'Kingsman: The Secret Service'

In [56]:
random_recommendation()

'The Godfather'