# Movies Recommender System

![](http://labs.criteo.com/wp-content/uploads/2017/08/CustomersWhoBought3.jpg)

I will be  implementing a few recommendation algorithms (content based, popularity based) and try to build  final recommendation system.

With us, we have two MovieLens datasets.
* **The Full Dataset:** Consists of 26,000,000 ratings and 750,000 tag applications applied to 45,000 movies by 270,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags.
* **The Small Dataset:** Comprises of 100,000 ratings and 1,300 tag applications applied to 9,000 movies by 700 users.

Firstly I will build a Simple Recommender using movies from the *Full Dataset* 
Then I will implement The Content Based recommender systems will make use of the small dataset (due to the computing power I possess being very limited).

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
import warnings; warnings.simplefilter('ignore')

## Simple Recommender

The Simple Recommender offers generalized recommendations to every user based on movie popularity and (sometimes) genre. The basic idea behind this recommender is that movies that are more popular and more critically acclaimed will have a higher probability of being liked by the average audience. This model does not give personalized recommendations based on the user. 

In [None]:
df = pd. read_csv('movies_metadata.csv')
df.head()

In [None]:
df['genres'] = df['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

I will use IMDB's *weighted rating* formula to construct my chart. Mathematically, it is represented as follows:

Weighted Rating (WR) = $(\frac{v}{v + m} . R) + (\frac{m}{v + m} . C)$


where,
* *v* is the number of votes for the movie
* *m* is the minimum votes required to be listed in the chart
* *R* is the average rating of the movie
* *C* is the mean vote across the whole report

The next step is to determine an appropriate value for *m*, the minimum votes required to be listed in the chart. for a movie to feature in the charts, it must have more votes than at least 95% of the movies in the list( i.e we will use 95 percentile system to get the minimum votes required to be listed in chart).

In [None]:
vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
vote_averages = df[df['vote_average'].notnull()]['vote_average'].astype('int')
C = vote_averages.mean()
C

In [None]:
m = vote_counts.quantile(0.95)
m

In [None]:
df['year'] = pd.to_datetime(df['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

In [None]:
qualified_movies = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & (df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity', 'genres']]
qualified_movies['vote_count'] = qualified_movies['vote_count'].astype('int')
qualified_movies['vote_average'] = qualified_movies['vote_average'].astype('int')
qualified_movies.shape

the minimum votes required to be listed in the chart is 434.0.

the mean vote across the whole report is 5.244896612406511

In [None]:
def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

In [None]:
qualified_movies['weighted_rating'] = qualified_movies.apply(weighted_rating, axis=1)

In [None]:
qualified_movies = qualified_movies.sort_values('weighted_rating', ascending=False)

# Top Movies

In [None]:
qualified_movies.head(15)

# Top Movies Based on Genres

In [None]:
new_df = df.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
new_df.name = 'genre'
gen_df = df.drop('genres', axis=1).join(new_df)

In [None]:
def genrebasedrec(genre, percentile=0.95):
    df = gen_df[gen_df['genre'] == genre]
    vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = df[df['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(percentile)
    
    qualified = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & (df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity']]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    
    qualified['weighted_rating'] = qualified.apply(lambda x: (x['vote_count']/(x['vote_count']+m) * x['vote_average']) + (m/(m+x['vote_count']) * C), axis=1)
    qualified = qualified.sort_values('weighted_rating', ascending=False).head(250)
    
    return qualified

# Top Romance Movies

In [None]:
genrebasedrec('Romance').head(15)

# Top Action Movies

In [None]:
genrebasedrec('Action').head(15)

# Content Based Recommender

Why we need the content based recommender?why simple recommendor system was not enough?
 
Simple Recommender System  gives the same recommendation to everyone, regardless of the user's personal taste. If a person who loves romantic movies (and hates action) were to look at our Top 15 Chart, s/he wouldn't probably like most of the movies. If someone look at our charts by genre, he/she wouldn't still be getting the best recommendations.

To personalise our recommendations more, we will do **Content Based Filtering.** and we will try to improve it further so that we have better recommendations.

We will use the small dataset provided 

In [None]:
new_df = pd.read_csv('links_small.csv')
new_df = new_df[new_df['tmdbId'].notnull()]['tmdbId'].astype('int')

In [None]:
#deleting the rows with bad format data
df = df.drop([19730, 29503, 35587])

In [None]:
#Check Notebook for how and why I got these indices.
df['id'] = df['id'].astype('int')

In [None]:
small_data = df[df['id'].isin(new_df)]
small_data.shape

### Movie Description Based Recommender

Let us first try to build a recommender using movie descriptions and taglines.

In [None]:
small_data['tagline'] = small_data['tagline'].fillna('')
small_data['description'] = small_data['overview'] + small_data['tagline']
small_data['description'] = small_data['description'].fillna('')

In [None]:
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(small_data['description'])

In [None]:
tfidf_matrix.shape

#### Cosine Similarity

We will be using the Cosine Similarity to calculate a numeric quantity that denotes the similarity between two movies. Mathematically, it is defined as follows:

$cosine(x,y) = \frac{x. y^\intercal}{||x||.||y||} $

Since we have used the TF-IDF Vectorizer, calculating the Dot Product will directly give us the Cosine Similarity Score. Therefore, we will use sklearn's **linear_kernel** instead of cosine_similarities since it is much faster.

In [None]:
similarity = linear_kernel(tfidf_matrix, tfidf_matrix)

In [None]:
similarity

In [None]:
small_data = small_data.reset_index()
titles = small_data['title']
indices = pd.Series(small_data.index, index=small_data['title'])

In [None]:
def get_recommendations(title):
    idx = indices[title]
    similarity_scores = list(enumerate(similarity[idx]))
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    similarity_scores = similarity_scores[1:31]
    movie_indices = [i[0] for i in similarity_scores]
    return titles.iloc[movie_indices]

In [None]:
get_recommendations('The Family').head(10)

In [None]:
get_recommendations('Batman Forever').head(10)

We see that for **Batman Forever**, this recommendation system is able to identify it as a Batman film and subsequently recommend other Batman films as its top recommendations. But unfortunately, that is all this system can do at the moment. This is not of much use to most people as it doesn't take into considerations very important features such as cast, crew, director and genre, which determine the rating and the popularity of a movie.

Someone who liked **The Dark Knight** probably likes it more because of Nolan and would hate **Batman Forever** and every other substandard movie in the Batman Franchise.

### Metadata Based Recommender

we are going to use much more suggestive metadata than **Overview** and **Tagline**. metadata_based recommender will take **genre**, **keywords**, **cast** and **crew** into consideration.

To build metadata based content recommender, we will need to merge our current dataset with the crew and the keyword datasets. Let us prepare this data as our first step.

In [None]:
credits = pd.read_csv('credits.csv')
keywords = pd.read_csv('keywords.csv')

In [None]:
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
df['id'] = df['id'].astype('int')

In [None]:
df.shape

In [None]:
df = df.merge(credits, on='id')
df = df.merge(keywords, on='id')

In [None]:
small_data1 = df[df['id'].isin(new_df)]
small_data1.shape

We now have our cast, crew, genres and credits, all in one dataframe.

1. **Crew:** From the crew, we will only pick the director as our feature since the others don't contribute that much to the *feel* of the movie.
2. **Cast:** Choosing Cast is a little more tricky. Lesser known actors and minor roles do not really affect people's opinion of a movie. Therefore, we must only select the major characters and their respective actors. Arbitrarily we will choose the top 3 actors that appear in the credits list. 

In [None]:
small_data1['cast'] = small_data1['cast'].apply(literal_eval)
small_data1['crew'] = small_data1['crew'].apply(literal_eval)
small_data1['keywords'] = small_data1['keywords'].apply(literal_eval)
small_data1['cast_size'] = small_data1['cast'].apply(lambda x: len(x))
small_data1['crew_size'] = small_data1['crew'].apply(lambda x: len(x))

In [None]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [None]:
small_data1['director'] = small_data1['crew'].apply(get_director)

In [None]:
small_data1['cast'] = small_data1['cast'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
small_data1['cast'] = small_data1['cast'].apply(lambda x: x[:3] if len(x) >=3 else x)

In [None]:
small_data1['keywords'] = small_data1['keywords'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

We will be creating a metadata dump(combination) for every movie which consists of **genres, director, main actors and keywords.** I then use a **Count Vectorizer** to create our count matrix as we did in the Description Recommender. The remaining steps are similar to what we did earlier: we calculate the cosine similarities and return movies that are most similar.

1. **Strip Spaces and Convert to Lowercase** from all  features. This way, engine will not confuse between **Sam Wilson** and **Sam Jones.** 
2. We will Mention Director 2 times to give it more weight relative to the entire cast.

In [None]:
small_data1['cast'] = small_data1['cast'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

In [None]:
small_data1['director'] = small_data1['director'].astype('str').apply(lambda x: str.lower(x.replace(" ", "")))
small_data1['director'] = small_data1['director'].apply(lambda x: [x,x])

#### Keywords

We will do a small amount of pre-processing of our keywords before putting them to any use. As a first step, we will calculate the frequenct counts of every keyword that appears in the dataset.

In [None]:
s = small_data1.apply(lambda x: pd.Series(x['keywords']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'keyword'

In [None]:
s = s.value_counts()
s

Keywords occur in frequencies ranging from 1 to 2170. We do not have any use for keywords that occur only once. Therefore, these can be safely removed. 

Finally, we will convert every word to its stem so that words such as *Dogs* and *Dog* are considered the same.

In [None]:
s = s[s >= 2]

In [None]:
stemmer = SnowballStemmer('english')
stemmer.stem('dogs')

In [None]:
def filter_keywords(x):
    words = []
    for i in x:
        if i in s:
            words.append(i)
    return words

In [None]:
small_data1['keywords'] = small_data1['keywords'].apply(filter_keywords)
small_data1['keywords'] = small_data1['keywords'].apply(lambda x: [stemmer.stem(i) for i in x])
small_data1['keywords'] = small_data1['keywords'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

In [None]:
small_data1['combination'] = small_data1['keywords'] + small_data1['cast'] + small_data1['director'] + small_data1['genres']
small_data1['combination'] = small_data1['combination'].apply(lambda x: ' '.join(x))

In [None]:
from tmdbv3api import TMDb
import json
import requests
tmdb = TMDb()
tmdb.api_key = '68b8a37c9ca19b233cc057643bfbb9eb'
from tmdbv3api import Movie
tmdb_movie = Movie()
def get_poster(x):
    response = requests.get('https://api.themoviedb.org/3/movie/{}?api_key={}'.format(x,tmdb.api_key))
    if response.status_code==200:
        data_json = response.json()
        if data_json['poster_path']:
            poster_str = 'https://image.tmdb.org/t/p/w500'+data_json['poster_path']
            return poster_str
        else:
            return 'static/default.jpg'
    return 'static/default.jpg'

In [None]:
#taking only useful columns into consideration.
small_data1.columns
small_data1=small_data1[['id','original_title','release_date','title','vote_average','combination']]
small_data1

In [None]:
count = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
count_matrix = count.fit_transform(small_data1['combination'])

In [None]:
cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [None]:
small_data1 = small_data1.reset_index()
titles = small_data1['title']
indices = pd.Series(small_data1.index, index=small_data1['title'])

**We will be apply get_poster function to small chunks of our small_data1 dataset this will help us to remove the Connection Reset error if we used get_poster function on whole dataset.**

Now why to apply get_poster function to whole dataset we can also apply it to recommended movies also?
We did this to save time to get recommendations of a movie.,if we applied get_poster only for recommended movies it would take up lot of time.

In [None]:
small_data1['poster']=np.nan
for i in range(2000):
    small_data1['poster'][i]=get_poster(small_data1['id'][i])

In [None]:
for i in range(2000,4000):
    small_data1['poster'][i]=get_poster(small_data1['id'][i])

In [None]:
for i in range(4000,7000):
    small_data1['poster'][i]=get_poster(small_data1['id'][i])

In [None]:
for i in range(7000,9219):
    small_data1['poster'][i]=get_poster(small_data1['id'][i])

In [None]:
def get_recommendations1(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    try:
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    except:
        sim_scores = sorted(sim_scores, key=lambda x: x[1][1], reverse=True)  
    sim_scores = sim_scores[1:11]
    movie_indices = [i[0] for i in sim_scores]
    tit = small_data1['title'].iloc[movie_indices]
    dat = small_data1['release_date'].iloc[movie_indices]
    rating = small_data1['vote_average'].iloc[movie_indices]
    movieid=small_data1['id'].iloc[movie_indices]
    org_title=small_data1['original_title'].iloc[movie_indices]
    poster=small_data1['poster'].iloc[movie_indices]
    
    
    return_df = pd.DataFrame(columns=['Title','Year'])
    return_df['Title'] = tit
    return_df['Year'] = dat
    return_df['Ratings'] = rating
    return_df['ID']=movieid
    return_df['org_title']=org_title
    return_df['poster'] =poster
    sorted_df = return_df.sort_values(by=['Ratings'], ascending=False)
    return sorted_df

In [None]:
get_recommendations1("The Dark Knight")

In [None]:
# The pickle module implements binary protocols for serializing and de-serializing a Python object structure.
import pickle
filename = 'movie_list.pkl'
pickle.dump(small_data1, open(filename, 'wb'))

#  We find this recommender system quite good and will be using to to recommend movies on our web application.