## Part 3

This notebook deals with creating a content based recommender system.

In [1]:
#Importing necessary modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

pd.set_option('display.max_rows', None, 'display.max_columns', None)

In [2]:
movies=pd.read_csv('movies_2.csv')
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48158 entries, 0 to 48157
Data columns (total 12 columns):
Unnamed: 0     48158 non-null int64
movie          48158 non-null object
year           48158 non-null int64
genre          48158 non-null object
duration       48158 non-null int64
certificate    48158 non-null object
directors      48158 non-null object
stars          48158 non-null object
rating         48158 non-null float64
metascore      48158 non-null float64
vote           48158 non-null int64
gross          48158 non-null int64
dtypes: float64(2), int64(5), object(5)
memory usage: 4.4+ MB


In [3]:
movies=movies.drop(['Unnamed: 0','year','duration','certificate','rating','metascore','vote','gross'], axis=1)
movies['stars']=movies['stars'].str.replace('Unknown','')
movies['genre']=movies['genre'].str.replace('Unknown','')
movies['directors']=movies['directors'].str.replace('Unknown','')
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48158 entries, 0 to 48157
Data columns (total 4 columns):
movie        48158 non-null object
genre        48158 non-null object
directors    48158 non-null object
stars        48158 non-null object
dtypes: object(4)
memory usage: 1.5+ MB


To get content based recommendations, we need to first collect all the words from stars, directors and genre. This has been done next. There are many stars for each movie and as not all of them deciding factor whether a movie is good or not, I collected only first three stars per movie. Also, I gave more weight to directors for my recommendation. 

In [4]:
movies['stars']=movies['stars'].str.split(',')
movies['n_stars']=movies['stars'].apply(len)
movies.head(2)

Unnamed: 0,movie,genre,directors,stars,n_stars
0,Hassan - The image of our common pain,"Drama, Family",Emaan,"[Leena Alam, Haroon Azizi, Hakim Diljo, Emaan]",4
1,Halt: The Motion Picture,Drama,Jezar Riches,"[Harley Wallen, Andrew Dawe-Collins, Dennis ...",4


In [5]:
movies['stars1'] = movies['stars'].apply(lambda x: x[:3] if len(x) >=3 else x)
movies.head(2)

Unnamed: 0,movie,genre,directors,stars,n_stars,stars1
0,Hassan - The image of our common pain,"Drama, Family",Emaan,"[Leena Alam, Haroon Azizi, Hakim Diljo, Emaan]",4,"[Leena Alam, Haroon Azizi, Hakim Diljo]"
1,Halt: The Motion Picture,Drama,Jezar Riches,"[Harley Wallen, Andrew Dawe-Collins, Dennis ...",4,"[Harley Wallen, Andrew Dawe-Collins, Dennis ..."


The next function helps in combining both the first and second names and making them lowercase. This prevents the later algorithm in wrongly matching 'James' of James Cameron and James Wan.

In [6]:
def short(x):
    lst=[]
    for i in x:
        y=i.replace(' ','')
        z=y.lower()
        lst.append(z)
    return lst
movies['stars2']=movies['stars1'].apply(short)
movies.head()

Unnamed: 0,movie,genre,directors,stars,n_stars,stars1,stars2
0,Hassan - The image of our common pain,"Drama, Family",Emaan,"[Leena Alam, Haroon Azizi, Hakim Diljo, Emaan]",4,"[Leena Alam, Haroon Azizi, Hakim Diljo]","[leenaalam, haroonazizi, hakimdiljo]"
1,Halt: The Motion Picture,Drama,Jezar Riches,"[Harley Wallen, Andrew Dawe-Collins, Dennis ...",4,"[Harley Wallen, Andrew Dawe-Collins, Dennis ...","[harleywallen, andrewdawe-collins, dennisdoyle..."
2,"Onyx, Kings of the Grail",History,Roberto Girault,"[Jim Caviezel, Maria de Medeiros, Anthony Ho...",4,"[Jim Caviezel, Maria de Medeiros, Anthony Ho...","[jimcaviezel, mariademedeiros, anthonyhowell]"
3,Mr. Presto,Comedy,Joey Kneiser,"[Shane Spresser, Eric Giles, Jon Latham, Sp...",4,"[Shane Spresser, Eric Giles, Jon Latham]","[shanespresser, ericgiles, jonlatham]"
4,Green on Green,"Adventure, Comedy",Tom Knoblauch,"[Rachel Dinan, Leah Cardenas, David Remus, ...",4,"[Rachel Dinan, Leah Cardenas, David Remus]","[racheldinan, leahcardenas, davidremus]"


In [7]:
movies['directors2']=movies['directors'].apply(lambda x: [x,x])
movies.head(2)

Unnamed: 0,movie,genre,directors,stars,n_stars,stars1,stars2,directors2
0,Hassan - The image of our common pain,"Drama, Family",Emaan,"[Leena Alam, Haroon Azizi, Hakim Diljo, Emaan]",4,"[Leena Alam, Haroon Azizi, Hakim Diljo]","[leenaalam, haroonazizi, hakimdiljo]","[Emaan, Emaan]"
1,Halt: The Motion Picture,Drama,Jezar Riches,"[Harley Wallen, Andrew Dawe-Collins, Dennis ...",4,"[Harley Wallen, Andrew Dawe-Collins, Dennis ...","[harleywallen, andrewdawe-collins, dennisdoyle...","[Jezar Riches, Jezar Riches]"


In [8]:
movies['directors3']=movies['directors2'].apply(short)
movies.head(2)

Unnamed: 0,movie,genre,directors,stars,n_stars,stars1,stars2,directors2,directors3
0,Hassan - The image of our common pain,"Drama, Family",Emaan,"[Leena Alam, Haroon Azizi, Hakim Diljo, Emaan]",4,"[Leena Alam, Haroon Azizi, Hakim Diljo]","[leenaalam, haroonazizi, hakimdiljo]","[Emaan, Emaan]","[emaan, emaan]"
1,Halt: The Motion Picture,Drama,Jezar Riches,"[Harley Wallen, Andrew Dawe-Collins, Dennis ...",4,"[Harley Wallen, Andrew Dawe-Collins, Dennis ...","[harleywallen, andrewdawe-collins, dennisdoyle...","[Jezar Riches, Jezar Riches]","[jezarriches, jezarriches]"


In [9]:
movies['genre']=movies['genre'].str.split(',')
movies['genre3']=movies['genre'].apply(short)
movies=movies.drop(['genre','directors','stars','n_stars','stars1','directors2'], axis=1)
movies.head(2)

Unnamed: 0,movie,stars2,directors3,genre3
0,Hassan - The image of our common pain,"[leenaalam, haroonazizi, hakimdiljo]","[emaan, emaan]","[drama, family]"
1,Halt: The Motion Picture,"[harleywallen, andrewdawe-collins, dennisdoyle...","[jezarriches, jezarriches]",[drama]


This step combines stars, directors and genre for individual movie into a single string. 

In [10]:
movies['bag'] = (movies['stars2'] + movies['directors3'] + movies['genre3']).apply(lambda x: [s for s in x if s])
movies['bag_len']=movies['bag'].apply(len)
movies['bag'] = movies['bag'].apply(lambda x: ' '.join(x))
movies.head(2)

Unnamed: 0,movie,stars2,directors3,genre3,bag,bag_len
0,Hassan - The image of our common pain,"[leenaalam, haroonazizi, hakimdiljo]","[emaan, emaan]","[drama, family]",leenaalam haroonazizi hakimdiljo emaan emaan d...,7
1,Halt: The Motion Picture,"[harleywallen, andrewdawe-collins, dennisdoyle...","[jezarriches, jezarriches]",[drama],harleywallen andrewdawe-collins dennisdoylejr....,6


We use Bag-of-Words model here. In this we first split the document in to tokens, then assign a weight to each token depending upon the frequency with which it occurs in the document or in a collection of all the documents and then make a matrix where each document is a row and each column is a token. This operation gives us an item vector. By this I mean that each movie is a vector with the weighted values for each attribute (that is a director's name or star's name). One of the vectorizer we will be using here for the mentioned purpose is CountVectorizer. It gives a matrix where the words represent column and movies are the rows and the weight is equal to the number of times a word occurs in a document.

In [11]:
count = CountVectorizer(analyzer='word',stop_words='english')
movies_matrix = count.fit_transform(movies['bag'])

Next, we have to calculate similarity between the items. As in the previous step we have obtained item profiles which are vectors in a high dimensional space. A good distance metric is an angle between the pairs of item vectors. We can estimate angle using Cosine formula. Mathematically,        
Cosine simiarity= Cos(theta)= A.B / ||A|| ||B||
that is, its a product of two vectors divided by the product of their magnitude. Thus Cosine similarity gives a measure of similarity between two movies. Thus as theta reduces, cosine similarity increases. 

For further operations we need a map that can take a movie's title and give its index in the dataframe. This is done below. Additionally, we make a function that can output a list of similar movies when we provide it with our movie of choice. In this function, first cosine similarity between the selected movie and all movies is found and then the function returns top 10 similar movies to our movie of choice.

In [12]:
# map that gives the movie index in dataframe
indices = pd.Series(movies.index, index=movies['movie'])

# function that provides a list of similar movies
def recommendations(title):
    
    # get index of the movie in the dataframe
    idx = indices[title]
    
    # get movie matrix for the selected movie
    chosen_movie_matrix = movies_matrix[idx]
    
    #get cosine similarity of all the movies with the selected movie
    cos_sim = cosine_similarity(chosen_movie_matrix, movies_matrix)
    
    # get a list of tuple where first is movie index and second is cos similarity score for all 
    #the movies with the provided movie. Similarity score is penalised based on the number of contents of the bag. 
    scores = [(i, sim - 1/(movies.iloc[i]['bag_len']+.0001)) for i, sim in enumerate(cos_sim[0])]

    # sorting all the movies based on the similarity scores in an ascending order
    scores.sort(key=lambda x: x[1], reverse=True)

    # get the indices of the top 10 similar movies
    movie_indices = [i[0] for i in scores[1:11]]

    # get the names of the most similar movies
    return movies.iloc[movie_indices]['movie']

In [13]:
recommendations('Inception')

22440                      Interstellar
32080             The Dark Knight Rises
6513                            Dunkirk
4024                     The Star Kings
24037       Star Wars: The New Republic
28336                           Don Jon
599                    Ready Player One
695                           Bumblebee
1314     Jurassic World: Fallen Kingdom
1373                            Rampage
Name: movie, dtype: object

In the case of Inception, the first 3 movies have same director and some overlap with the genre of the selected movie. The rest of the movies in the recommended list have match with the genre. As this is a content based recommender system we do not want to filter movies according to the rating rather we have prefered here filtering based on the director(s), stars and genre.

In [14]:
recommendations('Boyhood')

27559            Before Midnight
7082            Last Flag Flying
12838     Everybody Wants Some!!
37285                     Bernie
954                        Blaze
1065                   Stockholm
6952              First Reformed
8637                 Blood Money
12466                     Maudie
13604    In a Valley of Violence
Name: movie, dtype: object

After selecting 'Boyhood', the first 3 recommendations have matching directors and first 2 belong to the same genre 'drama'. 