
A quick check to make sure we are using the Google Colab free GPU instead of our local runtime:


In [2]:
import tensorflow as tf 
tf.test.gpu_device_name()

'/device:GPU:0'

We see the output of '/device:GPU:0', which tells us that the Colab GPU is being used. Good!


The next thing to do is import the necessary Python3 libraries:



In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import jaccard_score
print("all libraries successfully imported")

all libraries successfully imported


Let us read in a movie dataset and see what types of column values are present:

In [2]:
movie_df = pd.read_csv('https://raw.githubusercontent.com/codeheroku/Introduction-to-Machine-Learning/master/Building%20a%20Movie%20Recommendation%20Engine/movie_dataset.csv')
print(movie_df.head())
print("")
print("Features of the dataset are:\n")
print(movie_df.columns)

   index     budget                                    genres  \
0      0  237000000  Action Adventure Fantasy Science Fiction   
1      1  300000000                  Adventure Fantasy Action   
2      2  245000000                    Action Adventure Crime   
3      3  250000000               Action Crime Drama Thriller   
4      4  260000000          Action Adventure Science Fiction   

                                       homepage      id  \
0                   http://www.avatarmovie.com/   19995   
1  http://disney.go.com/disneypictures/pirates/     285   
2   http://www.sonypictures.com/movies/spectre/  206647   
3            http://www.thedarkknightrises.com/   49026   
4          http://movies.disney.com/john-carter   49529   

                                            keywords original_language  \
0  culture clash future space war space colony so...                en   
1  ocean drug abuse exotic island east india trad...                en   
2         spy based on novel sec

The meat of this system is about running a cosine similarity function between a given movie that the user is known to like and every other movie in the dataset. 

The way we will be running the cosine similarity, we need to combine some features together and run the similarity between the combined features. For simplicity, I will manually choose features that I find to be important: *keywords, cast, genres, and director*. 

Below is a helper function for combining my selected features within a single row.

In [3]:
features = ['keywords','cast','genres','director']
def combine_features(row):
    return row['keywords']+" "+row['cast']+" "+row['genres']+" "+row['director']

Let's start pre-processing the data. The first step is to clean invalid feature values, we will just assign empty strings to these. 

Only problem with this is that if our dataset is bad (eg. too many NaN or empty feature values), then the cosine similarity will pick up on these values and the cosine similarity score will be affected.

Next, we need to create a new dataframe with our selected features, along with the dataframe index and titles of all the movies in the original dataframe. We then add a new column called 'combined_features' which will hold the combined Strings of all the features we've selected previously. This column is populated by calling our combined_features function. 

In [4]:
 #filling all NaNs with blank string
for feature in features:
    movie_df[feature] = movie_df[feature].fillna('')

new_df = pd.DataFrame()

new_df["title"] = movie_df["title"]
new_df["index"] = movie_df["index"]
new_df["keywords"] = movie_df["keywords"]
new_df["cast"] = movie_df["cast"]
new_df["genres"] = movie_df["genres"]
new_df["director"] = movie_df["director"]

#applying combined_features() method over each rows of dataframe and storing the combined string in "combined_features" column
new_df["combined_features"] = new_df.apply(combine_features,axis=1) 
print(new_df.head())

                                      title  index  \
0                                    Avatar      0   
1  Pirates of the Caribbean: At World's End      1   
2                                   Spectre      2   
3                     The Dark Knight Rises      3   
4                               John Carter      4   

                                            keywords  \
0  culture clash future space war space colony so...   
1  ocean drug abuse exotic island east india trad...   
2         spy based on novel secret agent sequel mi6   
3  dc comics crime fighter terrorist secret ident...   
4  based on novel mars medallion space travel pri...   

                                                cast  \
0  Sam Worthington Zoe Saldana Sigourney Weaver S...   
1  Johnny Depp Orlando Bloom Keira Knightley Stel...   
2  Daniel Craig Christoph Waltz L\u00e9a Seydoux ...   
3  Christian Bale Michael Caine Gary Oldman Anne ...   
4  Taylor Kitsch Lynn Collins Samantha Morton Wil...   

 

Now, we need to vectorize the 'combined_features' of our dataframe. I will use CountVectorizer to convert text to word count vectors. After the 'combined_features' values are vectorized, I will perform a cosine similarity function by passing our count_matrix into the cosine_similarity function from sklearn.metrics.pairwise.

In [5]:
#creating new CountVectorizer() object
cv = CountVectorizer() 

 #feeding combined strings(movie contents) to CountVectorizer() object
count_matrix = cv.fit_transform(new_df["combined_features"])

cosine_sim = cosine_similarity(count_matrix)

A couple of helper functions to retrieve index and title values from the dataframe:

In [6]:
def get_title_from_index(index):
    return new_df[new_df.index == index]["title"].values[0]
def get_index_from_title(title):
    return new_df[new_df.title == title]["index"].values[0]

We then find cosine similarity scores between a given movie and every other movie:

In [11]:
movie_user_likes = "The Dark Knight Rises"
movie_index = get_index_from_title(movie_user_likes)

#accessing the row corresponding to given movie to find 
# all the similarity scores for that movie and then enumerating over it
similar_movies = list(enumerate(cosine_sim[movie_index])) 
sorted_similar_movies = sorted(similar_movies,key=lambda x:x[1],reverse=True)[1:]

# print(sorted_similar_movies)

Finally, let's display the top 15 movies that are similar to what a hypothetical user might like, along with their respective Cosine Similarity values:

In [12]:
i=0
print("Top 15 similar movies to '"+movie_user_likes+"' using Cosine Similarity are:\n")
for element in sorted_similar_movies:
    print(get_title_from_index(element[0]), " -- ", round(element[1], 2))
    i=i+1
    if i>15:
        break

Top 15 similar movies to 'The Dark Knight Rises' using Cosine Similarity are:

Batman Begins  --  0.73
The Dark Knight  --  0.69
Amidst the Devil's Wings  --  0.45
The Killer Inside Me  --  0.39
The Prestige  --  0.38
Batman Returns  --  0.36
Batman  --  0.35
Batman & Robin  --  0.34
Kick-Ass  --  0.33
RockNRolla  --  0.33
Kick-Ass 2  --  0.33
Harry Brown  --  0.31
In Too Deep  --  0.29
Defendor  --  0.29
Point Blank  --  0.29
Harsh Times  --  0.29


**References**

1.   Code Heroku. "Building a Movie Recommendation Engine in Python Using Scikit-learn". 2019. Accessed from https://medium.com/code-heroku/building-a-movie-recommendation-engine-in-python-using-scikit-learn-c7489d7cb145
2. Wikipedia. "Cosine Similarity". 2020. Accessed from https://en.wikipedia.org/wiki/Cosine_similarity

