<font color='black'>
<h1>
<span style="font-family:verdana; font-size:1.4em;">
Basic Movie Recommendation System using Cosine Similarity
</span>
</h1>
</font>

<font color='steelblue'>
<h2>
<span style="font-family:verdana; font-size:1.6em;">
Content Based Recommendation System for movies<br>
</span>
</h2>
</font>

<font color='darkslategrey'>
<span style="font-family:verdana; font-size:1.4em;">
    <b>Using content based filtering build a recommendation system to recommend a movie if a user likes certain movie
    <br><br>
    Following is included:<br></b>
<ol>
        <li>Simple example of Cosine Similarity</li>
        <li>Read a movie data set</li>
        <li>Use certain features about the movie</li>
        <li>Build a cosine similarity matrix</li>
        <li>Use this matrix to recommend movies</li>
</ol>
</span>
</font>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# embed the plots in notebook
%matplotlib inline
plt.style.use('seaborn-whitegrid')    # grids in the plots


pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# suppress warnings
import warnings
warnings.filterwarnings('ignore')

<font color='steelblue'>
    <h2>Finding Similarity</h2><br>
</font>
<font color='darkslategray'>
<span style="font-family:verdana; font-size:1.2em;">
How do we find movies that are similar to a certain movie, we can use that to recommend those movies to the user.
    <ul>
        <li>Finding similar movies - but how similar they are - to what extent</li><br><i><b>
        <li>Let's take simple text example</li>
        <ul>
            <li>Text 1: Apple Orange Apple</li>
            <li>Text 2: Orange Orange Apple</li>
        </ul></b></i>
    </ul>
</span>
</font>

In [None]:
# define list containing two strings
text = ["Apple Orange Apple", "Orange Orange Apple"]

In [None]:
countVec = CountVectorizer()
countMatrix = countVec.fit_transform(text)

`count_matrix` gives us a sparse matrix. To make it in human readable form, we need to apply `toarrray()` method to it. And before printing out this `count_matrix`, let us first print out the feature list(or, word list), which have been fed to our `CountVectorizer()` object.

In [None]:
# get word list
countVec.get_feature_names()

In [None]:
countMatrix.toarray()

<font color='steelblue'>
    <h2>Cosine Similarity</h2><br>
</font>
<font color='darkslategrey'>
<span style="font-family:verdana; font-size:1.4em;">
    <ul>
    <li>This indicates that the word <b>‘apple’ occurs 2 times in Text 1 and 1 time in Text 2</b>. Similarly, the word <b>‘orange’ occurs 1 time in Text 1 and 2 times in Text 2.</b></li><br>

<li>Now, we need to find cosine(or “cos”) similarity between these vectors to find out how similar they are from each other. We can calculate this using `cosine_similarity()` function from `sklearn.metrics.pairwise` library.</li><br>
        <li>Similarity value closer to 0 indicates that 2 movies are dis-similar, where as a value closer to 1 indicates that 2 movies are similar</li>
    </ul>
    </span>
    </font>

In [None]:
# Calculate consine similarity for the matrix
similarity = cosine_similarity(countMatrix)
similarity

<font color='darkslategray'>
<span style="font-family:verdana; font-size:1.0em;">
<b>
We can interpret this output like this:

1. Each row of the similarity matrix indicates each sentence of our input. So, row 0 = Text A and row 1 = Text B.
2. The same thing applies for columns. To get a better understanding over this, we can say that the output given above is same as the following:
    </b>
    </span>
    </font>
<code>
        Text 1:     Text 2:
Text 1: [[1.         0.8]  
Text 2: [0.8         1.]]  
</code>
<br>
<font color='darkslategrey'>
<span style="font-family:verdana; font-size:1.0em;">
<b>
Interpreting this, says that Text 1 is similar to Text 1(itself) by 100%(position [0,0]) and Text 1 is similar to Text 2 by 80%(position [0,1]). And by looking at the kind of output it is giving, we can easily say that this is always going to output a symmetric matrix. Because, if Text 1 is similar to Text 2 by 80% then, Text 2 is also going to be similar to Text 1 by 80%.
    </b>
    </span>
    </font>

<font color='steelblue'>
<span style="font-family:verdana; font-size:1.6em;">
    <h2>Building Movie Recommendation</h2>
</span>
</font>

In [None]:
df = pd.read_csv("movie_dataset.csv")

In [None]:
df.head(2)

## Use keywords, cast, genres and director as our features

In [None]:
features = ['keywords','cast','genres','director']

In [None]:
# function to combine all these features into a single string
# return row['keywords'] + " " + row['cast'] + " " + row['genres'] + " " + row['director']
def combineFeatures(row):
    retval = ""
    for feature in features:
        retval = retval + str(row[feature])
    return retval

In [None]:
# create a feature column, and initialize it with empty string
df['feature'] = ""

In [None]:
df.head(1)

In [None]:
df.title.nunique()

In [None]:
df.title.unique()

In [None]:
# see first row with only feature information
df[['keywords', 'cast', 'genres', 'director']].head(1)

In [None]:
# set the values for each row as the combined features
df['feature'] = df.apply(combineFeatures, axis = 1)

In [None]:
# show the first row feature column
df.iloc[0].feature

   ## Create Count Vectorizer for the feature column

In [None]:
countVec = CountVectorizer()
countMatrix = countVec.fit_transform(df['feature'])

In [None]:
# first 10 elements in the vector
countVec.get_feature_names()[:10]

## Create Consine Similarity

In [None]:
similarity = cosine_similarity(countMatrix)

In [None]:
similarity.shape

In [None]:
df.title[:10]

In [None]:
# first 10 similarity for the first row in the matrix
similarity[0][:10]

## Utility Functions

In [None]:
# get title from movie index
def getTitleFromIndex(index):
    return df[df.index == index]['title'].values[0]

# get movie index from movie title
def getIndexFromTitle(title):
    return df[df.title == title]['index'].values[0]

# get top n movie ratings based on user liking certain movie
def getTopRecommendations(numRecommendations, movieLikedByUser):
    mIndex = getIndexFromTitle(movieLikedByUser)
    similarMovies = list(enumerate(similarity[mIndex]))
    # now sort the list of similar movies based on similarity scores in
    # descending order. Also discard first element since it will be
    # the movie that is passed into the function
    sortedMovies = sorted(similarMovies, key = lambda x: x[1], reverse = True)[1:]
    
    i = 0
    for movie in sortedMovies:
        print(getTitleFromIndex(movie[0]))
        i = i + 1
        if i >= numRecommendations:
            break

In [None]:
getTopRecommendations(5, "Iron Man")

In [None]:
getTopRecommendations(10, "Toy Story")

In [None]:
names = df.title.unique()

In [None]:
for name in names:
    print(name)

In [None]:
getTopRecommendations(10, "Life of Pi")