# Movie Recommender

## Intro

In todays on demand world, one of the most beneficial uses of a recommender system is suggesting which movie or TV show a user would be interested in watching. Netflix uses user history to recommend what people should watch next, but before a user builds up a substantial history one would guess that seen a preview of the show or at least read the plot to make sure they’ll enjoy what they want to watch. Kaggle has a list of around 35k movies and their plots that can be found here: https://www.kaggle.com/jrobischon/wikipedia-movie-plots. In this project I will take a sample of the movies in this CSV by filter to just movies created after 2008 and use their plots to make movie recommendations based on the plot of the movies a user likes. 

## Data 
The Kaggle data has movie data dating back to 1901 with other fields such as the movie's wikipedia page and country of origin. For the purpose of this project we'll filter the data to only include movies from after 2008 and only get the movie's title, release year, director, genre, and plot. This data will provide us with enough to see what kind of recommendations we can get with certrain inputs. 


## Import Movie File

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

df = pd.read_csv('https://raw.githubusercontent.com/dquarshie89/Data-620/master/movie_plots.csv')

#Preview movie data frame
df.head(5)

Unnamed: 0,Title,Release Year,Director,Genre,Plot
0,"10,000 BC",2008,Roland Emmerich,adventure,"At about 10,000 BC, a tribe of hunter-gatherer..."
1,21,2008,Robert Luketic,drama,MIT maths major Ben Campbell (Jim Sturgess) is...
2,27 Dresses,2008,Anne Fletcher,romantic comedy,Jane Nichols (Katherine Heigl) has been a brid...
3,88 Minutes,2008,Jon Avnet,thriller,Forensic psychiatrist Dr. Jack Gramm (Al Pacin...
4,The Accidental Husband,2008,Griffin Dunne,romance,Patrick Sullivan (Jeffrey Dean Morgan) is look...


## TF-IDF
Term Frequency-Inverse Document Frequency or TF-IDF is one of the most popular algorithms in text processing. The idea behind TF-IDF is that the algorithm will take the documents, in our case the documents are movie plots, and score how important each word in it is. It does this by looking at each unique word in the documents and then seeing how frquently that word is used in each document. The TF-IDF scores for words can then be made into vectors for each document that can then be used to create a matrix of scores. For our movie plots we can use the Python library scikit-learn's TfIdfVectorizer class that will produces the TF-IDF matrix for us. 

In [3]:
#Remove all stop words 
tfidf = TfidfVectorizer(stop_words='english')
#Make the TF-IDF matrix using the words the movies' plots
tfidf_matrix = tfidf.fit_transform(df['Plot'])
#Output the shape of tfidf_matrix
tfidf_matrix.shape


(4030, 49109)

The shape of our matrix tells us that there are 4,030 movies in our data frame and 49,109 unique words in their plots. 

# Cosine Similarity  
Now that we have the TF_IDF matrix we can use it to get similarity scores so that we can compare the movies. For this project we'll go the cosine similarity score since it is fast to calculate. The cosine similarity score can be derived by taken the dot product of the TF-IDF vector's. The formula for the dot product is shown below but we can use sklearn's linear_kernel() to ge the results easily.

\begin{equation*}
cosine(x,y) = \frac{x \cdot y^T}{\mid\mid{x}\mid\mid \cdot \mid\mid{y}\mid\mid}
\end{equation*}


In [6]:
#Cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
cosine_sim

array([[1.00000000e+00, 6.62447541e-03, 4.48519788e-03, ...,
        5.48106689e-03, 3.83218953e-03, 0.00000000e+00],
       [6.62447541e-03, 1.00000000e+00, 1.38559553e-02, ...,
        0.00000000e+00, 3.25548290e-03, 1.45588830e-03],
       [4.48519788e-03, 1.38559553e-02, 1.00000000e+00, ...,
        5.53065872e-04, 1.18131753e-02, 0.00000000e+00],
       ...,
       [5.48106689e-03, 0.00000000e+00, 5.53065872e-04, ...,
        1.00000000e+00, 1.50412249e-03, 0.00000000e+00],
       [3.83218953e-03, 3.25548290e-03, 1.18131753e-02, ...,
        1.50412249e-03, 1.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 1.45588830e-03, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 1.00000000e+00]])

## Recommender Function  
With our vectors made and everything scored we can go ahead and start to build our function that will recommend movies based on the plots a user liked. Our function will give us the top 10 movies based on the plot of a movie that user has selected. To get started we'll map each movie to an index so that when it is chosen the function can relate it back to the movie data frame. Then we'll get the list of pairwise cosine similarity scores for the chosen movie with all the other movies and rank the scores from most to least. The first movie that pairs with the chosen movie should be the exact movie that was picked so we'll ignore that one but the next 10 should have close enough scores showing that those movies are very similar. 

In [7]:
#Get indices for the movie titles
indices = pd.Series(df.index, index=df['Title']).drop_duplicates()

def get_recommendations(title, cosine_sim=cosine_sim):
    #Get the index of the movie chosen
    idx = indices[title]
    #Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))
    #Sort the movies based on the similarity scores from greatest to least
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    #Get the scores of the 10 most similar movies, ingoring the 1st one
    sim_scores = sim_scores[1:11]
    #Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    #Return the top 10 most similar movies
    return df['Title'].iloc[movie_indices]

## Test the Recommender Function
Our function is good to go. Let's test it out using the movie Baby Mama. We'll extract that plot from our data frame and see what movies the function suggests when we give it that movie.

In [15]:
#Baby Mama Plot
df.loc[df['Title'] == 'Baby Mama', 'Plot'].item()

"Kate Holbrook (Tina Fey) is a successful single businesswoman who has always put her career before her personal life. Now in her late thirties, she finally decides to have her own child, but her plans are dampened when she discovers she has a minuscule chance of becoming pregnant because her uterus is T-shaped. Also denied the chance to adopt, Kate hires an immature, obnoxious, South Philly woman named Angie Ostrowski (Amy Poehler) to become her surrogate mother.\r\nWhen Angie becomes pregnant, Kate prepares for motherhood in her own typically driven fashion—until her surrogate shows up at her door with no place to live. Their conflicting personalities put them at odds as Kate learns first-hand about balancing motherhood and career and also dates the owner of a local blended-juice cafe, Rob Ackerman (Greg Kinnear).\r\nUnknown to Kate, the in-vitro fertilization procedure Angie had did not succeed and she is feigning the pregnancy, hoping to ultimately run off with her payment. Eventua

In [16]:
#Baby Mama Recommendations
get_recommendations('Baby Mama')

993                Smashed
322     My Sister's Keeper
530         Preacher's Kid
117     Over Her Dead Body
2016        Beautiful Kate
1495        You're Not You
337                 Orphan
57        Four Christmases
3618      Strawberry Cliff
1114      Drinking Buddies
Name: Title, dtype: object

Looking at our results we can see that Baby Mama is a movie about a career driven woman, Kate, who wants to use a surrogate to have a child. The surrogate turns out to be immature woman, Angie, who plans to run off with Kate's money and not give her a child. The two learn from and change each other for the better and both end up having kids. Wow...what a movie.

Our recommender system's first suggestion to Baby Mama is a movie called Smashed. Let's take a look at it's plot.

In [17]:
#Smashed Plot
df.loc[df['Title'] == 'Smashed', 'Plot'].item()



We see that Smashed is a movie focused around a young woman who wants to change her life and stop drinking. The plot mentions a lot of themes that Baby Mama also refernced, babies, pregnancy, faking pregnancies, and relationships. We can see why the recommender would suggest this. 

Let's pick another movie and see what results we get

## Spider-Man Recommendations

In [18]:
#Spider-Man Results
get_recommendations('The Amazing Spider-Man 2')

788     Amazing Spider-Man, TheThe Amazing Spider-Man
1065                                  Big Ass Spider!
2524                                       Dark Blood
2278                                      Harry Brown
659      Harry Potter and the Deathly Hallows: Part 2
2386    Harry Potter and the Deathly Hallows: Part II
2790                         Spooks: The Greater Good
464      Harry Potter and the Deathly Hallows: Part 1
2319     Harry Potter and the Deathly Hallows: Part I
2647                             The Harry Hill Movie
Name: Title, dtype: object

When we see the recommender for The Amazing Spider-Man 2 we are pleased to see that the system does a great job in recommending the first Spider-Man movie. It also does great by picking other fantasy movies like Harry Potter. But when we look at our movie dataset we see that Spider-Man: Homecoming is there. The user most likely would be interested in that movie as well but our system missed it.

# Recommendations Based on Title
Seeing as how our plot based recommender missed out on recommending the latest Spider-Man movie in the dataset when the user picked The Amazing Spider-Man 2, let's see if we can alter the function to picked based on movie titles.

In [20]:
#Rec Based on Title
title_tfidf_matrix = tfidf.fit_transform(df['Title'])

# Compute the cosine similarity matrix
title_cosine_sim = linear_kernel(title_tfidf_matrix, title_tfidf_matrix)

def get_recommendations_title(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    title_sim_scores = list(enumerate(title_cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    title_sim_scores = sorted(title_sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    title_sim_scores = title_sim_scores[1:11]

    # Get the movie indices
    title_movie_indices = [i[0] for i in title_sim_scores]

    # Return the top 10 most similar movies
    return df['Title'].iloc[title_movie_indices]

get_recommendations_title('The Amazing Spider-Man 2')


788         Amazing Spider-Man, TheThe Amazing Spider-Man
1892                               Spider-Man: Homecoming
355                                         A Serious Man
2245                                        The Other Man
2788                                               Man Up
2802                              The Man from U.N.C.L.E.
1065                                      Big Ass Spider!
3781                        The Amazing Praybeyt Benjamin
281                                       I Love You, Man
787     Amazing Adventures of the Living Corpse, TheTh...
Name: Title, dtype: object

The new function above allows the user to get both the first Amazing Spider-Man movie and also Spider-Man Homecoming. Based on titles the user will see all spider named movies. But the user will also miss out on movies like Harry Potter since none of the plot is taken into consideration.

# Conclusion
Using TF-IDF made getting scores for the words in the plots straightforward. We were able to use those scores to see which plots used those words the most and make assumptions that those plots would be similar. In the case of Baby Mama and Smashed I think the user would have been pleased with the choice they got. Smashed is a comedy centered around people making better life choices while Baby Mama is a comedy about people learning from each other to have better lives. In the case of The Amazing Spider-Man 2, I'm pretty sure if the user chose to watch that theywould probably choose to watch the first one but I also think they would've liked to know that Spider-Man: Homecoming was available as well. 

The recommender system could use improvements by taking in more than just the frequency of words in the plot or the titles. Maybe adding in sentiment of the movie, reviews, or directors/ actors would make the recommendations more spot on. However, I'm pleased that the system did as well as it did with just 4,000 movies to chose from. 