## Recommender: Overview of Movie Synopsis

This notebook uses the 'overview' column, which is concise description of the movie synopsis, to make a recommender. It uses a TfidfVectorizer to capture infrequently used words and weigh them more heavily (in contrast to CountVectorizer).

In [1]:
###############
### IMPORTS ###
###############

import pandas as pd
import numpy as np

import matplotlib as plt
import seaborn as sns

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

import pickle

In [2]:
df_all = pd.read_csv('data/dataframe_merged_small.csv')

In [3]:
print('Shape of dataframe: ', df_all.shape)
print('Columns of dataframe: ', df_all.columns)

Shape of dataframe:  (10876, 29)
Columns of dataframe:  Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count', 'cast', 'crew', 'keywords', 'director',
       'weighted_rating'],
      dtype='object')


In [4]:
df = pd.read_csv('data/dataframe_merged_small.csv', usecols=['id', 'title', 'overview', 'keywords'])

In [5]:
print('Shape of dataframe: ', df.shape)
print('Columns of dataframe: ', df.columns)

Shape of dataframe:  (10876, 4)
Columns of dataframe:  Index(['id', 'overview', 'title', 'keywords'], dtype='object')


In [6]:
df

Unnamed: 0,id,overview,title,keywords
0,862,"Led by Woody, Andy's toys live happily in his ...",Toy Story,"['jealousy', 'toy', 'boy', 'friendship', 'frie..."
1,8844,When siblings Judy and Peter discover an encha...,Jumanji,"['board game', 'disappearance', ""based on chil..."
2,15602,A family wedding reignites the ancient feud be...,Grumpier Old Men,"['fishing', 'best friend', 'duringcreditssting..."
3,949,"Obsessive master thief, Neil McCauley leads a ...",Heat,"['robbery', 'detective', 'bank', 'obsession', ..."
4,11860,An ugly duckling having undergone a remarkable...,Sabrina,"['paris', 'brother brother relationship', 'cha..."
...,...,...,...,...
10871,19307,Sid and Bernie keep having their amorous inten...,Carry On Camping,"['holiday', 'nudist camp', 'camping', 'tent', ..."
10872,18098,Scheherezade puts herself in danger to save Su...,Arabian Nights,[]
10873,52103,Little pocket thief Wu never got away from the...,Pickpocket,['independent film']
10874,455661,A closeted boy runs the risk of being outed by...,In a Heartbeat,"['love', 'teenager', 'lgbt', 'short']"


In [7]:
# Join [overview] and [keywords] together
# These two columns are synopsis-associated and it's sensible to join them together
df['overview_and_keywords'] = df['overview'] + df['keywords'].astype(str)

In [8]:
df['overview_and_keywords'].loc[0]

"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.['jealousy', 'toy', 'boy', 'friendship', 'friends']"

In [9]:
# Check if [overview] has NaN values and take the sum to see how many
# NOTE: IT DOESN'T REALLY MATTER IF WE REPLACE THE NAN VALUES BUT LETS DO IT ANYWAYS
#       THIS IS B/C WE WILL BE FEEDING IN [overview_and_keywords], NOT [overview]
df['overview'].isnull().sum()

52

In [10]:
df = df.replace(np.nan, '', regex=True)

In [11]:
df['overview'].isnull().sum()

0

In [12]:
vectorizer = TfidfVectorizer(stop_words='english')
doc_word = vectorizer.fit_transform(df['overview_and_keywords'])

In [13]:
# Compute the cosine similarity matrix from doc_word
cosine_sim = cosine_similarity(doc_word, doc_word)

In [14]:
# Look at the cosine_sim matrix
pd.DataFrame(cosine_sim)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,10866,10867,10868,10869,10870,10871,10872,10873,10874,10875
0,1.000000,0.013389,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.010031,0.046185,0.000000
1,0.013389,1.000000,0.040562,0.052569,0.000000,0.000000,0.009913,0.0,0.005114,0.000000,...,0.0,0.000000,0.000000,0.000000,0.003447,0.000000,0.000000,0.047648,0.000000,0.012675
2,0.000000,0.040562,1.000000,0.000000,0.000000,0.000000,0.010711,0.0,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.016470,0.000000,0.016014,0.050754,0.000000,0.014073
3,0.000000,0.052569,0.000000,1.000000,0.000000,0.000000,0.000000,0.0,0.018957,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.028662,0.032332,0.000000
4,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,0.000000,0.0,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.009021,0.016196,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10871,0.000000,0.000000,0.000000,0.000000,0.016196,0.000000,0.000000,0.0,0.000000,0.040168,...,0.0,0.031422,0.000000,0.000000,0.007882,1.000000,0.000000,0.026808,0.000000,0.000000
10872,0.000000,0.000000,0.016014,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,0.000000,0.000000
10873,0.010031,0.047648,0.050754,0.028662,0.000000,0.000000,0.008306,0.0,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.005253,0.026808,0.000000,1.000000,0.013248,0.004446
10874,0.046185,0.000000,0.000000,0.032332,0.000000,0.000000,0.009476,0.0,0.000000,0.000000,...,0.0,0.000000,0.034507,0.062217,0.000000,0.000000,0.000000,0.013248,1.000000,0.013403


In [19]:
# # Save cosine_sim array to use in hybrid recommendation system
np.save('similarity_matrix/cos_overview_small.npy', cosine_sim)

In [20]:
# Reset index of our dataframe and construct reverse mapping as before
indices = pd.Series(df.index, index=df['title'])

In [21]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return df['title'].iloc[movie_indices]

In [22]:
get_recommendations('The Dark Knight Rises')

5365                                       The Dark Knight
274                                                 Batman
667                                         Batman Returns
6209                            Batman: Under the Red Hood
2707                                                 Q & A
4411                                         Batman Begins
4080                    Batman Beyond: Return of the Joker
7450     Batman Unmasked: The Psychology of the Dark Kn...
10350    LEGO DC Comics Super Heroes: Justice League - ...
6752                                      Batman: Year One
Name: title, dtype: object

In [23]:
get_recommendations('Toy Story')

6179               Toy Story 3
1458               Toy Story 2
4470    The 40 Year Old Virgin
8125                 Small Fry
485      Rebel Without a Cause
5007              Factory Girl
2616             Class of 1984
7273             A Simple Life
1486           Man on the Moon
1528      White Men Can't Jump
Name: title, dtype: object