# Movie Recommendations with Document Similarity

Recommender systems are one of the popular and most adopted applications of machine learning. They are typically used to recommend entities to users and these entites can be anything like products, movies, services and so on. 

Popular examples of recommendations include,
- Amazon suggesting products on its website
- Amazon Prime, Netflix, Hotstar recommending movies\shows
- YouTube recommending videos to watch

Typically recommender systems can be implemented in three ways:

- Simple Rule-based Recommenders: Typically based on specific global metrics and thresholds like movie popularity, global ratings etc.
- Content-based Recommenders: This is based on providing similar entities based on a specific entity of interest. Content metadata can be used here like movie descriptions, genre, cast, director and so on
- Collaborative filtering Recommenders: Here we don't need metadata but we try to predict recommendations and ratings based on past ratings of different users and specific items.

We will be building a movie recommendation system here where based on data\metadata pertaining to different movies, we try and recommend similar movies of interest!

![](https://github.com/duybluemind1988/Data-science/blob/master/NLP/Text_analytic_Apress/Ch07%20-%20Text%20Similarity%20and%20Clustering/netflix_rec.png?raw=1)

Since our focus in not really recommendation engines but NLP, we will be leveraging the text-based metadata for each movie to try and recommend similar movies based on specific movies of interest. This falls under content-based recommenders. 

# Load Dataset

In [1]:
import pandas as pd

df = pd.read_csv('https://github.com/duybluemind1988/Data-science/blob/master/NLP/Text_analytic_Apress/Ch07%20-%20Text%20Similarity%20and%20Clustering/tmdb_5000_movies.csv.gz?raw=true', compression='gzip')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

In [2]:
df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [3]:
df = df[['title', 'tagline', 'overview', 'genres', 'popularity']]
df.tagline.fillna('', inplace=True)
df['description'] = df['tagline'].map(str) + ' ' + df['overview']
df.dropna(inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4800 entries, 0 to 4802
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   title        4800 non-null   object 
 1   tagline      4800 non-null   object 
 2   overview     4800 non-null   object 
 3   genres       4800 non-null   object 
 4   popularity   4800 non-null   float64
 5   description  4800 non-null   object 
dtypes: float64(1), object(5)
memory usage: 262.5+ KB


In [4]:
df.head()

Unnamed: 0,title,tagline,overview,genres,popularity,description
0,Avatar,Enter the World of Pandora.,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",150.437577,Enter the World of Pandora. In the 22nd centur...
1,Pirates of the Caribbean: At World's End,"At the end of the world, the adventure begins.","Captain Barbossa, long believed to be dead, ha...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",139.082615,"At the end of the world, the adventure begins...."
2,Spectre,A Plan No One Escapes,A cryptic message from Bond’s past sends him o...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",107.376788,A Plan No One Escapes A cryptic message from B...
3,The Dark Knight Rises,The Legend Ends,Following the death of District Attorney Harve...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",112.31295,The Legend Ends Following the death of Distric...
4,John Carter,"Lost in our world, found in another.","John Carter is a war-weary, former military ca...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",43.926995,"Lost in our world, found in another. John Cart..."


# Build a Movie Recommender System

Here you will build your own movie recommender system. We will use the following pipeline:
- Text pre-processing
- Feature Engineering
- Document Similarity Computation
- Find top similar movies
- Build a movie recommendation function


## Document Similarity

Recommendations are about understanding the underlying features which make us favour one choice over the other. Similarity between items(in this case movies) is one way to understanding why we choose one movie over another. There are different ways to calculate similarity between two items. One of the most widely used measures is __cosine similarity__ which we have already used in the previous unit.

### Cosine Similarity

Cosine Similarity is used to calculate a numeric score to denote the similarity between two text documents. Mathematically, it is defined as follows:

$$ cosine(x,y) = \frac{x. y^\intercal}{||x||.||y||} $$

## Text pre-processing

We will do some basic text pre-processing on our movie descriptions before we build our features

In [7]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
import re
import numpy as np

stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = nltk.word_tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

normalize_corpus = np.vectorize(normalize_document)

norm_corpus = normalize_corpus(list(df['description']))
len(norm_corpus)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


4800

In [9]:
norm_corpus

array(['enter world pandora 22nd century paraplegic marine dispatched moon pandora unique mission becomes torn following orders protecting alien civilization',
       'end world adventure begins captain barbossa long believed dead come back life headed edge earth turner elizabeth swann nothing quite seems',
       'plan one escapes cryptic message bonds past sends trail uncover sinister organization battles political forces keep secret service alive bond peels back layers deceit reveal terrible truth behind spectre',
       ...,
       'signed sealed delivered introduces dedicated quartet civil servants dead letter office us postal system transform elite team lostmail detectives determination deliver seemingly undeliverable takes post office unpredictable world letters packages past save lives solve crimes reunite old loves change futures arriving late always miraculously time',
       'new yorker shanghai ambitious new york attorney sam sent shanghai assignment immediately stumbles le

## Extract TF-IDF Features

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer(ngram_range=(1, 2), min_df=2)
tfidf_matrix = tf.fit_transform(norm_corpus)
tfidf_matrix.shape

(4800, 20667)

In [11]:
tfidf_matrix

<4800x20667 sparse matrix of type '<class 'numpy.float64'>'
	with 164727 stored elements in Compressed Sparse Row format>

## Compute Pairwise Document Similarity

In [12]:
from sklearn.metrics.pairwise import cosine_similarity

doc_sim = cosine_similarity(tfidf_matrix)
doc_sim_df = pd.DataFrame(doc_sim)
doc_sim_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,4760,4761,4762,4763,4764,4765,4766,4767,4768,4769,4770,4771,4772,4773,4774,4775,4776,4777,4778,4779,4780,4781,4782,4783,4784,4785,4786,4787,4788,4789,4790,4791,4792,4793,4794,4795,4796,4797,4798,4799
0,1.0,0.010701,0.0,0.01903,0.028687,0.024901,0.0,0.026516,0.0,0.00742,0.0,0.013823,0.0,0.0,0.009263,0.0,0.008344,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.009415,0.0,0.035881,0.047218,0.008047,0.026601,0.0,0.062495,0.00668,0.0,0.0,0.0,0.063331,0.0,0.0,0.008586,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.049388,0.0,0.0,0.005816,0.0,0.0,0.0,0.0,0.012472,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.009702,0.0,0.023336,0.033549,0.0,0.0,0.0,0.006892,0.0,0.0
1,0.010701,1.0,0.011891,0.0,0.041623,0.0,0.014564,0.027122,0.034688,0.007614,0.0,0.0,0.052134,0.009248,0.029468,0.004819,0.035518,0.032426,0.033918,0.0,0.018308,0.0,0.011241,0.020996,0.009661,0.042502,0.035897,0.024378,0.008258,0.006492,0.0,0.0128,0.006855,0.0,0.0,0.021532,0.016604,0.043655,0.016932,0.008811,...,0.012986,0.0,0.0,0.0,0.013762,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00643,0.010886,0.023603,0.030106,0.006809,0.0,0.009489,0.017109,0.012248,0.017389,0.0,0.0,0.0,0.0,0.0,0.0,0.007478,0.02251,0.009956,0.0,0.004818,0.0,0.0,0.012593,0.0,0.022391,0.013724,0.0
2,0.0,0.011891,1.0,0.0,0.0,0.0,0.0,0.022242,0.015854,0.004891,0.0,0.083542,0.012713,0.011898,0.0,0.003837,0.009272,0.032232,0.033011,0.0,0.0,0.0,0.014462,0.014841,0.0,0.0,0.0,0.0,0.0,0.090528,0.0,0.02369,0.019635,0.0,0.0,0.08201,0.013058,0.0,0.022129,0.0,...,0.014107,0.015918,0.021758,0.0,0.052869,0.0,0.015808,0.0,0.0,0.004846,0.0,0.0,0.0,0.0,0.010297,0.038254,0.010843,0.008621,0.0,0.008626,0.022677,0.007735,0.0,0.0,0.0,0.023521,0.0,0.0,0.005954,0.0,0.042617,0.0,0.0,0.0,0.016519,0.0,0.0,0.011682,0.0,0.004
3,0.01903,0.0,0.0,1.0,0.008793,0.0,0.015976,0.023172,0.027452,0.07361,0.0,0.0,0.0,0.0,0.0,0.005932,0.015444,0.0,0.004329,0.010927,0.005457,0.026736,0.0,0.0,0.015205,0.029242,0.015571,0.0,0.006201,0.021761,0.015126,0.025419,0.020006,0.0,0.0,0.0,0.025365,0.0,0.0243,0.009492,...,0.011328,0.015874,0.03006,0.0,0.025849,0.0,0.006226,0.010545,0.0,0.016889,0.0,0.0,0.0,0.0,0.004482,0.0,0.0,0.0,0.036303,0.0,0.009426,0.009747,0.0,0.022818,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.009667,0.0,0.0,0.0,0.0,0.028354,0.021785,0.027735
4,0.028687,0.041623,0.0,0.008793,1.0,0.0,0.022912,0.028676,0.0,0.023538,0.016695,0.0,0.027047,0.013766,0.01413,0.0,0.042218,0.026063,0.011792,0.016861,0.014546,0.0,0.0,0.0,0.028304,0.0,0.064792,0.046028,0.026343,0.019018,0.0,0.019027,0.026188,0.036808,0.0,0.0,0.037018,0.0,0.0,0.025814,...,0.009146,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.032717,0.0,0.0,0.0,0.0,0.008873,0.0,0.0,0.0,0.0,0.009513,0.013763,0.007763,0.0,0.0,0.0,0.0,0.0,0.0,0.014875,0.018762,0.0148,0.0,0.0,0.0,0.0,0.01076,0.0,0.010514,0.0,0.0


## Get List of Movie Titles

In [13]:
movies_list = df['title'].values
movies_list, movies_list.shape

(array(['Avatar', "Pirates of the Caribbean: At World's End", 'Spectre',
        ..., 'Signed, Sealed, Delivered', 'Shanghai Calling',
        'My Date with Drew'], dtype=object), (4800,))

## Find Top Similar Movies for a Sample Movie

Let's take __Minions__ the most popular movie the the dataframe above and try and find the most similar movies which can be recommended

#### Find movie ID

In [14]:
np.where(movies_list == 'Minions')

(array([546]),)

In [17]:
movie_idx = np.where(movies_list == 'Minions')[0][0]
movie_idx

546

#### Get movie similarities

In [15]:
doc_sim_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,4760,4761,4762,4763,4764,4765,4766,4767,4768,4769,4770,4771,4772,4773,4774,4775,4776,4777,4778,4779,4780,4781,4782,4783,4784,4785,4786,4787,4788,4789,4790,4791,4792,4793,4794,4795,4796,4797,4798,4799
0,1.000000,0.010701,0.000000,0.019030,0.028687,0.024901,0.000000,0.026516,0.000000,0.007420,0.000000,0.013823,0.000000,0.000000,0.009263,0.000000,0.008344,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.009415,0.000000,0.035881,0.047218,0.008047,0.026601,0.000000,0.062495,0.006680,0.000000,0.0,0.000000,0.063331,0.000000,0.000000,0.008586,...,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.049388,0.000000,0.000000,0.005816,0.000000,0.000000,0.000000,0.000000,0.012472,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.009702,0.0,0.023336,0.033549,0.000000,0.000000,0.0,0.006892,0.000000,0.000000
1,0.010701,1.000000,0.011891,0.000000,0.041623,0.000000,0.014564,0.027122,0.034688,0.007614,0.000000,0.000000,0.052134,0.009248,0.029468,0.004819,0.035518,0.032426,0.033918,0.000000,0.018308,0.000000,0.011241,0.020996,0.009661,0.042502,0.035897,0.024378,0.008258,0.006492,0.000000,0.012800,0.006855,0.000000,0.0,0.021532,0.016604,0.043655,0.016932,0.008811,...,0.012986,0.000000,0.000000,0.0,0.013762,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.006430,0.010886,0.023603,0.030106,0.006809,0.000000,0.009489,0.017109,0.012248,0.017389,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.007478,0.022510,0.009956,0.0,0.004818,0.000000,0.000000,0.012593,0.0,0.022391,0.013724,0.000000
2,0.000000,0.011891,1.000000,0.000000,0.000000,0.000000,0.000000,0.022242,0.015854,0.004891,0.000000,0.083542,0.012713,0.011898,0.000000,0.003837,0.009272,0.032232,0.033011,0.000000,0.000000,0.000000,0.014462,0.014841,0.000000,0.000000,0.000000,0.000000,0.000000,0.090528,0.000000,0.023690,0.019635,0.000000,0.0,0.082010,0.013058,0.000000,0.022129,0.000000,...,0.014107,0.015918,0.021758,0.0,0.052869,0.0,0.015808,0.000000,0.0,0.004846,0.000000,0.000000,0.000000,0.000000,0.010297,0.038254,0.010843,0.008621,0.000000,0.008626,0.022677,0.007735,0.0,0.000000,0.000000,0.023521,0.000000,0.0,0.005954,0.000000,0.042617,0.0,0.000000,0.000000,0.016519,0.000000,0.0,0.011682,0.000000,0.004000
3,0.019030,0.000000,0.000000,1.000000,0.008793,0.000000,0.015976,0.023172,0.027452,0.073610,0.000000,0.000000,0.000000,0.000000,0.000000,0.005932,0.015444,0.000000,0.004329,0.010927,0.005457,0.026736,0.000000,0.000000,0.015205,0.029242,0.015571,0.000000,0.006201,0.021761,0.015126,0.025419,0.020006,0.000000,0.0,0.000000,0.025365,0.000000,0.024300,0.009492,...,0.011328,0.015874,0.030060,0.0,0.025849,0.0,0.006226,0.010545,0.0,0.016889,0.000000,0.000000,0.000000,0.000000,0.004482,0.000000,0.000000,0.000000,0.036303,0.000000,0.009426,0.009747,0.0,0.022818,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.009667,0.000000,0.000000,0.000000,0.0,0.028354,0.021785,0.027735
4,0.028687,0.041623,0.000000,0.008793,1.000000,0.000000,0.022912,0.028676,0.000000,0.023538,0.016695,0.000000,0.027047,0.013766,0.014130,0.000000,0.042218,0.026063,0.011792,0.016861,0.014546,0.000000,0.000000,0.000000,0.028304,0.000000,0.064792,0.046028,0.026343,0.019018,0.000000,0.019027,0.026188,0.036808,0.0,0.000000,0.037018,0.000000,0.000000,0.025814,...,0.009146,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.032717,0.000000,0.000000,0.000000,0.000000,0.008873,0.000000,0.000000,0.000000,0.000000,0.009513,0.013763,0.007763,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.014875,0.018762,0.014800,0.0,0.000000,0.000000,0.000000,0.010760,0.0,0.010514,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4795,0.000000,0.012593,0.000000,0.000000,0.010760,0.000000,0.024914,0.017145,0.020977,0.010038,0.000000,0.000000,0.000000,0.016141,0.019743,0.000000,0.000000,0.013427,0.000000,0.016700,0.007436,0.006002,0.006310,0.000000,0.000000,0.006788,0.000000,0.000000,0.000000,0.008724,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.007232,0.000000,...,0.005546,0.020151,0.012338,0.0,0.000000,0.0,0.110187,0.014655,0.0,0.021020,0.016082,0.000000,0.011000,0.009353,0.013610,0.018403,0.000000,0.000000,0.025088,0.010277,0.019054,0.000000,0.0,0.000000,0.014755,0.000000,0.000000,0.0,0.012220,0.031123,0.000000,0.0,0.013552,0.000000,0.003977,1.000000,0.0,0.000000,0.004570,0.000000
4796,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.073474,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,1.0,0.000000,0.000000,0.000000
4797,0.006892,0.022391,0.011682,0.028354,0.010514,0.016354,0.000000,0.000000,0.023586,0.013142,0.012007,0.023835,0.000000,0.013592,0.016427,0.000000,0.015721,0.016371,0.020660,0.000000,0.000000,0.000000,0.000000,0.000000,0.006222,0.000000,0.005640,0.000000,0.005318,0.011204,0.010072,0.008244,0.028267,0.000000,0.0,0.000000,0.010415,0.009583,0.042869,0.005675,...,0.065486,0.012696,0.000000,0.0,0.020066,0.0,0.000000,0.000000,0.0,0.006285,0.009440,0.000000,0.006640,0.000000,0.012624,0.000000,0.008010,0.015298,0.000000,0.000000,0.024961,0.020398,0.0,0.006862,0.041406,0.000000,0.020548,0.0,0.000000,0.000000,0.006412,0.0,0.012920,0.016919,0.000000,0.000000,0.0,1.000000,0.007148,0.000000
4798,0.000000,0.013724,0.000000,0.021785,0.000000,0.000000,0.016650,0.007292,0.015781,0.008183,0.011078,0.000000,0.000000,0.000000,0.024650,0.018150,0.000000,0.004235,0.005827,0.018861,0.000000,0.000000,0.000000,0.000000,0.077971,0.008034,0.000000,0.000000,0.000000,0.021918,0.000000,0.000000,0.000000,0.000000,0.0,0.043089,0.012998,0.008934,0.013556,0.083580,...,0.014991,0.000000,0.000000,0.0,0.006774,0.0,0.000000,0.046190,0.0,0.012656,0.000000,0.000000,0.010497,0.009197,0.008101,0.000000,0.000000,0.017678,0.000000,0.000000,0.022308,0.007524,0.0,0.023684,0.007132,0.000000,0.000000,0.0,0.024790,0.000000,0.000000,0.0,0.000000,0.011651,0.000000,0.004570,0.0,0.007148,1.000000,0.006692


In [18]:
movie_similarities = doc_sim_df.iloc[movie_idx].values
movie_similarities

array([0.0104544 , 0.01072835, 0.        , ..., 0.00690954, 0.        ,
       0.        ])

In [19]:
movie_similarities.shape

(4800,)

#### Get top 5 similar movie IDs

In [20]:
similar_movie_idxs = np.argsort(-movie_similarities)[1:6]
similar_movie_idxs

array([506, 614, 241, 813, 154])

#### Get top 5 similar movies

In [21]:
similar_movies = movies_list[similar_movie_idxs]
similar_movies

array(['Despicable Me 2', 'Despicable Me',
       'Teenage Mutant Ninja Turtles: Out of the Shadows', 'Superman',
       'Rise of the Guardians'], dtype=object)

### Build a movie recommender function to recommend top 5 similar movies for any movie 

The movie title, movie title list and document similarity matrix dataframe will be given as inputs to the function

In [23]:
def movie_recommender(movie_title, movies=movies_list, doc_sims=doc_sim_df):
    # find movie id
    movie_idx = np.where(movies == movie_title)[0][0]
    # get movie similarities
    movie_similarities = doc_sims.iloc[movie_idx].values
    # get top 5 similar movie IDs
    similar_movie_idxs = np.argsort(-movie_similarities)[1:6]
    # get top 5 movies
    similar_movies = movies[similar_movie_idxs]
    # return the top 5 movies
    return similar_movies

## Sort Dataset by Popular Movies

In [24]:
pop_movies = df.sort_values(by='popularity', ascending=False)
pop_movies.head()

Unnamed: 0,title,tagline,overview,genres,popularity,description
546,Minions,"Before Gru, they had a history of bad bosses","Minions Stuart, Kevin and Bob are recruited by...","[{""id"": 10751, ""name"": ""Family""}, {""id"": 16, ""...",875.581305,"Before Gru, they had a history of bad bosses M..."
95,Interstellar,Mankind was born on Earth. It was never meant ...,Interstellar chronicles the adventures of a gr...,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 18, ""...",724.247784,Mankind was born on Earth. It was never meant ...
788,Deadpool,Witness the beginning of a happy ending,Deadpool tells the origin story of former Spec...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",514.569956,Witness the beginning of a happy ending Deadpo...
94,Guardians of the Galaxy,All heroes start somewhere.,"Light years from Earth, 26 years after being a...","[{""id"": 28, ""name"": ""Action""}, {""id"": 878, ""na...",481.098624,All heroes start somewhere. Light years from E...
127,Mad Max: Fury Road,What a Lovely Day.,An apocalyptic story set in the furthest reach...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",434.278564,What a Lovely Day. An apocalyptic story set in...


In [25]:
popular_movies = ['Minions', 'Interstellar', 'Deadpool', 'Jurassic World', 'Pirates of the Caribbean: The Curse of the Black Pearl',
              'Dawn of the Planet of the Apes', 'The Hunger Games: Mockingjay - Part 1', 'Terminator Genisys', 
              'Captain America: Civil War', 'The Dark Knight', 'The Martian', 'Batman v Superman: Dawn of Justice', 
              'Pulp Fiction', 'The Godfather', 'The Shawshank Redemption', 'The Lord of the Rings: The Fellowship of the Ring',  
              'Harry Potter and the Chamber of Secrets', 'Star Wars', 'The Hobbit: The Battle of the Five Armies',
              'Iron Man']

## Get Popular Movie Recommendations

In [26]:
for movie in popular_movies:
    print('Movie:', movie)
    print('Top 5 recommended Movies:', movie_recommender(movie_title=movie))
    print()

Movie: Minions
Top 5 recommended Movies: ['Despicable Me 2' 'Despicable Me'
 'Teenage Mutant Ninja Turtles: Out of the Shadows' 'Superman'
 'Rise of the Guardians']

Movie: Interstellar
Top 5 recommended Movies: ['Gattaca' 'Space Pirate Captain Harlock' 'Space Cowboys'
 'Starship Troopers' 'Final Destination 2']

Movie: Deadpool
Top 5 recommended Movies: ['Silent Trigger' 'Underworld: Evolution' 'Bronson' 'Shaft' 'Don Jon']

Movie: Jurassic World
Top 5 recommended Movies: ['Jurassic Park' 'The Lost World: Jurassic Park' 'The Nut Job'
 "National Lampoon's Vacation" 'Vacation']

Movie: Pirates of the Caribbean: The Curse of the Black Pearl
Top 5 recommended Movies: ["Pirates of the Caribbean: Dead Man's Chest" 'The Pirate'
 'Pirates of the Caribbean: On Stranger Tides'
 'The Pirates! In an Adventure with Scientists!' 'Joyful Noise']

Movie: Dawn of the Planet of the Apes
Top 5 recommended Movies: ['Battle for the Planet of the Apes' 'Groove' 'The Other End of the Line'
 'Chicago Overcoat

# BM25 Similarity

In [27]:
"""
Data:
-----
.. data:: PARAM_K1 - Free smoothing parameter for BM25.
.. data:: PARAM_B - Free smoothing parameter for BM25.
.. data:: EPSILON - Constant used for negative idf of document in corpus.
"""

import math
from six import iteritems
from six.moves import xrange

PARAM_K1 = 2.5
PARAM_B = 0.85
EPSILON = 0.2

class BM25(object):
    """Implementation of Best Matching 25 ranking function.
    Attributes
    ----------
    corpus_size : int
        Size of corpus (number of documents).
    avgdl : float
        Average length of document in `corpus`.
    corpus : list of list of str
        Corpus of documents.
    f : list of dicts of int
        Dictionary with terms frequencies for each document in `corpus`. Words used as keys and frequencies as values.
    df : dict
        Dictionary with terms frequencies for whole `corpus`. Words used as keys and frequencies as values.
    idf : dict
        Dictionary with inversed terms frequencies for whole `corpus`. Words used as keys and frequencies as values.
    doc_len : list of int
        List of document lengths.
    """

    def __init__(self, corpus):
        """
        Parameters
        ----------
        corpus : list of list of str
            Given corpus.
        """
        self.corpus_size = len(corpus)
        self.avgdl = sum(float(len(x)) for x in corpus) / self.corpus_size
        self.corpus = corpus
        self.f = []
        self.df = {}
        self.idf = {}
        self.doc_len = []
        self.initialize()

    def initialize(self):
        """Calculates frequencies of terms in documents and in corpus. Also computes inverse document frequencies."""
        for document in self.corpus:
            frequencies = {}
            self.doc_len.append(len(document))
            for word in document:
                if word not in frequencies:
                    frequencies[word] = 0
                frequencies[word] += 1
            self.f.append(frequencies)

            for word, freq in iteritems(frequencies):
                if word not in self.df:
                    self.df[word] = 0
                self.df[word] += 1

        for word, freq in iteritems(self.df):
            self.idf[word] = math.log(self.corpus_size - freq + 0.5) - math.log(freq + 0.5)

    def get_score(self, document, index, average_idf):
        """Computes BM25 score of given `document` in relation to item of corpus selected by `index`.
        Parameters
        ----------
        document : list of str
            Document to be scored.
        index : int
            Index of document in corpus selected to score with `document`.
        average_idf : float
            Average idf in corpus.
        Returns
        -------
        float
            BM25 score.
        """
        score = 0
        for word in document:
            if word not in self.f[index]:
                continue
            idf = self.idf[word] if self.idf[word] >= 0 else EPSILON * average_idf
            score += (idf * self.f[index][word] * (PARAM_K1 + 1)
                      / (self.f[index][word] + PARAM_K1 * (1 - PARAM_B + PARAM_B * self.doc_len[index] / self.avgdl)))
        return score

    def get_scores(self, document, average_idf):
        """Computes and returns BM25 scores of given `document` in relation to
        every item in corpus.
        Parameters
        ----------
        document : list of str
            Document to be scored.
        average_idf : float
            Average idf in corpus.
        Returns
        -------
        list of float
            BM25 scores.
        """
        scores = []
        for index in xrange(self.corpus_size):
            score = self.get_score(document, index, average_idf)
            scores.append(score)
        return scores


def get_bm25_weights(corpus):
    """Returns BM25 scores (weights) of documents in corpus.
    Each document has to be weighted with every document in given corpus.
    Parameters
    ----------
    corpus : list of list of str
        Corpus of documents.
    Returns
    -------
    list of list of float
        BM25 scores.
    Examples
    --------
    >>> from gensim.summarization.bm25 import get_bm25_weights
    >>> corpus = [
    ...     ["black", "cat", "white", "cat"],
    ...     ["cat", "outer", "space"],
    ...     ["wag", "dog"]
    ... ]
    >>> result = get_bm25_weights(corpus)
    """
    bm25 = BM25(corpus)
    average_idf = sum(float(val) for val in bm25.idf.values()) / len(bm25.idf)

    weights = []
    for doc in corpus:
        scores = bm25.get_scores(doc, average_idf)
        weights.append(scores)

    return weights

In [28]:
norm_corpus_tokens = np.array([nltk.word_tokenize(doc) for doc in norm_corpus])
norm_corpus_tokens[:3]

array([list(['enter', 'world', 'pandora', '22nd', 'century', 'paraplegic', 'marine', 'dispatched', 'moon', 'pandora', 'unique', 'mission', 'becomes', 'torn', 'following', 'orders', 'protecting', 'alien', 'civilization']),
       list(['end', 'world', 'adventure', 'begins', 'captain', 'barbossa', 'long', 'believed', 'dead', 'come', 'back', 'life', 'headed', 'edge', 'earth', 'turner', 'elizabeth', 'swann', 'nothing', 'quite', 'seems']),
       list(['plan', 'one', 'escapes', 'cryptic', 'message', 'bonds', 'past', 'sends', 'trail', 'uncover', 'sinister', 'organization', 'battles', 'political', 'forces', 'keep', 'secret', 'service', 'alive', 'bond', 'peels', 'back', 'layers', 'deceit', 'reveal', 'terrible', 'truth', 'behind', 'spectre'])],
      dtype=object)

In [29]:
%%time
wts = get_bm25_weights(norm_corpus_tokens)

CPU times: user 1min 24s, sys: 376 ms, total: 1min 24s
Wall time: 1min 24s


In [30]:
bm25_wts_df = pd.DataFrame(wts)
bm25_wts_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,4760,4761,4762,4763,4764,4765,4766,4767,4768,4769,4770,4771,4772,4773,4774,4775,4776,4777,4778,4779,4780,4781,4782,4783,4784,4785,4786,4787,4788,4789,4790,4791,4792,4793,4794,4795,4796,4797,4798,4799
0,149.060647,2.529227,0.000000,3.692476,5.765205,4.715867,0.000000,4.505193,0.000000,1.750501,0.000000,3.116295,0.000000,0.000000,2.312638,0.000000,2.049391,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,2.217683,0.000000,6.666185,7.011092,2.529227,4.729113,0.000000,10.205337,1.408206,0.000000,0.0,0.000000,9.692345,0.000000,0.000000,2.173071,...,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,8.774843,0.000000,0.000000,1.322035,0.000000,0.000000,0.000000,0.000000,2.518730,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.00000,0.0,0.000000,0.000000,2.589865,0.0,3.310184,5.061290,0.000000,0.000000,0.000000,1.619299,0.000000,0.000000
1,2.653483,119.903490,2.720199,0.000000,7.297372,0.000000,2.496650,5.774763,5.870872,1.750501,0.000000,0.000000,9.888388,2.131686,6.387875,1.294931,7.924214,6.743266,6.371557,0.000000,3.383689,0.000000,2.953141,4.014206,2.217683,6.574114,6.569765,4.486128,2.529227,1.426806,0.000000,2.790577,1.408206,0.000000,0.0,5.542147,3.187983,9.474867,3.734578,2.173071,...,2.546256,0.000000,0.000000,0.0,3.082403,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,1.546107,1.918173,4.808794,5.446829,1.602627,0.000000,2.265038,3.597849,2.142276,3.307771,0.0,0.000000,0.000000,0.000000,0.00000,0.0,1.918173,3.866167,2.589865,0.0,1.011185,0.000000,0.000000,2.725888,0.000000,4.532292,2.924023,0.000000
2,0.000000,3.229716,153.756470,0.000000,0.000000,0.000000,0.000000,4.538740,4.378262,1.399834,0.000000,18.181881,3.563449,3.466622,0.000000,1.275790,2.616985,7.800675,6.477053,0.000000,0.000000,0.000000,4.802502,3.665749,0.000000,0.000000,0.000000,0.000000,0.000000,16.481614,0.000000,5.212177,3.470809,0.000000,0.0,17.808090,3.131733,0.000000,5.504570,0.000000,...,2.595865,4.278096,4.779134,0.0,11.673475,0.0,4.255612,0.000000,0.0,1.377511,0.000000,0.000000,0.000000,0.000000,2.745383,7.888864,2.450630,2.353135,0.000000,2.199671,4.468374,1.736372,0.0,0.000000,0.000000,4.822006,0.00000,0.0,1.889820,0.000000,9.088009,0.0,0.000000,0.000000,3.253547,0.000000,0.000000,2.803116,0.000000,1.377511
3,6.141419,0.000000,0.000000,214.277248,3.182421,0.000000,5.433209,6.839524,7.445837,18.496688,0.000000,0.000000,0.000000,0.000000,0.000000,2.332368,4.785231,0.000000,1.679735,3.216480,2.105650,6.646793,0.000000,0.000000,5.704174,5.170018,4.654951,0.000000,3.252882,6.972240,5.463190,6.915813,4.411040,0.000000,0.0,0.000000,7.834622,0.000000,8.411508,3.630557,...,2.855221,5.451895,6.308170,0.0,8.643454,0.0,2.499538,3.166719,0.0,5.907301,0.000000,0.000000,0.000000,0.000000,1.700292,0.000000,0.000000,0.000000,9.043414,0.000000,2.646838,2.815873,0.0,6.951130,0.000000,0.000000,0.00000,0.0,0.000000,0.000000,0.000000,0.0,2.718450,0.000000,0.000000,0.000000,0.000000,8.037913,6.958000,7.156812
4,9.186831,10.791034,0.000000,2.665414,184.778486,0.000000,5.168103,7.278204,0.000000,6.643382,4.392354,0.000000,6.324939,3.455652,4.625277,0.000000,10.043073,7.121135,2.746088,3.519460,4.196293,0.000000,0.000000,0.000000,8.140458,0.000000,16.321723,10.410255,9.782488,5.237382,0.000000,5.581155,6.034426,9.894461,0.0,0.000000,8.262842,0.000000,0.000000,7.976699,...,2.082442,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,8.631082,0.000000,0.000000,0.000000,0.000000,2.644071,0.000000,0.000000,0.000000,0.000000,2.518730,2.793180,1.993356,0.0,0.000000,0.000000,0.000000,0.00000,0.0,4.242312,4.163380,5.179731,0.0,0.000000,0.000000,0.000000,2.996697,0.000000,3.238599,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4795,0.000000,4.129992,0.000000,0.000000,3.419436,0.000000,6.213082,5.906744,6.268116,3.142380,0.000000,0.000000,0.000000,4.314800,5.554264,0.000000,0.000000,5.080826,0.000000,3.953797,2.428782,2.305792,2.763315,0.000000,0.000000,1.751488,0.000000,0.000000,0.000000,2.594333,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,2.494587,0.000000,...,1.700823,7.564726,3.989733,0.0,0.000000,0.0,24.394992,4.373797,0.0,7.125419,4.549498,0.000000,3.525417,2.638578,4.194182,5.830620,0.000000,0.000000,8.499108,3.047800,5.065836,0.000000,0.0,0.000000,5.109791,0.000000,0.00000,0.0,4.242312,8.378132,0.000000,0.0,3.159860,0.000000,1.479444,235.914788,0.000000,0.000000,1.863847,0.000000
4796,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,14.585184,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.00000,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,189.258241,0.000000,0.000000,0.000000
4797,2.653483,7.079109,3.687551,7.919203,2.979990,4.697215,0.000000,0.000000,8.340881,4.393390,3.892841,5.672843,0.000000,3.753564,5.809466,0.000000,5.361809,4.645907,5.495571,0.000000,0.000000,0.000000,0.000000,0.000000,2.217683,0.000000,2.011234,0.000000,2.529227,3.580983,3.005403,2.790577,5.891497,0.000000,0.0,0.000000,3.131733,3.653249,13.287434,2.173071,...,8.698384,4.278096,0.000000,0.0,5.230711,0.0,0.000000,0.000000,0.0,2.138845,3.042660,0.000000,2.365138,0.000000,3.785615,0.000000,2.708975,4.610279,0.000000,0.000000,6.589766,4.042180,0.0,1.813832,4.814481,0.000000,5.48348,0.0,0.000000,0.000000,2.589865,0.0,3.182323,4.283073,0.000000,0.000000,0.000000,201.190509,2.524138,0.000000
4798,0.000000,4.430187,0.000000,7.213866,0.000000,0.000000,5.133360,3.056832,5.301060,3.155903,3.866299,0.000000,0.000000,0.000000,8.227114,6.862149,0.000000,1.863847,1.974816,4.303434,0.000000,0.000000,0.000000,0.000000,19.310826,1.985172,0.000000,0.000000,0.000000,6.260225,0.000000,0.000000,0.000000,0.000000,0.0,13.152109,3.791803,3.653249,4.502952,19.342089,...,4.060760,0.000000,0.000000,0.0,2.670736,0.0,0.000000,10.127580,0.0,3.657240,0.000000,0.000000,3.456049,2.638578,2.446782,0.000000,0.000000,5.338933,0.000000,0.000000,6.399716,2.260347,0.0,4.505952,2.524138,0.000000,0.00000,0.0,8.661532,0.000000,0.000000,0.0,0.000000,3.525417,0.000000,1.863847,0.000000,2.448463,221.359423,3.105578


In [31]:
for movie in popular_movies:
    print('Movie:', movie)
    print('Top 5 recommended Movies:', movie_recommender(movie_title=movie, doc_sims=bm25_wts_df))
    print()

Movie: Minions
Top 5 recommended Movies: ['Despicable Me 2' 'Despicable Me'
 'Teenage Mutant Ninja Turtles: Out of the Shadows' 'Intolerance'
 'Superman']

Movie: Interstellar
Top 5 recommended Movies: ['Space Pirate Captain Harlock' 'Prometheus' 'Starship Troopers' 'Gattaca'
 'Space Cowboys']

Movie: Deadpool
Top 5 recommended Movies: ['Underworld: Evolution' 'Shaft' 'Rabbit Hole' 'X-Men Origins: Wolverine'
 'Victor Frankenstein']

Movie: Jurassic World
Top 5 recommended Movies: ['Jurassic Park' 'The Lost World: Jurassic Park'
 "National Lampoon's Vacation" 'The Nut Job' 'Meet the Deedles']

Movie: Pirates of the Caribbean: The Curse of the Black Pearl
Top 5 recommended Movies: ['Pirates of the Caribbean: On Stranger Tides'
 "Pirates of the Caribbean: Dead Man's Chest" 'The Pirate'
 'Space Pirate Captain Harlock' 'Joyful Noise']

Movie: Dawn of the Planet of the Apes
Top 5 recommended Movies: ['Battle for the Planet of the Apes' 'Groove'
 'Conquest of the Planet of the Apes' 'The Othe