# TMDb Movie Data Analysis and Building a Movie Recommendation System
## Part 3: Recommender System Through Content-Based Filtering

### In this section, we will be building a reccomendation system from the dataset we previously cleaned and explored. Based on movie rating and vote average, Python will recommend a list of Top 10 movies based on your preference. 

In [120]:
import pandas as pd
import numpy as np

## Loading the Dataset

In [121]:
df = pd.read_csv('movies_final.csv')
df

Unnamed: 0,id,title,tagline,release_date,genres,belongs_to_collection,original_language,budget_musd,revenue_musd,production_companies,...,vote_average,popularity,runtime,overview,spoken_languages,poster_path,cast,cast_size,crew_size,director
0,862,Toy Story,,1995-10-30,Animation|Comedy|Family,Toy Story Collection,en,30.0,373.554033,Pixar Animation Studios,...,7.7,21.946943,81.0,"Led by Woody, Andy's toys live happily in his ...",English,<img src='http://image.tmdb.org/t/p/w185//uXDf...,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,13,106,John Lasseter
1,8844,Jumanji,Roll the dice and unleash the excitement!,1995-12-15,Adventure|Fantasy|Family,,en,65.0,262.797249,TriStar Pictures|Teitler Film|Interscope Commu...,...,6.9,17.015539,104.0,When siblings Judy and Peter discover an encha...,English|Français,<img src='http://image.tmdb.org/t/p/w185//vgpX...,Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...,26,16,Joe Johnston
2,15602,Grumpier Old Men,Still Yelling. Still Fighting. Still Ready for...,1995-12-22,Romance|Comedy,Grumpy Old Men Collection,en,,,Warner Bros.|Lancaster Gate,...,6.5,11.712900,101.0,A family wedding reignites the ancient feud be...,English,<img src='http://image.tmdb.org/t/p/w185//1FSX...,Walter Matthau|Jack Lemmon|Ann-Margret|Sophia ...,7,4,Howard Deutch
3,31357,Waiting to Exhale,Friends are the people who let you be yourself...,1995-12-22,Comedy|Drama|Romance,,en,16.0,81.452156,Twentieth Century Fox Film Corporation,...,6.1,3.859495,127.0,"Cheated on, mistreated and stepped on, the wom...",English,<img src='http://image.tmdb.org/t/p/w185//4wjG...,Whitney Houston|Angela Bassett|Loretta Devine|...,10,10,Forest Whitaker
4,11862,Father of the Bride Part II,Just When His World Is Back To Normal... He's ...,1995-02-10,Comedy,Father of the Bride Collection,en,,76.578911,Sandollar Productions|Touchstone Pictures,...,5.7,8.387519,106.0,Just when George Banks has recovered from his ...,English,<img src='http://image.tmdb.org/t/p/w185//lf9R...,Steve Martin|Diane Keaton|Martin Short|Kimberl...,12,7,Charles Shyer
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44686,439050,Subdue,Rising and falling between a man and woman,,Drama|Family,,fa,,,,...,4.0,0.072051,90.0,Rising and falling between a man and woman.,فارسی,<img src='http://image.tmdb.org/t/p/w185//pfC8...,Leila Hatami|Kourosh Tahami|Elham Korda,3,9,Hamid Nematollah
44687,111109,Century of Birthing,,2011-11-17,Drama,,tl,,,Sine Olivia,...,9.0,0.178241,360.0,An artist struggles to finish his work while a...,,<img src='http://image.tmdb.org/t/p/w185//xZkm...,Angel Aquino|Perry Dizon|Hazel Orencio|Joel To...,11,6,Lav Diaz
44688,67758,Betrayal,A deadly game of wits.,2003-08-01,Action|Drama|Thriller,,en,,,American World Pictures,...,3.8,0.903007,90.0,"When one of her hits goes wrong, a professiona...",English,<img src='http://image.tmdb.org/t/p/w185//eGga...,Erika Eleniak|Adam Baldwin|Julie du Page|James...,15,5,Mark L. Lester
44689,227506,Satan Triumphant,,1917-10-21,,,en,,,Yermoliev,...,,0.003503,87.0,"In a small town live two brothers, one a minis...",,<img src='http://image.tmdb.org/t/p/w185//aorB...,Iwan Mosschuchin|Nathalie Lissenko|Pavel Pavlo...,5,2,Yakov Protazanov


In [122]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44691 entries, 0 to 44690
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     44691 non-null  int64  
 1   title                  44691 non-null  object 
 2   tagline                20284 non-null  object 
 3   release_date           44657 non-null  object 
 4   genres                 42586 non-null  object 
 5   belongs_to_collection  4463 non-null   object 
 6   original_language      44681 non-null  object 
 7   budget_musd            8854 non-null   float64
 8   revenue_musd           7385 non-null   float64
 9   production_companies   33356 non-null  object 
 10  production_countries   38835 non-null  object 
 11  vote_count             44691 non-null  float64
 12  vote_average           42077 non-null  float64
 13  popularity             44691 non-null  float64
 14  runtime                43179 non-null  float64
 15  ov

### In order to make sure that the recommendations are accurate, we will be using a weighted rating formula used by IMDb. Using a vote average will not work, since some movies could have a vote average of 10, but that movie has only one review. The weighted rating takes into account number of votes also.
### \begin{equation}
\text Weighted Rating (\bf WR) = \left({{\bf v} \over {\bf v} + {\bf m}} \cdot R\right) + \left({{\bf m} \over {\bf v} + {\bf m}} \cdot C\right)
\end{equation}, where
* ### v is the number of votes for the movie;
* ### m is the minimum votes required to be listed in the chart;
* ### R is the average rating of the movie; and
* ### C is the mean vote across the whole report

In [123]:
C = df['vote_average'].mean()
C

6.003341492976852

In [124]:
m = df['vote_count'].quantile(0.9)
m

164.0

## Creating a DataFrame that meets the vote count criteria.

In [125]:
movies = df.copy().loc[df['vote_count'] >= m]
movies.shape

(4480, 22)

In [126]:
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    # Calculation based on the IMDB formula
    return (v / (v + m) * R) + (m / (m + v) * C)

In [127]:
# Define a new feature 'score' and calculate its value with 'weighted_rating()'
movies['score'] = movies.apply(weighted_rating, axis=1)

# Sort movies based on score calculated above
movies.sort_values(by='score', ascending=False, inplace=True)

# Print the top 15 movies
movies[['title', 'vote_count', 'vote_average', 'score']].head(10)

# Resetting the index
movies.reset_index(drop=True, inplace=True)

In [128]:
movies

Unnamed: 0,id,title,tagline,release_date,genres,belongs_to_collection,original_language,budget_musd,revenue_musd,production_companies,...,popularity,runtime,overview,spoken_languages,poster_path,cast,cast_size,crew_size,director,score
0,19404,Dilwale Dulhania Le Jayenge,Come... Fall In Love,1995-10-20,Comedy|Drama|Romance,,hi,13.2,100.000000,Yash Raj Films,...,34.457024,190.0,"Raj is a rich, carefree, happy-go-lucky second...",हिन्दी,<img src='http://image.tmdb.org/t/p/w185//2CAL...,Shah Rukh Khan|Kajol|Amrish Puri|Anupam Kher|S...,27,30,Aditya Chopra,8.484422
1,278,The Shawshank Redemption,Fear can hold you prisoner. Hope can set you f...,1994-09-23,Drama|Crime,,en,25.0,28.341469,Castle Rock Entertainment|Warner Bros.,...,51.645403,142.0,Framed in the 1940s for the double murder of h...,English,<img src='http://image.tmdb.org/t/p/w185//5KCV...,Tim Robbins|Morgan Freeman|Bob Gunton|Clancy B...,42,90,Frank Darabont,8.451954
2,238,The Godfather,An offer you can't refuse.,1972-03-14,Drama|Crime,The Godfather Collection,en,6.0,245.066411,Paramount Pictures|Alfran Productions,...,41.109264,175.0,"Spanning the years 1945 to 1955, a chronicle o...",English|Italiano|Latin,<img src='http://image.tmdb.org/t/p/w185//iVZ3...,Marlon Brando|Al Pacino|James Caan|Richard S. ...,58,42,Francis Ford Coppola,8.433831
3,155,The Dark Knight,Why So Serious?,2008-07-16,Drama|Action|Crime|Thriller,The Dark Knight Collection,en,185.0,1004.558444,DC Comics|Legendary Pictures|Warner Bros.|DC E...,...,123.167259,152.0,Batman raises the stakes in his war on crime. ...,English|普通话,<img src='http://image.tmdb.org/t/p/w185//qJ2t...,Christian Bale|Michael Caine|Heath Ledger|Aaro...,134,81,Christopher Nolan,8.269705
4,550,Fight Club,Mischief. Mayhem. Soap.,1999-10-15,Drama,,en,63.0,100.853753,Twentieth Century Fox Film Corporation|Regency...,...,63.869599,139.0,A ticking-time-bomb insomniac and a slippery s...,English,<img src='http://image.tmdb.org/t/p/w185//bptf...,Edward Norton|Brad Pitt|Meat Loaf|Jared Leto|H...,77,107,David Fincher,8.261730
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4475,205321,Sharknado,Enough said!,2013-07-11,TV Movie|Horror,Sharknado Collection,en,1.0,,The Asylum|Syfy|Southward Films,...,4.928195,86.0,"A freak hurricane hits Los Angeles, causing ma...",English,<img src='http://image.tmdb.org/t/p/w185//fCpJ...,Ian Ziering|Tara Reid|Cassandra Scerbo|John He...,14,16,Anthony C. Ferrante,4.357636
4476,13805,Disaster Movie,Your favorite movies are going to be destroyed.,2008-08-29,Action|Comedy,,en,25.0,14.109284,Grosvenor Park Media Ltd.|LionsGate|3 in the Box,...,11.634132,87.0,"In DISASTER MOVIE, the filmmaking team behind ...",English,<img src='http://image.tmdb.org/t/p/w185//3J8X...,Matt Lanter|Vanessa Lachey|Nicole Ari Parker|C...,15,13,Jason Friedberg,4.250116
4477,5491,Battlefield Earth,Take Back The Planet,2000-05-10,Action|Science Fiction|War,,en,44.0,21.400000,Franchise Pictures|Warner Bros.|JTP Films|Morg...,...,5.276926,118.0,"In the year 3000, man is no match for the Psyc...",English|Français,<img src='http://image.tmdb.org/t/p/w185//wXCR...,John Travolta|Barry Pepper|Forest Whitaker|Kim...,16,12,Roger Christian,4.164416
4478,9760,Epic Movie,We know it's big. We measured.,2007-01-25,Action|Adventure|Comedy,,en,20.0,86.865564,Twentieth Century Fox Film Corporation|Regency...,...,5.549609,86.0,"When Edward, Peter, Lucy and Susan each follow...",English,<img src='http://image.tmdb.org/t/p/w185//l0lG...,Kal Penn|Adam Campbell|Jennifer Coolidge|Jayma...,18,15,Jason Friedberg,4.123189


In [129]:
movies['overview'].head(10)

0    Raj is a rich, carefree, happy-go-lucky second...
1    Framed in the 1940s for the double murder of h...
2    Spanning the years 1945 to 1955, a chronicle o...
3    Batman raises the stakes in his war on crime. ...
4    A ticking-time-bomb insomniac and a slippery s...
5    A burger-loving hit man, his philosophical par...
6    The true story of how businessman Oskar Schind...
7    Under the direction of a ruthless instructor, ...
8    A ten year old girl who wanders away from her ...
9    A touching story of an Italian book seller of ...
Name: overview, dtype: object

### Based on the description, we shall find the similarity among the movies.

In [130]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a', etc.
tfidf = TfidfVectorizer(stop_words='english')

# Replace NaN with an empty string
movies['overview'] = movies['overview'].fillna('')

# Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(movies['overview'])

# Output the shape of tfidf_matrix
tfidf_matrix.shape

(4480, 19536)

In [131]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [132]:
pd.DataFrame(cosine_sim)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4470,4471,4472,4473,4474,4475,4476,4477,4478,4479
0,1.000000,0.009507,0.012428,0.000000,0.000000,0.000000,0.000000,0.018622,0.000000,0.027099,...,0.000000,0.000000,0.021003,0.000000,0.026402,0.010176,0.000000,0.010498,0.005564,0.000000
1,0.009507,1.000000,0.005216,0.000000,0.005947,0.000000,0.000000,0.021365,0.000000,0.005956,...,0.015150,0.005369,0.000000,0.000000,0.000000,0.010087,0.019574,0.000000,0.000000,0.000000
2,0.012428,0.005216,1.000000,0.030580,0.000000,0.037235,0.000000,0.000000,0.000000,0.072731,...,0.005820,0.030888,0.000000,0.013839,0.000000,0.000000,0.000000,0.000000,0.014148,0.015444
3,0.000000,0.000000,0.030580,1.000000,0.000000,0.017212,0.015430,0.000000,0.000000,0.011601,...,0.000000,0.088339,0.040266,0.000000,0.016947,0.015222,0.000000,0.000000,0.000000,0.019726
4,0.000000,0.005947,0.000000,0.000000,1.000000,0.010720,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.006154,0.000000,0.000000,0.000000,0.000000,0.018072,0.000000,0.000000,0.012663
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4475,0.010176,0.010087,0.000000,0.015222,0.000000,0.008637,0.000000,0.000000,0.000000,0.000000,...,0.049468,0.010895,0.000000,0.000000,0.013331,1.000000,0.025511,0.006899,0.000000,0.021895
4476,0.000000,0.019574,0.000000,0.000000,0.018072,0.019671,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.025511,1.000000,0.000000,0.010484,0.023236
4477,0.010498,0.000000,0.000000,0.000000,0.000000,0.007569,0.054882,0.035703,0.012434,0.000000,...,0.009102,0.000000,0.024012,0.000000,0.011682,0.006899,0.000000,1.000000,0.000000,0.021353
4478,0.005564,0.000000,0.014148,0.000000,0.000000,0.000000,0.056569,0.000000,0.089618,0.014850,...,0.033817,0.000000,0.010672,0.000000,0.000000,0.000000,0.010484,0.000000,1.000000,0.000000


In [133]:
# Construct a reverse map of indices and movie titles
indices = pd.Series(movies.index, index=movies.title).drop_duplicates()
indices

title
Dilwale Dulhania Le Jayenge       0
The Shawshank Redemption          1
The Godfather                     2
The Dark Knight                   3
Fight Club                        4
                               ... 
Sharknado                      4475
Disaster Movie                 4476
Battlefield Earth              4477
Epic Movie                     4478
Dragonball Evolution           4479
Length: 4480, dtype: int64

In [134]:
indices['The Dark Knight Rises']

153

In [176]:
from IPython.display import HTML
# Function that takes in movie title as input and outputs most similar movies
def recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]
    
    # Get the pairwise similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the scores of the 10 most similar movies 
    sim_scores = sim_scores[1:11]
    
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    
    # Return the top 10 most similar movies
    print(f"{'='*36}\nBelow are the Top 10 Recommendations\n{'='*36}")
    #return movies['title'].iloc[movie_indices]
    return HTML(movies[['poster_path', 'title']].iloc[movie_indices].set_index(np.arange(1,11)).rename(columns={'poster_path': '', 'title': 'Top 10'}).to_html(escape=False))

In [177]:
recommendations('Blade Runner')

Below are the Top 10 Recommendations


Unnamed: 0,Unnamed: 1,Top 10
1,,Desperado
2,,Furious 7
3,,Morgan
4,,Knight of Cups
5,,Blade
6,,The Hitcher
7,,The Girl with All the Gifts
8,,Unbroken
9,,Julieta
10,,Volcano
