# TMDb Movie Data Analysis and Building a Movie Recommendation System
## Part 3: Recommender System Through Content-Based Filtering

### In this section, we will vectorize all relevant columns and create a cosine similarity matrix to use for our movie recommendation system.

* **id:** The ID of the movie (clear/unique identifier).
* **title:** The Official Title of the movie.
* **tagline:** The tagline of the movie.
* **release_date:** Theatrical Release Date of the movie.
* **genres:** Genres associated with the movie.
* **belongs_to_collection:** Gives information on the movie series/franchise the particular film belongs to.
* **original_language:** The language in which the movie was originally shot in.
* **budget_musd:** The budget of the movie in million dollars.
* **revenue_musd:** The total revenue of the movie in million dollars.
* **production_companies:** Production companies involved with the making of the movie.
* **production_countries:** Countries where the movie was shot/produced in.
* **vote_count:** The number of votes by users, as counted by TMDB.
* **vote_average:** The average rating of the movie.
* **popularity:** The Popularity Score assigned by TMDB.
* **runtime:** The runtime of the movie in minutes.
* **overview:** A brief blurb of the movie.
* **spoken_languages:** Spoken languages in the film.
* **poster_path:** The URL of the poster image.
* **cast:** (Main) Actors appearing in the movie.
* **cast_size:** number of Actors appearing in the movie.
* **director:** Director of the movie.
* **crew_size:** Size of the film crew (incl. director, excl. actors).

## Loading the main libraries

In [1]:
import pandas as pd
import numpy as np
from ast import literal_eval
import scipy.sparse
pd.options.display.max_columns = 30
#pd.set_option('precision', 2)

## Loading the Dataset

In [2]:
df = pd.read_csv('movies_complete.csv')
df.head()

Unnamed: 0,belongs_to_collection,budget_musd,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue_musd,runtime,spoken_languages,tagline,title,vote_average,vote_count,year,html,cast_names,crew_names,director,profit_musd,return_musd,runtime_hours,Franchise
0,Blondie Collection,,Comedy,3924,en,Blondie and Dagwood are about to celebrate the...,2.445,https://image.tmdb.org/t/p/w500/o6UMTE2LzQdlKV...,Columbia Pictures,United States of America,1938-11-30,,70.0,English,,Blondie,7.1,5,1938.0,<img src='https://image.tmdb.org/t/p/w500/o6UM...,Penny Singleton|Arthur Lake|Larry Simms|Daisy|...,Frank R. Strayer|Richard Flournoy,Frank R. Strayer,,,1.0,Franchise
1,,,Adventure,6124,de,Der Mann ohne Namen is a German adventure movi...,0.6,https://image.tmdb.org/t/p/w500/6xUbUCvndklbGV...,,Germany,1921-01-01,,420.0,,,"Peter Voss, Thief of Millions",,0,1921.0,<img src='https://image.tmdb.org/t/p/w500/6xUb...,Harry Liedtke|Georg Alexander|Mady Christians|...,Robert Liebmann|Frederik Fuglsang|Georg Jacoby...,Georg Jacoby,,,7.0,Stand-alone
2,,,Drama|Romance,8773,fr,Love at Twenty unites five directors from five...,4.985,https://image.tmdb.org/t/p/w500/aup2QCYCsyEeQf...,Ulysse Productions|Unitec Films|Cinesecolo|Toh...,Germany|France|Italy|Japan|Poland,1962-06-22,,110.0,Deutsch|Français|Italiano|日本語|Polski,The Intimate Secrets of Young Lovers,Love at Twenty,6.8,36,1962.0,<img src='https://image.tmdb.org/t/p/w500/aup2...,Jean-Pierre Léaud|Marie-France Pisier|Patrick ...,François Truffaut|François Truffaut|Gérard Bra...,François Truffaut,,,1.0,Stand-alone
3,New World Disorder,,,25449,en,Gee Atherton ripping the Worlds course the day...,1.337,https://image.tmdb.org/t/p/w500/okQY6jVmRU19CU...,,,2008-12-08,,69.0,English,,New World Disorder 9: Never Enough,4.5,2,2008.0,<img src='https://image.tmdb.org/t/p/w500/okQY...,Darren Berrecloth|Cameron McCaul|Paul Basagoit...,Derek Westerlund,Derek Westerlund,,,1.0,Franchise
4,,,Family,31975,en,"Elmo is making a very, very super special surp...",0.6,https://image.tmdb.org/t/p/w500/qKWcCmvGr4g0dg...,,,2010-01-05,,46.0,,,Sesame Street: Elmo Loves You!,,0,2010.0,<img src='https://image.tmdb.org/t/p/w500/qKWc...,,,,,,0.0,Stand-alone


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 578038 entries, 0 to 578037
Data columns (total 27 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   belongs_to_collection  15891 non-null   object 
 1   budget_musd            24889 non-null   float64
 2   genres                 425057 non-null  object 
 3   id                     578038 non-null  int64  
 4   original_language      578038 non-null  object 
 5   overview               496799 non-null  object 
 6   popularity             578038 non-null  float64
 7   poster_path            420706 non-null  object 
 8   production_companies   273625 non-null  object 
 9   production_countries   377206 non-null  object 
 10  release_date           556155 non-null  object 
 11  revenue_musd           13603 non-null   float64
 12  runtime                474092 non-null  float64
 13  spoken_languages       370286 non-null  object 
 14  tagline                90465 non-nul

## Dropping movies below a vote count threshold

In [4]:
min_votes = df.vote_count.quantile(0.98)
print(min_votes)

145.0


In [5]:
movies = df.copy().loc[:,['id', 'title', 'year', 'vote_average', 'runtime', 'genres', 'cast_names', 'director', 'production_companies', 'overview', 'poster_path', 'html']].copy()
movies = movies.loc[df.vote_count >= min_votes].reset_index(drop=True)
print(f"Shape: {movies.shape}")
movies.head()

Shape: (11574, 12)


Unnamed: 0,id,title,year,vote_average,runtime,genres,cast_names,director,production_companies,overview,poster_path,html
0,2,Ariel,1988.0,6.8,73.0,Drama|Crime|Comedy,Turo Pajala|Susanna Haavisto|Matti Pellonpää|E...,Aki Kaurismäki,Villealfa Filmproductions,Taisto Kasurinen is a Finnish coal miner whose...,https://image.tmdb.org/t/p/w500/ojDg0PGvs6R9xY...,<img src='https://image.tmdb.org/t/p/w500/ojDg...
1,3,Shadows in Paradise,1986.0,7.2,74.0,Drama|Comedy,Matti Pellonpää|Kati Outinen|Sakari Kuosmanen|...,Aki Kaurismäki,Villealfa Filmproductions,"An episode in the life of Nikander, a garbage ...",https://image.tmdb.org/t/p/w500/nj01hspawPof0m...,<img src='https://image.tmdb.org/t/p/w500/nj01...
2,5,Four Rooms,1995.0,5.7,98.0,Crime|Comedy,Tim Roth|Jennifer Beals|Antonio Banderas|Valer...,Allison Anders,Miramax|A Band Apart,It's Ted the Bellhop's first night on the job....,https://image.tmdb.org/t/p/w500/75aHn1NOYXh4M7...,<img src='https://image.tmdb.org/t/p/w500/75aH...
3,6,Judgment Night,1993.0,6.5,110.0,Action|Thriller|Crime,Emilio Estevez|Cuba Gooding Jr.|Denis Leary|St...,Stephen Hopkins,Universal Pictures|Largo Entertainment|JVC,"While racing to a boxing match, Frank, Mike, J...",https://image.tmdb.org/t/p/w500/rYFAvSPlQUCeba...,<img src='https://image.tmdb.org/t/p/w500/rYFA...
4,11,Star Wars,1977.0,8.2,121.0,Adventure|Action|Science Fiction,Mark Hamill|Harrison Ford|Carrie Fisher|Peter ...,George Lucas,Lucasfilm Ltd.|20th Century Fox,Princess Leia is captured and held hostage by ...,https://image.tmdb.org/t/p/w500/6FfCtAuVAW8XJj...,<img src='https://image.tmdb.org/t/p/w500/6FfC...


In [6]:
movies.isnull().sum()

id                        0
title                     0
year                      0
vote_average              0
runtime                  40
genres                    6
cast_names               21
director                  3
production_companies    159
overview                 35
poster_path               0
html                      0
dtype: int64

## Removing rows with null values

In [7]:
movies = movies.dropna().reset_index(drop=True)
print(f"Shape: {movies.shape}")

Shape: (11349, 12)


## Creating a "tags" column by concatenating relevant columns

In [8]:
movies['tags'] = movies.genres + '|' + movies.cast_names + '|' + movies.director + '|' + movies.production_companies
movies['tags'] = movies.tags.apply(lambda x: x.replace(' ', ''))

In [9]:
movies.tags

0        Drama|Crime|Comedy|TuroPajala|SusannaHaavisto|...
1        Drama|Comedy|MattiPellonpää|KatiOutinen|Sakari...
2        Crime|Comedy|TimRoth|JenniferBeals|AntonioBand...
3        Action|Thriller|Crime|EmilioEstevez|CubaGoodin...
4        Adventure|Action|ScienceFiction|MarkHamill|Har...
                               ...                        
11344    Animation|Comedy|Family|TomHiddleston|DanCaste...
11345    Action|ScienceFiction|Thriller|CourtneyLoggins...
11346    Thriller|DavidKross|HannoKoffler|MariaEhrich|R...
11347    Documentary|MichaelSchumacher|MickSchumacher|C...
11348    Thriller|Drama|Horror|KateSiegel|JasonO'Mara|D...
Name: tags, Length: 11349, dtype: object

## Vectorizing the "tags" column

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
#cv = CountVectorizer(ngram_range=(1,3))
cv = TfidfVectorizer(ngram_range=(1,3))
cv_matrix = cv.fit_transform(movies.tags)
print(f"cv_matrix.shape: {cv_matrix.shape}")

cv_matrix.shape: (11349, 956519)


## Creating a cosine similarity matrix

In [16]:
from sklearn.metrics.pairwise import cosine_similarity
cs_matrix = cosine_similarity(cv_matrix, dense_output=False)#.astype('int16')
print(f"cs_matrix.shape: {cs_matrix.shape}")

cs_matrix.shape: (11349, 11349)


## Creating a function that returns the most similar movies based on the title of a movie

In [17]:
from IPython.display import HTML
# Function that takes in movie title as input and outputs most similar movies
def recommendations(title, cosine_sim=cs_matrix):
    # Get the index of the movie that matches the title
    idx = movies.loc[movies.title == title].index[0]
    # Get the pairwise similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx].toarray()[0]))
    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    # Get the scores of the 10 most similar movies 
    sim_scores = sim_scores[1:11]
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    
    # Return the top 10 most similar movies
    results = movies[['html', 'title']].iloc[movie_indices].set_index(np.arange(1,11)).rename(columns={'html': '', 'title': 'Top 10'})
    return HTML(results.to_html(escape=False))

In [18]:
recommendations('Star Wars')

Unnamed: 0,Unnamed: 1,Top 10
1,,The Empire Strikes Back
2,,Return of the Jedi
3,,The Star Wars Holiday Special
4,,Star Wars: Episode I - The Phantom Menace
5,,Star Wars: Episode III - Revenge of the Sith
6,,Raiders of the Lost Ark
7,,Star Wars: Episode II - Attack of the Clones
8,,Rollerball
9,,Willow
10,,The Elephant Man


In [19]:
recommendations('Toy Story')

Unnamed: 0,Unnamed: 1,Top 10
1,,Toy Story 2
2,,A Bug's Life
3,,Hawaiian Vacation
4,,Toy Story 3
5,,Tin Toy
6,,Small Fry
7,,Cars
8,,The Iron Giant
9,,Toy Story 4
10,,Buzz Lightyear of Star Command: The Adventure Begins


In [20]:
recommendations('Cobra')

Unnamed: 0,Unnamed: 1,Top 10
1,,Over the Top
2,,Zorba the Greek
3,,King Solomon's Mines
4,,American Ninja
5,,Death Wish 4: The Crackdown
6,,Death Wish II
7,,The Texas Chainsaw Massacre 2
8,,Leviathan
9,,American Ninja 2: The Confrontation
10,,Bullet to the Head


In [21]:
recommendations('The Godfather')

Unnamed: 0,Unnamed: 1,Top 10
1,,The Godfather: Part II
2,,The Godfather: Part III
3,,The Conversation
4,,Hearts of Darkness: A Filmmaker's Apocalypse
5,,The Freshman
6,,Cruising
7,,The Rainmaker
8,,Apocalypse Now
9,,Dog Day Afternoon
10,,The French Connection


In [22]:
recommendations('2001: A Space Odyssey')

Unnamed: 0,Unnamed: 1,Top 10
1,,Fear and Desire
2,,2010
3,,The Man in the Moon
4,,The Time Machine
5,,Into the Blue
6,,Untamed Heart
7,,Lolita
8,,Zabriskie Point
9,,The Haunting
10,,Poltergeist III


## Saving the "movies" DataFrame to a csv file

In [22]:
movies.to_csv('movies_streamlit.csv', index=False)

## Saving the cosine similarity matrix to an npz file

In [25]:
scipy.sparse.save_npz('cs_matrix.npz', cs_matrix)

### We have everything we need to create our app and deploy the movie recommendation system.