# Movie Recommendation System #

Data is being used to create more efficient systems and this is where Recommendation Systems come into play. Recommendation Systems are a type of information filtering system as they improve the quality of search results and provides items that are more relevant to the search item or are related to the search history of the user. Recommender systems are used in a variety of areas, with commonly recognized examples taking the form of playlist generators for video and music services (such as Youtube and Spotify), product recommenders for online stores (such as Amazon), or content recommenders for social media platforms and open web content recommenders. There are also popular recommender systems for specific topics like restaurants and online dating. These systems can vary in scope/complexity. Some utilize a single type of input, like music, or multiple inputs within and across platforms like news, books and search queries.

Content Based Filtering- They suggest similar items based on a particular item. This system uses item metadata, such as genre, director, description, actors, etc. for movies, to make these recommendations. The fundamental idea behind these recommender systems is that if a person liked a particular item, he or she will also like an item that is similar to it.

I'll be building a baseline Movie Recommendation System using "TMDB 5000 Movie Dataset" from Kaggle. I'll attempt to answer the following questions:
1. Based on user votes, what are the highest rated movies in the entire dataset? 
2. What are some recommended movies based on the selection of a particular movie?

First, let's load and analyze the movie data we're working with

In [2]:
import pandas as pd  
import numpy as np    
import matplotlib.pyplot as plt 

# Reading two CSV files into Pandas DataFrames, ignoring memory-related warnings
# 'tmdb_5000_credits.csv' contains movie credits data
# 'tmdb_5000_movies.csv' contains movie details data
df1 = pd.read_csv('tmdb_5000_credits.csv', low_memory=False)  # Reading the first CSV file into df1
df2 = pd.read_csv('tmdb_5000_movies.csv', low_memory=False)   # Reading the second CSV file into df2

The first dataset contains the following features:

- movie_id - A unique identifier for each movie.
- cast - The name of lead and supporting actors.
- crew - The name of directors, editors, composers, writers, etc.

The second dataset has the following features:

- budget - The budget for the movie made.
- genre - The genre of the movie be it action, comedy ,thriller etc.
- homepage - A link to the homepage(website) of the movie.
- id - This is the movie_id as in the first dataset.
- keywords - The keywords or tags related to the movie.
- original_language - The language in which the movie was made.
- original_title - The title of the movie before translation or adaptation.
- overview - A brief description of the movie.
- popularity - A numeric quantity specifying the movie popularity.
- production_companies - The production house of the movie.
- production_countries - The country in which it was produced.
- release_date - The date on which it was released.
- revenue - The worldwide revenue generated by the movie.
- runtime - The running time of the movie in minutes.
- status - "Released" or "Rumored".
- tagline - Movie's tagline.
- title - Title of the movie.
- vote_average - average ratings the movie recieved.
- vote_count - the count of votes recieved.

Let's merge the two data frames:

In [3]:
df1.columns = ['id','title','cast','crew']

merged_df = pd.merge(df1, df2, on='id', how='inner')

merged_df.head(1)

# Drop the 'title_x' column in order to only display one title column 
merged_df = merged_df.drop('title_x', axis=1)

merged_df.head(1)

Unnamed: 0,id,cast,crew,budget,genres,homepage,keywords,original_language,original_title,overview,...,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title_y,vote_average,vote_count
0,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...",237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


Let's now rename some of the columns to better organize the data set:

In [4]:
# Rename 'title_y' column to 'title' and 'homepage' column to 'website'
merged_df = merged_df.rename(columns={'title_y': 'title', 'homepage': 'website'})
merged_df.head(1)

Unnamed: 0,id,cast,crew,budget,genres,website,keywords,original_language,original_title,overview,...,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...",237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


IMDB's weighted rating (wr) formula can be simplified as follows:

(v/(v+m) * R) + (m/(m+v) * C)

where,

- v is the number of votes for the movie;
- m is the minimum votes required to be listed in the chart;
- R is the average rating of the movie; And
- C is the mean vote across the whole report

We already have v (vote_count) and R (vote_average) as columns/data in the data set.C can be calculated as:

In [5]:
C= df2['vote_average'].mean()
C

6.092171559442016

The mean rating for all movies is roughly 6.1 on a scale of 10.

The next step is to determine an appropriate value for m, the minimum votes required to be listed in the chart. We will use the 90th percentile as our cutoff. In other words, for a movie to be featured in the charts, it must have more votes than at least 90% of the movies in the list.

In [6]:
m = df2['vote_count'].quantile(0.9)
m

1838.4000000000015

90% of the movies in the dataset have a vote count equal to or lower than 1838.

Let's filter out the movies that qualify for the chart:

In [7]:
q_movies = df2.copy().loc[df2['vote_count'] >= m]
# df2.copy(): This suggests that df2 is a DataFrame, and copy() is a method used to create a copy of that DataFrame. This helps in avoiding accidentally modifying the original DataFrame.
q_movies.shape

(481, 20)

481 movies qualify to be in this list. To calculate our metric for each qualified movie, we will define a function, weighted_rating() and define a new feature score, of which we'll calculate the value by applying this function to our DataFrame of qualified movies:

In [8]:
def weighted_rating(x, m=m, C=C):
    # 'v' represents the vote count for the movie
    v = x['vote_count']
    
    # 'R' represents the average vote for the movie
    R = x['vote_average']
    
    # 'C' is the mean vote across the whole dataset
    # 'm' is the cutoff value for the vote count (calculated quantile)
    
    # Calculation based on the IMDB formula
    # Weighted Rating Formula: (v/(v+m) * R) + (m/(m+v) * C)
    # The formula rates movies based on their average vote and the number of votes it has received
    return (v/(v+m) * R) + (m/(m+v) * C)

In [9]:
# Define a new feature 'score' and calculate its value with `weighted_rating()`
q_movies['score'] = q_movies.apply(weighted_rating, axis=1)


Finally, let's sort the DataFrame based on the score feature and output the title, vote count, vote average and weighted rating (score) of the top 15 movies:

In [10]:
#Sort movies based on score calculated above
q_movies = q_movies.sort_values('score', ascending=False)

#Print the top 15 movies
q_movies[['title', 'vote_count', 'vote_average', 'score']].head(15)

Unnamed: 0,title,vote_count,vote_average,score
1881,The Shawshank Redemption,8205,8.5,8.059258
662,Fight Club,9413,8.3,7.939256
65,The Dark Knight,12002,8.2,7.92002
3232,Pulp Fiction,8428,8.3,7.904645
96,Inception,13752,8.1,7.863239
3337,The Godfather,5893,8.4,7.851236
95,Interstellar,10867,8.1,7.809479
809,Forrest Gump,7927,8.2,7.803188
329,The Lord of the Rings: The Return of the King,8064,8.1,7.727243
1990,The Empire Strikes Back,5879,8.2,7.697884


We have made our first recommender!

# Plot Description Based Recommender #
We will compute a pairwise similarity score for all movies based on their plot descriptions and recommend movies based on that similarity score. The plot description is given in the overview feature of our dataset. Let's take a look at the data:

In [11]:
df2['overview'].head(5)

0    In the 22nd century, a paraplegic Marine is di...
1    Captain Barbossa, long believed to be dead, ha...
2    A cryptic message from Bond’s past sends him o...
3    Following the death of District Attorney Harve...
4    John Carter is a war-weary, former military ca...
Name: overview, dtype: object

TF-IDF stands for Term Frequency-Inverse Document Frequency. It's a numerical statistic that reflects how important a word is to a document in a collection or corpus. This technique is widely used in information retrieval and text mining.

Creating TF-IDF Matrix: The fit_transform() method is used to compute the word counts (Term Frequency) and the inverse document frequency for each word in the 'overview' column. This results in a TF-IDF matrix (tfidf_matrix) representing each document's features in the corpus.

Outputting Matrix Shape: Finally, the shape of the TF-IDF matrix is printed to understand the number of documents and the number of unique words (features) in the corpus.

This process converts textual data into a numerical matrix where each row represents a document, each column represents a unique word, and the values are TF-IDF scores representing the importance of each word in the document collection.

In the below code:

Importing Libraries: It starts by importing the necessary modules from scikit-learn.
Initializing TF-IDF Vectorizer: The TfidfVectorizer from scikit-learn is initialized. It'll be used to convert a collection of raw documents (in this case, the 'overview' column in the DataFrame) into a matrix of TF-IDF features.

Specifying Stop Words: English stop words ('the', 'a', etc.) are removed from the documents to focus on more meaningful words.

Handling Missing Values: NaN values in the 'overview' column are replaced with empty strings to prevent errors during processing.

In [12]:
#Import TfIdfVectorizer from scikit-learn
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
df2['overview'] = df2['overview'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(df2['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(4803, 20978)

We see that over 20,000 different words were used to describe the 4800 movies in our dataset.

With this matrix in hand, we can now compute a similarity score. There are several options such as: the Euclidean, the Pearson and the Cosine Similarity Scores. Different scores work well in different scenarios and it is often a good idea to experiment with different metrics.

We will be using the CoSim to calculate a numeric quantity that denotes the similarity between two movies. We use the cosine similarity score since it is independent of magnitude and is relatively easy and fast to calculate. 




In the context of computing cosine similarity, linear_kernel calculates the dot product between the TF-IDF vectors of each pair of samples. This dot product essentially measures the similarity between two vectors, and when applied to all pairs in the tfidf_matrix, it generates a symmetric similarity matrix.

This matrix contains similarity scores between each pair of samples. For instance, cosine_sim[i][j] holds the similarity score between the ith and jth samples in tfidf_matrix.

In [13]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

We are going to define a function that takes in a movie title as an input and outputs a list of the 15 most similar movies. First, we need a reverse mapping of movie titles and DataFrame indices. In other words, we need a mechanism to identify the index of a movie in our metadata DataFrame, given its title.

In [14]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(df2.index, index=df2['title']).drop_duplicates()

We are now in a good position to define our recommendation function. These are the following steps we'll follow :

- Get the index of the movie given its title.
- Get the list of cosine similarity scores for that particular movie with all movies.
- Convert it into a list of tuples where the first element is its position and the second is the similarity score.
- Sort the aforementioned list of tuples based on the similarity scores; that is, the second element.
- Get the top 10 elements of this list. Ignore the first element as it refers to self (the movie most similar to a particular movie is the movie itself).
- Return the titles corresponding to the indices of the top elements.

In [15]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 15 most similar movies
    sim_scores = sim_scores[1:16]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 15 most similar movies
    return df2['title'].iloc[movie_indices]

In [16]:
get_recommendations('Law Abiding Citizen')

2507                Slow Burn
2388               I Am Wrath
1181                      JFK
2193     Secret in Their Eyes
3360    House of 1000 Corpses
3088               The Iceman
2138                  Copycat
1767             Suspect Zero
4575        Napoleon Dynamite
1202             Legal Eagles
3       The Dark Knight Rises
65            The Dark Knight
1567              Warm Bodies
2275        The Frozen Ground
1049            Patriot Games
Name: title, dtype: object

In [17]:
get_recommendations('Old School')

40                                          Cars 2
2312                                     Neighbors
2268    Three Kingdoms: Resurrection of the Dragon
1435                                     Wimbledon
2990                              Jeepers Creepers
2964                         The Last Days on Mars
3883                                  Animal House
163                                       Watchmen
2334                                       Birdman
2861                                 Sorority Boys
34                             Monsters University
1385                  Neighbors 2: Sorority Rising
771                                      Dragonfly
753                                   The Sentinel
3797                                  Futuro Beach
Name: title, dtype: object

In [18]:
get_recommendations('Wall Street: Money Never Sleeps')

3117                   The Good Guy
2610                 A Mighty Heart
298         The Wolf of Wall Street
2110     Madea's Witness Protection
2448                    Wall Street
2987               The Mighty Ducks
57                           WALL·E
3727                     Easy Money
4142                Supercapitalist
394                     Tower Heist
16                     The Avengers
3837                  She's the One
737       Jack Ryan: Shadow Recruit
3500                    Lucky Break
1098    Love in the Time of Cholera
Name: title, dtype: object

That is how to make a Plot Description Based Recommender. I can input a movie such as 'Old School' and the program will output 15 similar movies. 

In [89]:
merged_df.to_csv('output.csv', index=False)  # Save DataFrame to 'output.csv'