## *** Reel Insights: A Comprehensive Analysis of Movie Recommendation Systems***

**CAS Applied Data Science Final Project**

**By: Avisek Regmi, Avenue de la Foretaille 27b, Chambesy, 1292 CH**

![](https://camo.githubusercontent.com/68abad8a66113eb3c56dd584fa9b0b1fe4aab28200b3dfc61d3b00d40dba440c/68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f6a616c616a7468616e616b692f4d6f7669655f7265636f6d6d656e646174696f6e5f656e67696e652f6d61737465722f696d672f325f332e6a7067)

## **What is a Recommendation System?**

**A recommendation system, a subset of machine learning, leverages data to anticipate and streamline user preferences amidst an ever-expanding array of choices.**

Operating on Big Data, these AI algorithms analyze factors like past purchases and search history to suggest additional products tailored to individual users. By understanding user behavior and product characteristics, recommender systems facilitate personalized recommendations, driving engagement and enhancing user experience across various domains.

![alt text ](https://www.nvidia.com/content/dam/en-zz/Solutions/glossary/data-science/recommendation-system/img-1.png)

<img src='http://labs.criteo.com/wp-content/uploads/2017/08/CustomersWhoBought3.jpg', width=500>

Continuing my project on Movie Data Analysis and Recommendation Systems, I first explored cinema narratives and conducted a detailed analysis of movie metadata from TMDB. I also developed basic models to predict movie revenue and success, identifying key influencing features.

In this phase, I will implement various recommendation algorithms, including content-based, popularity-based, and collaborative filtering, aiming to create a robust recommendation system. I'll be working with two MovieLens datasets:

**Full Dataset:** 26 million ratings, 750,000 tag applications, 45,000 movies, and 270,000 users, including 12 million tag relevance scores.
**Small Dataset:** 100,000 ratings, 1,300 tag applications, 9,000 movies, and 700 users.

**The Full Dataset will be used to build a simple recommender, while the Small Dataset will support personalized recommendations, suited to my limited computing resources. My immediate goal is to develop the simple recommender system.**

In [None]:
! pip install scikit-surprise


Collecting scikit-surprise
  Downloading scikit_surprise-1.1.4.tar.gz (154 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.4/154.4 kB[0m [31m688.7 kB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.4-cp310-cp310-linux_x86_64.whl size=2357256 sha256=a014c39a6fba482d16eb188965d9481b44f3ed2aab38838bc2e47f271a21cbbb
  Stored in directory: /root/.cache/pip/wheels/4b/3f/df/6acbf0a40397d9bf3ff97f582cc22fb9ce66adde75bc71fd54
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.4


In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate
from surprise import Reader, Dataset
import pandas as pd
from surprise.model_selection import train_test_split
from surprise.model_selection import cross_validate
from surprise import SVD
import numpy as np



import warnings; warnings.simplefilter('ignore')

## **Simple Recommender**

The **Simple Recommender** provides general movie suggestions based on popularity and genre. It operates on the principle that widely popular and highly rated movies will appeal to most viewers. This model doesn’t customize recommendations for individual users.

To implement it, we simply sort movies by ratings and popularity, then display the top picks. By adding a genre filter, we can also highlight the highest-rated movies within a specific genre.



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
md = pd. read_csv('/content/drive/MyDrive/CAS DS Final Project - Movie Recommendation System - Avisek Regmi/movies_metadata.csv')
md.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [None]:
md['genres'] = md['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

I use the TMDB Ratings to come up with our **Top Movies Chart.** I will use IMDB's *weighted rating* formula to construct my chart. Mathematically, it is represented as follows:

Weighted Rating (WR) = $(\frac{v}{v + m} . R) + (\frac{m}{v + m} . C)$

where,
* *v* is the number of votes for the movie,
* *m* is the minimum votes required to be listed,
* *R* is the average rating of the movie,
* *C* is the mean vote across the entire dataset.

The next step is to set m, the minimum votes threshold. We'll use the 95th percentile as the cutoff, meaning a movie must have more votes than 95% of all movies to be included.

I will compile the overall Top 250 Chart and create functions to generate charts for specific genres.

**Let's get started!**

In [None]:
vote_counts = md[md['vote_count'].notnull()]['vote_count'].astype('int')
vote_averages = md[md['vote_average'].notnull()]['vote_average'].astype('int')
C = vote_averages.mean()
C

5.244896612406511

In [None]:
m = vote_counts.quantile(0.95)
m

434.0

In [None]:
md['year'] = pd.to_datetime(md['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

In [None]:
qualified = md[(md['vote_count'] >= m) & (md['vote_count'].notnull()) & (md['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity', 'genres']]
qualified['vote_count'] = qualified['vote_count'].astype('int')
qualified['vote_average'] = qualified['vote_average'].astype('int')
qualified.shape

(2274, 6)



To qualify for the chart, a movie must have at least **434 votes** on TMDB. The average rating for a movie on TMDB is **5.244** out of 10. A total of **2274** movies meet these criteria and are eligible for our chart.

In [None]:
def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

In [None]:
qualified['wr'] = qualified.apply(weighted_rating, axis=1)

In [None]:
qualified = qualified.sort_values('wr', ascending=False).head(250)

### **Top Movies**

In [None]:
qualified.head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,genres,wr
15480,Inception,2010,14075,8,29.108149,"[Action, Thriller, Science Fiction, Mystery, A...",7.917588
12481,The Dark Knight,2008,12269,8,123.167259,"[Drama, Action, Crime, Thriller]",7.905871
22879,Interstellar,2014,11187,8,32.213481,"[Adventure, Drama, Science Fiction]",7.897107
2843,Fight Club,1999,9678,8,63.869599,[Drama],7.881753
4863,The Lord of the Rings: The Fellowship of the Ring,2001,8892,8,32.070725,"[Adventure, Fantasy, Action]",7.871787
292,Pulp Fiction,1994,8670,8,140.950236,"[Thriller, Crime]",7.86866
314,The Shawshank Redemption,1994,8358,8,51.645403,"[Drama, Crime]",7.864
7000,The Lord of the Rings: The Return of the King,2003,8226,8,29.324358,"[Adventure, Fantasy, Action]",7.861927
351,Forrest Gump,1994,8147,8,48.307194,"[Comedy, Drama, Romance]",7.860656
5814,The Lord of the Rings: The Two Towers,2002,7641,8,29.423537,"[Adventure, Fantasy, Action]",7.851924


Three Christopher Nolan films—**Inception**, **The Dark Knight**, and &**Interstellar**—dominate the top of our chart, highlighting a strong preference among TMDB users for certain genres and directors.

Next, we’ll create a function to build genre-specific charts, relaxing our criteria to the **85th** percentile instead of the 95th.

In [None]:
s = md.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'
gen_md = md.drop('genres', axis=1).join(s)

In [None]:
def build_chart(genre, percentile=0.85):
    df = gen_md[gen_md['genre'] == genre]
    vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = df[df['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(percentile)

    qualified = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & (df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity']]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')

    qualified['wr'] = qualified.apply(lambda x: (x['vote_count']/(x['vote_count']+m) * x['vote_average']) + (m/(m+x['vote_count']) * C), axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(250)

    return qualified

Let's see my method in action by displaying the Top 15 Romance Movies. Despite its popularity, the romance genre was notably absent from our overall Top Chart.

### **Top Romance Movies**

In [None]:
build_chart('Romance').head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,wr
10309,Dilwale Dulhania Le Jayenge,1995,661,9,34.457024,8.565285
351,Forrest Gump,1994,8147,8,48.307194,7.971357
876,Vertigo,1958,1162,8,18.20822,7.811667
40251,Your Name.,2016,1030,8,34.461252,7.789489
883,Some Like It Hot,1959,835,8,11.845107,7.745154
1132,Cinema Paradiso,1988,834,8,14.177005,7.744878
19901,Paperman,2012,734,8,7.198633,7.713951
37863,Sing Street,2016,669,8,10.672862,7.689483
882,The Apartment,1960,498,8,11.994281,7.599317
38718,The Handmaiden,2016,453,8,16.727405,7.566166


According to my metrics, the top romance movie is Bollywood's **Dilwale Dulhania Le Jayenge.**


## **Content Filtering**

**Content filtering** recommends items based on their attributes or features, aiming to match user preferences. This method relies on the similarity between user and item characteristics, such as age or genre. By modeling the probability of a new interaction, content filtering suggests items akin to those previously engaged with.

For instance, if a user enjoys romantic comedies like "You’ve Got Mail" and "Sleepless in Seattle," a content filtering recommender might suggest another film with similar genres or cast, such as "Joe Versus the Volcano."

![alt text ](https://www.nvidia.com/content/dam/en-zz/Solutions/glossary/data-science/recommendation-system/img-3.png)

## **Content Based Recommender**

The recommender discussed earlier faces significant drawbacks, offering uniform recommendations regardless of individual preferences. This limits its effectiveness, especially for users with specific tastes.

For example, fans of Indian actor Mr. Shahrukh Khan might not find his movies prioritized. To address this, I will develop a personalized engine using **content-based filtering.**

This approach analyzes movie attributes to suggest similar ones, enhancing recommendation accuracy. I will create two **content-based recommenders** focusing on movie overviews and cast/crew details, respectively, utilizing a subset of available movies due to computing constraints.

**I will build two Content Based Recommenders based on:**
* **Movie Overviews and Taglines**
* **Movie Cast, Crew, Keywords and Genre**

Also, as mentioned in the introduction, I will be using a subset of all the movies available to us due to limiting computing power available to me.

In [None]:
links_small = pd.read_csv('/content/drive/MyDrive/CAS DS Final Project - Movie Recommendation System - Avisek Regmi/links_small.csv')
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')

In [None]:
md = md.drop([19730, 29503, 35587])

In [None]:
# Please check the Explortory Data Analysis Notebook to know how and why I got these indices.
md['id'] = md['id'].astype('int')

In [None]:
smd = md[md['id'].isin(links_small)]
smd.shape

(9099, 25)

In our smaller movie metadata dataset, we have **9099** films, which is one-fifth the size of our original collection containing 45,000 movies.

### **Movie Description Based Recommender**

I will begin by constructing a recommender using movie descriptions and taglines.

As I lack a quantitative metric to evaluate our model's performance, assessment will rely on qualitative analysis.

In [None]:
smd['tagline'] = smd['tagline'].fillna('')
smd['description'] = smd['overview'] + smd['tagline']
smd['description'] = smd['description'].fillna('')

In [None]:
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(smd['description'])

In [None]:
tfidf_matrix.shape

(9099, 268124)

#### **Cosine Similarity**

**Cosine similarity is a metric used to measure the similarity between two vectors in a multi-dimensional space. It calculates the cosine of the angle between these vectors, hence the name. Cosine similarity is widely used in various fields, including natural language processing, information retrieval, and recommendation systems.**

![alt text ](https://miro.medium.com/v2/resize:fit:720/format:webp/1*3Zrzb7RH25mezldIrEfqjA.png)

I will be using the **Cosine Similarity** to calculate a numeric quantity that denotes the similarity between two movies. Mathematically, it is defined as follows:

$cosine(x,y) = \frac{x. y^\intercal}{||x||.||y||} $

Since we have used the TF-IDF Vectorizer, calculating the Dot Product will directly give us the Cosine Similarity Score. Therefore, we will use sklearn's **linear_kernel** instead of cosine_similarities since it is much faster.

In [None]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [None]:
cosine_sim[0]

array([1.        , 0.00680476, 0.        , ..., 0.        , 0.00344913,
       0.        ])

With our pairwise cosine similarity matrix in hand, my next task is to craft a function that swiftly furnishes the 30 most akin movies determined by their cosine similarity scores.

In [None]:
smd = smd.reset_index()
titles = smd['title']
indices = pd.Series(smd.index, index=smd['title'])

In [None]:
def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

**We are all set. Let us now try and get the top recommendations for a few movies and see how good the recommendations are.**

In [None]:
get_recommendations('The Godfather').head(10)

973      The Godfather: Part II
8387                 The Family
3509                       Made
4196         Johnny Dangerously
29               Shanghai Triad
5667                       Fury
2412             American Movie
1582    The Godfather: Part III
4221                    8 Women
2159              Summer of Sam
Name: title, dtype: object

In [None]:
get_recommendations('The Dark Knight').head(10)

7931                      The Dark Knight Rises
132                              Batman Forever
1113                             Batman Returns
8227    Batman: The Dark Knight Returns, Part 2
7565                 Batman: Under the Red Hood
524                                      Batman
7901                           Batman: Year One
2579               Batman: Mask of the Phantasm
2696                                        JFK
8165    Batman: The Dark Knight Returns, Part 1
Name: title, dtype: object


Our system successfully recognizes **The Dark Knight** as a Batman film, suggesting other Batman movies as top picks. However, it falls short by not considering crucial factors like **cast**, **crew**, and **genre**, which greatly influence a movie's appeal.

For instance, a fan of **The Dark Knight** likely appreciates Nolan's direction and may not enjoy **Batman Forever**.

**To address this, I will enhance our recommender with comprehensive metadata, including genre, keywords, and cast/crew details, for a more refined selection process.**



### **Metadata Based Recommender**

A metadata-based recommender for movies is a recommendation system that leverages descriptive information about movies to generate personalized recommendations for users. This type of recommender analyzes various metadata attributes associated with movies, including:

**Genre:** Classifies movies into categories such as action, comedy, drama, thriller, etc.
**Keywords:** Identifies important keywords or tags associated with the movie's plot, themes, or content.
**Cast and Crew:** Considers actors, directors, writers, and other personnel involved in the production of the movie.
**Release Year:** Takes into account the year the movie was released, which may influence its relevance to users.
**Rating and Reviews:** Incorporates user ratings, critic reviews, and popularity metrics to assess the movie's quality and appeal.

By analyzing these metadata attributes, a metadata-based movie recommender can understand the characteristics of movies and users' preferences more comprehensively. It then recommends movies that share similar attributes or characteristics to those that users have expressed interest in or enjoyed previously.

**For example, if a user enjoys action movies directed by Christopher Nolan and starring Christian Bale, a metadata-based recommender might suggest other action movies directed by Nolan or featuring Bale in prominent roles.**

**Overall, metadata-based movie recommenders enhance the accuracy and relevance of movie recommendations by considering rich contextual information beyond just user-item interactions.**

![alt text ](https://d3i71xaburhd42.cloudfront.net/b762965835df99bffc4b776e3fbec5a13392e1fe/3-Figure1-1.png)

**To build our standard metadata based content recommender, we will need to merge our current dataset with the crew and the keyword datasets. Let us prepare this data as our first step.**

In [None]:
credits = pd.read_csv('/content/drive/MyDrive/CAS DS Final Project - Movie Recommendation System - Avisek Regmi/credits.csv')
keywords = pd.read_csv('/content/drive/MyDrive/CAS DS Final Project - Movie Recommendation System - Avisek Regmi/keywords.csv')

In [None]:
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
md['id'] = md['id'].astype('int')

In [None]:
md.shape

(45463, 25)

In [None]:
md = md.merge(credits, on='id')
md = md.merge(keywords, on='id')

In [None]:
smd = md[md['id'].isin(links_small)]
smd.shape

(9219, 28)

**In our consolidated dataframe, we've assembled the essential components: cast, crew, genres, and credits. Now, let's refine it with these guiding principles:**

For the **crew**, I will focus solely on the director, as their influence is paramount to the movie's essence.

Selecting the **cast** requires some finesse. Minor roles and lesser-known actors have minimal impact on audience perception.

Hence, I will prioritize the top three actors portraying major characters from the **credits list.**




We now have our cast, crew, genres and credits, all in one dataframe. Let us wrangle this a little more using the following intuitions:

1. **Crew:** From the crew, we will only pick the director as our feature since the others don't contribute that much to the *feel* of the movie.
2. **Cast:** Choosing Cast is a little more tricky. Lesser known actors and minor roles do not really affect people's opinion of a movie. Therefore, we must only select the major characters and their respective actors. Arbitrarily we will choose the top 3 actors that appear in the credits list.

In [None]:
smd['cast'] = smd['cast'].apply(literal_eval)
smd['crew'] = smd['crew'].apply(literal_eval)
smd['keywords'] = smd['keywords'].apply(literal_eval)
smd['cast_size'] = smd['cast'].apply(lambda x: len(x))
smd['crew_size'] = smd['crew'].apply(lambda x: len(x))

In [None]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [None]:
smd['director'] = smd['crew'].apply(get_director)

In [None]:
smd['cast'] = smd['cast'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
smd['cast'] = smd['cast'].apply(lambda x: x[:3] if len(x) >=3 else x)

In [None]:
smd['keywords'] = smd['keywords'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

I am taking an unconventional approach to building the recommender.

I will compile movie metadata; **genres**, **director**, **main actors**, and **keywords** into a dump.

Using a **Count Vectorizer**, I will generate a **count matrix.** Then, like before, I will calculate cosine similarities to find the most relevant movies.

**Steps:**

**1.** I will begin by standardizing the data: stripping spaces and converting to lowercase. This ensures our engine discerns between names like Johnny Depp and Johnny Galecki accurately.

**2.**To underscore the director's significance, I will **mention Director's name 2 times,** elevating their importance relative to the entire cast.


## **Count Vectorizer**

In a movie recommendation system, **CountVectorizer** is used to convert textual data related to movies into a numerical format that machine learning algorithms can process.

This textual data typically includes movie features such as **genres, plot summaries, cast, crew, keywords, and other metadata**.

**How CountVectorizer is Applied in Movie Recommendation Systems
Feature Extraction: Extract relevant textual information from the movie dataset.**

For example:

**Genres: **Action, Adventure, Sci-Fi**

**Keywords: **hero, villain, space**

**Plot summary:** **A space hero fights villains to save the universe.**

**Tokenization**: **Use CountVectorizer to tokenize the text, splitting it into individual words or tokens.**

**Vocabulary Building:** **Create a vocabulary of all unique tokens found in the textual data across all movies.**

**Count Matrix Creation:** **Construct a count matrix where each row represents a movie and each column represents the frequency of a token from the vocabulary in that movie's textual data.**


**Example:** **Consider a small dataset of three movies with their plot summaries:**

Movie 1: "A space hero fights villains."
Movie 2: "A hero saves the world."
Movie 3: "Villains plan to conquer the world."
Using CountVectorizer, the following steps would be performed:

**Tokenization:**

Movie 1: ["a", "space", "hero", "fights", "villains"]

Movie 2: ["a", "hero", "saves", "the", "world"]

Movie 3: ["villains", "plan", "to", "conquer", "the", "world"]

**Vocabulary Building:** ["a", "space", "hero", "fights", "villains", "saves", "the", "world", "plan", "to", "conquer"]

**Count Matrix Creation:**

![alt text ](https://miro.medium.com/v2/resize:fit:720/format:webp/1*omYTxakC8y4dz-fvpM1Lug.jpeg)





In [None]:
smd['cast'] = smd['cast'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

In [None]:
smd['director'] = smd['director'].astype('str').apply(lambda x: str.lower(x.replace(" ", "")))
smd['director'] = smd['director'].apply(lambda x: [x,x])

#### **Keywords**


Before we put our keywords to use, we'll start with a bit of pre-processing. First up, we'll tally up the frequency counts of each keyword in the dataset.

In [None]:
s = smd.apply(lambda x: pd.Series(x['keywords']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'keyword'

In [None]:
s = s.value_counts()
s[:5]

keyword
independent film        610
woman director          550
murder                  399
duringcreditsstinger    327
based on novel          318
Name: count, dtype: int64

Keywords vary in frequency from 1 to 610. Single occurrences aren't useful, so we'll discard those.

Then, we'll standardize words by converting them to their stems, treating variations like **Dogs** and **Dog** as identical.

In [None]:
s = s[s > 1]

In [None]:
stemmer = SnowballStemmer('english')
stemmer.stem('dogs')

'dog'

In [None]:
def filter_keywords(x):
    words = []
    for i in x:
        if i in s:
            words.append(i)
    return words

In [None]:
smd['keywords'] = smd['keywords'].apply(filter_keywords)
smd['keywords'] = smd['keywords'].apply(lambda x: [stemmer.stem(i) for i in x])
smd['keywords'] = smd['keywords'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

In [None]:
smd['soup'] = smd['keywords'] + smd['cast'] + smd['director'] + smd['genres']
smd['soup'] = smd['soup'].apply(lambda x: ' '.join(x))

In [None]:
count = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
count_matrix = count.fit_transform(smd['soup'])

In [None]:
cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [None]:
smd = smd.reset_index()
titles = smd['title']
indices = pd.Series(smd.index, index=smd['title'])

Let's revisit the get_recommendations function we've used before. With the updated cosine similarity scores, we anticipate different (and hopefully improved) results. We'll test it again with **The Dark Knight** and see what recommendations we get now.

In [None]:
get_recommendations('The Dark Knight').head(10)

8031                 The Dark Knight Rises
6218                         Batman Begins
7659            Batman: Under the Red Hood
6623                          The Prestige
1134                        Batman Returns
8927               Kidnapping Mr. Heineken
5943                              Thursday
1260                        Batman & Robin
2085                             Following
9024    Batman v Superman: Dawn of Justice
Name: title, dtype: object


I am really pleased with the results this time. The recommendations prominently feature other Christopher Nolan films, likely due to the emphasis on the director. I enjoyed watching **The Dark Knight** and found **Batman Begins**, **The Prestige,** and **The Dark Knight Rises** among the top suggestions.

We have room to experiment further with this engine by adjusting feature weights **(like directors, actors, genres)**, setting limits on keyword usage, considering genre frequency, displaying movies only in the same language, and more.





In [None]:
get_recommendations('Pulp Fiction').head(10)

1381         Jackie Brown
8905    The Hateful Eight
5200    Kill Bill: Vol. 2
4595                Basic
4764             S.W.A.T.
898        Reservoir Dogs
6939              Cleaner
4903    Kill Bill: Vol. 1
231         Kiss of Death
4306       The 51st State
Name: title, dtype: object

#### **Popularity and Ratings**

One thing that we notice about our recommendation system is that it recommends movies regardless of ratings and popularity. It is true that **Batman and Robin** has a lot of similar characters as compared to **The Dark Knight** but it was a terrible movie that shouldn't be recommended to anyone.

**Therefore, we will add a mechanism to remove bad movies and return movies which are popular and have had a good critical response.**

I will take the top 25 movies based on similarity scores and calculate the vote of the 60th percentile movie.

Then, using this as the value of $m$, we will calculate the weighted rating of each movie using IMDB's formula like we did in the Simple Recommender section.

In [None]:
def improved_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]

    movies = smd.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year']]
    vote_counts = movies[movies['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = movies[movies['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(0.60)
    qualified = movies[(movies['vote_count'] >= m) & (movies['vote_count'].notnull()) & (movies['vote_average'].notnull())]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    qualified['wr'] = qualified.apply(weighted_rating, axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(10)
    return qualified

In [None]:
improved_recommendations('The Dark Knight')

Unnamed: 0,title,vote_count,vote_average,year,wr
6623,The Prestige,4510,8,2006,7.758148
8031,The Dark Knight Rises,9263,7,2012,6.921448
6218,Batman Begins,7511,7,2005,6.904127
7659,Batman: Under the Red Hood,459,7,2010,6.147016
2085,Following,363,7,1998,6.044272
1134,Batman Returns,1706,6,1992,5.846862
7561,Harry Brown,351,6,2009,5.582529
8026,Bullet to the Head,490,5,2013,5.115027
9024,Batman v Superman: Dawn of Justice,7189,5,2016,5.013943
1260,Batman & Robin,1447,4,1997,4.287233


In [None]:
improved_recommendations('Pulp Fiction')

Unnamed: 0,title,vote_count,vote_average,year,wr
898,Reservoir Dogs,3821,8,1992,7.718986
7280,Inglourious Basterds,6598,7,2009,6.891679
4903,Kill Bill: Vol. 1,5091,7,2003,6.862133
8905,The Hateful Eight,4405,7,2015,6.842588
5200,Kill Bill: Vol. 2,4061,7,2004,6.830542
1381,Jackie Brown,1580,7,1997,6.62179
8110,The Raid,1076,7,2011,6.495553
6788,Death Proof,1359,6,2007,5.817225
8558,Oldboy,632,5,2013,5.099705
4764,S.W.A.T.,780,5,2003,5.08755


Despite its poor quality, **Batman and Robin** remains on our recommendation list likely due to its slightly below-average rating of 4 on TMDB. It's frustrating considering superior films like **The Dark Knight Rises** receive just 7 significantly higher ratings. Unfortunately, there's little we can do to address this discrepancy. Consequently, I will conclude our **Content-Based Recommender** section for now and revisit it when we develop a hybrid engine.


## **Collaborative Filtering**


**Collaborative filtering** algorithms recommend items based on the preferences of many users. By analyzing past interactions between users and items, these algorithms predict future behavior. They create models from users' previous activities, like purchases or ratings, and compare them with the actions of others. The core idea is that users with similar past behaviors are likely to have similar future preferences. For instance, if you and another user have shown similar movie tastes, the system might recommend a movie to you that the other user enjoyed.


![alt text ](https://www.nvidia.com/content/dam/en-zz/Solutions/glossary/data-science/recommendation-system/img-2.png)

## **Collaborative Filtering**


Our content-based engine has significant limitations. It can only suggest movies similar to a given movie, lacking the ability to capture diverse tastes and provide cross-genre recommendations. Moreover, it doesn't personalize recommendations, offering the same suggestions for a movie regardless of the user's individual preferences.

To address these issues, I will implement **Collaborative Filtering**, which makes recommendations by leveraging the preferences of similar users. This technique predicts how much a user will like a product based on the opinions of users with similar tastes.

Instead of building Collaborative Filtering from scratch, I will use the **Surprise library**, which employs powerful algorithms like **Singular Value Decomposition (SVD)** to minimize RMSE (Root Mean Square Error) and deliver excellent recommendations.





## **Surprise Library**

The **Surprise library** is a Python scikit for building and analyzing recommender systems. It is particularly useful for collaborative filtering, which is a key technique in movie recommendation systems.

**Surprise simplifies the process of implementing and experimenting with various recommendation algorithms**, **allowing developers to focus on optimizing and improving their models rather than dealing with the complexities of algorithm implementation from scratch.**

## **Singular Value Decomposition (SVD)**

In a movie recommendation system, **Singular Value Decomposition (SVD)** is used to factorize the user-item interaction matrix into three matrices that reveal latent factors representing both users and items.

**This technique helps in identifying patterns and correlations within the data, allowing the system to make accurate predictions about user preferences for unseen movies.**

In [None]:
reader = Reader()

In [None]:
ratings = pd.read_csv('/content/drive/MyDrive/CAS DS Final Project - Movie Recommendation System - Avisek Regmi/ratings_small.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [None]:
# Define a Reader object
reader = Reader()

# Load ratings data from CSV
ratings = pd.read_csv('/content/drive/MyDrive/CAS DS Final Project - Movie Recommendation System - Avisek Regmi/ratings_small.csv')

# Load data into Surprise's Dataset format
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

# Split the data into train and test sets
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)


In [None]:
# Create SVD algorithm object
svd = SVD()

# Perform cross-validation
results = cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=False)

# Print the results
for fold_idx, (rmse, mae) in enumerate(zip(results['test_rmse'], results['test_mae']), start=1):
    print(f"------------\nFold {fold_idx}\nRMSE: {rmse:.4f}\nMAE:  {mae:.4f}")

print("------------")
print(f"Mean RMSE: {np.mean(results['test_rmse']):.4f}")
print(f"Mean MAE : {np.mean(results['test_mae']):.4f}")
print("------------")
print("------------")
print(results)


------------
Fold 1
RMSE: 0.8890
MAE:  0.6827
------------
Fold 2
RMSE: 0.8947
MAE:  0.6894
------------
Fold 3
RMSE: 0.8968
MAE:  0.6919
------------
Fold 4
RMSE: 0.8957
MAE:  0.6915
------------
Fold 5
RMSE: 0.9045
MAE:  0.6964
------------
Mean RMSE: 0.8962
Mean MAE : 0.6904
------------
------------
{'test_rmse': array([0.88899599, 0.89470887, 0.89684854, 0.89566372, 0.90453313]), 'test_mae': array([0.68271915, 0.68939319, 0.69185507, 0.69147059, 0.69642919]), 'fit_time': (1.5712404251098633, 1.79483962059021, 1.6885294914245605, 2.2576448917388916, 2.5366971492767334), 'test_time': (0.10976982116699219, 0.3483736515045166, 0.13660836219787598, 0.21720290184020996, 0.40570902824401855)}


We achieved a mean **Root Mean Square Error (RMSE)** of **0.8962**, which is excellent for our needs.

**Now, let's train our model on the full dataset and generate predictions.**

In [None]:
# Create SVD algorithm object
svd = SVD()

# Load ratings data from CSV
ratings = pd.read_csv('/content/drive/MyDrive/CAS DS Final Project - Movie Recommendation System - Avisek Regmi/ratings_small.csv')

# Load data into Surprise's Dataset format
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

# Split the data into train and test sets
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# Fit the model to the training data
svd.fit(trainset)


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f5a90ad8400>

**Let's take a look at the ratings given by user 5000.**

In [None]:
ratings[ratings['userId'] == 1]

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205
5,1,1263,2.0,1260759151
6,1,1287,2.0,1260759187
7,1,1293,2.0,1260759148
8,1,1339,3.5,1260759125
9,1,1343,2.0,1260759131


In [None]:
svd.predict(1, 302, 3)

Prediction(uid=1, iid=302, r_ui=3, est=2.74311792877312, details={'was_impossible': False})


For the movie with ID 302, the estimated rating is **2.791**. A notable aspect of this recommender system is that it doesn't consider the movie's content; it predicts ratings solely based on the assigned movie ID and the ratings given by other users.

## **Hybrid Recommender**

![](https://miro.medium.com/v2/resize:fit:720/format:webp/0*7rVI7mk6e8ntjFmV.png)


In this section, I aim to create a basic hybrid recommender, merging techniques from both content-based and collaborative filtering engines.

Here's how it functions:

**Input:** User ID and Movie Title

**Output:** Similar movies ranked by expected ratings from that specific user.



In [None]:
def convert_int(x):
    try:
        return int(x)
    except:
        return np.nan

In [None]:
id_map = pd.read_csv('/content/drive/MyDrive/CAS DS Final Project - Movie Recommendation System - Avisek Regmi/links_small.csv')[['movieId', 'tmdbId']]
id_map['tmdbId'] = id_map['tmdbId'].apply(convert_int)
id_map.columns = ['movieId', 'id']
id_map = id_map.merge(smd[['title', 'id']], on='id').set_index('title')
#id_map = id_map.set_index('tmdbId')

In [None]:
indices_map = id_map.set_index('id')

In [None]:
def hybrid(userId, title):
    idx = indices[title]
    tmdbId = id_map.loc[title]['id']
    #print(idx)
    movie_id = id_map.loc[title]['movieId']

    sim_scores = list(enumerate(cosine_sim[int(idx)]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]

    movies = smd.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year', 'id']]
    movies['est'] = movies['id'].apply(lambda x: svd.predict(userId, indices_map.loc[x]['movieId']).est)
    movies = movies.sort_values('est', ascending=False)
    return movies.head(10)

In [None]:
hybrid(1, 'Avatar')

Unnamed: 0,title,vote_count,vote_average,year,id,est
974,Aliens,3282.0,7.7,1986,679,3.484824
1011,The Terminator,4208.0,7.4,1984,218,3.385863
522,Terminator 2: Judgment Day,4274.0,7.7,1991,280,3.374011
8401,Star Trek Into Darkness,4479.0,7.4,2013,54138,3.173193
2014,Fantastic Planet,140.0,7.6,1973,16306,3.111188
7705,Alice in Wonderland,8.0,5.4,1933,25694,3.05682
1621,Darby O'Gill and the Little People,35.0,6.7,1959,18887,3.006838
8658,X-Men: Days of Future Past,6155.0,7.5,2014,127585,2.975509
2834,Predator,2129.0,7.3,1987,106,2.954153
922,The Abyss,822.0,7.1,1989,2756,2.928059


In [None]:
hybrid(500, 'Avatar')

Unnamed: 0,title,vote_count,vote_average,year,id,est
522,Terminator 2: Judgment Day,4274.0,7.7,1991,280,3.346064
974,Aliens,3282.0,7.7,1986,679,3.231076
2014,Fantastic Planet,140.0,7.6,1973,16306,3.22895
1621,Darby O'Gill and the Little People,35.0,6.7,1959,18887,3.225533
7705,Alice in Wonderland,8.0,5.4,1933,25694,3.102655
1668,Return from Witch Mountain,38.0,5.6,1978,14822,3.082991
6084,Beastmaster 2: Through the Portal of Time,17.0,4.6,1991,27549,3.019307
8658,X-Men: Days of Future Past,6155.0,7.5,2014,127585,3.015859
8401,Star Trek Into Darkness,4479.0,7.4,2013,54138,2.998766
7265,Dragonball Evolution,475.0,2.9,2009,14164,2.933833


In my **hybrid recommender system**, we observe diverse recommendations for different users, even for the same movie.

This personalization enhances the tailored experience for each user.

## **Conclusion**


In this notebook, I have crafted four distinct recommendation systems:

1. **Simple Recommender:** Utilizing TMDB Vote Count and Vote Averages, we generated Top Movies Charts overall and by genre, employing the IMDB Weighted Rating System for sorting.

2. **Content Based Recommender:** Developing two engines—one analyzing movie overviews and taglines, the other leveraging metadata like cast, crew, genre, and keywords. We incorporated a filter favoring movies with higher votes and ratings.

3. **Collaborative Filtering:**  Leveraging the Surprise Library, we employed single value decomposition to create a collaborative filter. Achieving an RMSE below 1, the engine provided estimated ratings for users and movies.

4. **Hybrid Engine:** Merging content and collaborative filtering concepts, we designed an engine suggesting movies to users based on internally calculated estimated ratings.



