# Movies Recommender System

![](http://labs.criteo.com/wp-content/uploads/2017/08/CustomersWhoBought3.jpg)

In this notebook, I will attempt at implementing a few recommendation algorithms and try to build an ensemble of these models to come up with our final recommendation system. With us, we have two MovieLens datasets.

* The Full Dataset comprises 26 million ratings and 750,000 tag applications across 45,000 movies, contributed by 270,000 users. It also contains tag genome data with 12 million relevance scores across 1,100 tags.

* The Small Dataset contains 100,000 ratings and 1,300 tag applications across 9,000 movies, provided by 700 users.

We'll create a Simple Recommender using data from the Full Dataset, while all personalized recommender systems will utilize the smaller dataset due to limited computing power.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet

import warnings; warnings.simplefilter('ignore')

## 1. Simple Recommender

The Simple Recommender provides broad recommendations to users, leveraging movie popularity and occasionally genre preferences. Unlike personalized recommendation systems, this approach offers suggestions based on the overall popularity and critical acclaim of movies. The core principle is that highly rated and widely acclaimed movies are more likely to appeal to a general audience.

Implementing this model is straightforward. By sorting movies based on ratings and popularity, we can easily identify the top-rated and most popular films. Additionally, users can specify a genre to receive recommendations tailored to their genre preferences. This simple yet effective approach serves as a foundation for more advanced recommendation systems.

In [2]:
md = pd. read_csv('movies_metadata.csv')
md.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [3]:
md['genres'] = md['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

We utilize **TMDB Ratings** to compile our Top Movies Chart, employing **IMDB's weighted rating formula** for this purpose. The mathematical representation of this formula is:

Weighted Rating (WR) = $(\frac{v}{v + m} . R) + (\frac{m}{v + m} . C)$

where,
* *v* is the number of votes for the movie
* *m* is the minimum votes required to be listed in the chart
* *R* is the average rating of the movie
* *C* is the mean vote across the whole report

The next step involves determining an appropriate value for m, which represents the minimum votes required for a movie to be included in the chart. We opt to use the **90th** percentile as our cutoff. This means that for a movie to be featured in the charts, it must have more votes than at least 90% of the movies in the list.

Subsequently, we construct the overall Top **500** Chart and devise a function to generate charts for specific genres.

In [4]:
vote_counts = md[md['vote_count'].notnull()]['vote_count'].astype('int')
vote_averages = md[md['vote_average'].notnull()]['vote_average'].astype('int')
C = vote_averages.mean()
C

5.244896612406511

In [5]:
m = vote_counts.quantile(0.90)
m

160.0

In [6]:
md['year'] = pd.to_datetime(md['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

In [7]:
qualified = md[(md['vote_count'] >= m) & 
               (md['vote_count'].notnull()) & 
               (md['vote_average'].notnull()) & 
               (md['year'] >= '1990')][['title', 'year', 'vote_count', 'vote_average', 'popularity', 'genres']]
qualified['vote_count'] = qualified['vote_count'].astype('int')
qualified['vote_average'] = qualified['vote_average'].astype('int')
qualified.shape

(3888, 6)

Hence, in order to be eligible for inclusion in the chart, a movie must garner a minimum of **160** votes on TMDB and should be at least from year 1990. Additionally, the average rating for movies on TMDB stands at **5.244 out of 10**. As a result, a total of **3888** movies meet the criteria to be featured on our chart.

In [8]:
def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

In [9]:
qualified['wr'] = qualified.apply(weighted_rating, axis=1)

In [10]:
qualified = qualified.sort_values('wr', ascending=False).head(500)

### Top Movies

In [11]:
qualified['rank'] = qualified['wr'].rank(ascending=False, method='first')
qualified.head(20)

Unnamed: 0,title,year,vote_count,vote_average,popularity,genres,wr,rank
10309,Dilwale Dulhania Le Jayenge,1995,661,9,34.457024,"[Comedy, Drama, Romance]",8.268189,1.0
15480,Inception,2010,14075,8,29.108149,"[Action, Thriller, Science Fiction, Mystery, A...",7.969033,2.0
12481,The Dark Knight,2008,12269,8,123.167259,"[Drama, Action, Crime, Thriller]",7.964533,3.0
22879,Interstellar,2014,11187,8,32.213481,"[Adventure, Drama, Science Fiction]",7.961151,4.0
2843,Fight Club,1999,9678,8,63.869599,[Drama],7.955192,5.0
4863,The Lord of the Rings: The Fellowship of the Ring,2001,8892,8,32.070725,"[Adventure, Fantasy, Action]",7.951302,6.0
292,Pulp Fiction,1994,8670,8,140.950236,"[Thriller, Crime]",7.950077,7.0
314,The Shawshank Redemption,1994,8358,8,51.645403,"[Drama, Crime]",7.948249,8.0
7000,The Lord of the Rings: The Return of the King,2003,8226,8,29.324358,"[Adventure, Fantasy, Action]",7.947434,9.0
351,Forrest Gump,1994,8147,8,48.307194,"[Comedy, Drama, Romance]",7.946934,10.0


Observing the top of our chart, we notice that three films by Christopher Nolan - **Inception**, **The Dark Knight**, and **Interstellar** - hold prominent positions. This indicates a notable inclination of TMDB users towards specific genres and directors.

Now, we will proceed to create our function for generating charts based on specific genres. To achieve this, we will adjust our default criteria to the **80th** percentile instead of the previous 90th.

In [12]:
s = md.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'
gen_md = md.drop('genres', axis=1).join(s)

In [13]:
def build_chart(genre, percentile=0.80):
    df = gen_md[(gen_md['genre'] == genre) & (gen_md['year'] >= '1990')]
    vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = df[df['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(percentile)
    
    qualified = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & (df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity']]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    
    qualified['wr'] = qualified.apply(lambda x: (x['vote_count']/(x['vote_count']+m) * x['vote_average']) + (m/(m+x['vote_count']) * C), axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(500)
    
    qualified['rank'] = qualified.reset_index().index + 1
    
    return qualified.head(20)

Let's put our method into action by showcasing the **Top 20 Romance Movies**. Despite being one of the most popular movie genres, romance barely made an appearance in our Generic Top Chart.

### Top 20 Romance Movies

In [14]:
build_chart('Romance')

Unnamed: 0,title,year,vote_count,vote_average,popularity,wr,rank
10309,Dilwale Dulhania Le Jayenge,1995,661,9,34.457024,8.566091,1
351,Forrest Gump,1994,8147,8,48.307194,7.971437,2
40251,Your Name.,2016,1030,8,34.461252,7.790099,3
19901,Paperman,2012,734,8,7.198633,7.714789,4
37863,Sing Street,2016,669,8,10.672862,7.690396,5
38718,The Handmaiden,2016,453,8,16.727405,7.567464,6
24886,The Way He Looks,2014,262,8,5.711274,7.33343,7
45437,In a Heartbeat,2017,146,8,20.82178,7.007176,8
1639,Titanic,1997,7770,7,26.88907,6.981644,9
19731,Silver Linings Playbook,2012,4840,7,14.488111,6.970736,10


According to our criteria, Bollywood's **Dilwale Dulhania Le Jayenge** claims the top spot among romance movies, while **Theory of Everything** only managed to secure a place in the top 20.

### Top 20 Horror Movies

In [15]:
build_chart('Horror')

Unnamed: 0,title,year,vote_count,vote_average,popularity,wr,rank
41492,Split,2016,4461,7,28.920839,6.943894,1
14236,Zombieland,2009,3655,7,11.063029,6.931873,2
21276,The Conjuring,2013,3169,7,14.90169,6.921766,3
42169,Get Out,2017,2978,7,36.894806,6.916923,4
8147,Shaun of the Dead,2004,2479,7,14.902948,6.900892,5
8230,Saw,2004,2255,7,23.508433,6.891493,6
39097,The Conjuring 2,2016,2018,7,14.767317,6.879391,7
6353,28 Days Later,2002,1816,7,17.656951,6.866722,8
12277,Sweeney Todd: The Demon Barber of Fleet Street,2007,1745,7,10.038401,6.861613,9
4591,The Others,2001,1708,7,11.046007,6.858792,10


In the horror genre, **The Shining** takes the first position, while **Scream** is ranked in the top 20.

### Top 20 Fantasy Movies

In [16]:
build_chart('Fantasy')

Unnamed: 0,title,year,vote_count,vote_average,popularity,wr,rank
4863,The Lord of the Rings: The Fellowship of the Ring,2001,8892,8,32.070725,7.876352,1
7000,The Lord of the Rings: The Return of the King,2003,8226,8,29.324358,7.866833,2
5814,The Lord of the Rings: The Two Towers,2002,7641,8,29.423537,7.857175,3
3030,The Green Mile,1999,4166,8,19.96678,7.749069,4
5481,Spirited Away,2001,3968,8,41.048867,7.737759,5
9698,Howl's Moving Castle,2004,2049,8,16.136048,7.534347,6
2884,Princess Mononoke,1997,2041,8,17.166725,7.532836,7
14551,Avatar,2009,12114,7,185.070892,6.942019,8
19971,The Hobbit: An Unexpected Journey,2012,8427,7,23.253089,6.917869,9
26555,Star Wars: The Force Awakens,2015,7993,7,31.626013,6.913634,10


In the fantasy genre, **The Lord of the Rings: The Fellowship of the Ring**, **The Lord of the Rings: The Return of the King**, and **The Lord of the Rings: The Two Towers** claim the top three spots, showcasing the enduring popularity of J.R.R. Tolkien's epic saga. These movies are celebrated for their captivating storytelling, rich world-building, and memorable characters. Meanwhile, **Harry Potter and the Deathly Hallows: Part 2** secures the twentieth position, representing the beloved wizarding world created by J.K. Rowling and captivating audiences with its magical adventures.

## Content Based Filtering

The recommender system developed in the preceding section exhibits notable limitations. Notably, it offers the same recommendations to all users, disregarding individual preferences. This means that someone who adores romantic films but dislikes action would likely not find many suitable recommendations in our Top 20 Chart. Even when exploring genre-specific charts, users may still not receive the most tailored suggestions.

For example, consider a user who enjoys films like "Dilwale Dulhania Le Jayenge," "My Name is Khan," and "Kabhi Khushi Kabhi Gham." It's evident that this individual has a preference for movies featuring actor Shahrukh Khan and director Karan Johar. However, even when accessing the romance chart, these beloved films may not appear as top recommendations.

To enhance the personalization of our recommendations, we will construct an engine that assesses the similarity between movies based on specific metrics and recommends titles closely resembling a particular movie that a user has enjoyed. This approach, known as **Content-Based Filtering**, relies on movie metadata (or content) to generate recommendations.

We've developed three distinct Content-Based Recommenders:
* **Movie Description-Based Recommender (Utilizing Movie Overviews and Taglines)**
* **Metadata-Based Recommender (Leveraging Movie Cast, Crew, Keywords, and Genre)**
* **Enhanced Recommender Incorporating Popularity and Ratings**

Additionally, it's important to note that we're working with a subset of all available movies due to computational constraints.

In [17]:
links_small = pd.read_csv('links_small.csv')
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')

In [18]:
md = md.drop([19730, 29503, 35587])

In [19]:
md['id'] = md['id'].astype('int')

In [20]:
smd = md[md['id'].isin(links_small)]
smd.shape

(9099, 25)

In our small movies metadata dataset, we have a total of **9099** movies, which represents a reduction of five times compared to our original dataset containing 45,000 movies.

## 2. Movie Description Based Recommender

Our initial attempt involves constructing a recommender system using movie descriptions and taglines. Given the absence of a quantitative metric for evaluating our machine's performance, we will need to assess its effectiveness qualitatively.

In [21]:
smd['tagline'] = smd['tagline'].fillna('')
smd['description'] = smd['overview'] + smd['tagline']
smd['description'] = smd['description'].fillna('')

In [22]:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 2), min_df=1, stop_words='english')
tfidf_matrix = tf.fit_transform(smd['description'])

In [23]:
tfidf_matrix.shape

(9099, 268124)

#### Cosine Similarity

We'll employ Cosine Similarity to quantify the similarity between two movies. Mathematically, it is defined as follows:

$cosine(x,y) = \frac{x. y^\intercal}{||x||.||y||} $

Because we've utilized the TF-IDF Vectorizer, computing the Dot Product will yield the Cosine Similarity Score directly. Hence, I'll opt for **sklearn's linear_kernel** instead of cosine_similarities since it offers faster computation.

In [24]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [25]:
cosine_sim[0]

array([1.        , 0.00680476, 0.        , ..., 0.        , 0.00344913,
       0.        ])

Next, we possess a pairwise cosine similarity matrix for all movies within our dataset. Our subsequent task is to create a function that provides the **30** most similar movies based on the cosine similarity score.

In [26]:
smd = smd.reset_index()
titles = smd['title']
indices = pd.Series(smd.index, index=smd['title'])

In [27]:
def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

We're all set. Now, let's proceed to obtain the top 10 recommendations for a selection of movies and evaluate the quality of these recommendations.

In [28]:
get_recommendations('Avatar').head(10)

2059                                    The Matrix
4506                              Tears of the Sun
4695    Lara Croft Tomb Raider: The Cradle of Life
2910               Pandora and the Flying Dutchman
538                          Hellraiser: Bloodline
7460                                    Green Zone
7587                                  The American
3015                                 House Party 2
2561                                     Supernova
975                                A Grand Day Out
Name: title, dtype: object

In [29]:
get_recommendations('The Godfather').head(10)

973      The Godfather: Part II
8387                 The Family
3509                       Made
4196         Johnny Dangerously
29               Shanghai Triad
5667                       Fury
2412             American Movie
1582    The Godfather: Part III
4221                    8 Women
2159              Summer of Sam
Name: title, dtype: object

In [30]:
get_recommendations('The Dark Knight').head(10)

7931                      The Dark Knight Rises
132                              Batman Forever
1113                             Batman Returns
8227    Batman: The Dark Knight Returns, Part 2
7565                 Batman: Under the Red Hood
524                                      Batman
7901                           Batman: Year One
2579               Batman: Mask of the Phantasm
2696                                        JFK
8165    Batman: The Dark Knight Returns, Part 1
Name: title, dtype: object

We observed that for **The Dark Knight**, our system correctly identified it as a Batman film and recommended other Batman films as top recommendations. However, this limited approach overlooks crucial features such as cast, crew, director, and genre, which significantly influence a movie's rating and popularity. For instance, someone who enjoyed **The Dark Knight** might attribute their enjoyment to Christopher Nolan's direction, which would lead them to dislike **Batman Forever** and other lower-quality films in the Batman franchise.

To address this limitation, we will enhance our recommender system by incorporating more comprehensive metadata than just movie **overview** and **tagline**. The next iteration of our recommender system will utilize **genre**, **keywords**, **cast**, and **crew** information to provide more nuanced and accurate recommendations.

## 3. Metadata-Based Recommender

To construct our standard metadata-based content recommender, we'll merge our current dataset with the crew and keyword datasets. Let's begin by preparing this data as our initial step.

In [31]:
credits = pd.read_csv('credits.csv')
keywords = pd.read_csv('keywords.csv')

In [32]:
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
md['id'] = md['id'].astype('int')

In [33]:
md.shape

(45463, 25)

In [34]:
md = md.merge(credits, on='id')
md = md.merge(keywords, on='id')

In [35]:
smd = md[md['id'].isin(links_small)]
smd.shape

(9219, 28)

Now that we've consolidated our cast, crew, genres, and credits into one dataframe, let's further refine it using the following considerations:

1. **Crew:** We'll focus solely on the director from the crew data, as other roles don't significantly impact the movie's overall perception.
2. **Cast:** Selecting the cast requires some careful consideration. Minor roles and lesser-known actors typically have minimal influence on viewers' opinions. Hence, we'll only include the major characters and their corresponding actors. To do this, we'll arbitrarily select the top 3 actors listed in the credits for each movie.

In [36]:
smd['cast'] = smd['cast'].apply(literal_eval)
smd['crew'] = smd['crew'].apply(literal_eval)
smd['keywords'] = smd['keywords'].apply(literal_eval)
smd['cast_size'] = smd['cast'].apply(lambda x: len(x))
smd['crew_size'] = smd['crew'].apply(lambda x: len(x))

In [37]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [38]:
smd['director'] = smd['crew'].apply(get_director)

In [39]:
smd['cast'] = smd['cast'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
smd['cast'] = smd['cast'].apply(lambda x: x[:3] if len(x) >=3 else x)

In [40]:
smd['keywords'] = smd['keywords'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

Our approach to building the recommender is somewhat unconventional but effective. We created a metadata dump for each movie, including **genres, director, main actors, and keywords**. Then, we used a **Count Vectorizer** to generate a count matrix, similar to what we did for the Description Recommender. The subsequent steps involve calculating cosine similarities to identify the most similar movies.

Here's how we prepare the genres and credits data:

1. We **stripped spaces and converted all features to lowercase** to ensure consistency and prevent confusion between similar names, such as **Johnny Depp** and **Johnny Galecki**.
2. To give the director more weight relative to the entire cast, we **mentioned the director's name three times** in the metadata dump.

In [41]:
smd['cast'] = smd['cast'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

In [42]:
smd['director'] = smd['director'].astype('str').apply(lambda x: str.lower(x.replace(" ", "")))
smd['director'] = smd['director'].apply(lambda x: [x,x, x])

#### Keywords

We performed a small amount of pre-processing of our keywords before putting them to any use. As a first step, we calculated the frequency counts of every keyword that appeared in the dataset.

In [43]:
s = smd.apply(lambda x: pd.Series(x['keywords']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'keyword'

In [44]:
s = s.value_counts()
s[:5]

independent film        610
woman director          550
murder                  399
duringcreditsstinger    327
based on novel          318
Name: keyword, dtype: int64

Keywords occurred in frequencies ranging from **1 to 610**. We did not have any use for keywords that occurred only once. Therefore, these could be safely removed. Finally, we converted every word to its stem so that words such as **Dogs** and **Dog** were considered the same.

In [45]:
s = s[s > 1]

In [46]:
stemmer = SnowballStemmer('english')
stemmer.stem('dogs')

'dog'

In [47]:
def filter_keywords(x):
    words = []
    for i in x:
        if i in s:
            words.append(i)
    return words

In [48]:
smd['keywords'] = smd['keywords'].apply(filter_keywords)
smd['keywords'] = smd['keywords'].apply(lambda x: [stemmer.stem(i) for i in x])
smd['keywords'] = smd['keywords'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

In [49]:
smd['soup'] = smd['keywords'] + smd['cast'] + smd['director'] + smd['genres']
smd['soup'] = smd['soup'].apply(lambda x: ' '.join(x))

In [50]:
count = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=1, stop_words='english')
count_matrix = count.fit_transform(smd['soup'])

In [51]:
cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [52]:
smd = smd.reset_index()
titles = smd['title']
indices = pd.Series(smd.index, index=smd['title'])

We will reuse the get_recommendations function that we had written earlier. Since our cosine similarity scores have changed, we expected it to give us different (and probably better) results. Let's check for **The Dark Knight** again and see what recommendations we get this time around.

In [53]:
get_recommendations('The Dark Knight').head(10)

8031         The Dark Knight Rises
6218                 Batman Begins
6623                  The Prestige
2085                     Following
7648                     Inception
4145                      Insomnia
3381                       Memento
8613                  Interstellar
7659    Batman: Under the Red Hood
1134                Batman Returns
Name: title, dtype: object

We are much more satisfied with the results we got this time around. The recommendations seem to have recognized other Christopher Nolan movies like **Batman Begins**, **The Prestige** and **The Dark Knight Rises** (due to the high weightage given to the director) and put them as top recommendations.

We can, of course, experiment on this engine by trying out different weights for our features (directors, actors, genres), limiting the number of keywords that can be used in the soup, weighing genres based on their frequency, only showing movies with the same languages, etc.

We can also get recommendations for another movie, **Avatar** and compare it with the previous recommender system.

In [54]:
get_recommendations('Avatar').head(10)

974                             Aliens
522         Terminator 2: Judgment Day
1011                    The Terminator
922                          The Abyss
4347    Piranha Part Two: The Spawning
344                          True Lies
1376                           Titanic
8401           Star Trek Into Darkness
3216                Dungeons & Dragons
8724                 Jupiter Ascending
Name: title, dtype: object

## 4. Enhanced Recommender Incorporating Popularity and Ratings

One observation about our recommendation system was that it suggested movies without considering their ratings and popularity. While some movies may have had similar characteristics, recommending a poorly rated movie would not have been beneficial.

To address this, we incorporated a mechanism to exclude poorly rated movies and focus on popular ones with positive critical responses. We started by selecting the **top 25 movies** based on similarity scores and determined the vote count threshold corresponding to the **60th percentile movie**. Using this threshold value (𝑚), we calculated the weighted rating of each movie using IMDB's formula, as done in the Simple Recommender section.

In [55]:
def improved_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]
    
    movies = smd.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year']]
    vote_counts = movies[movies['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = movies[movies['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(0.60)
    qualified = movies[(movies['vote_count'] >= m) & (movies['vote_count'].notnull()) & (movies['vote_average'].notnull())]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    qualified['wr'] = qualified.apply(weighted_rating, axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(10)
    return qualified

In [56]:
improved_recommendations('The Dark Knight')

Unnamed: 0,title,vote_count,vote_average,year,wr
7648,Inception,14075,8,2010,7.969033
8613,Interstellar,11187,8,2014,7.961151
6623,The Prestige,4510,8,2006,7.905607
3381,Memento,4168,8,2000,7.898148
8031,The Dark Knight Rises,9263,7,2012,6.970199
6218,Batman Begins,7511,7,2005,6.963392
1134,Batman Returns,1706,6,1992,5.935254
132,Batman Forever,1529,5,1995,5.023199
9024,Batman v Superman: Dawn of Justice,7189,5,2016,5.005332
1260,Batman & Robin,1447,4,1997,4.123947


Let us also get the recommendations for **Avatar**.

In [59]:
improved_recommendations('Avatar')

Unnamed: 0,title,vote_count,vote_average,year,wr
1376,Titanic,7770,7,1997,6.964588
8658,X-Men: Days of Future Past,6155,7,2014,6.955532
8401,Star Trek Into Darkness,4479,7,2013,6.939466
522,Terminator 2: Judgment Day,4274,7,1991,6.936667
1011,The Terminator,4208,7,1984,6.93571
974,Aliens,3282,7,1986,6.918415
922,The Abyss,822,7,1989,6.714036
8419,Man of Steel,6462,6,2013,5.981755
344,True Lies,1138,6,1994,5.906921
8724,Jupiter Ascending,2816,5,2015,5.013166


## Conclusion


In this notebook, we have built 4 different recommendation engines based on various ideas and algorithms. They are as follows:

* **Simple Recommender:** This system utilized overall TMDB Vote Count and Vote Averages to construct Top Movies Charts, both in general and for specific genres. The IMDB Weighted Rating System was employed to calculate ratings, based on which the sorting was ultimately performed.

* **Movie Description-Based Recommender:** This recommender leveraged movie descriptions and taglines to identify similarities between movies. Using TF-IDF Vectorizer and Cosine Similarity, it recommended movies that were most similar to a particular movie based on its textual content.

* **Metadata-Based Recommender:** This system incorporated a more comprehensive set of metadata including genres, directors, main actors, and keywords. By utilizing a Count Vectorizer to create a count matrix, it identified similarities between movies based on this metadata and recommended similar movies.

* **Enhanced Recommender Incorporating Popularity and Ratings:** This recommender further enhanced the metadata-based approach by considering the popularity and ratings of movies. It excluded poorly rated movies and focused on popular ones with positive critical responses. Using IMDB's formula, it calculated weighted ratings for each movie to provide more refined recommendations.