# Popularity based recommendation engine

Simple recommenders are basic systems that recommend the top items based on a certain metric or score. The basic idea behind this system is that movies that are more popular will have a higher probability of being liked by the average audience.

In this lesson, we will build a simplified clone of IMDb Top 250 Movies using metadata collected from IMDb.

## 1. Import libraries and read the data

To begin, we import the pandas library and read the IMDb-data:

In [1]:
# import Pandas
import pandas as pd

# load Movies Metadata
movies_df = pd.read_csv('resources/movies_metadata.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


## 2. Explore the data

It's a good idea to always explore your data a bit, so you know what you're working with. Let's print the first three rows and have a look at the data.

In [2]:
# print the first three rows
movies_df.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0


Now print the vote_average and vote_count of the first 10 rows.

In [3]:
# print the vote_average and vote_count of the first 10 rows
movies_df[['vote_average', 'vote_count']].head(10)

Unnamed: 0,vote_average,vote_count
0,7.7,5415.0
1,6.9,2413.0
2,6.5,92.0
3,6.1,34.0
4,5.7,173.0
5,7.7,1886.0
6,6.2,141.0
7,5.4,45.0
8,5.5,174.0
9,6.6,1194.0


One of the most basic metrics to build our *Top 250* is the rating (vote_average from above). However, using this metric has a few caveats. For example, it does not take into consideration the popularity of a movie. Therefore, a movie with a rating of 9 from 10 voters will be considered 'better' than a movie with a rating of 8.9 from 10,000 voters.

So it is necessary to come up with a weighted rating that takes into account __the average rating and the number of votes__ it has garnered. We will use the IMDb's weighted rating formula (since we are trying to build a clone of IMDb's Top 250):

$$\textrm{Weighted Rating (WR)} = (\frac{v}{v + m} . R) + (\frac{m}{v + m} . C)$$

where,

- v is the number of votes for the movie (`vote_count`)
- m is the minimum number of votes required to be listed in the chart (to be computed)
- R is the average rating of the movie (`vote_average`)
- C is the mean vote across the whole report (to be computed)

As a first step, let's calculate the value of C, the mean rating across all movies:

In [4]:
# calculate C
C = movies_df['vote_average'].mean()
print(C)

5.618207215133889


Now we need to determine an appropriate value for m, the minimum number of votes required to be listed in the chart. There is no right value for m, therefore we will use the 90th percentile as cutoff. In other words, for a movie to feature in the charts, it must be in the 10% top most votes list (since we are cutting off 90% of the movies based on `vote_count`).

Let's calculate the number of votes, m, received by the movie in the 90th percentile. The Pandas library makes this task extremely trivial using the .quantile() method:

In [7]:
# calculate the minimum number of votes required to be in the chart, m
m = movies_df['vote_count'].quantile(0.90)
print(m)

160.0


If we had chosen the 75th percentile, we would have considered the top 25% of the movies in terms of the number of votes garnered. As the percentile decreases, the number of movies considered increases. You can check the number of votes for the 75th percentile yourself.

## 3. Filter the data

Next, we can filter the movies that qualify for the chart, based on their vote counts. We use the .copy() method to ensure that the new q_movies_df DataFrame created is independent of your original movies_df DataFrame. In other words, any changes made to the q_movies_df DataFrame does not affect movies_df.

You see that there are 4555 movies which qualify to be in this list.

In [8]:
# filter out all qualified movies into a new DataFrame
q_movies_df = movies_df.copy().loc[movies_df['vote_count'] >= m]
q_movies_df.shape

(4555, 24)

Now, we need to calculate the *Weighted Rating* for each qualified movie. To do this, we will define a function, weighted_rating() and define a new feature score, of which we'll calculate the value by applying this function to the DataFrame of qualified movies:

In [10]:
# function that computes the weighted rating of each movie
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    # calculation based on the IMDb formula
    return (v/(v+m) * R) + (m/(m+v) * C)

# define a new feature 'score' and calculate its value with `weighted_rating()`
q_movies_df['score'] = q_movies_df.apply(weighted_rating, axis=1)

Let's have a look at the new created column.

In [11]:
q_movies_df.head(5)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,score
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,7.640253
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,6.820293
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,5.6607
5,False,,60000000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",,949,tt0113277,en,Heat,"Obsessive master thief, Neil McCauley leads a ...",...,187436818.0,170.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,A Los Angeles Crime Saga,Heat,False,7.7,1886.0,7.537201
8,False,,35000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",,9091,tt0114576,en,Sudden Death,International action superstar Jean Claude Van...,...,64350171.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Terror goes into overtime.,Sudden Death,False,5.5,174.0,5.556626


## 4. Top 15

Finally, let's sort the DataFrame based on the score feature and output the title, vote count, vote average and score (= weighted rating) of the top 15 movies.

In [12]:
# sort movies based on score calculated above
q_movies_df = q_movies_df.sort_values('score', ascending=False)

# print the top 15 movies
q_movies_df[['title', 'vote_count', 'vote_average', 'score']].head(15)

Unnamed: 0,title,vote_count,vote_average,score
314,The Shawshank Redemption,8358.0,8.5,8.445869
834,The Godfather,6024.0,8.5,8.425439
10309,Dilwale Dulhania Le Jayenge,661.0,9.1,8.421453
12481,The Dark Knight,12269.0,8.3,8.265477
2843,Fight Club,9678.0,8.3,8.256385
292,Pulp Fiction,8670.0,8.3,8.251406
522,Schindler's List,4436.0,8.3,8.206639
23673,Whiplash,4376.0,8.3,8.205404
5481,Spirited Away,3968.0,8.3,8.196055
2211,Life Is Beautiful,3643.0,8.3,8.187171


You see that the chart has a lot of movies in common with the IMDB Top 250 chart: for example, your top two movies, "Shawshank Redemption" and "The Godfather", are the same as IMDb. Check the other movies yourself. Pretty impressive! No?

<img src="./resources/imdb.png" style="height: 500px"/>