# Popularity-Based Filtering

## Load data

In [97]:
import pandas as pd

movies_df = pd.read_csv('./data/movies.csv')
credits_df = pd.read_csv('./data/credits.csv')
ratings_df = pd.read_csv('./data/ratings.csv')

## Test loaded data

In [98]:
movies_df.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


In [99]:
credits_df.head()

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [100]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


## Calculate a weighted rating

The IMDB formula `WR = (v / (v+m)) * R + (m / (v+m)) * C` is used for calculating the weighted rating (WR). It calculates the weighted average of two components: the average rating (R) and the average rating of all movies (C), based on the number of votes (v) and a predetermined cutoff (m).

- v: number of votes for a movie
- m: minimum number of votes required
- R: average rating of the movie
- C: average rating across all movies in the dataset


In [101]:
# Calculate the minimum number of votes required (m) using the 90th percentile of the vote count.
m = movies_df['vote_count'].quantile(0.9)

# display m value
m

1838.4000000000015

### Filter data on threshold

Filtering movies_df based on m ensures that the analysis or recommendations focus on movies with a significant level of engagement, 
typically in the top 10% percentile of the dataset. 
This helps to mitigate the impact of outlier movies with very few votes and ensures that the recommendations are based on 
popular movies with a substantial number of votes.

In [102]:
# Create a copy of the DataFrame 'movies_df' to avoid modifying the original DataFrame
# Select rows from the copied DataFrame where the 'vote_count' column values are greater than the threshold 'm'
movies_df_filtered = movies_df.copy().loc[movies_df['vote_count'] > m]

# display this df copy
movies_df_filtered.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


In [103]:
# Calculate the mean vote rating (C) across the 'vote_average' column of the filtered DataFrame 'movies_df_filtered'
C = movies_df_filtered['vote_average'].mean()

# display C value
C

6.962993762993763

In [104]:
def weighted_rating(row, m=m, C=C):
    """
    Calculate the weighted rating for a movie based on its vote count and average rating.

    Parameters:
    - row: A pandas Series representing a row of the DataFrame containing movie data.
    - m: The minimum number of votes required to be listed in the calculation. Default is provided from the global variable 'm'.
    - C: The mean vote rating across the whole DataFrame. Default is provided from the global variable 'C'.

    Returns:
    - The weighted rating (wr) calculated based on the IMDB formula:
      wr = (v / (v+m)) * R + (m / (v+m)) * C
    """

    # Extract vote count and average rating from the row
    v = row['vote_count']
    R = row['vote_average']

    # Calculate the weighted rating using the provided IMDB formula
    wr = (v / (v+m)) * R + (m / (v+m)) * C

    return wr

In [105]:
# Create a new column 'weighted_rating' in the DataFrame 'movies_df_filtered'
# Apply the function 'weighted_rating' to each row of the DataFrame, considering the rows stored in the column 'filtered_rows'
# 'axis=1' indicates that the function will be applied row-wise
movies_df_filtered['weighted_rating'] = movies_df_filtered.apply(weighted_rating, axis=1)

# display this df with the added column
movies_df_filtered.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,weighted_rating
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,7.168053
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,6.918271


### Sort data in a descendent order on 'weighted_rating'

In [106]:
# Sort the DataFrame 'movies_df_filtered' based on the values in the 'weighted_rating' column in descending order
# The parameter 'ascending=False' specifies that the sorting should be in descending order
sorted_movies_rated = movies_df_filtered.sort_values('weighted_rating', ascending=False)

# Select the top rows (head) of the sorted DataFrame to retrieve the highest-rated movies
sorted_movies_rated.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,weighted_rating
1881,25000000,"[{""id"": 18, ""name"": ""Drama""}, {""id"": 80, ""name...",,278,"[{""id"": 378, ""name"": ""prison""}, {""id"": 417, ""n...",en,The Shawshank Redemption,Framed in the 1940s for the double murder of h...,136.747729,"[{""name"": ""Castle Rock Entertainment"", ""id"": 97}]",...,1994-09-23,28341469,142.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Fear can hold you prisoner. Hope can set you f...,The Shawshank Redemption,8.5,8205,8.218658
662,63000000,"[{""id"": 18, ""name"": ""Drama""}]",http://www.foxmovies.com/movies/fight-club,550,"[{""id"": 825, ""name"": ""support group""}, {""id"": ...",en,Fight Club,A ticking-time-bomb insomniac and a slippery s...,146.757391,"[{""name"": ""Regency Enterprises"", ""id"": 508}, {...",...,1999-10-15,100853753,139.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Mischief. Mayhem. Soap.,Fight Club,8.3,9413,8.081543
3232,8000000,"[{""id"": 53, ""name"": ""Thriller""}, {""id"": 80, ""n...",,680,"[{""id"": 396, ""name"": ""transporter""}, {""id"": 14...",en,Pulp Fiction,"A burger-loving hit man, his philosophical par...",121.463076,"[{""name"": ""Miramax Films"", ""id"": 14}, {""name"":...",...,1994-10-08,213928762,154.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Just because you are a character doesn't mean ...,Pulp Fiction,8.3,8428,8.060583
3337,6000000,"[{""id"": 18, ""name"": ""Drama""}, {""id"": 80, ""name...",http://www.thegodfather.com/,238,"[{""id"": 131, ""name"": ""italy""}, {""id"": 699, ""na...",en,The Godfather,"Spanning the years 1945 to 1955, a chronicle o...",143.659698,"[{""name"": ""Paramount Pictures"", ""id"": 4}, {""na...",...,1972-03-14,245066411,175.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,An offer you can't refuse.,The Godfather,8.4,5893,8.058304
65,185000000,"[{""id"": 18, ""name"": ""Drama""}, {""id"": 28, ""name...",http://thedarkknight.warnerbros.com/dvdsite/,155,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight,Batman raises the stakes in his war on crime. ...,187.322927,"[{""name"": ""DC Comics"", ""id"": 429}, {""name"": ""L...",...,2008-07-16,1004558444,152.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Why So Serious?,The Dark Knight,8.2,12002,8.03569


### Clean data

In [107]:
# Select the 'title' and 'weighted_rating' columns from the sorted DataFrame 'sorted_movies_rated'
cleaned_movies_data = sorted_movies_rated.loc[:, ['title', 'weighted_rating']]

# Select the top rows (head) of the cleaned DataFrame to retrieve the highest-rated movies
cleaned_movies_data.head()

Unnamed: 0,title,weighted_rating
1881,The Shawshank Redemption,8.218658
662,Fight Club,8.081543
3232,Pulp Fiction,8.060583
3337,The Godfather,8.058304
65,The Dark Knight,8.03569


In [108]:
# Convert the cleaned DataFrame 'cleaned_movies_data' to a dictionary format
# The parameter 'orient="records"' specifies that each row will be represented as a dictionary
movies_dict = cleaned_movies_data.to_dict(orient='records')

# display this dict or list of dicts (orient='records')
movies_dict 

[{'title': 'The Shawshank Redemption', 'weighted_rating': 8.218657798543095},
 {'title': 'Fight Club', 'weighted_rating': 8.081542539940607},
 {'title': 'Pulp Fiction', 'weighted_rating': 8.060582846361697},
 {'title': 'The Godfather', 'weighted_rating': 8.058303506982918},
 {'title': 'The Dark Knight', 'weighted_rating': 8.035690278741058},
 {'title': 'Forrest Gump', 'weighted_rating': 7.967125538522511},
 {'title': 'Inception', 'weighted_rating': 7.965925680796371},
 {'title': 'Interstellar', 'weighted_rating': 7.935481585301346},
 {'title': 'The Empire Strikes Back', 'weighted_rating': 7.905326629938545},
 {'title': "Schindler's List", 'weighted_rating': 7.901460539917588},
 {'title': 'Whiplash', 'weighted_rating': 7.896554351961089},
 {'title': 'The Lord of the Rings: The Return of the King',
  'weighted_rating': 7.88891255997412},
 {'title': 'Spirited Away', 'weighted_rating': 7.86713999258378},
 {'title': 'Star Wars', 'weighted_rating': 7.852992972902218},
 {'title': 'The Godfath