# <p style="font-family:newtimeroman;color:#FFD700;font-weight: bold;font-size:150%;"> IMDB Top 250 Movie Ranking Measurement Systems</p>

# Weighted Average Ratings

* We will build a Recommender System for IMDB top 250.
* IMDb publishes weighted vote averages rather than raw data averages.
* The simplest way to explain it is that although we accept and consider all votes received by users, not all votes have the same impact (or 'weight') on the final rating.
* When unusual voting activity is detected, an alternate weighting calculation may be applied in order to preserve the reliability of our system.
* To ensure that our rating mechanism remains effective,we do not disclose the exact method used to generate the rating.
* See also the complete FAQ for IMDb ratings.

<img src="https://logos-world.net/wp-content/uploads/2022/04/IMDb-Logo-700x394.png" >

In [14]:
import pandas as pd
import math
import scipy.stats as st
from sklearn.preprocessing import MinMaxScaler

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', 500)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.float_format', lambda x : '%.5f' %x)

df = pd.read_csv('/kaggle/input/imdb-250/movies_metadata.csv',
                 low_memory=False) # to hide dtype warnings
df.info()
df = df[['title', 'vote_average', 'vote_count']]
df.head()
df.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

(45466, 3)

# Sorting by vote average

In [15]:
df.sort_values('vote_average', ascending=False).head(10)
df['vote_count'].describe([0.10, 0.25, 0.50, 0.70, 0.80, 0.90, 0.95, 0.99]).T

count   45460.00000
mean      109.89734
std       491.31037
min         0.00000
10%         1.00000
25%         3.00000
50%        10.00000
70%        25.00000
80%        50.00000
90%       160.00000
95%       434.00000
99%      2183.82000
max     14075.00000
Name: vote_count, dtype: float64

# Choose the vote counts above 400 votes ones, because the mean is 109 and analyzing above the mean would be more solid.

In [16]:
df[df['vote_count'] > 400].sort_values('vote_average', ascending=False).head(10)

df['vote_count_score'] = MinMaxScaler(feature_range=(1,10)).fit(df[['vote_count']]).transform(df[['vote_count']])

df['average_count_score'] = df['vote_count_score'] * df['vote_average']
df.sort_values('average_count_score', ascending=False).head(10)

Unnamed: 0,title,vote_average,vote_count,vote_count_score,average_count_score
15480,Inception,8.1,14075.0,10.0,81.0
12481,The Dark Knight,8.3,12269.0,8.84519,73.41505
22879,Interstellar,8.1,11187.0,8.15332,66.0419
17818,The Avengers,7.4,12000.0,8.67318,64.18153
14551,Avatar,7.2,12114.0,8.74607,62.97174
26564,Deadpool,7.4,11444.0,8.31766,61.55065
2843,Fight Club,8.3,9678.0,7.18842,59.66388
20051,Django Unchained,7.8,10297.0,7.58423,59.15697
23753,Guardians of the Galaxy,7.9,10014.0,7.40327,58.48582
292,Pulp Fiction,8.3,8670.0,6.54387,54.31414


# IMDB Weight Rating (this method was used till 2015)

* weighted_rating = (v / (v + M) * r) + (M / (v + M) * C)
* r = vote average
* v = vote count
* M = minimum votes required to be listed in the top 250
* C = the mean vote across the whole report ( currently 7.0)

* Part 1 as example : (3000 / (3000 + 500)) * 8 = 6.85
* Part 2 as example : 500 / (3000 + 500)) * 7 = 1
* Part 1 + Part 2 = 7.85

In [17]:
M = 2500
C = df['vote_average'].mean()

def weighted_rating(r, v, M, C):
    return (v / (v + M) * r) + (M / (v + M) * C)

df.sort_values('average_count_score', ascending=False).head(10)
df['weighted_rating'] = weighted_rating(df['vote_average'], df['vote_count'], M, C)
df.sort_values('weighted_rating', ascending=False).head(10)

Unnamed: 0,title,vote_average,vote_count,vote_count_score,average_count_score,weighted_rating
12481,The Dark Knight,8.3,12269.0,8.84519,73.41505,7.84604
314,The Shawshank Redemption,8.5,8358.0,6.34437,53.92714,7.83648
2843,Fight Club,8.3,9678.0,7.18842,59.66388,7.74946
15480,Inception,8.1,14075.0,10.0,81.0,7.72567
292,Pulp Fiction,8.3,8670.0,6.54387,54.31414,7.69978
834,The Godfather,8.5,6024.0,4.85194,41.24146,7.6548
22879,Interstellar,8.1,11187.0,8.15332,66.0419,7.64669
351,Forrest Gump,8.2,8147.0,6.20945,50.91748,7.59377
7000,The Lord of the Rings: The Return of the King,8.1,8226.0,6.25996,50.70571,7.52155
4863,The Lord of the Rings: The Fellowship of the Ring,8.0,8892.0,6.68583,53.48661,7.47731


# Bayesian Average Rating Score (Bar Score)

In [18]:
def bayesian_average_rating(n, confidence=0.95):
    if sum(n) == 0:
        return 0
    K = len(n)
    z = st.norm.ppf(1 - (1- confidence) / 2)
    N= sum(n)
    first_part = 0.0
    second_part = 0.0
    for k,n_k in enumerate(n):
        first_part +=(k + 1) * (n[k] + 1) / (N + K)
        second_part += (k + 1)  * (k + 1) * (n[k] + 1) / (N + K)
    score = first_part - z * math.sqrt((second_part - first_part * first_part) / (N + K + 1))
    return score

# We analyze how many votes are in each star. Related dataset is in rating csv.

In [19]:
df = pd.read_csv('/kaggle/input/imdb-ratings/imdb_ratings.csv')
df.info()
df.columns
df = df.iloc[0:, 1:]
df['bar_score'] = df.apply(lambda x: bayesian_average_rating(x[['one', 'two', 'three', 'four',
                                                                'five', 'six', 'seven', 'eight',
                                                                'nine', 'ten']]), axis=1)
df.sort_values('bar_score', ascending=False).head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  250 non-null    int64  
 1   id          250 non-null    int64  
 2   movieName   250 non-null    object 
 3   rating      250 non-null    float64
 4   ten         250 non-null    int64  
 5   nine        250 non-null    int64  
 6   eight       250 non-null    int64  
 7   seven       250 non-null    int64  
 8   six         250 non-null    int64  
 9   five        250 non-null    int64  
 10  four        250 non-null    int64  
 11  three       250 non-null    int64  
 12  two         250 non-null    int64  
 13  one         250 non-null    int64  
dtypes: float64(1), int64(12), object(1)
memory usage: 27.5+ KB


Unnamed: 0,id,movieName,rating,ten,nine,eight,seven,six,five,four,three,two,one,bar_score
0,111161,1. The Shawshank Redemption (1994),9.2,1295382,600284,273091,87368,26184,13515,6561,4704,4355,34733,9.14539
1,68646,2. The Godfather (1972),9.1,837932,402527,199440,78541,30016,16603,8419,6268,5879,37128,8.94002
3,468569,4. The Dark Knight (2008),9.0,1034863,649123,354610,137748,49483,23237,11429,8082,7173,30345,8.89596
2,71562,3. The Godfather: Part II (1974),9.0,486356,324905,175507,70847,26349,12657,6210,4347,3892,20469,8.8125
4,50083,5. 12 Angry Men (1957),8.9,246765,225437,133998,48341,15773,6278,2866,1723,1478,8318,8.76793
6,167260,7. The Lord of the Rings: The Return of ...,8.9,703093,433087,270113,117411,44760,21818,10873,7987,6554,28990,8.75204
5,108052,6. Schindler's List (1993),8.9,453906,383584,220586,82367,27219,12922,6234,4572,4289,19328,8.74361
11,109830,12. Forrest Gump (1994),8.8,622104,553654,373644,151284,51140,22720,11692,7647,5941,12110,8.69915
12,1375666,13. Inception (2010),8.7,724798,627987,408686,174229,60668,26910,13436,8703,6932,17621,8.69315
10,137523,11. Fight Club (1999),8.8,637087,572654,371752,152295,53059,24755,12648,8606,6948,17435,8.67448
