# **Scoring and sorting the IMDB Top250 List**

# Business problem

### One of the most important problems in e-commerce is the correct calculation of the points given to the products after sales. The solution to this problem means providing greater customer satisfaction for the e-commerce site, prominence of the product for the sellers and a seamless shopping experience for the buyers. Another problem is the correct ordering of the comments given to the products. Since misleading comments will directly affect the sale of the product, it will cause both financial loss and loss of customers. In the solution of these 2 basic problems, while the e-commerce site and the sellers will increase their sales, the customers will complete the purchasing journey without any problems.

### Purpose: updating the list top250, which is referenced by many people

# Importing the libraries

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as st
import math
from sklearn.preprocessing import MinMaxScaler
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 500)
pd.set_option('display.max_rows', 20)
pd.set_option('display.float_format', lambda x: '%.2f' % x)
pd.set_option('display.expand_frame_repr', False)

# Reading the dataset

In [2]:
df = pd.read_csv('/kaggle/input/movie-metadatacsv/movies_metadata.csv', low_memory=False)
df

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.00,81.00,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.70,5415.00
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.00,104.00,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.90,2413.00
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.00,101.00,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.50,92.00
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.00,127.00,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.10,34.00
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.00,106.00,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.70,173.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...",http://www.imdb.com/title/tt6209470/,439050,tt6209470,fa,رگ خواب,Rising and falling between a man and woman.,0.072051,/jldsYflnId4tTWPx8es3uzsB1I8.jpg,[],"[{'iso_3166_1': 'IR', 'name': 'Iran'}]",,0.00,90.00,"[{'iso_639_1': 'fa', 'name': 'فارسی'}]",Released,Rising and falling between a man and woman,Subdue,False,4.00,1.00
45462,False,,0,"[{'id': 18, 'name': 'Drama'}]",,111109,tt2028550,tl,Siglo ng Pagluluwal,An artist struggles to finish his work while a...,0.178241,/xZkmxsNmYXJbKVsTRLLx3pqGHx7.jpg,"[{'name': 'Sine Olivia', 'id': 19653}]","[{'iso_3166_1': 'PH', 'name': 'Philippines'}]",2011-11-17,0.00,360.00,"[{'iso_639_1': 'tl', 'name': ''}]",Released,,Century of Birthing,False,9.00,3.00
45463,False,,0,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...",,67758,tt0303758,en,Betrayal,"When one of her hits goes wrong, a professiona...",0.903007,/d5bX92nDsISNhu3ZT69uHwmfCGw.jpg,"[{'name': 'American World Pictures', 'id': 6165}]","[{'iso_3166_1': 'US', 'name': 'United States o...",2003-08-01,0.00,90.00,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,A deadly game of wits.,Betrayal,False,3.80,6.00
45464,False,,0,[],,227506,tt0008536,en,Satana likuyushchiy,"In a small town live two brothers, one a minis...",0.003503,/aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg,"[{'name': 'Yermoliev', 'id': 88753}]","[{'iso_3166_1': 'RU', 'name': 'Russia'}]",1917-10-21,0.00,87.00,[],Released,,Satan Triumphant,False,0.00,0.00


# Selecting the related variables

In [3]:
df = df[['title', 'vote_average', 'vote_count']]
df.head()

Unnamed: 0,title,vote_average,vote_count
0,Toy Story,7.7,5415.0
1,Jumanji,6.9,2413.0
2,Grumpier Old Men,6.5,92.0
3,Waiting to Exhale,6.1,34.0
4,Father of the Bride Part II,5.7,173.0


### sorting according the tha variable vote_average

In [4]:
df.sort_values('vote_average', ascending=False).head(20)

Unnamed: 0,title,vote_average,vote_count
21642,Ice Age Columbus: Who Were the First Americans?,10.0,1.0
15710,If God Is Willing and da Creek Don't Rise,10.0,1.0
22396,Meat the Truth,10.0,1.0
22395,Marvin Hamlisch: What He Did For Love,10.0,1.0
35343,Elaine Stritch: At Liberty,10.0,1.0
186,Reckless,10.0,1.0
45047,The Human Surge,10.0,1.0
22377,The Guide,10.0,1.0
22346,هیچ کجا هیچ کس,10.0,1.0
1634,Other Voices Other Rooms,10.0,1.0


### filtering the variable 'vote_count' accorting to a certain vote count
### the basic motivation is to recommend a certain number (250) of movies

In [5]:
df['vote_count'].describe([0.1, 0.25, 0.5, 0.7, 0.8, 0.9, 0.95, 0.99]).T

count   45460.00
mean      109.90
std       491.31
min         0.00
10%         1.00
25%         3.00
50%        10.00
70%        25.00
80%        50.00
90%       160.00
95%       434.00
99%      2183.82
max     14075.00
Name: vote_count, dtype: float64

### filtering according to vote_count > 400

In [6]:
df[df['vote_count'] > 400].sort_values('vote_average', ascending=False)

Unnamed: 0,title,vote_average,vote_count
10309,Dilwale Dulhania Le Jayenge,9.10,661.00
40251,Your Name.,8.50,1030.00
834,The Godfather,8.50,6024.00
314,The Shawshank Redemption,8.50,8358.00
1152,One Flew Over the Cuckoo's Nest,8.30,3001.00
...,...,...,...
1486,Speed 2: Cruise Control,4.10,439.00
28207,The Boy Next Door,4.10,1034.00
18101,Jack and Jill,4.00,619.00
21238,Sharknado,3.80,484.00


### standardization the values of the variable vote_count

In [7]:
df['vote_count_score'] = MinMaxScaler(feature_range=(1, 10)).fit_transform(df[['vote_count']])

### multiply the vote count and vote average and create a new variable named 'average_count_score'

In [8]:
df['average_count_score'] = df['vote_count_score'] * df['vote_average']
df.sort_values('average_count_score', ascending=False).head(20)

Unnamed: 0,title,vote_average,vote_count,vote_count_score,average_count_score
15480,Inception,8.1,14075.0,10.0,81.0
12481,The Dark Knight,8.3,12269.0,8.85,73.42
22879,Interstellar,8.1,11187.0,8.15,66.04
17818,The Avengers,7.4,12000.0,8.67,64.18
14551,Avatar,7.2,12114.0,8.75,62.97
26564,Deadpool,7.4,11444.0,8.32,61.55
2843,Fight Club,8.3,9678.0,7.19,59.66
20051,Django Unchained,7.8,10297.0,7.58,59.16
23753,Guardians of the Galaxy,7.9,10014.0,7.4,58.49
292,Pulp Fiction,8.3,8670.0,6.54,54.31


# IMDB weighted rating

### weighted rating = (v/(v+M) * r) + (M/(v+M) * c)
#### r = vote average
#### v = vote count
#### M = minimum votes rquired to be listed in the top250
#### c = mean vote acros the whole report (current value = 7.0)
#### until 2015, this formula was used to calculate the imdb ratings

### defining a function for the above formula

In [9]:
def weighted_rating(r, v, M, c):
    return (v/(v+M) * r) + (M/(v+M) * c)

M = 2500
c = df['vote_average'].mean()
weighted_rating(7.4, 11444, M, c)   # for the film 'deadpool'

7.080544896574546

In [10]:
weighted_rating(8.1, 14075.00, M, c)   # for the film 'inception'

7.725672279809078

In [11]:
weighted_rating(8.5, 8358.00, M, c)   # for the film 'The Shawshank Redemption'

7.83648167598411

### applying the defined function to the whole data

In [12]:
df['weighted_rating'] = weighted_rating(df['vote_average'], df['vote_count'], M, c)
df.sort_values('weighted_rating', ascending=False).head(20)

Unnamed: 0,title,vote_average,vote_count,vote_count_score,average_count_score,weighted_rating
12481,The Dark Knight,8.3,12269.0,8.85,73.42,7.85
314,The Shawshank Redemption,8.5,8358.0,6.34,53.93,7.84
2843,Fight Club,8.3,9678.0,7.19,59.66,7.75
15480,Inception,8.1,14075.0,10.0,81.0,7.73
292,Pulp Fiction,8.3,8670.0,6.54,54.31,7.7
834,The Godfather,8.5,6024.0,4.85,41.24,7.65
22879,Interstellar,8.1,11187.0,8.15,66.04,7.65
351,Forrest Gump,8.2,8147.0,6.21,50.92,7.59
7000,The Lord of the Rings: The Return of the King,8.1,8226.0,6.26,50.71,7.52
4863,The Lord of the Rings: The Fellowship of the Ring,8.0,8892.0,6.69,53.49,7.48


# Bayesian average rating score

### defining a Bayesian Average Rating function

In [13]:
def bayesian_average_rating(n, confidence=0.95):    # n = number of each star given
    if sum(n) == 0:
        return 0
    K = len(n)
    z = st.norm.ppf(1 - (1 - confidence) / 2)
    N = sum(n)
    first_part = 0.0
    second_part = 0.0
    for k, n_k in enumerate(n):
        first_part += (k + 1) * (n[k] + 1) / (N + K)
        second_part += (k + 1) * (k + 1) * (n[k] + 1) / (N + K)
    score = first_part - z * math.sqrt((second_part - first_part * first_part) / (N + K + 1))
    return score

### calculating the IMDB scor the the film 'The Shawshank Redemption'

In [14]:
bayesian_average_rating([34733, 4355, 4704, 6561, 13515, 26183, 87368, 273082, 600260, 1295351]) 
# the numbers are the distribution of stars given

9.14538444560111

### calculating the IMDB scor the the film 'The Godfather'

In [15]:
bayesian_average_rating([37128, 5879, 6268, 8419, 16603, 30016, 78538, 199430, 402518, 837905])

8.940007324860396

### application of bayesian_average_rating function to the new dataset to calcualate the IMDB scores of the movies

In [16]:
new_df = pd.read_csv('/kaggle/input/imdb-ratings/imdb_ratings.csv')
new_df = new_df.iloc[0:, 1:]
new_df.head()

Unnamed: 0,id,movieName,rating,ten,nine,eight,seven,six,five,four,three,two,one
0,111161,1. The Shawshank Redemption (1994),9.2,1295382,600284,273091,87368,26184,13515,6561,4704,4355,34733
1,68646,2. The Godfather (1972),9.1,837932,402527,199440,78541,30016,16603,8419,6268,5879,37128
2,71562,3. The Godfather: Part II (1974),9.0,486356,324905,175507,70847,26349,12657,6210,4347,3892,20469
3,468569,4. The Dark Knight (2008),9.0,1034863,649123,354610,137748,49483,23237,11429,8082,7173,30345
4,50083,5. 12 Angry Men (1957),8.9,246765,225437,133998,48341,15773,6278,2866,1723,1478,8318


In [17]:
new_df['bar_score'] = new_df.apply(lambda x: bayesian_average_rating(x[['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten']]), axis=1)
new_df.sort_values('bar_score', ascending=False).head(20)

Unnamed: 0,id,movieName,rating,ten,nine,eight,seven,six,five,four,three,two,one,bar_score
0,111161,1. The Shawshank Redemption (1994),9.2,1295382,600284,273091,87368,26184,13515,6561,4704,4355,34733,9.15
1,68646,2. The Godfather (1972),9.1,837932,402527,199440,78541,30016,16603,8419,6268,5879,37128,8.94
3,468569,4. The Dark Knight (2008),9.0,1034863,649123,354610,137748,49483,23237,11429,8082,7173,30345,8.9
2,71562,3. The Godfather: Part II (1974),9.0,486356,324905,175507,70847,26349,12657,6210,4347,3892,20469,8.81
4,50083,5. 12 Angry Men (1957),8.9,246765,225437,133998,48341,15773,6278,2866,1723,1478,8318,8.77
6,167260,7. The Lord of the Rings: The Return of ...,8.9,703093,433087,270113,117411,44760,21818,10873,7987,6554,28990,8.75
5,108052,6. Schindler's List (1993),8.9,453906,383584,220586,82367,27219,12922,6234,4572,4289,19328,8.74
11,109830,12. Forrest Gump (1994),8.8,622104,553654,373644,151284,51140,22720,11692,7647,5941,12110,8.7
12,1375666,13. Inception (2010),8.7,724798,627987,408686,174229,60668,26910,13436,8703,6932,17621,8.69
10,137523,11. Fight Club (1999),8.8,637087,572654,371752,152295,53059,24755,12648,8606,6948,17435,8.67


### Consequently, the bayesian average rating scores of top250 list of the IMBD was calculated succesfully and this list is so similar to the top250  list of the IMDB in the real life.

# Thank you for checking my notebook!