## Notebook Overview

texttexttext

## Imports & Configurations

In [1]:
import pandas as pd

In [10]:
TITLE_RATINGS_PATH = r".\data\title.ratings.tsv"
TITLE_BASICS_PATH = r".\data\title.basics.tsv"

AVG_RATING = 3
RATING_COUNT = 50000

## Data Analysis & Feature Engineering

In [None]:
title_basics_df = pd.read_csv(TITLE_BASICS_PATH, sep="\t", low_memory=False, na_values="\\N", on_bad_lines='skip')
title_ratings_df = pd.read_csv(TITLE_RATINGS_PATH, sep="\t", low_memory=False, na_values="\\N", on_bad_lines='skip')

In [13]:
df_list = [title_basics_df, title_ratings_df]

for df in df_list:
    print("First 5 row:\n")
    print(df.head())
    print("Shape:\n")
    print(df.shape)
    print("Information:\n")
    print(df.info())
    print("Description:\n")
    print(df.describe())

First 5 row:

      tconst titleType            primaryTitle           originalTitle  \
0  tt0000001     short              Carmencita              Carmencita   
1  tt0000002     short  Le clown et ses chiens  Le clown et ses chiens   
2  tt0000003     short            Poor Pierrot          Pauvre Pierrot   
3  tt0000004     short             Un bon bock             Un bon bock   
4  tt0000005     short        Blacksmith Scene        Blacksmith Scene   

   isAdult  startYear  endYear runtimeMinutes                    genres  
0        0     1894.0      NaN              1         Documentary,Short  
1        0     1892.0      NaN              5           Animation,Short  
2        0     1892.0      NaN              5  Animation,Comedy,Romance  
3        0     1892.0      NaN             12           Animation,Short  
4        0     1893.0      NaN              1                     Short  
Shape:

(11957318, 9)
Information:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11957318 en

In [19]:
title_basics_df["titleType"].value_counts()

titleType
tvEpisode       9207148
short           1087439
movie            727974
video            317000
tvSeries         288708
tvMovie          152585
tvMiniSeries      65476
tvSpecial         54813
videoGame         45396
tvShort           10778
tvPilot               1
Name: count, dtype: int64

In this notebook, I just make an effort on movies. Therefore, I must make filtered the dataset.

In [22]:
movie_ratings_df = title_basics_df[(title_basics_df["titleType"] == "movie") | (title_basics_df["titleType"] == "tvMovie")]
movie_ratings_df.shape # It must be movie + tvMovie -> (727974 + 152585)

(880559, 9)

In [28]:
movies_with_ratings = movie_ratings_df.merge(
    title_ratings_df[['tconst', 'averageRating', 'numVotes']],
    on='tconst',
    how='left'
)

In [29]:
movies_with_ratings.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes
0,tt0000009,movie,Miss Jerry,Miss Jerry,0,1894.0,,45,Romance,5.3,228.0
1,tt0000147,movie,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0,1897.0,,100,"Documentary,News,Sport",5.3,572.0
2,tt0000502,movie,Bohemios,Bohemios,0,1905.0,,100,,3.7,23.0
3,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0,1906.0,,70,"Action,Adventure,Biography",6.0,1028.0
4,tt0000591,movie,The Prodigal Son,L'enfant prodigue,0,1907.0,,90,Drama,5.3,34.0


In [30]:
movies_with_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 880559 entries, 0 to 880558
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   tconst          880559 non-null  object 
 1   titleType       880559 non-null  object 
 2   primaryTitle    880556 non-null  object 
 3   originalTitle   880556 non-null  object 
 4   isAdult         880559 non-null  int64  
 5   startYear       768689 non-null  float64
 6   endYear         0 non-null       float64
 7   runtimeMinutes  564485 non-null  object 
 8   genres          789505 non-null  object 
 9   averageRating   392873 non-null  float64
 10  numVotes        392873 non-null  float64
dtypes: float64(4), int64(1), object(6)
memory usage: 73.9+ MB


In [26]:
def calculate_weighted_rating(df, min_votes=RATING_COUNT):
    """
    The function calculate the IMDb Weighted Rating (WR)
    
    Args:
        df (pd.DataFrame): movie_ratings_df, it must include at least the columns:
            ['tconst', 'averageRating', 'numVotes']
        min_votes (int): Miniumum votes number-Threshold (m)
        
    Returns:
        pd.DataFrame: adding the 'weighted_rating' columns to df
    """
    df = df.copy()
    
    # Global ortalama rating
    C = df['averageRating'].mean()
    m = min_votes
    
    # Weighted Rating hesapla
    df['weighted_rating'] = (
        (df['numVotes'] / (df['numVotes'] + m)) * df['averageRating'] +
        (m / (df['numVotes'] + m)) * C
    )
    
    return df

In [27]:
movies_with_wr = calculate_weighted_rating(movie_ratings_df)

KeyError: 'averageRating'