# IMDb Dataset Cleaning

In this notebook, we will focus on enhancing the "CMU movie dataset" by integrating IMDb data, specifically using the IMDb ratings and number of votes. Rather than examining the entire IMDb dataset in detail, our primary goal is to supplement the CMU dataset with key performance indicators from IMDb that will help us better assess movie success. By carefully cleaning and aligning the relevant IMDb features, we aim to create a more comprehensive dataset, optimized for analysis in Milestone 3.

### Loading the Datasets

In [51]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os
import sys
import pickle

In [52]:
data_folder = '../data/'
pickle_folder = data_folder + 'pickle/'
imdb_folder = data_folder + 'IMDB/'

In [44]:
title_basics = pd.read_csv(imdb_folder + 'title.basics.tsv', sep='\t', low_memory=False)
title_ratings = pd.read_csv(imdb_folder + 'title.ratings.tsv', sep='\t', low_memory=False)

In [53]:
with open(pickle_folder + 'movies_clean.p', 'rb') as f:
    movies = pickle.load(f)

display(title_basics.sample(5))
display(title_ratings.sample(5))

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
9216141,tt5724742,tvEpisode,Episode #3.10,Episode #3.10,0,2001,\N,\N,Sport
9866402,tt7186502,tvEpisode,Episode #1.4,Episode #1.4,0,1992,\N,50,Documentary
1685921,tt11295786,video,Midnight Prowl 11,Midnight Prowl 11,1,2007,\N,148,Adult
8658795,tt4462024,tvEpisode,Die Eröffnungsfeier,Die Eröffnungsfeier,0,2015,\N,\N,Romance
9548989,tt6477664,tvEpisode,Dark Valol,Dark Valol,0,2017,\N,3,"Adventure,Animation,Comedy"


Unnamed: 0,tconst,averageRating,numVotes
791532,tt1642852,7.0,9
497030,tt10574224,7.7,57
1218740,tt4767246,3.3,10
1257728,tt5375084,7.0,10
1164478,tt3899186,8.8,26


## 1. Cleaning IMDb Features

- First, we noticed that the IMDb dataset includes various types of media, such as series, TV shows, etc. However, for our analysis, we are only interested in keeping films from these lists. Additionally, we will only retain the relevant features from the available ones: specifically, primaryTitle and startYear from title.basics (which will enable us to merge with the CMU dataset later) and averageRating and numVotes from title.rating, to provide an additional metric of film success.

In [54]:
title_basics_movies = title_basics[title_basics['titleType'] == 'movie'][['tconst', 'primaryTitle', 'startYear']]
title_basics_movies['startYear'] = pd.to_numeric(title_basics_movies['startYear'], errors='coerce').fillna(0).astype(int)
title_basics_movies = title_basics_movies.dropna(subset=['primaryTitle', 'startYear'])

- Then, we merge title_basics and title_ratings to obtain one imdb_dataframe with the features desired

In [55]:
imdb_data = title_basics_movies.merge(title_ratings, on='tconst', how='left')
imdb_data.sample(5)

Unnamed: 0,tconst,primaryTitle,startYear,averageRating,numVotes
213807,tt0498644,Bandeira Branca de Oxalá,1968,,
351311,tt1579881,Srisailam,2009,6.2,8.0
626614,tt6539320,Firewall the Financial Crisis of 2007-2013,2012,,
436865,tt2402115,Fugue State,2011,,
278222,tt12348920,Quarantine Cat Film Fest,2020,6.8,13.0


## 2. Combining CMU Movie Summary and IMDb dataset 

- Now, we can begin the merging of CMU Movie Summary and IMDb datasets. We begin by trying to map movies between each datasets. As there are no unique ID common to both datasets, we use the combination of the features ['Movie_name', 'Year'] of CMU and ['primaryTitle', 'startYear'] of IMDb, to identify which movie have to be matched.

In [56]:
movies['Movie_box_office_revenue'] = pd.to_numeric(movies['Movie_box_office_revenue'], errors='coerce')

merged_data = movies.merge(
    imdb_data[['primaryTitle', 'startYear', 'averageRating', 'numVotes']],
    left_on=['Movie_name', 'Year'],
    right_on=['primaryTitle', 'startYear'],
    how='left'
).drop(columns=['primaryTitle', 'startYear'])

merged_data.sample(5)

Unnamed: 0,Wikipedia_movie_ID,Movie_name,Movie_release_date,Movie_box_office_revenue,Movie_languages_(Freebase_ID:name_tuples),Movie_countries_(Freebase_ID:name_tuples),Year,Year_Interval,Genres_0,Genres_1,Genres_2,averageRating,numVotes
29711,31517122,Lapland Odyssey,2010-09-10,4762640.0,"{""/m/06mp7"": ""Swedish Language"", ""/m/01gp_d"": ...","{""/m/02vzc"": ""Finland"", ""/m/03rj0"": ""Iceland"",...",2010,1995-2015,"""Romance Film""","""Drama""","""Comedy film""",6.8,6590.0
35129,9006010,Zameer: The Awakening of a Soul,1997-05-16,,"{""/m/03k50"": ""Hindi Language""}","{""/m/03rk0"": ""India""}",1997,1995-2015,"""Drama""",,,4.1,46.0
62491,32871914,Three in One,1957,,"{""/m/02h40lc"": ""English Language""}","{""/m/0chghy"": ""Australia""}",1957,1955-1975,"""Drama""",,,,
44897,31997075,"Who's Counting? Marilyn Waring On Sex, Lies An...",1995-10-22,,"{""/m/064_8sq"": ""French Language"", ""/m/02h40lc""...","{""/m/0d060g"": ""Canada""}",1995,1975-1995,"""Indie""","""Documentary""",,,
53674,5151199,Riens du tout,1992,,"{""/m/064_8sq"": ""French Language""}","{""/m/0f8l9c"": ""France""}",1992,1975-1995,"""Comedy""",,,,


- We noticed that like the box office revenue, the average rating are not available for all movies of the datasets and approximately half of the datasets ratings are missing (almost 33 000 movies don't have ratings). Although it is not great, it is still far better than the number of movies that have a valid box office revenue value (~8000 movies).

In [57]:
films_with_ratings = merged_data['averageRating'].notna().sum()
print(f"Nb of films with IMDb ratings: {films_with_ratings}")
films_without_ratings = merged_data['averageRating'].isna().sum()
print(f"Nb of films without IMDb ratings: {films_without_ratings}")

movies_clean = merged_data.dropna(subset=['averageRating','numVotes'])
movies_clean.sample(5)

Nb of films with IMDb ratings: 39553
Nb of films without IMDb ratings: 32744


Unnamed: 0,Wikipedia_movie_ID,Movie_name,Movie_release_date,Movie_box_office_revenue,Movie_languages_(Freebase_ID:name_tuples),Movie_countries_(Freebase_ID:name_tuples),Year,Year_Interval,Genres_0,Genres_1,Genres_2,averageRating,numVotes
71439,1291727,Booty Call,1997-02-26,20050376.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}",1997,1995-2015,"""Sex comedy""","""Comedy""",,5.5,9657.0
46672,17377535,Kuxa Kanema: The Birth of Cinema,2003,,{},"{""/m/05r4w"": ""Portugal""}",2003,1995-2015,"""Documentary""",,,6.4,43.0
56514,79848,Bachelor Mother,1939-06-30,1975000.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}",1939,1935-1955,"""Comedy""","""Black-and-white""",,7.5,4692.0
10098,22994083,Chief Crazy Horse,1955,1750000.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}",1955,1935-1955,"""Action/Adventure""","""Indian Western""","""Western""",6.1,703.0
54787,7832768,The Last Temptation of Christ,1988-08-12,8373585.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America"", ""/m/...",1988,1975-1995,"""Drama""","""Hagiography""",,7.5,64115.0


- Finally, we update all our dataframes previously cleaned and preprocessed with the new features 'average ratings' and 'num votes'

In [58]:
with open(pickle_folder + "movies_clean.p", "wb" ) as f:
    pickle.dump(movies_clean,f)
    
def assign_rating_numvotes(df1):
    df1['averageRating'] = movies_clean['averageRating']
    df1['numVotes'] = movies_clean['numVotes']
    return df1
    
with open(pickle_folder + "movies_clean_with_season.p", "rb" ) as f:
    movies_clean_with_season = pickle.load(f)
    pickle.dump( assign_rating_numvotes(movies_clean_with_season) 
                , open(pickle_folder + "movies_clean_with_season.p", "wb" ) )

with open(pickle_folder + 'movies_countries_exploded.p', "rb" ) as f:
    movies_exploded = pickle.load(f)
    pickle.dump( assign_rating_numvotes(movies_exploded) 
                , open(pickle_folder + 'movies_countries_exploded.p', "wb" ) )

with open(pickle_folder + 'movies_languages_exploded.p', "rb" ) as f:
    movies_exploded = pickle.load(f)
    pickle.dump( assign_rating_numvotes(movies_exploded) 
                , open(pickle_folder + 'movies_languages_exploded.p', "wb" ) )
    
with open(pickle_folder + 'movies_genres_exploded.p', "rb" ) as f:
    movies_exploded = pickle.load(f)
    pickle.dump( assign_rating_numvotes(movies_exploded) 
                , open(pickle_folder + 'movies_genres_exploded.p', "wb" ) )