# IMDb Dataset Cleaning

In this notebook, we will focus on enhancing the "CMU movie dataset" by integrating IMDb data, specifically using the IMDb ratings and number of votes. Rather than examining the entire IMDb dataset in detail, our primary goal is to supplement the CMU dataset with key performance indicators from IMDb that will help us better assess movie success. By carefully cleaning and aligning the relevant IMDb features, we aim to create a more comprehensive dataset, optimized for analysis in Milestone 3.

### Loading the Datasets

In [1]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os
import sys
import pickle

In [2]:
data_folder = '../data/'
pickle_folder = data_folder + 'pickle/'
imdb_folder = data_folder + 'IMDB/'

In [3]:
title_basics = pd.read_csv(imdb_folder + 'title.basics.tsv', sep='\t', low_memory=False)
title_ratings = pd.read_csv(imdb_folder + 'title.ratings.tsv', sep='\t', low_memory=False)

In [4]:
with open(pickle_folder + 'movies_clean.p', 'rb') as f:
    movies = pickle.load(f)

display(title_basics.sample(5))
display(title_ratings.sample(5))

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
4829270,tt18452302,tvEpisode,Episode #1.3264,Episode #1.3264,0,2021,\N,\N,"Comedy,Drama,Romance"
8873473,tt4950818,tvEpisode,Episode #5.8,Episode #5.8,0,\N,\N,\N,\N
1728101,tt1137210,tvEpisode,The Man Who Made Careful Arrangements: Part 2,The Man Who Made Careful Arrangements: Part 2,0,1961,\N,\N,"Crime,Drama"
3725622,tt1506296,tvEpisode,Episode #2.105,Episode #2.105,0,1993,\N,60,"Comedy,Music,Talk-Show"
6070762,tt24249188,tvEpisode,Episode dated 14 December 2022,Episode dated 14 December 2022,0,2022,\N,\N,News


Unnamed: 0,tconst,averageRating,numVotes
49494,tt0071104,6.3,133
875095,tt19896070,7.5,10
252107,tt0457128,5.9,241
1344591,tt6802136,6.7,8
1399122,tt7903716,7.4,5


## 1. Cleaning IMDb Features

- First, we noticed that the IMDb dataset includes various types of media, such as series, TV shows, etc. However, for our analysis, we are only interested in keeping films from these lists. Additionally, we will only retain the relevant features from the available ones: specifically, primaryTitle and startYear from title.basics (which will enable us to merge with the CMU dataset later) and averageRating and numVotes from title.rating, to provide an additional metric of film success.

In [5]:
title_basics_movies = title_basics[title_basics['titleType'] == 'movie'][['tconst', 'primaryTitle', 'startYear']]
title_basics_movies['startYear'] = pd.to_numeric(title_basics_movies['startYear'], errors='coerce').fillna(0).astype(int)
title_basics_movies = title_basics_movies.dropna(subset=['primaryTitle', 'startYear'])

- Then, we merge title_basics and title_ratings to obtain one imdb_dataframe with the features desired

In [6]:
imdb_data = title_basics_movies.merge(title_ratings, on='tconst', how='left')
imdb_data.sample(5)

Unnamed: 0,tconst,primaryTitle,startYear,averageRating,numVotes
540104,tt3479538,Grandma's Ashes,0,,
147829,tt0263418,Hana-tsumi nikki,1939,6.2,20.0
493168,tt30247269,Il Viaggio Di Guaman,0,,
590279,tt5190944,Mumbai Pune Mumbai 2,2015,6.7,509.0
144605,tt0257224,Three Picture Deal,2002,,


## 2. Combining CMU Movie Summary and IMDb dataset 

- Now, we can begin the merging of CMU Movie Summary and IMDb datasets. We begin by trying to map movies between each datasets. As there are no unique ID common to both datasets, we use the combination of the features ['Movie_name', 'Year'] of CMU and ['primaryTitle', 'startYear'] of IMDb, to identify which movie have to be matched.

In [7]:
movies['Movie_box_office_revenue'] = pd.to_numeric(movies['Movie_box_office_revenue'], errors='coerce')

merged_data = movies.merge(
    imdb_data[['primaryTitle', 'startYear', 'averageRating', 'numVotes']],
    left_on=['Movie_name', 'Year'],
    right_on=['primaryTitle', 'startYear'],
    how='left'
).drop(columns=['primaryTitle', 'startYear'])

merged_data.sample(5)

Unnamed: 0,Wikipedia_movie_ID,Movie_name,Movie_release_date,Movie_box_office_revenue,Movie_languages_(Freebase_ID:name_tuples),Movie_countries_(Freebase_ID:name_tuples),Year,Year_Interval,Genres_0,Genres_1,Genres_2,averageRating,numVotes
13142,33000410,The Mikado,1939,,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}",1939,1935-1955,"""Romantic comedy""","""Romance Film""","""Musical""",6.4,833.0
63613,5294089,Erskineville Kings,1999-09-23,,"{""/m/02h40lc"": ""English Language""}","{""/m/0chghy"": ""Australia""}",1999,1995-2015,"""Ensemble Film""","""Drama""",,6.3,691.0
15739,23799017,Monpura,2009-02-13,,"{""/m/01c7y"": ""Bengali Language""}","{""/m/0162b"": ""Bangladesh""}",2009,1995-2015,"""Romance Film""",,,8.8,9366.0
8761,35002972,Bougafer 33,2010,,{},"{""/m/04wgh"": ""Morocco""}",2010,1995-2015,"""Documentary""",,,,
12682,10529623,Ingen djävla picknick,2002-09-21,,{},"{""/m/0d0vqn"": ""Sweden""}",2002,1995-2015,"""Short Film""",,,,


- We noticed that like the box office revenue, the average rating are not available for all movies of the datasets and approximately half of the datasets ratings are missing (almost 33 000 movies don't have ratings). Although it is not great, it is still far better than the number of movies that have a valid box office revenue value (~8000 movies).

In [8]:
films_with_ratings = merged_data['averageRating'].notna().sum()
print(f"Nb of films with IMDb ratings: {films_with_ratings}")
films_without_ratings = merged_data['averageRating'].isna().sum()
print(f"Nb of films without IMDb ratings: {films_without_ratings}")

movies_clean = merged_data.dropna(subset=['averageRating','numVotes'])
movies_clean.sample(5)

Nb of films with IMDb ratings: 39553
Nb of films without IMDb ratings: 32744


Unnamed: 0,Wikipedia_movie_ID,Movie_name,Movie_release_date,Movie_box_office_revenue,Movie_languages_(Freebase_ID:name_tuples),Movie_countries_(Freebase_ID:name_tuples),Year,Year_Interval,Genres_0,Genres_1,Genres_2,averageRating,numVotes
11023,4980413,Satte Pe Satta,1982-01-22,,"{""/m/03k50"": ""Hindi Language""}","{""/m/03rk0"": ""India""}",1982,1975-1995,"""Comedy""","""World cinema""",,7.2,3697.0
21341,26695384,Praja,2001,,"{""/m/0999q"": ""Malayalam Language""}","{""/m/03rk0"": ""India""}",2001,1995-2015,"""Action""","""Drama""",,4.2,423.0
24847,2557469,Killing Me Softly,2002-05-10,,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America"", ""/m/...",2002,1995-2015,"""Thriller""","""Mystery""","""Romance Film""",5.4,19667.0
48561,6616873,Shoeshine,1946,,"{""/m/02bjrlw"": ""Italian Language"", ""/m/02h40lc...","{""/m/03rjj"": ""Italy""}",1946,1935-1955,"""Drama""","""Coming of age""","""World cinema""",8.0,8287.0
54858,9118027,Mississippi Cold Case,2007-02,,"{""/m/02h40lc"": ""English Language""}","{""/m/0d060g"": ""Canada""}",2007,1995-2015,"""Documentary""",,,7.3,15.0


- Finally, we update all our dataframes previously cleaned and preprocessed with the new features 'average ratings' and 'num votes'

In [9]:
with open(pickle_folder + "movies_clean.p", "wb" ) as f:
    pickle.dump(movies_clean,f)
    
def assign_rating_numvotes(df1):
    df1['averageRating'] = movies_clean['averageRating']
    df1['numVotes'] = movies_clean['numVotes']
    return df1
    
with open(pickle_folder + "movies_clean_with_season.p", "rb" ) as f:
    movies_clean_with_season = pickle.load(f)
    pickle.dump( assign_rating_numvotes(movies_clean_with_season) 
                , open(pickle_folder + "movies_clean_with_season.p", "wb" ) )

with open(pickle_folder + 'movies_countries_exploded.p', "rb" ) as f:
    movies_exploded = pickle.load(f)
    pickle.dump( assign_rating_numvotes(movies_exploded) 
                , open(pickle_folder + 'movies_countries_exploded.p', "wb" ) )

with open(pickle_folder + 'movies_languages_exploded.p', "rb" ) as f:
    movies_exploded = pickle.load(f)
    pickle.dump( assign_rating_numvotes(movies_exploded) 
                , open(pickle_folder + 'movies_languages_exploded.p', "wb" ) )
    
with open(pickle_folder + 'movies_genres_exploded.p', "rb" ) as f:
    movies_exploded = pickle.load(f)
    pickle.dump( assign_rating_numvotes(movies_exploded) 
                , open(pickle_folder + 'movies_genres_exploded.p', "wb" ) )