# IMDb Dataset Cleaning

In this notebook, we will focus on enhancing the "CMU movie dataset" by integrating IMDb data, specifically using the IMDb ratings and number of votes. Rather than examining the entire IMDb dataset in detail, our primary goal is to supplement the CMU dataset with key performance indicators from IMDb that will help us better assess movie success. By carefully cleaning and aligning the relevant IMDb features, we aim to create a more comprehensive dataset, optimized for analysis in Milestone 3.

### Loading the Datasets

In [42]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os
import sys
import pickle

In [43]:
data_folder = '../data/'
pickle_folder = data_folder + 'pickle/'
imdb_folder = data_folder + 'IMDB/'

In [44]:
title_basics = pd.read_csv(imdb_folder + 'title.basics.tsv', sep='\t', low_memory=False)
title_ratings = pd.read_csv(imdb_folder + 'title.ratings.tsv', sep='\t', low_memory=False)

In [45]:
with open(pickle_folder + 'movies_clean.p', 'rb') as f:
    movies = pickle.load(f)

display(title_basics.sample(5))
display(title_ratings.sample(5))

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
11078411,tt9828884,tvEpisode,Taking Shape,Taking Shape,0,2014,\N,\N,"Game-Show,Reality-TV"
1660449,tt11249262,tvEpisode,Episode #1.144,Episode #1.144,0,2019,\N,\N,Horror
1187853,tt10398092,tvEpisode,Episode #1.7,Episode #1.7,0,2014,\N,\N,Talk-Show
6027344,tt2394910,tvEpisode,Sailsbury,Sailsbury,0,1981,\N,\N,Reality-TV
1376000,tt10737208,tvEpisode,Episode #1.1176,Episode #1.1176,0,2014,\N,\N,Drama


Unnamed: 0,tconst,averageRating,numVotes
1443580,tt8933888,7.5,95
1090383,tt3134452,5.2,35
716869,tt14663020,7.0,36
751921,tt15381208,7.5,36
372013,tt0718410,6.0,11


## 1. Cleaning IMDb Features

- First, we noticed that the IMDb dataset includes various types of media, such as series, TV shows, etc. However, for our analysis, we are only interested in keeping films from these lists. Additionally, we will only retain the relevant features from the available ones: specifically, primaryTitle and startYear from title.basics (which will enable us to merge with the CMU dataset later) and averageRating and numVotes from title.rating, to provide an additional metric of film success.

In [46]:
title_basics_movies = title_basics[title_basics['titleType'] == 'movie'][['tconst', 'primaryTitle', 'startYear']]
title_basics_movies['startYear'] = pd.to_numeric(title_basics_movies['startYear'], errors='coerce').fillna(0).astype(int)
title_basics_movies = title_basics_movies.dropna(subset=['primaryTitle', 'startYear'])

- Then, we merge title_basics and title_ratings to obtain one imdb_dataframe with the features desired

In [47]:
imdb_data = title_basics_movies.merge(title_ratings, on='tconst', how='left')
imdb_data.sample(5)

Unnamed: 0,tconst,primaryTitle,startYear,averageRating,numVotes
191457,tt0392202,Junge Jawani,1932,,
299388,tt1339132,The New Oceania,2005,,
307349,tt13736468,Nima Yoshij,2007,,
380594,tt1853642,Praschan Requiem,2012,5.6,11.0
40206,tt0055373,The Right Approach,1961,5.3,125.0


## 2. Combining CMU Movie Summary and IMDb dataset 

- Now, we can begin the merging of CMU Movie Summary and IMDb datasets. We begin by trying to map movies between each datasets. As there are no unique ID common to both datasets, we use the combination of the features ['Movie_name', 'Year'] of CMU and ['primaryTitle', 'startYear'] of IMDb, to identify which movie have to be matched.

In [48]:
movies['Movie_box_office_revenue'] = pd.to_numeric(movies['Movie_box_office_revenue'], errors='coerce')

merged_data = movies.merge(
    imdb_data[['primaryTitle', 'startYear', 'averageRating', 'numVotes']],
    left_on=['Movie_name', 'Year'],
    right_on=['primaryTitle', 'startYear'],
    how='left'
).drop(columns=['primaryTitle', 'startYear'])

merged_data.sample(5)

Unnamed: 0,Wikipedia_movie_ID,Movie_name,Movie_release_date,Movie_box_office_revenue,Movie_languages_(Freebase_ID:name_tuples),Movie_countries_(Freebase_ID:name_tuples),Year,Year_Interval,Genres_0,Genres_1,Genres_2,averageRating,numVotes
6151,12741677,Pin...,1988,,"{""/m/02h40lc"": ""English Language""}","{""/m/0d060g"": ""Canada""}",1988,1975-1995,"""Thriller""","""Horror""","""Indie""",,
42096,9930507,Lickety-Splat,1961-06-03,,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}",1961,1955-1975,"""Short Film""","""Family Film""","""Comedy""",,
14810,32895484,The Mutiny of the Bounty,1916,,"{""/m/06ppq"": ""Silent film"", ""/m/02h40lc"": ""Eng...",{},1916,1915-1935,"""Silent film""",,,7.7,35.0
45596,17593898,Honeysuckle Rose,1980-07-18,17815212.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}",1980,1975-1995,"""Romance Film""","""Drama""","""Musical""",6.3,1926.0
61695,11376450,Tarana,1951,,"{""/m/03k50"": ""Hindi Language""}","{""/m/03rk0"": ""India""}",1951,1935-1955,"""Romantic drama""","""Romance Film""","""Drama""",7.3,178.0


- We noticed that like the box office revenue, the average rating are not available for all movies of the datasets and approximately half of the datasets ratings are missing (almost 33 000 movies don't have ratings). Although it is not great, it is still far better than the number of movies that have a valid box office revenue value (~8000 movies).

In [49]:
films_with_ratings = merged_data['averageRating'].notna().sum()
print(f"Nb of films with IMDb ratings: {films_with_ratings}")
films_without_ratings = merged_data['averageRating'].isna().sum()
print(f"Nb of films without IMDb ratings: {films_without_ratings}")

movies_clean = merged_data.dropna(subset=['averageRating','numVotes'])
movies_clean.sample(5)

Nb of films with IMDb ratings: 39553
Nb of films without IMDb ratings: 32744


Unnamed: 0,Wikipedia_movie_ID,Movie_name,Movie_release_date,Movie_box_office_revenue,Movie_languages_(Freebase_ID:name_tuples),Movie_countries_(Freebase_ID:name_tuples),Year,Year_Interval,Genres_0,Genres_1,Genres_2,averageRating,numVotes
60013,5856725,The Venice Project,1999-09-09,,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America"", ""/m/...",1999,1995-2015,"""Period piece""","""Melodrama""","""Drama""",5.3,226.0
27705,36448415,Not Suitable for Children,2012-06-06,,"{""/m/02h40lc"": ""English Language""}","{""/m/0chghy"": ""Australia""}",2012,1995-2015,"""Romance Film""","""Comedy film""",,5.8,5340.0
25195,22304534,Desperation,2006-05-23,,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}",2006,1995-2015,"""Horror""","""Film adaptation""",,5.5,47.0
38199,11876523,Extreme Movie,2008-12-05,54822.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}",2008,1995-2015,"""Parody""","""Sex comedy""","""Comedy""",3.7,11647.0
53605,20022496,Sons of Steel,1988,,"{""/m/02h40lc"": ""English Language""}","{""/m/0chghy"": ""Australia""}",1988,1975-1995,"""Science Fiction""","""Comedy""","""Musical""",5.1,261.0


- Finally, we update all our dataframes previously cleaned and preprocessed with the new features 'average ratings' and 'num votes'

In [50]:
with open(pickle_folder + "movies_clean.p", "wb" ) as f:
    pickle.dump(movies_clean,f)
    
def assign_rating_numvotes(df1):
    df1['averageRating'] = movies_clean['averageRating']
    df1['numVotes'] = movies_clean['numVotes']
    return df1
    
with open(pickle_folder + "movies_clean_with_season.p", "rb" ) as f:
    movies_clean_with_season = pickle.load(f)
    pickle.dump( assign_rating_numvotes(movies_clean_with_season) 
                , open(pickle_folder + "movies_clean_with_season.p", "wb" ) )

with open(pickle_folder + 'movies_countries_exploded.p', "rb" ) as f:
    movies_exploded = pickle.load(f)
    pickle.dump( assign_rating_numvotes(movies_exploded) 
                , open(pickle_folder + 'movies_countries_exploded.p', "wb" ) )

with open(pickle_folder + 'movies_languages_exploded.p', "rb" ) as f:
    movies_exploded = pickle.load(f)
    pickle.dump( assign_rating_numvotes(movies_exploded) 
                , open(pickle_folder + 'movies_languages_exploded.p', "wb" ) )
    
with open(pickle_folder + 'movies_genres_exploded.p', "rb" ) as f:
    movies_exploded = pickle.load(f)
    pickle.dump( assign_rating_numvotes(movies_exploded) 
                , open(pickle_folder + 'movies_genres_exploded.p', "wb" ) )