# IMDb Dataset Cleaning

In this notebook, we will focus on enhancing the "CMU movie dataset" by integrating IMDb data, specifically using the IMDb ratings and number of votes. Rather than examining the entire IMDb dataset in detail, our primary goal is to supplement the CMU dataset with key performance indicators from IMDb that will help us better assess movie success. By carefully cleaning and aligning the relevant IMDb features, we aim to create a more comprehensive dataset, optimized for analysis in Milestone 3.

### Loading the Datasets

In [73]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os
import sys
import pickle

In [74]:
data_folder = '../data/'
pickle_folder = data_folder + 'pickle/'
imdb_folder = data_folder + 'IMDB/'

In [42]:
title_basics = pd.read_csv("../../ada-2024-project-melyn_copie/data/IMDb/" + 'title.basics.tsv', sep='\t', low_memory=False)
title_ratings = pd.read_csv("../../ada-2024-project-melyn_copie/data/IMDb/" + 'title.ratings.tsv', sep='\t', low_memory=False)

In [75]:
with open(pickle_folder + 'movies_clean.p', 'rb') as f:
    movies = pickle.load(f)

display(title_basics.sample(5))
display(title_ratings.sample(5))

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
3004885,tt13735470,tvEpisode,Episode #1.15,Episode #1.15,0,2020,\N,60,"News,Talk-Show"
6897194,tt28635536,tvEpisode,The Life,The Life,0,2023,\N,\N,Horror
2813707,tt1337860,tvEpisode,Dream Dinner,Dream Dinner,0,2008,\N,\N,Documentary
2677213,tt13124252,tvEpisode,Authentic Self - Lois Robbins Interview on act...,Authentic Self - Lois Robbins Interview on act...,0,2020,\N,\N,Talk-Show
1202730,tt1042497,movie,Exit Speed,Exit Speed,0,2008,\N,91,Thriller


Unnamed: 0,tconst,averageRating,numVotes
505705,tt10728262,7.8,13
993102,tt2589156,7.9,29
1174033,tt4064152,8.1,12
323265,tt0613015,7.0,28
108688,tt0152050,8.2,166


In [76]:
movies

Unnamed: 0,Wikipedia_movie_ID,Movie_name,Movie_box_office_revenue,Year,Year_Interval,nb_of_Genres,Genre_Action,Genre_Action/Adventure,Genre_Adventure,Genre_Animation,...,Country_Canada,Country_France,Country_Germany,Country_Hong Kong,Country_India,Country_Italy,Country_Japan,Country_Other,Country_United Kingdom,Country_United States of America
0,330,Actrius,,1996,1970-2000,2,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
1,3217,Army of Darkness,21502796.0,1992,1970-2000,12,True,True,False,False,...,False,False,False,False,False,False,False,False,False,True
2,3333,The Birth of a Nation,50000000.0,1915,1915-1930,7,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
3,3746,Blade Runner,33139618.0,1982,1970-2000,12,False,False,False,False,...,False,False,False,True,False,False,False,False,False,True
4,3837,Blazing Saddles,119500000.0,1974,1970-2000,3,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60678,36956792,The Water Horse: Legend of the Deep,103071443.0,2007,2000-2015,4,False,False,True,False,...,False,False,False,False,False,False,False,True,True,True
60679,37067980,The Lady from Peking,,1975,1970-2000,2,False,False,False,False,...,False,False,False,False,False,False,False,True,False,True
60680,37373877,Crazy Eights,,2006,2000-2015,2,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
60681,37476824,I Love New Year,,2011,2000-2015,4,False,False,False,False,...,False,False,False,False,True,False,False,False,False,False


## 1. Cleaning IMDb Features

- First, we noticed that the IMDb dataset includes various types of media, such as series, TV shows, etc. However, for our analysis, we are only interested in keeping films from these lists. Additionally, we will only retain the relevant features from the available ones: specifically, primaryTitle and startYear from title.basics (which will enable us to merge with the CMU dataset later) and averageRating and numVotes from title.rating, to provide an additional metric of film success.

In [77]:
title_basics_movies = title_basics[title_basics['titleType'] == 'movie'][['tconst', 'primaryTitle', 'startYear', 'runtimeMinutes']]
title_basics_movies['startYear'] = pd.to_numeric(title_basics_movies['startYear'], errors='coerce').fillna(0).astype(int)
title_basics_movies = title_basics_movies.dropna(subset=['primaryTitle', 'startYear', 'runtimeMinutes'])

- Then, we merge title_basics and title_ratings to obtain one imdb_dataframe with the features desired

In [78]:
imdb_data = title_basics_movies.merge(title_ratings, on='tconst', how='left')
imdb_data["runtimeMinutes"] = pd.to_numeric(imdb_data["runtimeMinutes"], errors='coerce')
imdb_data.dropna(subset="runtimeMinutes",inplace=True)

## 2. Combining CMU Movie Summary and IMDb dataset 

- Now, we can begin the merging of CMU Movie Summary and IMDb datasets. We begin by trying to map movies between each datasets. As there are no unique ID common to both datasets, we use the combination of the features ['Movie_name', 'Year'] of CMU and ['primaryTitle', 'startYear'] of IMDb, to identify which movie have to be matched.

In [79]:
movies['Movie_box_office_revenue'] = pd.to_numeric(movies['Movie_box_office_revenue'], errors='coerce')

merged_data = movies.merge(
    imdb_data[['primaryTitle', 'startYear', 'averageRating', 'numVotes']],
    left_on=['Movie_name', 'Year'],
    right_on=['primaryTitle', 'startYear'],
    how='inner'
).drop(columns=['primaryTitle', 'startYear'])

merged_data = merged_data[~merged_data["Wikipedia_movie_ID"].duplicated(keep=False)]
merged_data

Unnamed: 0,Wikipedia_movie_ID,Movie_name,Movie_box_office_revenue,Year,Year_Interval,nb_of_Genres,Genre_Action,Genre_Action/Adventure,Genre_Adventure,Genre_Animation,...,Country_Germany,Country_Hong Kong,Country_India,Country_Italy,Country_Japan,Country_Other,Country_United Kingdom,Country_United States of America,averageRating,numVotes
0,3217,Army of Darkness,21502796.0,1992,1970-2000,12,True,True,False,False,...,False,False,False,False,False,False,False,True,7.4,197717.0
1,3333,The Birth of a Nation,50000000.0,1915,1915-1930,7,False,False,False,False,...,False,False,False,False,False,False,False,True,6.1,26681.0
2,3746,Blade Runner,33139618.0,1982,1970-2000,12,False,False,False,False,...,False,True,False,False,False,False,False,True,8.1,835060.0
3,3837,Blazing Saddles,119500000.0,1974,1970-2000,3,False,False,False,False,...,False,False,False,False,False,False,False,True,7.7,155432.0
4,3947,Blue Velvet,8551228.0,1986,1970-2000,3,False,False,False,False,...,False,False,False,False,False,False,False,True,7.7,219742.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34785,36683360,2016: Obama's America,33449086.0,2012,2000-2015,1,False,False,False,False,...,False,False,False,False,False,False,False,True,4.8,11049.0
34786,36814246,Eraserhead,7000000.0,1977,1970-2000,10,False,False,False,False,...,False,False,False,False,False,False,False,True,7.3,130107.0
34787,36821133,Loetoeng Kasaroeng,,1926,1915-1930,1,False,False,False,False,...,False,False,False,False,False,True,False,False,7.3,12.0
34788,36929245,Before Midnight,,2013,2000-2015,2,False,False,False,False,...,False,False,False,False,False,False,False,True,7.9,175072.0


- We noticed that like the box office revenue, the average rating are not available for all movies of the datasets and approximately half of the movies in CMU finds an equivalent in IMDb ( More than 34 000 movies when merging the two datasets ). When withdrawing the movies from this dataset that don't have a valid value for average rating, the number of movies in the dataset drop to ~ 33 000. Although it is not great, it is still far better than the number of movies that have a valid box office revenue value (~ 8000 movies).

In [80]:
films_with_ratings = merged_data['averageRating'].notna().sum()
print(f"Nb of films with IMDb ratings: {films_with_ratings}")
films_without_ratings = merged_data['averageRating'].isna().sum()
print(f"Nb of films without IMDb ratings: {films_without_ratings}")

merged_data.dropna(subset=['averageRating','numVotes'],inplace=True)
merged_data.sample(5)

Nb of films with IMDb ratings: 32964
Nb of films without IMDb ratings: 1162


Unnamed: 0,Wikipedia_movie_ID,Movie_name,Movie_box_office_revenue,Year,Year_Interval,nb_of_Genres,Genre_Action,Genre_Action/Adventure,Genre_Adventure,Genre_Animation,...,Country_Germany,Country_Hong Kong,Country_India,Country_Italy,Country_Japan,Country_Other,Country_United Kingdom,Country_United States of America,averageRating,numVotes
4468,1782182,Frog and Wombat,,1998,1970-2000,1,False,False,False,False,...,False,False,False,False,False,False,False,True,5.2,236.0
31693,31214134,Her First Affaire,,1932,1930-1950,1,False,False,False,False,...,False,False,False,False,False,False,True,False,5.7,115.0
11703,6836206,Paradise Canyon,,1935,1930-1950,4,True,True,False,False,...,False,False,False,False,False,False,False,True,5.1,1047.0
13486,8934569,Amor a primera vista,,1956,1950-1970,2,False,False,False,False,...,False,False,False,False,False,False,False,False,6.7,19.0
33616,34609492,Bobbili Simham,,1994,1970-2000,2,True,False,False,False,...,False,False,True,False,False,False,False,False,6.7,133.0


- Moreover, we remove films with fewer than 200 voters to improve the comparison of movies by their average ratings, ensuring only films with a sufficient number of critics are considered.

In [81]:
movies_clean = merged_data[merged_data["numVotes"]>200]
print(f"Nb of films with 200 votes or more : {len(movies_clean)}")
films_without_ratings = merged_data['averageRating'].isna().sum()
print(f"Nb of films with less than 200 votes: {len(merged_data)-len(movies_clean)}")

Nb of films with 200 votes or more : 25184
Nb of films with less than 200 votes: 7780


- Finally, we update our dataframes previously cleaned and preprocessed with the new features 'average ratings' and 'num votes'

In [82]:
pickle.dump(movies_clean,open(pickle_folder + "movies_clean.p", "wb" ))

- When updating the dataframes which contains movies whose release season are known, we notice that we lose almost 10 000 movies in the dataset.

In [84]:
with open(pickle_folder+"movies_season.p", 'rb') as f:
    movies_season = pickle.load(f)
    
movies_season = pd.merge(movies_clean,movies_season,on=["Wikipedia_movie_ID"])
display(movies_season)
print(f"Nb of films removed after update : {len(movies_clean)-len(movies_season)}")
print(f"Nb of films in the updated dataframe : {len(movies_season)}")

with open(pickle_folder+"movies_clean_with_season.p", 'wb') as f:
    pickle.dump(movies_season,f)

Unnamed: 0,Wikipedia_movie_ID,Movie_name,Movie_box_office_revenue,Year,Year_Interval,nb_of_Genres,Genre_Action,Genre_Action/Adventure,Genre_Adventure,Genre_Animation,...,Country_Hong Kong,Country_India,Country_Italy,Country_Japan,Country_Other,Country_United Kingdom,Country_United States of America,averageRating,numVotes,release_season
0,3217,Army of Darkness,21502796.0,1992,1970-2000,12,True,True,False,False,...,False,False,False,False,False,False,True,7.4,197717.0,Autumn
1,3746,Blade Runner,33139618.0,1982,1970-2000,12,False,False,False,False,...,True,False,False,False,False,False,True,8.1,835060.0,Summer
2,3837,Blazing Saddles,119500000.0,1974,1970-2000,3,False,False,False,False,...,False,False,False,False,False,False,True,7.7,155432.0,Winter
3,3947,Blue Velvet,8551228.0,1986,1970-2000,3,False,False,False,False,...,False,False,False,False,False,False,True,7.7,219742.0,Summer
4,4227,Barry Lyndon,20000000.0,1975,1970-2000,7,False,False,False,False,...,False,False,False,False,False,True,True,8.1,186676.0,Winter
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16594,36598217,Secret Service of the Air,,1939,1930-1950,3,True,False,True,False,...,False,False,False,False,False,False,True,5.7,271.0,Spring
16595,36674310,Mystery of Marie Roget,,1942,1930-1950,1,False,False,False,False,...,False,False,False,False,False,False,True,5.9,371.0,Spring
16596,36683360,2016: Obama's America,33449086.0,2012,2000-2015,1,False,False,False,False,...,False,False,False,False,False,False,True,4.8,11049.0,Summer
16597,36814246,Eraserhead,7000000.0,1977,1970-2000,10,False,False,False,False,...,False,False,False,False,False,False,True,7.3,130107.0,Spring


Nb of films removed after update : 8585
Nb of films in the updated dataframe : 16599
