# Movie Dataset Cleaning

First, we will keep our focus on cleaning the "movie.metadata" dataset. 
The goal is to do a deep review of the whole dataset, have a good understanding of the missing data for each relevant feature for our study and to have a cleaned version ready for Milestone 3.

### Loading the Dataset

In [62]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os
import sys
import pickle

project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
sys.path.append(project_root)

from importlib import reload
import src.utils.utils
reload(src.utils.utils)
from src.utils.utils import clean_columns, extract_info, exploded_format, dropempty

In [63]:
folder = '../data/CMU/'
pickle_folder = "../data/pickle/"

In [64]:
movie_original_data = pd.read_csv(folder + 'movie.metadata.tsv' ,sep='\t',names=['Wikipedia_movie_ID',
'Freebase_movie_ID',
'Movie_name',
'Movie_release_date',
'Movie_box_office_revenue',
'Movie_runtime',
'Movie_languages_(Freebase_ID:name_tuples)',
'Movie_countries_(Freebase_ID:name_tuples)',
'Movie_genres_(Freebase_ID:name_tuples)'])

In [65]:
movies = movie_original_data.copy()

In [66]:
print("The original number of movie in movie.metadata dataset is : {}".format(movies.shape[0]))

The original number of movie in movie.metadata dataset is : 81741


## 1. Dropping invalid values for Movie box office revenue

- Most important, because our main question is focusing on the implications of characteristics on the overall box office performance of movies, the first goal is to drop all the movie lines where the box office is not detailed. 

In [95]:
movies_with_box_office = movies.dropna(subset=['Movie_box_office_revenue'])

In [96]:
print("The number of movie with a valid value for the box office revenue in movie.metadata dataset :\n {}".format(movies_with_box_office.shape[0]))

The number of movie with a valid value for the box office revenue in movie.metadata dataset :
 8401


- We notice that dropping all the movies without a valid box office revenue value reduces the size of the dataframe by a factor of almost 10. Hence, we think that we should take another criteria to complete the evaluation of the success of movies. That was the main reason that pushed us to study the imdb dataset and its ratings.

##  2. Cleaning Features

- Before starting the analysis of the features relevant to our project, we will drop the columns that are irrelevant to us or that we don’t intend to use, to avoid being overwhelmed by unnecessary information when printing the data frames.

In [97]:
movies_clean = movies.drop(columns=['Freebase_movie_ID','Movie_runtime'])
movies_clean.sample(5)

Unnamed: 0,Wikipedia_movie_ID,Movie_name,Movie_release_date,Movie_box_office_revenue,Movie_languages_(Freebase_ID:name_tuples),Movie_countries_(Freebase_ID:name_tuples),Movie_genres_(Freebase_ID:name_tuples)
13102,5983922,Toby Tortoise Returns,1936-08-22,,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02hmvc"": ""Short Film""}"
79420,213188,DuckTales the Movie: Treasure of the Lost Lamp,1990-08-03,18115724.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America"", ""/m/...","{""/m/0bj8m2"": ""Children's"", ""/m/01hmnh"": ""Fant..."
52721,26240693,Ikarie XB-1,1963,,"{""/m/01wgr"": ""Czech Language"", ""/m/02h40lc"": ""...","{""/m/01mk6"": ""Czechoslovakia""}","{""/m/06n90"": ""Science Fiction""}"
20257,15119049,The Man with the Punch,1920-12-18,,"{""/m/06ppq"": ""Silent film"", ""/m/02h40lc"": ""Eng...","{""/m/09c7w0"": ""United States of America""}","{""/m/02hmvc"": ""Short Film"", ""/m/06ppq"": ""Silen..."
4577,5589600,Alai,2005,,"{""/m/07c9s"": ""Tamil Language""}","{""/m/03rk0"": ""India""}","{""/m/068d7h"": ""Romantic drama"", ""/m/02l7c8"": ""..."


###  2.1. Cleaning the Dates

- First, we saw that for the movie "Hunting Season" the release date written was "1010-12-02" but the real release date is  "2010-12-02" :

In [98]:
movies_clean.loc[movies['Movie_name'] == 'Hunting Season']

Unnamed: 0,Wikipedia_movie_ID,Movie_name,Movie_release_date,Movie_box_office_revenue,Movie_languages_(Freebase_ID:name_tuples),Movie_countries_(Freebase_ID:name_tuples),Movie_genres_(Freebase_ID:name_tuples)
62836,29666067,Hunting Season,1010-12-02,12160978.0,"{""/m/02hwyss"": ""Turkish Language"", ""/m/02h40lc...","{""/m/01znc_"": ""Turkey""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/02n4kr"": ""My..."


Hence, we modified this value by the true one : 

In [99]:
movies_clean.loc[62836, 'Movie_release_date'] = '2010-12-02'
print(movies_clean.loc[62836, 'Movie_release_date'])

2010-12-02


- Drop all the movies where no release date are written

In [100]:
movies_clean = movies_clean.dropna(inplace = False, subset =['Movie_release_date']).copy()

In [101]:
movies_clean.shape[0]

74839

- Create a column 'Year' where we only have the released year of the movie (to do year by year analysis) : 

In [102]:
movies_clean['Year'] = movies_clean['Movie_release_date'].str[:4].astype(int)

- For our study, in addition to analyzing the dataset year by year, we decided to conduct an analysis across five different intervals of approximately 20 years each, spanning from 1915 to 2015. We decided to restrain ourself to that particular period because of the lack of meaningful data befor the 1910 decade.
- We created a 'Year_Interval' column in which each film is categorized into one of the six designated study intervals.
- Finally, we save our new version of the movies dataset with these new columns

In [198]:
movies_clean['Year_Interval'] = pd.cut(movies_clean['Year'], 
                                       bins=[1914, 1930, 1950, 1970, 2000, 2016], 
                                       labels=['1915-1930', '1930-1950', '1950-1970', '1970-2000', '2000-2015']
                                      ).astype(str)

movies_clean = movies_clean.query(" 2016 > Year > 1914")

display(movies_clean)

pickle.dump( movies_clean, open(pickle_folder + "movies_clean.p", "wb" ) )

Unnamed: 0,Wikipedia_movie_ID,Movie_name,Movie_release_date,Movie_box_office_revenue,Movie_languages_(Freebase_ID:name_tuples),Movie_countries_(Freebase_ID:name_tuples),Movie_genres_(Freebase_ID:name_tuples),Year,Year_Interval
0,975900,Ghosts of Mars,2001-08-24,14010832.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",2001,2000-2015
1,3196793,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp...",2000,1970-2000
2,28463795,Brun bitter,1988,,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D...",1988,1970-2000
3,9363483,White Of The Eye,1987,,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic...",1987,1970-2000
4,261236,A Woman in Flames,1983,,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}",1983,1970-2000
...,...,...,...,...,...,...,...,...,...
81736,35228177,Mermaids: The Body Found,2011-03-19,,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/07s9rl0"": ""Drama""}",2011,2000-2015
81737,34980460,Knuckle,2011-01-21,,"{""/m/02h40lc"": ""English Language""}","{""/m/03rt9"": ""Ireland"", ""/m/07ssc"": ""United Ki...","{""/m/03bxz7"": ""Biographical film"", ""/m/07s9rl0...",2011,2000-2015
81738,9971909,Another Nice Mess,1972-09-22,,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06nbt"": ""Satire"", ""/m/01z4y"": ""Comedy""}",1972,1970-2000
81739,913762,The Super Dimension Fortress Macross II: Lover...,1992-05-21,,"{""/m/03_9r"": ""Japanese Language""}","{""/m/03_3d"": ""Japan""}","{""/m/06n90"": ""Science Fiction"", ""/m/0gw5n2f"": ...",1992,1970-2000


- Now, because we will also do some analysis using the seasons (summer, winter, fall, spring) of movies' release date. We create another DataFrame, in which each movie with a specified release date is categorized by the season in which it was released.
- We also save this Dataframe creating a second version of our clean dataset. We prefer to have these rather than just merging the two and keeping only one because the creation of the 'release_season' excludes all the movies for which we don't have information about the month of release

In [199]:
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
sys.path.append(project_root)
reload(src.utils.utils)

from src.utils.utils import categorize_release_season

In [200]:
md_release_season = movies_clean.copy()
md_release_season['Release_Month'] = md_release_season['Movie_release_date'].str.split('-').str[1]
md_release_season.dropna(subset = 'Release_Month', inplace = True)

In [201]:
md_release_season['release_season'] = md_release_season['Release_Month'].astype(int).apply(categorize_release_season)
md_release_season.shape[0]

41768

In [202]:
display(md_release_season)
pickle.dump( md_release_season[["Wikipedia_movie_ID","release_season"]], open(pickle_folder + "movies_season.p", "wb" ) )

Unnamed: 0,Wikipedia_movie_ID,Movie_name,Movie_release_date,Movie_box_office_revenue,Movie_languages_(Freebase_ID:name_tuples),Movie_countries_(Freebase_ID:name_tuples),Movie_genres_(Freebase_ID:name_tuples),Year,Year_Interval,Release_Month,release_season
0,975900,Ghosts of Mars,2001-08-24,14010832.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",2001,2000-2015,08,Summer
1,3196793,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp...",2000,1970-2000,02,Winter
7,10408933,Alexander's Ragtime Band,1938-08-16,3600000.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/04t36"": ""Musical"", ""/m/01z4y"": ""Comedy"", ...",1938,1930-1950,08,Summer
12,6631279,Little city,1997-04-04,,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06cvj"": ""Romantic comedy"", ""/m/0hj3n0w"": ...",1997,1970-2000,04,Spring
13,171005,Henry V,1989-11-08,10161099.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/04xvh5"": ""Costume drama"", ""/m/082gq"": ""Wa...",1989,1970-2000,11,Autumn
...,...,...,...,...,...,...,...,...,...,...,...
81735,32468537,Shadow Boxing 2,2007-10-18,,"{""/m/06b_j"": ""Russian Language"", ""/m/02h40lc"":...","{""/m/06bnz"": ""Russia""}","{""/m/01z02hx"": ""Sports"", ""/m/0lsxr"": ""Crime Fi...",2007,2000-2015,10,Autumn
81736,35228177,Mermaids: The Body Found,2011-03-19,,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/07s9rl0"": ""Drama""}",2011,2000-2015,03,Spring
81737,34980460,Knuckle,2011-01-21,,"{""/m/02h40lc"": ""English Language""}","{""/m/03rt9"": ""Ireland"", ""/m/07ssc"": ""United Ki...","{""/m/03bxz7"": ""Biographical film"", ""/m/07s9rl0...",2011,2000-2015,01,Winter
81738,9971909,Another Nice Mess,1972-09-22,,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06nbt"": ""Satire"", ""/m/01z4y"": ""Comedy""}",1972,1970-2000,09,Autumn


### 2.2. Clean 'Genres' Column 

- The ‘genres’ characteristic is among the key features to answer our general question. However, we noticed that in the movies.metadata dataset, this feature was presented in a (Freebase_ID:name_tuples) format, and thus wasn’t very clear and understandable. That is why we decided to reformat the ‘genres’ feature.
- We dropped all the movies that have no information about their genres (more than 1700)

In [203]:
column_name = 'Movie_genres_(Freebase_ID:name_tuples)' 

md_Genres = movies_clean.copy()
md_Genres["Genres"] = md_Genres[column_name].apply(extract_info)

md_Genres, n = dropempty(md_Genres,'Genres')
print(f"Number of movies dropped : {n}")

Number of movies dropped : 1756


- We calculates the number of genres for each entry and adds this count as a new column. As we can see when using the describe() method, each movie has 3 'Genres' in average.

In [204]:
md_Genres['nb_of_Genres'] = md_Genres["Genres"].apply(lambda x:len(x))
md_Genres['nb_of_Genres'].describe()

count    71771.000000
mean         3.165931
std          2.112019
min          1.000000
25%          1.000000
50%          3.000000
75%          4.000000
max         17.000000
Name: nb_of_Genres, dtype: float64

- We update the previous dataframes by adding the 'nb of genres' columns

In [205]:
data_genres = md_Genres.drop(columns=["Genres","Movie_genres_(Freebase_ID:name_tuples)"])

with open(pickle_folder + 'movies_clean.p', 'wb') as f:
    pickle.dump(data_genres, f)

- We also create a dataframe containing 'genres' in an exploded format in order to facilitate the frequency and causal analysis of genres

In [206]:
exploded_format('Genres',md_Genres,pickle_folder + "movies_genres_exploded.p")

Genres
Drama                31547
Comedy               15402
Romance Film          9730
Thriller              8396
Action                8266
                     ...  
Statutory rape           1
Romantic thriller        1
Chick flick              1
Buddy Picture            1
Neorealism               1
Name: count, Length: 363, dtype: int64

### 2.3. Clean 'Languages' Column

- In the same way as the part for Genres’ characteristics, this feature was presented in a (Freebase_ID:name_tuples) format, and thus wasn’t very clear and understandable. That is why we decided to reformat the ‘Language’ feature.
- We dropped all the movies that have no information about their languages (~ 9500)

In [207]:
column_name = 'Movie_languages_(Freebase_ID:name_tuples)'

md_language = data_genres.copy()
md_language['Languages'] = md_language[column_name].apply(extract_info)


md_language, n = dropempty(md_language,'Languages')
print(f"Number of movies dropped : {n}")

Number of movies dropped : 9543


- We calculate the number of languages for each entry and add this count as a new column. As we can see when using the describe() method, each movie has 1 "Language" on average.

In [208]:
md_language['nb_of_Languages'] = md_language["Languages"].apply(lambda x:len(x))
md_language['nb_of_Languages'].describe()

count    62228.000000
mean         1.199412
std          0.581817
min          1.000000
25%          1.000000
50%          1.000000
75%          1.000000
max         15.000000
Name: nb_of_Languages, dtype: float64

- We update the previous dataframes by adding the 'nb of languages' columns

In [209]:
data_languages = md_language.drop(columns=["Languages","Movie_languages_(Freebase_ID:name_tuples)"])

with open(pickle_folder + 'movies_clean.p', 'wb') as f:
    pickle.dump(data_languages, f)

- We create a new dataframe containing the Languages in an exploded format in order to facilitate the frequency analysis of languages

In [210]:
exploded_format('Languages',md_language,pickle_folder + "movies_languages_exploded.p")

Languages
English Language           38744
Spanish Language            3473
French Language             3280
Hindi Language              3273
Italian Language            2390
                           ...  
Sunda Language                 1
Hazaragi Language              1
Pawnee Language                1
Gumatj Language                1
Judeo-Georgian Language        1
Name: count, Length: 191, dtype: int64

### 2.4. Clean 'Countries' Column 

- For the Countries feature, similar to the approach used for the Genres feature, the data was originally presented in a (Freebase_ID:name_tuples) format, making it unclear and difficult to interpret. To address this, we chose to reformat the Countries feature into a more understandable structure.
- We dropped all the movies that have no information about their countries (~ 1500)

In [211]:
column_name = "Movie_countries_(Freebase_ID:name_tuples)"

md_countries = data_languages.copy()
md_countries["Countries"] = md_countries[column_name].apply(extract_info)

md_countries, n = dropempty(md_countries,'Countries')
print(f"Number of movies dropped : {n}")

Number of movies dropped : 1545


- We calculate the number of countries for each entry and add this count as a new column. As we can see when using the describe() method, each movie has 1 "Countries" on average.

In [212]:
md_countries['nb_of_Countries'] = md_countries["Countries"].apply(lambda x:len(x))
md_countries['nb_of_Countries'].describe()

count    60683.00000
mean         1.19091
std          0.56936
min          1.00000
25%          1.00000
50%          1.00000
75%          1.00000
max         14.00000
Name: nb_of_Countries, dtype: float64

- We update the previous dataframes by adding the 'nb of countries' columns

In [213]:
data_countries = md_countries.drop(columns=["Countries","Movie_countries_(Freebase_ID:name_tuples)"])

with open(pickle_folder + 'movies_clean.p', 'wb') as f:
    pickle.dump(data_countries, f)

- We create a new dataframe containing the Countries in an exploded format in order to facilitate the frequency analysis of countries

In [214]:
exploded_format('Countries',md_countries,pickle_folder + "movies_countries_exploded.p")

Countries
United States of America    29364
India                        6680
United Kingdom               6484
France                       3665
Italy                        2718
                            ...  
Macau                           1
Malayalam Language              1
Soviet occupation zone          1
Palestinian Territories         1
Iraqi Kurdistan                 1
Name: count, Length: 145, dtype: int64

In [215]:
g = clean_columns("Genres",40,"Genre")

l = clean_columns("Languages",10,"Lang")

c = clean_columns("Countries",10,"Country")

display(g)
display(l)
display(c)

Unnamed: 0,Wikipedia_movie_ID,Movie_name,Movie_box_office_revenue,Year,Year_Interval,nb_of_Genres,Genre_Action,Genre_Action/Adventure,Genre_Adventure,Genre_Animation,...,Genre_Romantic comedy,Genre_Romantic drama,Genre_Science Fiction,Genre_Short Film,Genre_Silent film,Genre_Sports,Genre_Thriller,Genre_War film,Genre_Western,Genre_World cinema
0,330,Actrius,,1996,1970-2000,2,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,3217,Army of Darkness,21502796.0,1992,1970-2000,12,True,True,False,False,...,False,False,False,False,False,False,False,False,False,False
2,3333,The Birth of a Nation,50000000.0,1915,1915-1930,7,False,False,False,False,...,False,False,False,False,True,False,False,True,False,False
3,3746,Blade Runner,33139618.0,1982,1970-2000,12,False,False,False,False,...,False,False,True,False,False,False,True,False,False,False
4,3837,Blazing Saddles,119500000.0,1974,1970-2000,3,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
71766,37473592,Thoppul Kodi,,2011,2000-2015,1,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
71767,37476824,I Love New Year,,2011,2000-2015,4,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
71768,37478048,Mr. Bechara,,1996,1970-2000,1,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
71769,37492363,Cherries and Clover,,2011,2000-2015,3,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


Unnamed: 0,Wikipedia_movie_ID,Movie_name,Movie_box_office_revenue,Year,Year_Interval,nb_of_Languages,lang_english,lang_french,lang_german,lang_hindi,lang_italian,lang_japanese,lang_other,lang_silent film,lang_spanish,lang_standard mandarin,lang_tamil
0,330,Actrius,,1996,1970-2000,2,False,False,False,False,False,False,True,False,True,False,False
1,3217,Army of Darkness,21502796.0,1992,1970-2000,1,True,False,False,False,False,False,False,False,False,False,False
2,3333,The Birth of a Nation,50000000.0,1915,1915-1930,2,True,False,False,False,False,False,False,True,False,False,False
3,3746,Blade Runner,33139618.0,1982,1970-2000,5,True,False,True,False,False,True,True,False,False,False,False
4,3837,Blazing Saddles,119500000.0,1974,1970-2000,2,True,False,False,False,False,False,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
62223,37067980,The Lady from Peking,,1975,1970-2000,1,True,False,False,False,False,False,False,False,False,False,False
62224,37373877,Crazy Eights,,2006,2000-2015,1,True,False,False,False,False,False,False,False,False,False,False
62225,37476824,I Love New Year,,2011,2000-2015,1,False,False,False,True,False,False,False,False,False,False,False
62226,37478048,Mr. Bechara,,1996,1970-2000,1,False,False,False,True,False,False,False,False,False,False,False


Unnamed: 0,Wikipedia_movie_ID,Movie_name,Movie_box_office_revenue,Year,Year_Interval,nb_of_Countries,Country_Argentina,Country_Canada,Country_France,Country_Germany,Country_Hong Kong,Country_India,Country_Italy,Country_Japan,Country_Other,Country_United Kingdom,Country_United States of America
0,330,Actrius,,1996,1970-2000,1,False,False,False,False,False,False,False,False,True,False,False
1,3217,Army of Darkness,21502796.0,1992,1970-2000,1,False,False,False,False,False,False,False,False,False,False,True
2,3333,The Birth of a Nation,50000000.0,1915,1915-1930,1,False,False,False,False,False,False,False,False,False,False,True
3,3746,Blade Runner,33139618.0,1982,1970-2000,2,False,False,False,False,True,False,False,False,False,False,True
4,3837,Blazing Saddles,119500000.0,1974,1970-2000,1,False,False,False,False,False,False,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60678,36956792,The Water Horse: Legend of the Deep,103071443.0,2007,2000-2015,3,False,False,False,False,False,False,False,False,True,True,True
60679,37067980,The Lady from Peking,,1975,1970-2000,2,False,False,False,False,False,False,False,False,True,False,True
60680,37373877,Crazy Eights,,2006,2000-2015,1,False,False,False,False,False,False,False,False,False,False,True
60681,37476824,I Love New Year,,2011,2000-2015,1,False,False,False,False,False,True,False,False,False,False,False


In [216]:
data_with_features = pd.merge(g,pd.merge(l,c,
                    how="inner",
                    on=['Wikipedia_movie_ID', 'Movie_name', 'Movie_box_office_revenue', 'Year', 'Year_Interval']
                   ),
         how="inner",
         on=['Wikipedia_movie_ID', 'Movie_name', 'Movie_box_office_revenue', 'Year', 'Year_Interval'])

In [217]:
pickle.dump(data_with_features, open(pickle_folder + "movies_clean.p","wb"))