In [710]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os
import sys
import pickle

In [711]:
folder = '../data/CMU/'
pickle_folder = "../data/pickle/"

In [712]:
movie_original_data = pd.read_csv(folder + 'movie.metadata.tsv' ,sep='\t',names=['Wikipedia_movie_ID',
'Freebase_movie_ID',
'Movie_name',
'Movie_release_date',
'Movie_box_office_revenue',
'Movie_runtime',
'Movie_languages_(Freebase_ID:name_tuples)',
'Movie_countries_(Freebase_ID:name_tuples)',
'Movie_genres_(Freebase_ID:name_tuples)'])

# Movie Dataset Cleaning

- First, we will keep our focus on cleaning the "movie.metadata" dataset. 
The goal is to do a deep review of the whole dataset, have a good understanding of the missing data for each relevant feature for our study and to have a cleaned version ready for Milestone 3.

In [713]:
movies = movie_original_data.copy()

In [714]:
print("The original number of movie in movie.metadata dataset is : {}".format(movies.shape[0]))

The original number of movie in movie.metadata dataset is : 81741


## 1. Dropping invalid values for Movie box office revenue

- Most important, because our main question is focusing on the implications of characteristics on the overall box office performance of movies, the first goal is to drop all the movie lines where the box office is not detailed. 

In [715]:
movies_with_box_office = movies.dropna(subset=['Movie_box_office_revenue'])

In [716]:
print("The number of movie with a valid value for the box office revenue in movie.metadata dataset :\n {}".format(movies_with_box_office.shape[0]))

The number of movie with a valid value for the box office revenue in movie.metadata dataset :
 8401


- We notice that dropping all the movies without a valid box office revenue value reduces the size of the dataframe by a factor of almost 10. Hence, we think that we should take another criteria to complete the evaluation of the success of movies. That was the main reason that pushed us to study the imdb dataset and its ratings.

##  2. Cleaning Features

- Before starting the analysis of the features relevant to our project, we will drop the columns that are irrelevant to us or that we don’t intend to use, to avoid being overwhelmed by unnecessary information when printing the data frames.

In [717]:
movies_clean = movies.drop(columns=['Freebase_movie_ID','Movie_runtime'])
movies_clean.sample(5)

Unnamed: 0,Wikipedia_movie_ID,Movie_name,Movie_release_date,Movie_box_office_revenue,Movie_languages_(Freebase_ID:name_tuples),Movie_countries_(Freebase_ID:name_tuples),Movie_genres_(Freebase_ID:name_tuples)
63189,10368207,Phir Wohi Dil Laya Hoon,1963,,"{""/m/03k50"": ""Hindi Language""}","{""/m/03rk0"": ""India""}","{""/m/02l7c8"": ""Romance Film"", ""/m/04t36"": ""Mus..."
16385,24318707,Sura,2009,,"{""/m/07c9s"": ""Tamil Language""}","{""/m/03rk0"": ""India""}","{""/m/02kdv5l"": ""Action""}"
4180,5664063,Recess: Taking the Fifth Grade,2003-12-09,,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/0hqxf"": ""Family Film"", ""/m/01z4y"": ""Comed..."
44992,9397049,Outcast of the Islands,1952,,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/07s9rl0"": ""Drama"", ""/m/03k9fj"": ""Adventure""}"
54238,19592948,Hum Dono,1985,,"{""/m/03k50"": ""Hindi Language""}","{""/m/03rk0"": ""India""}","{""/m/02l7c8"": ""Romance Film"", ""/m/0hqxf"": ""Fam..."


###  2.1. Cleaning the Dates

- First, we saw that for the movie "Hunting Season" the release date written was "1010-12-02" but the real release date is  "2010-12-02" :

In [718]:
movies_clean.loc[movies['Movie_name'] == 'Hunting Season']

Unnamed: 0,Wikipedia_movie_ID,Movie_name,Movie_release_date,Movie_box_office_revenue,Movie_languages_(Freebase_ID:name_tuples),Movie_countries_(Freebase_ID:name_tuples),Movie_genres_(Freebase_ID:name_tuples)
62836,29666067,Hunting Season,1010-12-02,12160978.0,"{""/m/02hwyss"": ""Turkish Language"", ""/m/02h40lc...","{""/m/01znc_"": ""Turkey""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/02n4kr"": ""My..."


Hence, we modified this value by the true one : 

In [719]:
movies_clean.loc[62836, 'Movie_release_date'] = '2010-12-02'
print(movies_clean.loc[62836, 'Movie_release_date'])

2010-12-02


- Drop all the movies where no release date are written

In [720]:
movies_clean = movies_clean.dropna(inplace = False, subset = 'Movie_release_date').copy()

In [721]:
movies_clean.shape[0]

74839

- Create a column 'Year' where we only have the released year of the movie (to do year by year analysis) : 

In [722]:
movies_clean['Year'] = movies_clean['Movie_release_date'].str[:4]
movies_clean['Year'] = movies_clean['Year'].astype(int)

- For our study, in addition to analyzing the dataset year by year, we decided to conduct an analysis across six different intervals of approximately 20 years each, spanning from 1910 to 2016. We decided to restrain ourself to that particular period because of the lack of meaningful data befor the 1910 decade.
- We created a 'Year_Interval' column in which each film is categorized into one of the six designated study intervals.
- Finally, we save our new version of the movies dataset with these new columns

In [723]:
intervals = [(1910, 1930), (1930, 1950), (1950, 1970), (1970, 1990), (1990, 2016)]
movies_clean['Year_Interval'] = pd.cut(movies_clean['Year'], bins=[1910, 1930, 1950, 1970, 1990, 2016], labels=['1910-1930', '1930-1950', '1950-1970', '1970-1990', '1990-2016'])
movies_clean['Year_Interval'] = movies_clean['Year_Interval'].astype(str)

movies_clean = movies_clean.query(" 2016 > Year > 1910")

pickle.dump( movies_clean, open(pickle_folder + "movies_clean.p", "wb" ) )

- Now, because we will also do some analysis using the seasons (summer, winter, fall, spring) of movies' release date. We create another DataFrame, in which each movie with a specified release date is categorized by the season in which it was released.
- We also save this Dataframe creating a second version of our clean dataset. We prefer to have these rather than just merging the two and keeping only one because the creation of the 'release_season' excludes all the movies for which we don't have information about the month of release

In [724]:
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
sys.path.append(project_root)

from src.utils.utils import categorize_release_season

In [725]:
md_release_season = movies_clean.copy()
md_release_season['Movie_release_date'] = pd.to_datetime(md_release_season['Movie_release_date'], errors='coerce')
md_release_season.dropna(subset = 'Movie_release_date', inplace = True)

In [726]:
md_release_season['release_season'] = md_release_season['Movie_release_date'].apply(categorize_release_season)
md_release_season.shape[0]

39217

In [727]:
md_release_season
pickle.dump( md_release_season, open(pickle_folder + "movies_clean_with_season.p", "wb" ) )

### 2.2. Clean 'Genres' Column 

- The ‘genres’ characteristic is among the key features to answer our general question. However, we noticed that in the movies.metadata dataset, this feature was presented in a (Freebase_ID:name_tuples) format, and thus wasn’t very clear and understandable. That is why we decided to reformat the ‘genres’ feature.
- We dropped all the movies that have no information about their genres (more than 1700)

In [728]:
from importlib import reload
import src.utils.utils
reload(src.utils.utils)
from src.utils.utils import extract_info

column_name = 'Movie_genres_(Freebase_ID:name_tuples)' 

md_Genres = movies_clean.copy()
md_Genres["Genres"] = md_Genres[column_name].apply(extract_info)

len_before = len(md_Genres['Genres'])
md_Genres = md_Genres[md_Genres['Genres'].apply(lambda x: len(x) > 0)]
len_after = len(md_Genres['Genres'])
print(f"Number of movies dropped : {len_before-len_after}")

Number of movies dropped : 1760


- We splits the "Genres" column into a new DataFrame, expanding lists into separate columns. We calculates the number of genres for each entry and adds this count as a new column. As we can see when using the describe() method, each movie has 3 'Genres' in average.

In [729]:
genres_split = pd.DataFrame(md_Genres["Genres"].tolist(), index=md_Genres.index)

md_Genres['nb_of_Genres'] = md_Genres["Genres"].apply(lambda x:len(x))
md_Genres['nb_of_Genres'].describe()

count    72688.000000
mean         3.174018
std          2.107443
min          1.000000
25%          1.000000
50%          3.000000
75%          4.000000
max         17.000000
Name: nb_of_Genres, dtype: float64

In [730]:
# We keep only the first three genres of each row
genres_split = genres_split.iloc[:, :3]
genres_split = genres_split.add_prefix("Genres_")
genres_split.sample(5)

Unnamed: 0,Genres_0,Genres_1,Genres_2
5390,"""Martial Arts Film""","""Chinese Movies""",
80400,"""Crime Fiction""","""Buddy film""","""Chase Movie"""
75102,"""Drama""","""Musical Drama""","""Adventure"""
67684,"""Horror""","""Creature Film""","""Psychological thriller"""
59861,"""Thriller""","""World cinema""","""Musical"""


- We update the previous dataframes by adding the new 'genres' columns

In [731]:
data_genres = md_Genres.join(genres_split).drop(columns=["nb_of_Genres","Genres","Movie_genres_(Freebase_ID:name_tuples)"])

with open(pickle_folder + 'movies_clean.p', 'wb') as f:
    pickle.dump(data_genres, f)  
    
with open(pickle_folder + 'movies_clean_with_season.p', 'wb') as f:
    movie_clean_with_season = md_release_season.join(genres_split).drop(columns=["Movie_genres_(Freebase_ID:name_tuples)"])
    pickle.dump(movie_clean_with_season, f)

- We also create a dataframe containing 'genres' in an exploded format in order to facilitate the frequency analysis of genres

In [732]:
md_genres_exploded = md_Genres.explode('Genres')
pickle.dump( md_genres_exploded, open(pickle_folder + "movies_genres_exploded.p", "wb" ) )
md_genres_exploded.value_counts('Genres')

Genres
"Drama"                31940
"Comedy"               15620
"Romance Film"          9763
"Black-and-white"       8661
"Thriller"              8406
                       ...  
"Statutory rape"           1
"Romantic thriller"        1
"Chick flick"              1
"Buddy Picture"            1
"Neorealism"               1
Name: count, Length: 363, dtype: int64

### 2.3. Clean 'Languages' Column

- In the same way as the part for Genres’ characteristics, this feature was presented in a (Freebase_ID:name_tuples) format, and thus wasn’t very clear and understandable. That is why we decided to reformat the ‘Language’ feature.
- We dropped all the movies that have no information about their languages (approximately 10 000)

In [733]:
column_name = 'Movie_languages_(Freebase_ID:name_tuples)'

md_language = movies_clean.copy()
md_language['Language'] = md_language[column_name].apply(extract_info)

len_before = len(md_language['Language'])
md_language = md_language[md_language['Language'].apply(lambda x: len(x) > 0)]
len_after = len(md_language['Language'])
print(f"Number of movies dropped : {len_before-len_after}")

Number of movies dropped : 9981


- We splits the "Language" column into a new DataFrame, expanding lists into separate columns. We calculate the number of languages for each entry and add this count as a new column. As we can see when using the describe() method, each movie has 1 "Language" on average. 

In [734]:
language_split = pd.DataFrame(md_language["Language"].tolist(), index=md_language.index)
md_language['nb_of_Languages'] = md_language["Language"].apply(lambda x:len(x))
md_language['nb_of_Languages'].describe()

count    64467.000000
mean         1.200195
std          0.577103
min          1.000000
25%          1.000000
50%          1.000000
75%          1.000000
max         15.000000
Name: nb_of_Languages, dtype: float64

- We create a new dataframe containing the Languages in an exploded format in order to facilitate the frequency analysis of languages

In [735]:
md_languages_exploded = md_language.explode('Language')
pickle.dump( md_languages_exploded, open(pickle_folder + "movies_languages_exploded.p", "wb" ) )
md_languages_exploded.value_counts('Language')

Language
"English Language"           39536
"Spanish Language"            3526
"Hindi Language"              3431
"French Language"             3299
"Silent film"                 2846
                             ...  
"Sunda Language"                 1
"Hazaragi Language"              1
"Pawnee Language"                1
"Gumatj Language"                1
"Judeo-Georgian Language"        1
Name: count, Length: 191, dtype: int64

In [736]:
md_languages_exploded

Unnamed: 0,Wikipedia_movie_ID,Movie_name,Movie_release_date,Movie_box_office_revenue,Movie_languages_(Freebase_ID:name_tuples),Movie_countries_(Freebase_ID:name_tuples),Movie_genres_(Freebase_ID:name_tuples),Year,Year_Interval,Language,nb_of_Languages
0,975900,Ghosts of Mars,2001-08-24,14010832.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",2001,1990-2016,"""English Language""",1
1,3196793,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp...",2000,1990-2016,"""English Language""",1
2,28463795,Brun bitter,1988,,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D...",1988,1970-1990,"""Norwegian Language""",1
3,9363483,White Of The Eye,1987,,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic...",1987,1970-1990,"""English Language""",1
4,261236,A Woman in Flames,1983,,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}",1983,1970-1990,"""German Language""",1
...,...,...,...,...,...,...,...,...,...,...,...
81736,35228177,Mermaids: The Body Found,2011-03-19,,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/07s9rl0"": ""Drama""}",2011,1990-2016,"""English Language""",1
81737,34980460,Knuckle,2011-01-21,,"{""/m/02h40lc"": ""English Language""}","{""/m/03rt9"": ""Ireland"", ""/m/07ssc"": ""United Ki...","{""/m/03bxz7"": ""Biographical film"", ""/m/07s9rl0...",2011,1990-2016,"""English Language""",1
81738,9971909,Another Nice Mess,1972-09-22,,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06nbt"": ""Satire"", ""/m/01z4y"": ""Comedy""}",1972,1970-1990,"""English Language""",1
81739,913762,The Super Dimension Fortress Macross II: Lover...,1992-05-21,,"{""/m/03_9r"": ""Japanese Language""}","{""/m/03_3d"": ""Japan""}","{""/m/06n90"": ""Science Fiction"", ""/m/0gw5n2f"": ...",1992,1990-2016,"""Japanese Language""",1


### 2.4. Clean 'Countries' Column 

- For the Countries feature, similar to the approach used for the Genres feature, the data was originally presented in a (Freebase_ID:name_tuples) format, making it unclear and difficult to interpret. To address this, we chose to reformat the Countries feature into a more understandable structure.
- We dropped all the movies that have no information about their countries (approximately 5000)

In [737]:
md_countries = movies_clean.copy()
md_countries["Countries"] = md_countries['Movie_countries_(Freebase_ID:name_tuples)'].apply(extract_info)

len_before = len(md_countries['Countries'])
md_countries = md_countries[md_countries['Countries'].apply(lambda x: len(x) > 0)]
len_after = len(md_countries['Countries'])
print(f"Number of movies dropped : {len_before-len_after}")

Number of movies dropped : 5184


- We splits the "Countries" column into a new DataFrame, expanding lists into separate columns. We calculate the number of countries for each entry and add this count as a new column. As we can see when using the describe() method, each movie has 1 "Countries" on average.

In [738]:
countries_split = pd.DataFrame(md_countries["Countries"].tolist(), index=md_countries.index)
md_countries['nb_of_Countries'] = md_countries["Countries"].apply(lambda x:len(x))
md_countries['nb_of_Countries'].describe()

count    69264.000000
mean         1.180498
std          0.552584
min          1.000000
25%          1.000000
50%          1.000000
75%          1.000000
max         14.000000
Name: nb_of_Countries, dtype: float64

- We create a new dataframe containing the Countries in an exploded format in order to facilitate the frequency analysis of countries

In [739]:
md_countries_exploded = md_countries.explode('Countries')
pickle.dump( md_countries_exploded, open(pickle_folder + "movies_countries_exploded.p", "wb" ) )
md_countries_exploded.value_counts('Countries')

Countries
"United States of America"    33051
"India"                        7764
"United Kingdom"               7406
"France"                       4086
"Italy"                        3015
                              ...  
"Iraqi Kurdistan"                 1
"Jordan"                          1
"Macau"                           1
"Palestinian Territories"         1
"Republic of China"               1
Name: count, Length: 146, dtype: int64

In [740]:
md_countries_exploded

Unnamed: 0,Wikipedia_movie_ID,Movie_name,Movie_release_date,Movie_box_office_revenue,Movie_languages_(Freebase_ID:name_tuples),Movie_countries_(Freebase_ID:name_tuples),Movie_genres_(Freebase_ID:name_tuples),Year,Year_Interval,Countries,nb_of_Countries
0,975900,Ghosts of Mars,2001-08-24,14010832.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",2001,1990-2016,"""United States of America""",1
1,3196793,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp...",2000,1990-2016,"""United States of America""",1
2,28463795,Brun bitter,1988,,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D...",1988,1970-1990,"""Norway""",1
3,9363483,White Of The Eye,1987,,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic...",1987,1970-1990,"""United Kingdom""",1
4,261236,A Woman in Flames,1983,,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}",1983,1970-1990,"""Germany""",1
...,...,...,...,...,...,...,...,...,...,...,...
81737,34980460,Knuckle,2011-01-21,,"{""/m/02h40lc"": ""English Language""}","{""/m/03rt9"": ""Ireland"", ""/m/07ssc"": ""United Ki...","{""/m/03bxz7"": ""Biographical film"", ""/m/07s9rl0...",2011,1990-2016,"""Ireland""",2
81737,34980460,Knuckle,2011-01-21,,"{""/m/02h40lc"": ""English Language""}","{""/m/03rt9"": ""Ireland"", ""/m/07ssc"": ""United Ki...","{""/m/03bxz7"": ""Biographical film"", ""/m/07s9rl0...",2011,1990-2016,"""United Kingdom""",2
81738,9971909,Another Nice Mess,1972-09-22,,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06nbt"": ""Satire"", ""/m/01z4y"": ""Comedy""}",1972,1970-1990,"""United States of America""",1
81739,913762,The Super Dimension Fortress Macross II: Lover...,1992-05-21,,"{""/m/03_9r"": ""Japanese Language""}","{""/m/03_3d"": ""Japan""}","{""/m/06n90"": ""Science Fiction"", ""/m/0gw5n2f"": ...",1992,1990-2016,"""Japan""",1
