In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

## Some basic exploration of the CMU data

First, we opened the CMU character metadata and look at the missing values :

In [6]:
columns_character = ['Wikipedia movie ID', 'Freebase movie ID', 'Movie release date', 'Character name', 'Actor date of birth',
                     'Actor gender', 'Actor height (in meters)', 'Actor ethnicity (Freebase ID)', 'Actor name', 
                     'Actor age at movie release', 'Freebase character/actor map ID', 'Freebase character ID',
                     'Freebase actor ID']

df_cmu_character = pd.read_csv("MovieSummaries/character.metadata.tsv",sep='\t',names=columns_character)

print(df_cmu_character.shape)
print('Percentage of NaN in each feature : ')
print(df_cmu_character.isna().sum(axis = 0) / df_cmu_character.shape[0] * 100)

(450669, 13)
Percentage of NaN in each feature : 
Wikipedia movie ID                  0.000000
Freebase movie ID                   0.000000
Movie release date                  2.217814
Character name                     57.220488
Actor date of birth                23.552763
Actor gender                       10.120288
Actor height (in meters)           65.645740
Actor ethnicity (Freebase ID)      76.466542
Actor name                          0.272484
Actor age at movie release         35.084064
Freebase character/actor map ID     0.000000
Freebase character ID              57.218269
Freebase actor ID                   0.180842
dtype: float64


As we were first interested in looking at the ethnicities of the actors, we map the freebase ID with its label thanks to the `mid2name.tsv` file found on : https://github.com/xiaoling/figer/issues/6

In [31]:
df_mapID = pd.read_csv("Expanded_data/mid2name.tsv", sep='\t', names=['ID', 'label'])
df_mapID = df_mapID.drop_duplicates(subset=["ID"], keep='first')
ethnicity = df_cmu_character['Actor ethnicity (Freebase ID)']
df_ethnicity = ethnicity.to_frame()
df_ethnicity.columns = ['ID']
df_merge = pd.merge(df_ethnicity, df_mapID, how='left')
df_character['Ethnicity'] = df_merge['label']

We found that it is was difficult to complete the ethnicities of the actors with external datatset. So, we decided to choose another idea.

Then, we opened the CMU movies metadata and look at the missing values :

In [8]:
columns_movie = ['Wikipedia movie ID', 'Freebase movie ID', 'Movie name', 'Movie release date', 'Movie box office revenue',
                 'Movie runtime', 'Movie languages (Freebase ID:name tuples)', 'Movie countries (Freebase ID:name tuples)',
                 'Movie genres (Freebase ID:name tuples)']
df_cmu_movie = pd.read_csv("MovieSummaries/movie.metadata.tsv",sep='\t', names=columns_movie)

print(df_cmu_movie.shape)
print('Percentage of NaN in each feature : ')
print(df_cmu_movie.isna().sum(axis = 0) / df_cmu_movie.shape[0] * 100)

print('\nSum of {} in the string columns : ')
print('Movie languages : {}'.format(sum(df_cmu_movie['Movie languages (Freebase ID:name tuples)']=='{}')))
print('Movie countries : {}'.format(sum(df_cmu_movie['Movie countries (Freebase ID:name tuples)']=='{}')))
print('Movie genres : {}'.format(sum(df_cmu_movie['Movie genres (Freebase ID:name tuples)']=='{}')))

(81741, 9)
Percentage of NaN in each feature : 
Wikipedia movie ID                            0.000000
Freebase movie ID                             0.000000
Movie name                                    0.000000
Movie release date                            8.443743
Movie box office revenue                     89.722416
Movie runtime                                25.018045
Movie languages (Freebase ID:name tuples)     0.000000
Movie countries (Freebase ID:name tuples)     0.000000
Movie genres (Freebase ID:name tuples)        0.000000
dtype: float64

Sum of {} in the string columns : 
Movie languages : 13866
Movie countries : 8154
Movie genres : 2294


Then, we looked at the CMU plot summaries data :

In [11]:
df_cmu_summaries = pd.read_csv("MovieSummaries/plot_summaries.txt",sep='\t', names=['Wikipedia movie ID', 'Plot summary'])
df_cmu_summaries.head(3)

print('Percentage of missing summaries : ')
print(100 - df_cmu_summaries.shape[0] / df_cmu_movie.shape[0] * 100)

Percentage of missing summaries : 
48.24751348772342


## Additional datasets

The CMU movie metadata contains not many and not recent movies (until 2012 only). Moreover, it has a lot of NA values, espcially for the box office revenue. So, we decided to complete this dataset to have more representative one. We use :
* Wikipedia to query box office revenues that were missing
* IMDB dataset to complete the amount of movies

You have to download IMDB data from https://datasets.imdbws.com/. Please download these files :
* `title.akas.tsv.gz`
* `title.basics.tsv.gz`
* `title.ratings.tsv.gz`
Unzip them, and place them in `IMDB_data/`

By running `wikipedia_query.py`, a `Expanded_data/wikipedia_query.tsv` file will be created. The script requests imdb-associated films on wikipedia, with associated box office revenues and freebase IDs if available (on wikipedia). With the freebase IDs we will be able to associate this data with the CMU movie metadata. You do not need to run this command as the `wikipedia_query.tsv` file was small enough to be pushed on Github. 

In [None]:
#!python3 wikipedia_query.py

By running `wikipedia_query.py`, `Expanded_data/IMDB_wiki.tsv` and `Expanded_data/movie.expanded_metadata.tsv` files will be created. The script  brings together IMDB data with associated wikipedia data to create a representative movie dataset, used for large-scale analysis. Then, we complete the missing values of the CMU movie metadata. So, we have :
* `Expanded_data/IMDB_wiki.tsv` which is the merge between IMDB and wikipedia query data. 
* `Expanded_data/movie.expanded_metadata.tsv` in which we transfered the missing release dates and box office revenue to CMU movie metadata

In [59]:
!python3 expand_data_bis.py

## Preprocessing 

In [60]:
from preprocessing import *

First, we load the `Expanded_data/movie.expanded_metadata.tsv` file :

In [61]:
df_movie = pd.read_csv("Expanded_data/movie.expanded_metadata.tsv",sep='\t')
print(df_movie.shape)
df_movie.head(3)

(81741, 9)


Unnamed: 0,Wikipedia movie ID,Freebase movie ID,Movie name,Movie release date,Movie box office revenue,Movie runtime,Movie languages (Freebase ID:name tuples),Movie countries (Freebase ID:name tuples),Movie genres (Freebase ID:name tuples)
0,142780.0,/m/011_mj,The Circus,1928-01-06,,68.0,"{""/m/06ppq"": ""Silent film"", ""/m/02h40lc"": ""Eng...","{""/m/09c7w0"": ""United States of America""}","{""/m/06cvj"": ""Romantic comedy"", ""/m/06ppq"": ""S..."
1,142786.0,/m/011_p6,Thunderbolt,1929,,91.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/07s9rl0"": ""Drama"", ""/m/01g6gs"": ""Black-an..."
2,142822.0,/m/011_zy,The Green Goddess,1930-02-13,,73.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02rd8h3"": ""Goat gland"", ""/m/07s9rl0"": ""Dr..."


The `Expanded_data/movie.expanded_metadata.tsv` is now our CMU movies metadata whose missing values were completed as much as possible. We noticed that the Movie language, Movie countries and Movie genres were of type string and it was not easy to handle with that. So, we decided, to change these columns into a dictionnary type and to create new columns with only the list of the names. We do this preprocessing with functions that are in `preprocessing.py`.

In [62]:
# Change the (Freebase ID:name tuples) columns from strings to dictionnaries     
df_movie['Movie genres (Freebase ID:name tuples)'] = df_movie['Movie genres (Freebase ID:name tuples)'].apply(lambda x: transform_into_dict(x))
df_movie['Movie countries (Freebase ID:name tuples)'] = df_movie['Movie countries (Freebase ID:name tuples)'].apply(lambda x: transform_into_dict(x))
df_movie['Movie languages (Freebase ID:name tuples)'] = df_movie['Movie languages (Freebase ID:name tuples)'].apply(lambda x: transform_into_dict_2(x))

# New column with the list of names only --> easier to do analyses 
df_movie['Movie languages names'] = df_movie['Movie languages (Freebase ID:name tuples)'].apply(lambda x: transform_to_list_names(x))
df_movie['Movie countries names'] = df_movie['Movie countries (Freebase ID:name tuples)'].apply(lambda x: transform_to_list_names(x))
df_movie['Movie genres names'] = df_movie['Movie genres (Freebase ID:name tuples)'].apply(lambda x: transform_to_list_names(x))

df_movie.head(3)

Unnamed: 0,Wikipedia movie ID,Freebase movie ID,Movie name,Movie release date,Movie box office revenue,Movie runtime,Movie languages (Freebase ID:name tuples),Movie countries (Freebase ID:name tuples),Movie genres (Freebase ID:name tuples),Movie languages names,Movie countries names,Movie genres names
0,142780.0,/m/011_mj,The Circus,1928-01-06,,68.0,"{'/m/06ppq': 'Silent film', 'm/02h40lc': 'Engl...",{'/m/09c7w0': 'United States of America'},"{'/m/06cvj': 'Romantic comedy', '/m/06ppq': 'S...","[Silent film, English Language]",[United States of America],"[Romantic comedy, Silent film, Adventure, Blac..."
1,142786.0,/m/011_p6,Thunderbolt,1929,,91.0,{'/m/02h40lc': 'English Language'},{'/m/09c7w0': 'United States of America'},"{'/m/07s9rl0': 'Drama', '/m/01g6gs': 'Black-an...",[English Language],[United States of America],"[Drama, Black-and-white]"
2,142822.0,/m/011_zy,The Green Goddess,1930-02-13,,73.0,{'/m/02h40lc': 'English Language'},{'/m/09c7w0': 'United States of America'},"{'/m/02rd8h3': 'Goat gland', '/m/07s9rl0': 'Dr...",[English Language],[United States of America],"[Goat gland, Drama, Indie, Black-and-white]"


Then, we can also merge the plot summaries to the df_movie :

In [63]:
# Merge with the df_movie thanks to the wikipedia movie ID
df_movie = pd.merge(df_movie, df_cmu_summaries, how='left')

Then, we can load the bigger dataset `Expanded_data/IMDB_wiki.tsv` :

In [64]:
df_movie_big = pd.read_csv("Expanded_data/IMDB_wiki.tsv",sep='\t')
print(df_movie_big.shape)
df_movie_big.head(3)

(9363536, 7)


Unnamed: 0,Freebase movie ID,Movie name,Movie release date,Movie genres names,averageRating,numVotes,Movie box office revenue
0,,Carmencita,1894,"['Documentary', 'Short']",5.7,1922.0,
1,,The Clown and His Dogs,1892,"['Animation', 'Short']",5.8,259.0,
2,,,1892,"['Animation', 'Comedy', 'Romance']",6.5,1733.0,


We noticed that there are either 'NaN' or '\\N' for missing values in this dataset. So, we changed the '\\N' into 'NaN' for more consistency across the dataset.

In [65]:
#print('\nSum of in the string columns : ')
#print((df_movie_big == '\\N').sum(axis = 0) / df_movie_big.shape[0] * 100)

# float('NaN') otherwise it is not recognized as the other NaN of the dataset
df_movie_big.replace('\\N', float('NaN'), inplace=True)

print('Percentage of NaN in each feature : ')
print(df_movie_big.isna().sum(axis = 0) / df_movie_big.shape[0] * 100)

Percentage of NaN in each feature : 
Freebase movie ID           98.904890
Movie name                  86.264163
Movie release date          13.450378
Movie genres names           0.036717
averageRating               86.695069
numVotes                    86.695069
Movie box office revenue    99.967576
dtype: float64


We have now as dataset :
* a big dataset merged with IMDB and wikipedia data
* one smaller dataset whose missing values were completed as much as possible

So, we will merge these two datasets to have a big one that contains all the informations that we want, even if it will contain a lot of missing values. We decided to merge the 'Movie name', 'Movie release date', 'Movie genres names', 'Movie box office revenue' features thanks to the Freebase ID.

In [70]:
df_movie_final = df_movie_big.set_index('Freebase movie ID').combine_first( \
    df_movie.set_index('Freebase movie ID')[['Movie name', 'Movie release date', \
                                             'Movie genres names', 'Movie box office revenue']]).reset_index()

cols = ['Freebase movie ID', 'Movie name',
       'Movie release date', 'Movie genres names',
       'averageRating', 'numVotes', 'Movie box office revenue']
df_movie_final = df_movie_final[cols]

In [73]:
print(df_movie_final.shape)
df_movie_final.head()

(9383377, 7)


Unnamed: 0,Freebase movie ID,Movie name,Movie release date,Movie genres names,averageRating,numVotes,Movie box office revenue
0,/m/0100_m55,Urban Animals,1987,"['Comedy', 'Sci-Fi']",5.2,79.0,
1,/m/0100_mnm,,1999,['Comedy'],5.8,15.0,
2,/m/0100_nzr,,1999,['Drama'],4.8,119.0,
3,/m/0100_pgp,,1988,['Comedy'],6.8,103.0,
4,/m/0100_pz9,,1985,['Comedy'],2.4,59.0,


As we are interested into the box office revenue, we look at this dataset when dropping the missing box office revenues :

In [74]:
df_movie_box = df_movie_final.dropna(subset='Movie box office revenue')
print(df_movie_box.shape)
df_movie_box.head(3)

(10200, 7)


Unnamed: 0,Freebase movie ID,Movie name,Movie release date,Movie genres names,averageRating,numVotes,Movie box office revenue
9,/m/0100blym,Testament of Youth,2014,"['Biography', 'Drama', 'History']",7.2,29157.0,1800000.0
33,/m/0100khzv,Leviathan,2014,"['Crime', 'Drama']",7.6,53856.0,4100000.0
231,/m/0105j_71,Spy,2015,"['Action', 'Comedy']",7.0,245896.0,235700000.0


We have 10'200 movies to do our analyses on the box office revenenues. We suppose that this sample is large enough to get reliable results. However, we have to verify this hypothese by doing some further analyses