In [1]:
import pandas as pd

## Basic exploration of the CMU data original dataset
First, we opened the **CMU character metadata** and look at the missing values :

In [2]:
columns_character = ['Wikipedia movie ID', 'Freebase movie ID', 'Movie release date', 'Character name', 'Actor date of birth',
                     'Actor gender', 'Actor height (in meters)', 'Actor ethnicity (Freebase ID)', 'Actor name',
                     'Actor age at movie release', 'Freebase character/actor map ID', 'Freebase character ID',
                     'Freebase actor ID']

df_cmu_character = pd.read_csv("data/MovieSummaries/character.metadata.tsv",sep='\t',names=columns_character)

print(df_cmu_character.shape)
print('Percentage of NaN in each feature : ')
print(df_cmu_character.isna().sum(axis = 0) / df_cmu_character.shape[0] * 100)

(450669, 13)
Percentage of NaN in each feature : 
Wikipedia movie ID                  0.000000
Freebase movie ID                   0.000000
Movie release date                  2.217814
Character name                     57.220488
Actor date of birth                23.552763
Actor gender                       10.120288
Actor height (in meters)           65.645740
Actor ethnicity (Freebase ID)      76.466542
Actor name                          0.272484
Actor age at movie release         35.084064
Freebase character/actor map ID     0.000000
Freebase character ID              57.218269
Freebase actor ID                   0.180842
dtype: float64


As we were first interested in looking at the ethnicities of the actors, we map the freebase ID with its label thanks to the `mid2name.tsv` file found on : https://github.com/xiaoling/figer/issues/6

In [3]:
df_mapID = pd.read_csv("data/Expanded_data/mid2name.tsv", sep='\t', names=['ID', 'label'])
df_mapID = df_mapID.drop_duplicates(subset=["ID"], keep='first')
ethnicity = df_cmu_character['Actor ethnicity (Freebase ID)']
df_ethnicity = ethnicity.to_frame()
df_ethnicity.columns = ['ID']
df_merge = pd.merge(df_ethnicity, df_mapID, how='left')
df_cmu_character['Ethnicity'] = df_merge['label']

We found that it is was difficult to complete the ethnicities of the actors with external datatset. So, we decided to choose another idea.

Then, we opened the **CMU movies metadata** and look at the missing values :

In [4]:
columns_movie = ['Wikipedia movie ID', 'Freebase movie ID', 'Movie name', 'Movie release date', 'Movie box office revenue',
                 'Movie runtime', 'Movie languages (Freebase ID:name tuples)', 'Movie countries (Freebase ID:name tuples)',
                 'Movie genres (Freebase ID:name tuples)']
df_cmu_movie = pd.read_csv("data/MovieSummaries/movie.metadata.tsv",sep='\t', names=columns_movie)

print(df_cmu_movie.shape)
print('Percentage of NaN in each feature : ')
print(df_cmu_movie.isna().sum(axis = 0) / df_cmu_movie.shape[0] * 100)

print('\nSum of {} in the string columns : ')
print('Movie languages : {}'.format(sum(df_cmu_movie['Movie languages (Freebase ID:name tuples)']=='{}')))
print('Movie countries : {}'.format(sum(df_cmu_movie['Movie countries (Freebase ID:name tuples)']=='{}')))
print('Movie genres : {}'.format(sum(df_cmu_movie['Movie genres (Freebase ID:name tuples)']=='{}')))

(81741, 9)
Percentage of NaN in each feature : 
Wikipedia movie ID                            0.000000
Freebase movie ID                             0.000000
Movie name                                    0.000000
Movie release date                            8.443743
Movie box office revenue                     89.722416
Movie runtime                                25.018045
Movie languages (Freebase ID:name tuples)     0.000000
Movie countries (Freebase ID:name tuples)     0.000000
Movie genres (Freebase ID:name tuples)        0.000000
dtype: float64

Sum of {} in the string columns : 
Movie languages : 13866
Movie countries : 8154
Movie genres : 2294


Then, we looked at the CMU plot summaries data :

In [5]:
df_cmu_summaries = pd.read_csv("data/MovieSummaries/plot_summaries.txt",sep='\t', names=['Wikipedia movie ID', 'Plot summary'])
df_cmu_summaries.head(3)

print('Percentage of missing summaries : ')
print(100 - df_cmu_summaries.shape[0] / df_cmu_movie.shape[0] * 100)

Percentage of missing summaries : 
48.24751348772342


## Pulling data to complete our dataset

The CMU movie metadata contains not many and not recent movies (until 2012 only). Moreover, it has a lot of NA values, espcially for the box office revenue. So, we decided to complete this dataset to have more representative one. We use :
* **Wikipedia** to query box office revenues that were missing
* **IMDB** dataset to complete the amount of movies and ensure good representation of their variety
* **TMDB** dataset to fetch movie budget and country of origin (production)
* **Inflation** dataset to get corrected box office revenue and budget across the years. We found data on the inflation of each country from 1960 to 2021 on https://data.worldbank.org/indicator/FP.CPI.TOTL.ZG

If you want to run the notebooks, you have to download IMDB data from https://datasets.imdbws.com/. Please download these files :
* `title.akas.tsv.gz`
* `title.basics.tsv.gz`
* `title.ratings.tsv.gz`
Unzip them, and place them in `data/IMDB_data/`

The other data is obtained by running .py files in, which make direct query using APIs.

### Wikipedia query
By running `wikipedia_query.py`, a `data/Expanded_data/wikipedia_query.tsv` file will be created. The script requests imdb-associated films on wikipedia, with associated box office revenues and freebase IDs if available (on wikipedia). With the freebase IDs we will be able to associate this data with the CMU movie metadata. You do not need to run this command as the `wikipedia_query.tsv` file was small enough to be pushed on Github.

In [2]:
# !python3 data_creation_scripts/wikipedia_query.py
# try with python if not working

Traceback (most recent call last):
  File "c:\Users\loris\EPFL\ADA_github_project\data_creation_scripts\wikipedia_query.py", line 1, in <module>
    from SPARQLWrapper import SPARQLWrapper, JSON
ModuleNotFoundError: No module named 'SPARQLWrapper'


### Merge IMDB and Wikipedia data
By running `expand_data.py`, a `data/Expanded_data/big_data.tsv`file will be created. The script brings together IMDB data with associated wikipedia data and notably box office values from the CMU dataset to create a big representative movie dataset, used for large-scale analysis.

In [6]:
!python3 data_creation_scripts/expand_data.py
# try with python if not working

### TMDB query
By running `TMDB_query.py`, a `data/Expanded_data/TMDB_query.tsv` file will be created. The script uses the imdb ids and freebase ids to query movie budgets and country of origin (production). You do not need to run this script as the `TMDB_query.tsv` file was small enough to be pushed on Github.

In [4]:
#!python3 data_creation_scripts/TMDB_query.py
# try with python if not working

### Merge TMDB to big_data
By running `final_dataset_creation.py`, `data/Expanded_data/big_data_final.tsv` will be created. The script adds budget and revenue to our previous big data using the index corresponding to box office revenue. We also preprocess this big dataset :
* As the Movie release date is not homogeneous across all movies, we decided to only keep the year as a timestamp.
* We noticed that there are either 'NaN' or '\\N' for missing values in this dataset. So, we changed the '\\N' into 'NaN' for more consistency across the dataset
* We preprocessed the Movie genres because they were string type. We changed them into list of strings and we replaced the [\\\N] list into NaN.
* We calculated the inflation coefficient, the inflation-corrected box office revenue and the inflation-corrected budget for the non-missing data. It results in three new columns in the final dataset. We choose the USA inflation as all the prices of the datasets are in US dollards

In [7]:
!python3 data_creation_scripts/final_dataset_creation.py
# try with python if not working

  big_data = pd.read_csv('Expanded_data/big_data.tsv', sep='\t')


## Basic exploration of the final dataset

In [10]:
final_data = pd.read_csv('data/Expanded_data/big_data_final.tsv', sep='\t')

  final_data = pd.read_csv('Expanded_data/big_data_final.tsv', sep='\t')


In [11]:
final_data.head(10)

Unnamed: 0,Freebase movie ID,IMDB_id,Movie box office revenue,Movie genres names,Movie name,Movie release date,averageRating,budget,numVotes,prod_country,inflation coeff,inflation corrected revenue,inflation corrected budget
0,/m/0100_m55,tt0138297,,"['Comedy', 'Sci-Fi']",Urban Animals,1987.0,5.2,,79.0,,2.384772,,
1,/m/0100_mnm,tt0202813,,['Comedy'],,1999.0,5.8,,15.0,,1.626713,,
2,/m/0100_nzr,tt0184302,,['Drama'],,1999.0,4.8,,119.0,,1.626713,,
3,/m/0100_pgp,tt0094831,,['Comedy'],,1988.0,6.8,,103.0,,2.291337,,
4,/m/0100_pz9,tt0088884,,['Comedy'],,1985.0,2.4,,59.0,,2.519087,,
5,/m/0100b4n_,tt0074791,,"['Comedy', 'Romance']",,1976.0,5.0,,39.0,,4.761513,,
6,/m/0100b5r4,tt10147624,,['Comedy'],,1992.0,,,,,1.93113,,
7,/m/0100b64g,tt6568614,,['Western'],,1970.0,6.7,,7.0,,6.979259,,
8,/m/0100bkr7,tt1329171,,,,1993.0,,,,,1.875764,,
9,/m/0100blym,tt1441953,1800000.0,"['Biography', 'Drama', 'History']",Testament of Youth,2014.0,7.2,,29135.0,GB,1.144606,2060292.0,


In [12]:
print('Percentage of NaN in each feature : ')
print(final_data.isna().sum(axis = 0) / final_data.shape[0] * 100)

Percentage of NaN in each feature : 
Freebase movie ID              81.634447
IMDB_id                         3.657842
Movie box office revenue       98.428813
Movie genres names             14.742497
Movie name                     49.981838
Movie release date             13.506234
averageRating                  56.554256
budget                         99.185316
numVotes                       56.554256
prod_country                   98.797753
inflation coeff                31.411421
inflation corrected revenue    98.611975
inflation corrected budget     99.226566
dtype: float64


In [13]:
print('Count in each feature : ')
final_data.count()

Count in each feature : 


Freebase movie ID              119321
IMDB_id                        625935
Movie box office revenue        10208
Movie genres names             553918
Movie name                     324968
Movie release date             561950
averageRating                  282267
budget                           5293
numVotes                       282267
prod_country                     7811
inflation coeff                445620
inflation corrected revenue      9018
inflation corrected budget       5025
dtype: int64

We have 10'208 movies to do our analyses on the box office revenenues. We suppose that this sample is large enough to get reliable results. However, we have to verify this hypothese by doing some further analyses

If we want to look at the movies that box office revenue, genres, average rating, a budget, the production country, the inflation are not missing, we have 4433 movies available.

In [14]:
final_data.dropna(subset=['Movie box office revenue', 'Movie genres names', 'averageRating', 'budget', 'prod_country', 'inflation coeff']).shape

(4433, 13)

When only the average rating and money features (box office revenue, budget, inflation coeff) are not missing, we have 5019 movies.

In [16]:
final_data.dropna(subset=['Movie box office revenue', 'averageRating', 'budget', 'inflation coeff']).shape

(5019, 13)