# Movies: Explortory Data Analysis

by Israel Diaz

## Data Description

The data correspond to the one downloaded from [IMDB source](https://datasets.imdbws.com/).

**IMDb Dataset Details**

Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8 character set. The first line in each file contains headers that describe what is in each column. A ‘\N’ is used to denote that a particular field is missing or null for that title/name. The available datasets are as follows:

**title.akas.tsv.gz** - Contains the following information for titles:

* titleId (string) - a tconst, an alphanumeric unique identifier of the title
* ordering (integer) – a number to uniquely identify rows for a given titleId
* title (string) – the localized title
* region (string) - the region for this version of the title
* language (string) - the language of the title
* types (array) - Enumerated set of attributes for this alternative title. One or more of the following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay". New values may be added in the future without warning
* attributes (array) - Additional terms to describe this alternative title, not enumerated
* isOriginalTitle (boolean) – 0: not original title; 1: original title

**title.basics.tsv.gz** - Contains the following information for titles:

* tconst (string) - alphanumeric unique identifier of the title
* titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
* primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
* originalTitle (string) - original title, in the original language
* isAdult (boolean) - 0: non-adult title; 1: adult title
* startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
* endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
* runtimeMinutes – primary runtime of the title, in minutes
* genres (string array) – includes up to three genres associated with the title

**title.ratings.tsv.gz** – Contains the IMDb rating and votes information for titles

* tconst (string) - alphanumeric unique identifier of the title
* averageRating – weighted average of all the individual user ratings
* numVotes - number of votes the title has received

## Loading Data

### Import Libraries

In [1]:
## General
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter('ignore')


## suppress scientific notation
pd.options.display.float_format = '{:20,.2f}'.format


### Load Data

For Now, I will explore data from year 2000

In [2]:
FOLDER = 'data/'
YEAR = 2000

I'll load the year 2000 data to test the code while it is still downloading the data from TMDB.

In [3]:
data = pd.read_json(path_or_buf=FOLDER + f'tmdb_api_results_{YEAR}.json')

In [4]:
data.head()

Unnamed: 0,imdb_id,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,original_language,original_title,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,0,,,,,,,,,,...,,,,,,,,,,
1,tt0113026,0.0,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,,10000000.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10402, '...",,62127.0,en,The Fantasticks,...,0.0,86.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Try to remember the first time magic happened,The Fantasticks,0.0,5.5,22.0,
2,tt0113092,0.0,,,0.0,"[{'id': 878, 'name': 'Science Fiction'}]",,110977.0,en,For the Cause,...,0.0,100.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,The ultimate showdown on a forbidden planet.,For the Cause,0.0,5.1,8.0,
3,tt0116391,0.0,,,0.0,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",,442869.0,hi,Gang,...,0.0,152.0,"[{'english_name': 'Hindi', 'iso_639_1': 'hi', ...",Released,,Gang,0.0,4.0,1.0,
4,tt0118694,0.0,/n4GJFGzsc7NinI1VeGDXIcQjtU2.jpg,,150000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",,843.0,cn,花樣年華,...,12854953.0,99.0,"[{'english_name': 'Cantonese', 'iso_639_1': 'c...",Released,"Feel the heat, keep the feeling burning, let t...",In the Mood for Love,0.0,8.11,2136.0,PG


In [5]:
print(f'Number of instances: {len(data)}')

Number of instances: 1214


## Data Cleaning

### Duplicated

In [6]:
data['imdb_id'].duplicated().sum()

0

### Missing Values

In [7]:
data.isna().sum()

imdb_id                     0
adult                       1
backdrop_path             554
belongs_to_collection    1102
budget                      1
genres                      1
homepage                    1
id                          1
original_language           1
original_title              1
overview                    1
popularity                  1
poster_path               124
production_companies        1
production_countries        1
release_date                1
revenue                     1
runtime                     1
spoken_languages            1
status                      1
tagline                     1
title                       1
video                       1
vote_average                1
vote_count                  1
certification             424
dtype: int64

I will drop the following data, because they are unnecessary or duplicated into the data:
* backdrop_path
* poster_path
* overview
* homepage
* id
* tagline
* original_title
* spoken_language



In [8]:
data.drop(columns=['backdrop_path','poster_path', 'overview', 'homepage', 'id', 'tagline', 'original_title', 'spoken_languages'], inplace=True)


In [9]:
data.drop(0, axis=0, inplace=True)

### Column Transformations

#### genres

In [10]:
test_genre = data.loc[125, 'genres']
print('Outer: ', type(test_genre))
print('Inner: ', type(test_genre[0]))
print(test_genre)

Outer:  <class 'list'>
Inner:  <class 'dict'>
[{'id': 28, 'name': 'Action'}, {'id': 878, 'name': 'Science Fiction'}]


In [11]:
#@ Define Function (by Israel Diaz)
def transform_dtl(list_of_dictionaries):
    ''' Returns list of genres from a list of dictionaries'''
    logenres = [list(i.values())[1] for i in [k for k in list_of_dictionaries]]
    return logenres

def transform_prodc(list_of_prod_names):
    ''' Returns list of production companies from a list of dictionaries'''
    logenres = [i['name'] for i in [k for k in list_of_prod_names]]
    return logenres

def transform_lang(list_of_lang):
    ''' Returns list of languages from a list of dictionaries languages'''
    logenres = [i['english_name'] for i in [k for k in list_of_lang]]
    return logenres

def build_lists(data, column, function):
    '''Return the data with lists instead dictionaries, will the apply the function "function" to the transformation'''
    list_column = f'list_{column}'
    data[list_column] = data[column].map(lambda x: function(x), na_action='ignore')
    return data

# I didn't use the following function but I stored it here because I will use it further ahead.
def explode_data(data, reference, column):
    '''transfom the lists in dummy variables. '''
    list_column = f'list_{column}'
    build_lists(data, column)

    subset = data[[reference, column, list_column]]
    #dummy variables
    subset.set_index(reference, inplace=True)
    dummies = pd.get_dummies(subset[list_column].apply(pd.Series).stack(), prefix=column, prefix_sep='_', dummy_na=True).sum(level=0)
    #merge data
    data = data.merge(dummies, how='inner', left_on=reference, right_index=True)
    #drop duplicated columns
    data.drop(columns=[column, list_column], inplace=True)
    return data

In [12]:
data = build_lists(data=data, column='genres', function=transform_dtl)
data.drop(columns='genres', inplace=True)
data.head()

Unnamed: 0,imdb_id,adult,belongs_to_collection,budget,original_language,popularity,production_companies,production_countries,release_date,revenue,runtime,status,title,video,vote_average,vote_count,certification,list_genres
1,tt0113026,0.0,,10000000.0,en,3.34,"[{'id': 51207, 'logo_path': None, 'name': 'Sul...","[{'iso_3166_1': 'US', 'name': 'United States o...",2000-09-22,0.0,86.0,Released,The Fantasticks,0.0,5.5,22.0,,"[Comedy, Music, Romance]"
2,tt0113092,0.0,,0.0,en,2.02,"[{'id': 7405, 'logo_path': '/rfnws0uY8rsNAsrLb...","[{'iso_3166_1': 'US', 'name': 'United States o...",2000-11-15,0.0,100.0,Released,For the Cause,0.0,5.1,8.0,,[Science Fiction]
3,tt0116391,0.0,,0.0,hi,0.6,[],"[{'iso_3166_1': 'IN', 'name': 'India'}]",2000-04-14,0.0,152.0,Released,Gang,0.0,4.0,1.0,,"[Drama, Action, Crime]"
4,tt0118694,0.0,,150000.0,cn,28.73,"[{'id': 539, 'logo_path': '/iPLtePguIzOPNtAWfT...","[{'iso_3166_1': 'FR', 'name': 'France'}, {'iso...",2000-09-29,12854953.0,99.0,Released,In the Mood for Love,0.0,8.11,2136.0,PG,"[Drama, Romance]"
5,tt0118852,0.0,,0.0,en,4.52,"[{'id': 67930, 'logo_path': None, 'name': 'Cha...","[{'iso_3166_1': 'US', 'name': 'United States o...",2000-09-02,0.0,99.0,Released,Chinese Coffee,0.0,6.8,49.0,R,[Drama]


#### Production_companies

In [13]:
test_prod = data.loc[365, 'production_companies']
print('Outer: ', type(test_prod))
print('Inner: ', type(test_prod[0]))
print(test_prod)

Outer:  <class 'list'>
Inner:  <class 'dict'>
[{'id': 7069, 'logo_path': None, 'name': 'Daly-Harris Productions', 'origin_country': ''}, {'id': 7070, 'logo_path': None, 'name': 'Davis Entertainment Classics', 'origin_country': ''}, {'id': 7071, 'logo_path': None, 'name': 'Sordid Lives LLC', 'origin_country': ''}]


In [14]:
[i['name'] for i in [k for k in test_prod]]

['Daly-Harris Productions', 'Davis Entertainment Classics', 'Sordid Lives LLC']

In [15]:
data = build_lists(data=data, column='production_companies', function=transform_prodc)
data.drop(columns='production_companies', inplace=True)
data.head()

Unnamed: 0,imdb_id,adult,belongs_to_collection,budget,original_language,popularity,production_countries,release_date,revenue,runtime,status,title,video,vote_average,vote_count,certification,list_genres,list_production_companies
1,tt0113026,0.0,,10000000.0,en,3.34,"[{'iso_3166_1': 'US', 'name': 'United States o...",2000-09-22,0.0,86.0,Released,The Fantasticks,0.0,5.5,22.0,,"[Comedy, Music, Romance]","[Sullivan Street Productions, Michael Ritchie ..."
2,tt0113092,0.0,,0.0,en,2.02,"[{'iso_3166_1': 'US', 'name': 'United States o...",2000-11-15,0.0,100.0,Released,For the Cause,0.0,5.1,8.0,,[Science Fiction],"[Dimension Films, Grand Design Entertainment, ..."
3,tt0116391,0.0,,0.0,hi,0.6,"[{'iso_3166_1': 'IN', 'name': 'India'}]",2000-04-14,0.0,152.0,Released,Gang,0.0,4.0,1.0,,"[Drama, Action, Crime]",[]
4,tt0118694,0.0,,150000.0,cn,28.73,"[{'iso_3166_1': 'FR', 'name': 'France'}, {'iso...",2000-09-29,12854953.0,99.0,Released,In the Mood for Love,0.0,8.11,2136.0,PG,"[Drama, Romance]","[Block 2 Pictures, Orly Films, Jet Tone Films,..."
5,tt0118852,0.0,,0.0,en,4.52,"[{'iso_3166_1': 'US', 'name': 'United States o...",2000-09-02,0.0,99.0,Released,Chinese Coffee,0.0,6.8,49.0,R,[Drama],"[Chal Productions, Shooting Gallery]"


#### Production Countries

In [16]:
test_prod = data.loc[25, 'production_countries']
print('Outer: ', type(test_prod))
print('Inner: ', type(test_prod[0]))
print(test_prod)

Outer:  <class 'list'>
Inner:  <class 'dict'>
[{'iso_3166_1': 'CA', 'name': 'Canada'}, {'iso_3166_1': 'US', 'name': 'United States of America'}]


In [17]:
[i['name'] for i in [k for k in test_prod]]

['Canada', 'United States of America']

In [18]:
data = build_lists(data=data, column='production_countries', function=transform_prodc)
data.drop(columns='production_countries', inplace=True)
data.head()

Unnamed: 0,imdb_id,adult,belongs_to_collection,budget,original_language,popularity,release_date,revenue,runtime,status,title,video,vote_average,vote_count,certification,list_genres,list_production_companies,list_production_countries
1,tt0113026,0.0,,10000000.0,en,3.34,2000-09-22,0.0,86.0,Released,The Fantasticks,0.0,5.5,22.0,,"[Comedy, Music, Romance]","[Sullivan Street Productions, Michael Ritchie ...",[United States of America]
2,tt0113092,0.0,,0.0,en,2.02,2000-11-15,0.0,100.0,Released,For the Cause,0.0,5.1,8.0,,[Science Fiction],"[Dimension Films, Grand Design Entertainment, ...",[United States of America]
3,tt0116391,0.0,,0.0,hi,0.6,2000-04-14,0.0,152.0,Released,Gang,0.0,4.0,1.0,,"[Drama, Action, Crime]",[],[India]
4,tt0118694,0.0,,150000.0,cn,28.73,2000-09-29,12854953.0,99.0,Released,In the Mood for Love,0.0,8.11,2136.0,PG,"[Drama, Romance]","[Block 2 Pictures, Orly Films, Jet Tone Films,...","[France, Hong Kong, Netherlands]"
5,tt0118852,0.0,,0.0,en,4.52,2000-09-02,0.0,99.0,Released,Chinese Coffee,0.0,6.8,49.0,R,[Drama],"[Chal Productions, Shooting Gallery]",[United States of America]


### Export Data

In [19]:
data.to_csv(f'data/mod/{YEAR}.csv.gz', compression='gzip', index=False)

## Exploratory Data Analysis

### Return movies with budget or revenue greater than 0

In [20]:
filter = (data['budget'] > 0) | (data['revenue'] > 0)

print(f'Number of Instances: {len(data[filter])}')

Number of Instances: 293


There are 293 instances that hace budget or revenue greater than 0 in the year 2000. Ok Let's save it.

In [21]:
data_budget = data[filter].copy()

### Movies per certification categories (G/PG/PG-13/R)

In [22]:
data_budget[['certification', 'imdb_id']].groupby(by='certification').count().sort_values(by='imdb_id',ascending=False)

Unnamed: 0_level_0,imdb_id
certification,Unnamed: 1_level_1
R,105
PG-13,61
,51
PG,17
NR,10
G,8


Rated-R movies are by far the ones that most produced in the year 2000

### Revenue per certification category

In [23]:
data_budget[['certification', 'revenue']].groupby(by='certification').mean().sort_values(by='revenue',ascending=False)

Unnamed: 0_level_0,revenue
certification,Unnamed: 1_level_1
G,105343104.38
PG-13,102032591.39
PG,81166934.0
R,31639905.5
,13370304.8
NR,10169454.0


### Average budget per certification category

In [24]:
data_budget[['certification', 'budget']].groupby(by='certification').mean().sort_values(by='budget',ascending=False)


Unnamed: 0_level_0,budget
certification,Unnamed: 1_level_1
PG,48141176.71
PG-13,45668032.79
G,45000000.0
R,19025095.24
NR,9034009.4
,5091799.29
