# Movies: Explortory Data Analysis

by Israel Diaz

## Data Description

The data correspond to the one downloaded from [IMDB source](https://datasets.imdbws.com/).

**IMDb Dataset Details**

Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8 character set. The first line in each file contains headers that describe what is in each column. A ‘\N’ is used to denote that a particular field is missing or null for that title/name. The available datasets are as follows:

**title.akas.tsv.gz** - Contains the following information for titles:

* titleId (string) - a tconst, an alphanumeric unique identifier of the title
* ordering (integer) – a number to uniquely identify rows for a given titleId
* title (string) – the localized title
* region (string) - the region for this version of the title
* language (string) - the language of the title
* types (array) - Enumerated set of attributes for this alternative title. One or more of the following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay". New values may be added in the future without warning
* attributes (array) - Additional terms to describe this alternative title, not enumerated
* isOriginalTitle (boolean) – 0: not original title; 1: original title

**title.basics.tsv.gz** - Contains the following information for titles:

* tconst (string) - alphanumeric unique identifier of the title
* titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
* primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
* originalTitle (string) - original title, in the original language
* isAdult (boolean) - 0: non-adult title; 1: adult title
* startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
* endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
* runtimeMinutes – primary runtime of the title, in minutes
* genres (string array) – includes up to three genres associated with the title

**title.ratings.tsv.gz** – Contains the IMDb rating and votes information for titles

* tconst (string) - alphanumeric unique identifier of the title
* averageRating – weighted average of all the individual user ratings
* numVotes - number of votes the title has received

## Loading Data

### Import Libraries

In [2]:
## General
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter('ignore')


### Load Data

For Now, I will explore data from year 2000

In [3]:
FOLDER = 'data/'

In [11]:
data = pd.read_json(path_or_buf=FOLDER + 'tmdb_api_results_2000.json')

In [12]:
data.head()

Unnamed: 0,imdb_id,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,original_language,original_title,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,0,,,,,,,,,,...,,,,,,,,,,
1,tt0113026,0.0,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,,10000000.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10402, '...",,62127.0,en,The Fantasticks,...,0.0,86.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Try to remember the first time magic happened,The Fantasticks,0.0,5.5,22.0,
2,tt0113092,0.0,,,0.0,"[{'id': 878, 'name': 'Science Fiction'}]",,110977.0,en,For the Cause,...,0.0,100.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,The ultimate showdown on a forbidden planet.,For the Cause,0.0,5.1,8.0,
3,tt0116391,0.0,,,0.0,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",,442869.0,hi,Gang,...,0.0,152.0,"[{'english_name': 'Hindi', 'iso_639_1': 'hi', ...",Released,,Gang,0.0,4.0,1.0,
4,tt0118694,0.0,/n4GJFGzsc7NinI1VeGDXIcQjtU2.jpg,,150000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",,843.0,cn,花樣年華,...,12854953.0,99.0,"[{'english_name': 'Cantonese', 'iso_639_1': 'c...",Released,"Feel the heat, keep the feeling burning, let t...",In the Mood for Love,0.0,8.113,2136.0,PG


In [13]:
print(f'Number of instances: {len(data)}')

Number of instances: 1214


## Data Cleaning

### Duplicated

In [14]:
data['imdb_id'].duplicated().sum()

0

### Missing Values

In [15]:
data.isna().sum()

imdb_id                     0
adult                       1
backdrop_path             554
belongs_to_collection    1102
budget                      1
genres                      1
homepage                    1
id                          1
original_language           1
original_title              1
overview                    1
popularity                  1
poster_path               124
production_companies        1
production_countries        1
release_date                1
revenue                     1
runtime                     1
spoken_languages            1
status                      1
tagline                     1
title                       1
video                       1
vote_average                1
vote_count                  1
certification             424
dtype: int64

I will drop the following data:
* Columns: backdrop_path, poster_path, overview

In [16]:
data.drop(columns=['backdrop_path','poster_path', 'overview'], inplace=True)

### Column Transformations

#### genres

In [47]:
test_genre = data.loc[125, 'genres']
print('Outer: ', type(test_genre))
print('Inner: ', type(test_genre[0]))
print(test_genre)

Outer:  <class 'list'>
Inner:  <class 'dict'>
[{'id': 28, 'name': 'Action'}, {'id': 878, 'name': 'Science Fiction'}]


In [56]:
## Define Function
def transform_dtl(list_of_dictionaries):
    ''' Returns list of genres from a list of dictionaries'''
    logenres = [list(i.values())[1] for i in [k for k in list_of_dictionaries]]
    return logenres

In [58]:
data['list_genres'] = data['genres'].map(lambda x: transform_dtl(x), na_action='ignore')
data.head()

Unnamed: 0,imdb_id,adult,belongs_to_collection,budget,genres,homepage,id,original_language,original_title,popularity,...,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification,list_genres
0,0,,,,,,,,,,...,,,,,,,,,,
1,tt0113026,0.0,,10000000.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10402, '...",,62127.0,en,The Fantasticks,3.342,...,86.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Try to remember the first time magic happened,The Fantasticks,0.0,5.5,22.0,,"[Comedy, Music, Romance]"
2,tt0113092,0.0,,0.0,"[{'id': 878, 'name': 'Science Fiction'}]",,110977.0,en,For the Cause,2.017,...,100.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,The ultimate showdown on a forbidden planet.,For the Cause,0.0,5.1,8.0,,[Science Fiction]
3,tt0116391,0.0,,0.0,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",,442869.0,hi,Gang,0.6,...,152.0,"[{'english_name': 'Hindi', 'iso_639_1': 'hi', ...",Released,,Gang,0.0,4.0,1.0,,"[Drama, Action, Crime]"
4,tt0118694,0.0,,150000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",,843.0,cn,花樣年華,28.733,...,99.0,"[{'english_name': 'Cantonese', 'iso_639_1': 'c...",Released,"Feel the heat, keep the feeling burning, let t...",In the Mood for Love,0.0,8.113,2136.0,PG,"[Drama, Romance]"


In [59]:
data.drop(columns='genres', inplace=True)

#### Production_companies

In [62]:
test_prod = data.loc[365, 'production_companies']
print('Outer: ', type(test_prod))
print('Inner: ', type(test_prod[0]))
print(test_prod)

Outer:  <class 'list'>
Inner:  <class 'dict'>
[{'id': 7069, 'logo_path': None, 'name': 'Daly-Harris Productions', 'origin_country': ''}, {'id': 7070, 'logo_path': None, 'name': 'Davis Entertainment Classics', 'origin_country': ''}, {'id': 7071, 'logo_path': None, 'name': 'Sordid Lives LLC', 'origin_country': ''}]


In [70]:
[i['name'] for i in [k for k in test_prod]]

['Daly-Harris Productions', 'Davis Entertainment Classics', 'Sordid Lives LLC']

In [71]:
## Define Function
def transform_prodc(list_of_prod_names):
    ''' Returns list of production companies from a list of dictionaries'''
    logenres = [i['name'] for i in [k for k in list_of_prod_names]]
    return logenres

In [72]:
data['list_production_companes'] = data['production_companies'].map(lambda x: transform_prodc(x), na_action='ignore')
data.drop(columns='production_companies', inplace=True)
data.head()

Unnamed: 0,imdb_id,adult,belongs_to_collection,budget,homepage,id,original_language,original_title,popularity,production_countries,...,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification,list_genres,list_production_companes
0,0,,,,,,,,,,...,,,,,,,,,,
1,tt0113026,0.0,,10000000.0,,62127.0,en,The Fantasticks,3.342,"[{'iso_3166_1': 'US', 'name': 'United States o...",...,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Try to remember the first time magic happened,The Fantasticks,0.0,5.5,22.0,,"[Comedy, Music, Romance]","[Sullivan Street Productions, Michael Ritchie ..."
2,tt0113092,0.0,,0.0,,110977.0,en,For the Cause,2.017,"[{'iso_3166_1': 'US', 'name': 'United States o...",...,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,The ultimate showdown on a forbidden planet.,For the Cause,0.0,5.1,8.0,,[Science Fiction],"[Dimension Films, Grand Design Entertainment, ..."
3,tt0116391,0.0,,0.0,,442869.0,hi,Gang,0.6,"[{'iso_3166_1': 'IN', 'name': 'India'}]",...,"[{'english_name': 'Hindi', 'iso_639_1': 'hi', ...",Released,,Gang,0.0,4.0,1.0,,"[Drama, Action, Crime]",[]
4,tt0118694,0.0,,150000.0,,843.0,cn,花樣年華,28.733,"[{'iso_3166_1': 'FR', 'name': 'France'}, {'iso...",...,"[{'english_name': 'Cantonese', 'iso_639_1': 'c...",Released,"Feel the heat, keep the feeling burning, let t...",In the Mood for Love,0.0,8.113,2136.0,PG,"[Drama, Romance]","[Block 2 Pictures, Orly Films, Jet Tone Films,..."


# WORK IN PROGRESS

### USE THE FRONT DRIVEWAY