# Movies: Explortory Data Analysis

by Israel Diaz

## Data Description

The data correspond to the one downloaded from [IMDB source](https://datasets.imdbws.com/).

**IMDb Dataset Details**

Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8 character set. The first line in each file contains headers that describe what is in each column. A ‘\N’ is used to denote that a particular field is missing or null for that title/name. The available datasets are as follows:

**title.akas.tsv.gz** - Contains the following information for titles:

* titleId (string) - a tconst, an alphanumeric unique identifier of the title
* ordering (integer) – a number to uniquely identify rows for a given titleId
* title (string) – the localized title
* region (string) - the region for this version of the title
* language (string) - the language of the title
* types (array) - Enumerated set of attributes for this alternative title. One or more of the following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay". New values may be added in the future without warning
* attributes (array) - Additional terms to describe this alternative title, not enumerated
* isOriginalTitle (boolean) – 0: not original title; 1: original title

**title.basics.tsv.gz** - Contains the following information for titles:

* tconst (string) - alphanumeric unique identifier of the title
* titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
* primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
* originalTitle (string) - original title, in the original language
* isAdult (boolean) - 0: non-adult title; 1: adult title
* startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
* endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
* runtimeMinutes – primary runtime of the title, in minutes
* genres (string array) – includes up to three genres associated with the title

**title.ratings.tsv.gz** – Contains the IMDb rating and votes information for titles

* tconst (string) - alphanumeric unique identifier of the title
* averageRating – weighted average of all the individual user ratings
* numVotes - number of votes the title has received

## Loading Data

### Import Libraries

In [1]:
## General
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os, glob, json
import warnings
warnings.simplefilter('ignore')


## suppress scientific notation
pd.options.display.float_format = '{:20,.2f}'.format


### Load Data

For Now, I will explore data from year 2000 and 2001, I will add more years as the project goes forward.

In [2]:
FOLDER = 'data/'
YEARS = [2000,2001]

In [3]:
data = pd.DataFrame()
for YEAR in YEARS:
    FILE_NAME = f'tmdb_api_results_{YEAR}.json'
    temp = pd.read_json(path_or_buf=FOLDER+FILE_NAME)
    data = pd.concat([data,temp], ignore_index=True, sort=False)

In [4]:
display(data.head())

Unnamed: 0,imdb_id,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,original_language,original_title,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,0,,,,,,,,,,...,,,,,,,,,,
1,tt0113026,0.0,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,,10000000.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10402, '...",,62127.0,en,The Fantasticks,...,0.0,86.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Try to remember the first time magic happened,The Fantasticks,0.0,5.5,22.0,
2,tt0113092,0.0,,,0.0,"[{'id': 878, 'name': 'Science Fiction'}]",,110977.0,en,For the Cause,...,0.0,100.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,The ultimate showdown on a forbidden planet.,For the Cause,0.0,5.1,8.0,
3,tt0116391,0.0,,,0.0,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",,442869.0,hi,Gang,...,0.0,152.0,"[{'english_name': 'Hindi', 'iso_639_1': 'hi', ...",Released,,Gang,0.0,4.0,1.0,
4,tt0118694,0.0,/n4GJFGzsc7NinI1VeGDXIcQjtU2.jpg,,150000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",,843.0,cn,花樣年華,...,12854953.0,99.0,"[{'english_name': 'Cantonese', 'iso_639_1': 'c...",Released,"Feel the heat, keep the feeling burning, let t...",In the Mood for Love,0.0,8.11,2138.0,PG


In [5]:
print(f'Number of instances: {len(data)}')

Number of instances: 2520


### Export Data

In [6]:
data.to_csv(f'{FOLDER}tmdb_results_combined.csv.gz', compression='gzip', index=False)

## Exploratory Data Analysis

### Return movies with budget or revenue greater than 0

In [7]:
filter = (data['budget'] > 0) | (data['revenue'] > 0)

print(f'Number of Instances: {len(data[filter])}')

Number of Instances: 629


There are 629 instances that hace budget or revenue greater than 0 in the year 2000. Ok Let's save it.

In [8]:
data_budget = data[filter].copy()

### Movies per certification categories (G/PG/PG-13/R)

In [9]:
data_budget[['certification', 'imdb_id']].groupby(by='certification').count().sort_values(by='imdb_id',ascending=False)

Unnamed: 0_level_0,imdb_id
certification,Unnamed: 1_level_1
R,229
PG-13,130
,127
PG,36
NR,17
G,14


Rated-R movies are by far the ones that most produced in the year 2000

### Revenue per certification category

In [10]:
data_budget[['certification', 'revenue']].groupby(by='certification').mean().sort_values(by='revenue',ascending=False)

Unnamed: 0_level_0,revenue
certification,Unnamed: 1_level_1
G,123746274.93
PG,109533845.75
PG-13,98963541.18
R,33135234.21
,10813612.86
NR,9588674.35


### Average budget per certification category

In [11]:
data_budget[['certification', 'budget']].groupby(by='certification').mean().sort_values(by='budget',ascending=False)


Unnamed: 0_level_0,budget
certification,Unnamed: 1_level_1
PG,43819367.75
PG-13,42844291.75
G,40857142.86
R,19698709.39
NR,6302358.47
,5822588.02
