# Goals:
- Global Goal: Study characteristics of successful movies in the US
- Local Goal: Combine data files from TMDB query into single dataset and perform exploratory analysis

# Data Description
- results from TMDB query are in individual .csv files sorted by movie year.
- There are different versions included in the data folder. The relevant files have the format ```'Data/final_tmdb_data_*.0.csv.gz'```

# Deliverables
- Single .csv file including all data from TMDB query as well as exploratory analysis results

# EDA Foci
    - How many movies had at least some valid financial information (values > 0 for budget OR revenue)?
        - exclude any movies with 0's for budget AND revenue from the remaining visualizations.
    - How many movies are there in each of the certification categories (G/PG/PG-13/R)?
    - What is the average revenue per certification category?
    - What is the average budget per certification category?

# Import and Load

In [29]:
import os, glob
import pandas as pd
import numpy as np


# Concatenate TMDB results to database

In [5]:
FOLDER = "Data/"
os.makedirs(FOLDER, exist_ok=True)
os.listdir(FOLDER)

['tmdb_api_results_2017.0.json',
 'final_tmdb_data_2005.0.csv.gz',
 'final_tmdb_data_2004.0.csv.gz',
 'tmdb_api_results_2009.0.json',
 'tmdb_api_results_2005.0.json',
 'tmdb_api_results_2021.0.json',
 'tmdb_api_results_2019.0.json',
 'final_tmdb_data_2018.0.csv.gz',
 'final_tmdb_data_2019.0.csv.gz',
 'tmdb_api_results_2007.0.json',
 'tmdb_api_results_2015.0.json',
 'final_tmdb_data_2012.0.csv.gz',
 'final_tmdb_data_2013.0.csv.gz',
 'final_tmdb_data_2021.0.csv.gz',
 'final_tmdb_data_2020.0.csv.gz',
 'tmdb_api_results_2011.0.json',
 'tmdb_api_results_2003.0.json',
 'final_tmdb_data_2015.0.csv.gz',
 'final_tmdb_data_2014.0.csv.gz',
 'basics.csv.gz',
 'tmdb_api_results_2000.json',
 'final_tmdb_data_2000.csv.gz',
 'final_tmdb_data_2008.0.csv.gz',
 'final_tmdb_data_2009.0.csv.gz',
 'tmdb_api_results_2001.json',
 'ratings.csv.gz',
 'final_tmdb_data_2002.0.csv.gz',
 'final_tmdb_data_2003.0.csv.gz',
 'tmdb_api_results_2001.0.json',
 'tmdb_api_results_2013.0.json',
 'final_tmdb_data_2011.0.csv.g

In [12]:
csv_files = glob.glob('Data/final_tmdb_data_*.0.csv.gz')

In [13]:
csv_files

['Data/final_tmdb_data_2005.0.csv.gz',
 'Data/final_tmdb_data_2004.0.csv.gz',
 'Data/final_tmdb_data_2018.0.csv.gz',
 'Data/final_tmdb_data_2019.0.csv.gz',
 'Data/final_tmdb_data_2012.0.csv.gz',
 'Data/final_tmdb_data_2013.0.csv.gz',
 'Data/final_tmdb_data_2021.0.csv.gz',
 'Data/final_tmdb_data_2020.0.csv.gz',
 'Data/final_tmdb_data_2015.0.csv.gz',
 'Data/final_tmdb_data_2014.0.csv.gz',
 'Data/final_tmdb_data_2008.0.csv.gz',
 'Data/final_tmdb_data_2009.0.csv.gz',
 'Data/final_tmdb_data_2002.0.csv.gz',
 'Data/final_tmdb_data_2003.0.csv.gz',
 'Data/final_tmdb_data_2011.0.csv.gz',
 'Data/final_tmdb_data_2010.0.csv.gz',
 'Data/final_tmdb_data_2006.0.csv.gz',
 'Data/final_tmdb_data_2007.0.csv.gz',
 'Data/final_tmdb_data_2001.0.csv.gz',
 'Data/final_tmdb_data_2000.0.csv.gz',
 'Data/final_tmdb_data_2016.0.csv.gz',
 'Data/final_tmdb_data_2017.0.csv.gz',
 'Data/final_tmdb_data_2022.0.csv.gz']

In [19]:
for f in csv_files:
    print(f)
    result = pd.read_csv(f,lineterminator='\n')

Data/final_tmdb_data_2005.0.csv.gz
Data/final_tmdb_data_2004.0.csv.gz
Data/final_tmdb_data_2018.0.csv.gz
Data/final_tmdb_data_2019.0.csv.gz
Data/final_tmdb_data_2012.0.csv.gz
Data/final_tmdb_data_2013.0.csv.gz
Data/final_tmdb_data_2021.0.csv.gz
Data/final_tmdb_data_2020.0.csv.gz
Data/final_tmdb_data_2015.0.csv.gz
Data/final_tmdb_data_2014.0.csv.gz
Data/final_tmdb_data_2008.0.csv.gz
Data/final_tmdb_data_2009.0.csv.gz
Data/final_tmdb_data_2002.0.csv.gz
Data/final_tmdb_data_2003.0.csv.gz
Data/final_tmdb_data_2011.0.csv.gz
Data/final_tmdb_data_2010.0.csv.gz
Data/final_tmdb_data_2006.0.csv.gz
Data/final_tmdb_data_2007.0.csv.gz
Data/final_tmdb_data_2001.0.csv.gz
Data/final_tmdb_data_2000.0.csv.gz
Data/final_tmdb_data_2016.0.csv.gz
Data/final_tmdb_data_2017.0.csv.gz
Data/final_tmdb_data_2022.0.csv.gz


In [20]:
tmdb_results_df = pd.concat((pd.read_csv(f, lineterminator='\n') for f in csv_files), ignore_index=True)

In [21]:
tmdb_results_df.head()

Unnamed: 0,imdb_id,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,original_language,original_title,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,0,,,,,,,,,,...,,,,,,,,,,
1,tt0088751,0.0,,,350000.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 27, 'nam...",,29163.0,en,The Naked Monster,...,0.0,100.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,,The Naked Monster,0.0,3.4,5.0,
2,tt0118141,0.0,/unoJZwLGTlzKc3QkvsERVPLRnFH.jpg,,0.0,"[{'id': 18, 'name': 'Drama'}, {'id': 14, 'name...",http://www.crispinglover.com/whatisit.htm,54506.0,en,What Is It?,...,0.0,72.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,The adventures of a young man whose principle ...,What Is It?,0.0,5.8,22.0,NC-17
3,tt0120667,0.0,/jkBEPKRq4HWlLwsMFMdDiYwaCle.jpg,"{'id': 9744, 'name': 'Fantastic Four Collectio...",100000000.0,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",,9738.0,en,Fantastic Four,...,333535934.0,106.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,4 times the action. 4 times the adventure. 4 t...,Fantastic Four,0.0,5.769,8238.0,PG-13
4,tt0121164,0.0,/r4VumNLSafeGhlieKNhGv0BQ4UD.jpg,,40000000.0,"[{'id': 10749, 'name': 'Romance'}, {'id': 14, ...",http://corpsebridemovie.warnerbros.com/,3933.0,en,Corpse Bride,...,118133252.0,77.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,There's been a grave misunderstanding.,Corpse Bride,0.0,7.488,7411.0,


In [22]:
tmdb_results_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60706 entries, 0 to 60705
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   imdb_id                60706 non-null  object 
 1   adult                  60683 non-null  float64
 2   backdrop_path          37940 non-null  object 
 3   belongs_to_collection  3908 non-null   object 
 4   budget                 60683 non-null  float64
 5   genres                 60683 non-null  object 
 6   homepage               14929 non-null  object 
 7   id                     60683 non-null  float64
 8   original_language      60683 non-null  object 
 9   original_title         60683 non-null  object 
 10  overview               59348 non-null  object 
 11  popularity             60683 non-null  float64
 12  poster_path            55252 non-null  object 
 13  production_companies   60683 non-null  object 
 14  production_countries   60683 non-null  object 
 15  re

# Exploratory Analysis

## Movies with budget info?
    - budget > 0 or Revenue > 0
    - exclude those which have 0 for both

In [25]:
budget_df = tmdb_results_df[(tmdb_results_df['budget'] > 0) | (tmdb_results_df['revenue'] > 0)]

In [33]:
# check that all budget = 0 entries have non-zero revenues
budget_df[budget_df['budget']==0]['revenue'].describe()

count    2.437000e+03
mean     9.272926e+06
std      3.125197e+07
min      1.000000e+00
25%      1.112320e+05
50%      1.113277e+06
75%      6.019720e+06
max      6.862576e+08
Name: revenue, dtype: float64

In [34]:
# and the reverse
budget_df[budget_df['revenue']==0]['budget'].describe()

count    6.092000e+03
mean     3.503121e+06
std      1.012984e+07
min      1.000000e+00
25%      2.500000e+04
50%      5.000000e+05
75%      3.000000e+06
max      2.000000e+08
Name: budget, dtype: float64

## movies in each certification?

In [37]:
budget_df['certification'].unique()

array([nan, 'PG-13', 'PG', 'R', 'G', 'NC-17', 'NR', 'R ', 'PG-13 ',
       'Unrated'], dtype=object)

In [39]:
budget_df.loc[:,'certification'].replace({'R ':'R'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  budget_df.loc[:,'certification'].replace({'R ':'R'}, inplace=True)


In [40]:
budget_df['certification'].value_counts()

R          3242
PG-13      2030
NR          952
PG          793
G           160
NC-17        37
PG-13         1
Unrated       1
Name: certification, dtype: int64

## Average revenue per certification category

In [44]:
budget_df.groupby(['certification']).mean()

Unnamed: 0_level_0,adult,budget,id,popularity,revenue,runtime,video,vote_average,vote_count
certification,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
G,0.0,32177720.0,189196.8375,31.260706,96051170.0,90.70625,0.0,6.124244,1376.325
NC-17,0.027027,3209207.0,253127.621622,14.582432,4905780.0,101.243243,0.0,5.630541,509.081081
NR,0.00105,2852528.0,357739.030462,7.311978,6383152.0,92.137605,0.011555,4.617087,198.192227
PG,0.0,42040110.0,184792.761665,40.360364,126162600.0,100.591425,0.001261,6.191763,1839.723834
PG-13,0.000493,40809500.0,168468.588177,38.308545,118392100.0,108.4867,0.002463,6.232267,2321.76601
PG-13,0.0,0.0,262958.0,6.042,4825184.0,106.0,0.0,6.724,156.0
R,0.000308,14704490.0,178258.082973,28.040854,31056500.0,103.816471,0.001234,5.921849,1192.002776
Unrated,0.0,260.0,407659.0,0.6,0.0,76.0,0.0,2.0,1.0


## Average Budget per certification category

see above

# Save combined data

In [45]:
tmdb_results_df.to_csv('Data/tmdb_results_combined.csv.gz')