### Explore selected data sets
- tn.movie_budgets.csv.gz
- tmdb.movies.csv.gz
- imdb.title.basics.csv.gz

### Rationale
- unique index: movie name/title
- create 1 joined data set

### Ideas
- budget vs gross
- create new column for profit (budget - gross)
- release month vs gross/profit
- genre vs vote average (need to investigate movies with multiple genres, will it skew data?)
- genre vs popularity (need to investigate definition of popularity)
- runtime vs profit (genre?)

### Call-outs
- unique movie name/titles (will there be duplicates? reboots?)
- movie name/titles casing (use all lowercase?)
- does worldwide gross include domestic gross?

# 1. Import libraries

In [11]:
import pandas as pd
import numpy as np
import datetime
import matplotlib.pyplot as plt
%matplotlib inline

# 2. Load data

### budget_gross

In [15]:
budget_gross_df = pd.read_csv('zippedData/tn.movie_budgets.csv.gz', compression='gzip')

In [16]:
budget_gross_df.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


In [17]:
budget_gross_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB


### rating

In [18]:
rating_df = pd.read_csv('zippedData/tmdb.movies.csv.gz', compression='gzip')

In [19]:
rating_df.head()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


In [20]:
rating_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         26517 non-null  int64  
 1   genre_ids          26517 non-null  object 
 2   id                 26517 non-null  int64  
 3   original_language  26517 non-null  object 
 4   original_title     26517 non-null  object 
 5   popularity         26517 non-null  float64
 6   release_date       26517 non-null  object 
 7   title              26517 non-null  object 
 8   vote_average       26517 non-null  float64
 9   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 2.0+ MB


### runtime_genre

In [21]:
runtime_genre_df = pd.read_csv('zippedData/imdb.title.basics.csv.gz', compression='gzip')

In [22]:
runtime_genre_df.head()

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


In [23]:
runtime_genre_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   tconst           146144 non-null  object 
 1   primary_title    146144 non-null  object 
 2   original_title   146123 non-null  object 
 3   start_year       146144 non-null  int64  
 4   runtime_minutes  114405 non-null  float64
 5   genres           140736 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB


# 3. Data Cleaning

### A. budget_gross

In [25]:
budget_gross_df.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


#### Action items
- update release_date format
- lowercase movie
- remove '$', ',' and convert from str to int

In [27]:
# Update release_date datetime format
budget_gross_df['release_date'] = pd.to_datetime(budget_gross_df['release_date'], infer_datetime_format=True)
budget_gross_df.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,2009-12-18,Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,2011-05-20,Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,2019-06-07,Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,2015-05-01,Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,2017-12-15,Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


In [29]:
# lowercase movie
budget_gross_df['movie'] = budget_gross_df['movie'].str.lower()
budget_gross_df.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,2009-12-18,avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,2011-05-20,pirates of the caribbean: on stranger tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,2019-06-07,dark phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,2015-05-01,avengers: age of ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,2017-12-15,star wars ep. viii: the last jedi,"$317,000,000","$620,181,382","$1,316,721,747"


In [32]:
# remove'$' from production_budget
budget_gross_df['production_budget'] = budget_gross_df['production_budget'].str.replace('$', '')
budget_gross_df.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,2009-12-18,avatar,425000000,"$760,507,625","$2,776,345,279"
1,2,2011-05-20,pirates of the caribbean: on stranger tides,410600000,"$241,063,875","$1,045,663,875"
2,3,2019-06-07,dark phoenix,350000000,"$42,762,350","$149,762,350"
3,4,2015-05-01,avengers: age of ultron,330600000,"$459,005,868","$1,403,013,963"
4,5,2017-12-15,star wars ep. viii: the last jedi,317000000,"$620,181,382","$1,316,721,747"


In [33]:
# remove ',' from production_budget
budget_gross_df['production_budget'] = budget_gross_df['production_budget'].str.replace(',', '')
budget_gross_df.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,2009-12-18,avatar,425000000,"$760,507,625","$2,776,345,279"
1,2,2011-05-20,pirates of the caribbean: on stranger tides,410600000,"$241,063,875","$1,045,663,875"
2,3,2019-06-07,dark phoenix,350000000,"$42,762,350","$149,762,350"
3,4,2015-05-01,avengers: age of ultron,330600000,"$459,005,868","$1,403,013,963"
4,5,2017-12-15,star wars ep. viii: the last jedi,317000000,"$620,181,382","$1,316,721,747"


In [36]:
# convert production_budget from str to int
budget_gross_df['production_budget'] = budget_gross_df['production_budget'].astype(int)
budget_gross_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   id                 5782 non-null   int64         
 1   release_date       5782 non-null   datetime64[ns]
 2   movie              5782 non-null   object        
 3   production_budget  5782 non-null   int64         
 4   domestic_gross     5782 non-null   object        
 5   worldwide_gross    5782 non-null   object        
dtypes: datetime64[ns](1), int64(2), object(3)
memory usage: 271.2+ KB


In [37]:
# remove'$' from domestic_gross
budget_gross_df['domestic_gross'] = budget_gross_df['domestic_gross'].str.replace('$', '')
budget_gross_df.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,2009-12-18,avatar,425000000,760507625,"$2,776,345,279"
1,2,2011-05-20,pirates of the caribbean: on stranger tides,410600000,241063875,"$1,045,663,875"
2,3,2019-06-07,dark phoenix,350000000,42762350,"$149,762,350"
3,4,2015-05-01,avengers: age of ultron,330600000,459005868,"$1,403,013,963"
4,5,2017-12-15,star wars ep. viii: the last jedi,317000000,620181382,"$1,316,721,747"


In [38]:
# remove ',' from domestic_gross
budget_gross_df['domestic_gross'] = budget_gross_df['domestic_gross'].str.replace(',', '')
budget_gross_df.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,2009-12-18,avatar,425000000,760507625,"$2,776,345,279"
1,2,2011-05-20,pirates of the caribbean: on stranger tides,410600000,241063875,"$1,045,663,875"
2,3,2019-06-07,dark phoenix,350000000,42762350,"$149,762,350"
3,4,2015-05-01,avengers: age of ultron,330600000,459005868,"$1,403,013,963"
4,5,2017-12-15,star wars ep. viii: the last jedi,317000000,620181382,"$1,316,721,747"


In [39]:
# convert domestic_gross from str to int
budget_gross_df['domestic_gross'] = budget_gross_df['domestic_gross'].astype(int)
budget_gross_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   id                 5782 non-null   int64         
 1   release_date       5782 non-null   datetime64[ns]
 2   movie              5782 non-null   object        
 3   production_budget  5782 non-null   int64         
 4   domestic_gross     5782 non-null   int64         
 5   worldwide_gross    5782 non-null   object        
dtypes: datetime64[ns](1), int64(3), object(2)
memory usage: 271.2+ KB


In [40]:
# remove'$' from worldwide_gross
budget_gross_df['worldwide_gross'] = budget_gross_df['worldwide_gross'].str.replace('$', '')
budget_gross_df.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,2009-12-18,avatar,425000000,760507625,2776345279
1,2,2011-05-20,pirates of the caribbean: on stranger tides,410600000,241063875,1045663875
2,3,2019-06-07,dark phoenix,350000000,42762350,149762350
3,4,2015-05-01,avengers: age of ultron,330600000,459005868,1403013963
4,5,2017-12-15,star wars ep. viii: the last jedi,317000000,620181382,1316721747


In [41]:
# remove ',' from worldwide_gross
budget_gross_df['worldwide_gross'] = budget_gross_df['worldwide_gross'].str.replace(',', '')
budget_gross_df.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,2009-12-18,avatar,425000000,760507625,2776345279
1,2,2011-05-20,pirates of the caribbean: on stranger tides,410600000,241063875,1045663875
2,3,2019-06-07,dark phoenix,350000000,42762350,149762350
3,4,2015-05-01,avengers: age of ultron,330600000,459005868,1403013963
4,5,2017-12-15,star wars ep. viii: the last jedi,317000000,620181382,1316721747


In [42]:
# convert worldwide_gross from str to int
budget_gross_df['worldwide_gross'] = budget_gross_df['worldwide_gross'].astype(int)
budget_gross_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   id                 5782 non-null   int64         
 1   release_date       5782 non-null   datetime64[ns]
 2   movie              5782 non-null   object        
 3   production_budget  5782 non-null   int64         
 4   domestic_gross     5782 non-null   int64         
 5   worldwide_gross    5782 non-null   int64         
dtypes: datetime64[ns](1), int64(4), object(1)
memory usage: 271.2+ KB


### B. rating

In [46]:
rating_df.head()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


In [47]:
rating_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         26517 non-null  int64  
 1   genre_ids          26517 non-null  object 
 2   id                 26517 non-null  int64  
 3   original_language  26517 non-null  object 
 4   original_title     26517 non-null  object 
 5   popularity         26517 non-null  float64
 6   release_date       26517 non-null  object 
 7   title              26517 non-null  object 
 8   vote_average       26517 non-null  float64
 9   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 2.0+ MB


#### Action items:
