Final Project Submission

Please fill out:
* Student name: 
* Student pace: self paced / part time / full time
* Scheduled project review date/time: 
* Instructor name: 
* Blog post URL:

Roshni's Scratch Notebook

# I. Cleaning The Numbers Dataset

## 1. Import packages & data

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

tn_df = pd.read_csv('zippedData/tn.movie_budgets.csv.gz', parse_dates=['release_date'])

## 2. Remove unnecessary column (ID)

In [2]:
tn_df = tn_df.drop('id', axis=1)

## 3. Convert Number Variables to Int type

In [8]:
tn_df['production_budget'].replace(',','', regex=True, inplace=True)
tn_df['production_budget'] = tn_df['production_budget'].map(lambda x: x.replace('$',''))

tn_df['domestic_gross'].replace(',','', regex=True, inplace=True)
tn_df['domestic_gross'] = tn_df['domestic_gross'].map(lambda x: x.replace('$',''))

tn_df['worldwide_gross'].replace(',','', regex=True, inplace=True)
tn_df['worldwide_gross'] = tn_df['worldwide_gross'].map(lambda x: x.replace('$',''))

tn_df['production_budget'] = tn_df['production_budget'].astype(int)
tn_df['domestic_gross'] = tn_df['domestic_gross'].astype(int)
tn_df['worldwide_gross'] = tn_df['worldwide_gross'].astype(int)

tn_df

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,2009-12-18,Avatar,425000000,760507625,2776345279
1,2,2011-05-20,Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875
2,3,2019-06-07,Dark Phoenix,350000000,42762350,149762350
3,4,2015-05-01,Avengers: Age of Ultron,330600000,459005868,1403013963
4,5,2017-12-15,Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747
...,...,...,...,...,...,...
5777,78,2018-12-31,Red 11,7000,0,0
5778,79,1999-04-02,Following,6000,48482,240495
5779,80,2005-07-13,Return to the Land of Wonders,5000,1338,1338
5780,81,2015-09-29,A Plague So Pleasant,1400,0,0


## 4. Excluding Worldwide Gross == 0 
- Worldwide Gross of 0 indicates that either the data is missing data or the movie was never released in theaters.

In [9]:
print("Excluding worldwide gross of $0 would remove " + str(round((1 - 5415/5781)*100, 1)) + " percent of datapoints")

#Less than 10% of data is being removed, which meets the rule of thumb


tn_df = tn_df[tn_df['worldwide_gross'] != 0]

tn_df

Excluding worldwide gross of $0 would remove 6.3 percent of datapoints


Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,2009-12-18,Avatar,425000000,760507625,2776345279
1,2,2011-05-20,Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875
2,3,2019-06-07,Dark Phoenix,350000000,42762350,149762350
3,4,2015-05-01,Avengers: Age of Ultron,330600000,459005868,1403013963
4,5,2017-12-15,Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747
...,...,...,...,...,...,...
5775,76,2006-05-26,Cavite,7000,70071,71644
5776,77,2004-12-31,The Mongol King,7000,900,900
5778,79,1999-04-02,Following,6000,48482,240495
5779,80,2005-07-13,Return to the Land of Wonders,5000,1338,1338


## 5. Calculate Net Profit as a new variable in dataset

In [10]:
# Net profit = worldwide gross - production budget


tn_df = tn_df.assign(net_profit = tn_df['worldwide_gross'] - tn_df['production_budget'])

tn_df

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,net_profit
0,1,2009-12-18,Avatar,425000000,760507625,2776345279,2351345279
1,2,2011-05-20,Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875,635063875
2,3,2019-06-07,Dark Phoenix,350000000,42762350,149762350,-200237650
3,4,2015-05-01,Avengers: Age of Ultron,330600000,459005868,1403013963,1072413963
4,5,2017-12-15,Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747,999721747
...,...,...,...,...,...,...,...
5775,76,2006-05-26,Cavite,7000,70071,71644,64644
5776,77,2004-12-31,The Mongol King,7000,900,900,-6100
5778,79,1999-04-02,Following,6000,48482,240495,234495
5779,80,2005-07-13,Return to the Land of Wonders,5000,1338,1338,-3662


# 6. Convert Gross, Profit & Budget to "in millions" to make graphs more readable

In [11]:
tn_df['budget_mils'] = tn_df['production_budget'] / 1000000
tn_df['profit_mils'] = tn_df['net_profit']/1000000
tn_df['domestic_gross_mils'] = tn_df['domestic_gross']/1000000
tn_df['worldwide_gross_mils'] = tn_df['worldwide_gross']/1000000

# II. Relationship between Production Budget and Global Net Profit

In [None]:
sns.set(rc={'figure.figsize':(16,10)})
plot = sns.scatterplot(x='budget_in_mils', y='profit_in_mils', data=tn_df)
plt.axvline(0, color='black')
plt.axhline(0, color='black')

plot.set_title('Profit by Budget')
plot.set_xlabel('Production Budget (in millions)')
plot.set_ylabel('Global Profit (in millions)');

plt.show()

### Examining Correlation

In [None]:
r_budget_profit = tn_df.corr()['production_budget']['net_profit']

round(r_budget_profit, 2)

r = 0.61 

- Moderately strong correlation between net profit & production budget

# III. Budget-Profit Relationship in Low, Medium and High Budget Movies

- Higher budget = higher reward, but also higher risk! Less guarantee of a payoff. Which budget range is most likely to have the best payoff?

- Subsetting films into low, medium, high budget: 
 - Low = 0 - 20 mil, 
 - Med = 20 - 100 mil
 - High = 100+ mil

In [None]:
labels = ["Low","Medium","High"]

tn_df['budget_groups'] = pd.cut(tn_df['budget_in_mils'], bins=[0,100,200,500],include_lowest=True, labels=labels)


test = tn_df.loc[tn_df['budget_groups'] == 'Low']

test

## Profit for Low, Medium & High Budget Films:

In [None]:
tn_df['roi'] = tn_df['net_profit']/tn_df['production_budget']*100

sns.barplot(data=tn_df, x =tn_df['budget_groups'], y=tn_df['roi'], error)

## Comparing Correlations by Group:

In [None]:
tn_df_budget_groups = tn_df['budget_mil, profit_bil']



# IV. Qualities of High Performer vs. Low Performer Large-Budget Films

- With great risk comes a great reward
- What qualities distinguish the high-reward films from the major losses? (Genre? Director? Actor?)

First, let's look at the distribution of production budgets across all films:

In [None]:
tn_df['production_budget_mil'] = tn_df['production_budget']/1000000

plt.hist(tn_df['production_budget_mil']);

- Most films fall in the lower end of the budget range. Let's subset the high-budget films (budgets > 200 mil)

In [None]:
tn_df['production_budget_mil'].describe()

high_budget_df = tn_df.loc[tn_df['production_budget_mil'] >= 200]

high_budget_df

- The relationship between production budget and net profit is probably more murky in this range:

In [None]:
high_budget_df.corr()['production_budget']['net_profit']

In [None]:
plt.hist(high_budget_df['net_profit'])

# Other Scratch Work

## First Dataset Explorations:

### 1. "Rotten Tomatoes: Movie Info" Dataset

#### Main Takeaways:

##### Potential Predictor Variables
 1. MPAA Rating
 2. Genre
 3. Director
 4. Writer
 5. Theater Date (month released/time of month?)
 6. Runtime

##### Potential Outcome Variable
 1. Box_Office
 - Seems to be 1st weekend collection in US only? 
 - Probably not the best source to use
 
 2. Currency (only USD, useless)
 
##### Connector Variables
 1. id

#### Methodology

###### Import Data & View First 5 Rows

In [None]:
rt_df.head()

###### Examine Columns/Variables of Interest

In [None]:
rt_movie_info_df.info()

- Movie title is not included in this database, just ID.
 - So much missing data!!! (more on that below)

###### Missing Data: 10% Rule

In [None]:
# Columns that have more than 10% of data missing:

print(1560*10/100) 

#156 Missing, 1404 Present

((rt_df.isna().sum())>156)

#Director, Writer, Theater_Date, DVD_Date, Currency, Box_Office, Studio

###### Descriptive Statistics

- Numeric variables: Box_Office, Runtime
- Categorical variables: the rest
- Theater_Date --> convert to DateTime if interesting

In [None]:
rt_df['rating'].value_counts(normalize = True)* 100

In [None]:
rt_df['genre'].value_counts(normalize = True) * 100

#299 different Genres! will have to parse through & convert to lists to use?

In [None]:
rt_df['runtime'].value_counts()

# need to get rid of minutes, need to convert to int

In [None]:
#Unique movies in dataset:

len(rt_df['id'].value_counts())

<br>

### 2. Rotten Tomatoes: Reviews Dataset

##### Dataset Info

In [None]:



rt_movie_df = pd.read_csv('zippedData/rt.movie_info.tsv.gz', sep = '\t')

rt_review_df = pd.read_csv('zippedData/rt.reviews.tsv.gz', sep = '\t', encoding = 'unicode_escape', error_bad_lines=False)

rt_review_df.info()

##### Missing Data 10% rule:

In [None]:
#10%: Missing data > 5443

((rt_review_df.isna().sum())>5443)

# Rating

##### How many unique movies are represented in this dataset?

In [None]:
len(rt_review_df['id'].value_counts()) # 1135

- 1135 movies in this dataset vs. 1560 movies in "Movie Info" dataset?

- Missing review data on **325 movies**?

##### Descriptive Statistics

**Categorical Variables**: Fresh, Critic, Top_Critic

**Numeric Variables**: Rating (needs to be converted to int & standardized)

In [None]:
rt_rev_df['rating'].value_counts()

In [None]:
rt_rev_df['fresh'].value_counts()

In [None]:
rt_rev_df['top_critic'].value_counts()

<br>

### 3. The Numbers Database

#### Main Takeaways

##### Potential Variables of Interest:
1. Release Date
2. Production Budget

##### Outcome Variables:
1. Domestic Gross
2. Worldwide Gross
3. Production Budget (if we want to look at net profits)

##### Connector Variables:
1. Movie Name

##### Variable Definitions:
1. Domestic Gross = total box office collections in US & Canada
2. Worldwide Gross = total box office collections everywhere
3. Net Profit (to be calculated) = Worldwide Gross - Production Budget

#### Methodology

##### Imports, First 5 Rows & Last 5 Rows

In [None]:

tn_df.head()

In [None]:
tn_df.tail()

###### Questions:
1. How does a movie gross nothing? Should we exclude?

##### Dataset Info

In [None]:
tn_df.info()

- No missing data! 

- All **object type**

## Takeaways from Tues AM Meeting

- First priority: The Numbers Dataset
- Second priority: Rotten Tomatoes (too messy; similar data available in The Movies Database)

#### Next Steps (using The Numbers Dataset)

1. Explore if ID can be used as a connecting variable
2. Convert Production_Budget, Domestic_Gross, Worldwide_Gross to int
3. Calculate Net Profit
4. **Scatter Plot: Production Budget vs. Net Profit **
5. Convert Release_Date to DateTime Object
6. Calculate Quarter, Month, Time of Month (early/mid/late?)
7. Explore correlations between time & profit

### 1. Exploring ID Further: Is this a connecting variable?

In [None]:
tn_df['id'].value_counts()

 - ID not a unique identifier, seems to just be the last 2 digits of the original index

In [None]:
tn_df.describe()

### 4. Scatterplot: Production Budget by Worldwide Net Profit