# Phase 2 Review

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
import sklearn.metrics as metrics
from random import gauss
from mpl_toolkits.mplot3d import Axes3D
from scipy import stats as st

%matplotlib inline

pd.set_option('display.max_columns', 100)

### Check Your Data … Quickly
The first thing you want to do when you get a new dataset, is to quickly to verify the contents with the .head() method.

In [2]:
df = pd.read_csv('movie_metadata.csv')
print(df.shape)
df.head(15)
df

(5043, 28)


Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,4834,Wes Studi,0.0,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,471220,48350,Jack Davenport,0.0,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,Christoph Waltz,Spectre,275868,11700,Stephanie Sigman,1.0,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,Tom Hardy,The Dark Knight Rises,1144337,106759,Joseph Gordon-Levitt,0.0,deception|imprisonment|lawlessness|police offi...,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,Doug Walker,Star Wars: Episode VII - The Force Awakens ...,8,143,,0.0,,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,,,,,,,12.0,7.1,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5038,Color,Scott Smith,1.0,87.0,2.0,318.0,Daphne Zuniga,637.0,,Comedy|Drama,Eric Mabius,Signed Sealed Delivered,629,2283,Crystal Lowe,2.0,fraud|postal worker|prison|theft|trial,http://www.imdb.com/title/tt3000844/?ref_=fn_t...,6.0,English,Canada,,,2013.0,470.0,7.7,,84
5039,Color,,43.0,43.0,,319.0,Valorie Curry,841.0,,Crime|Drama|Mystery|Thriller,Natalie Zea,The Following,73839,1753,Sam Underwood,1.0,cult|fbi|hideout|prison escape|serial killer,http://www.imdb.com/title/tt2071645/?ref_=fn_t...,359.0,English,USA,TV-14,,,593.0,7.5,16.00,32000
5040,Color,Benjamin Roberds,13.0,76.0,0.0,0.0,Maxwell Moody,0.0,,Drama|Horror|Thriller,Eva Boehnke,A Plague So Pleasant,38,0,David Chandler,0.0,,http://www.imdb.com/title/tt2107644/?ref_=fn_t...,3.0,English,USA,,1400.0,2013.0,0.0,6.3,,16
5041,Color,Daniel Hsia,14.0,100.0,0.0,489.0,Daniel Henney,946.0,10443.0,Comedy|Drama|Romance,Alan Ruck,Shanghai Calling,1255,2386,Eliza Coupe,5.0,,http://www.imdb.com/title/tt2070597/?ref_=fn_t...,9.0,English,USA,PG-13,,2012.0,719.0,6.3,2.35,660


## Question 1

A Hollywood executive wants to know how much an R-rated movie released after 2000 will earn. The data above is a sample of some of the movies with that rating during that timeframe, as well as other movies. How would you go about answering her question? Talk through it theoretically and then do it in code.

What is the 95% confidence interval for a post-2000 R-rated movie's box office gross?

In [3]:
# I would calculate a 95% confidence interval for the mean take of all R-rated movies in the sample. 
# Then, "I can say with 95% confidence that the average (mean) take for R-rated movies released after
# 2000 falls within that interval."

In [4]:
newer_R_df = df[(df['title_year'] > 2000) & (df['content_rating'] == 'R')].dropna(subset = ['gross'])
newer_R_df

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
94,Color,Jonathan Mostow,280.0,109.0,84.0,191.0,M.C. Gainey,648.0,150350192.0,Action|Sci-Fi,Nick Stahl,Terminator 3: Rise of the Machines,305340,1769,Carolyn Hennesy,0.0,drifter|exploding truck|future|machine|skynet,http://www.imdb.com/title/tt0181852/?ref_=fn_t...,1676.0,English,USA,R,200000000.0,2003.0,284.0,6.4,2.35,0
113,Color,Oliver Stone,248.0,206.0,0.0,591.0,Angelina Jolie Pitt,12000.0,34293771.0,Action|Adventure|Biography|Drama|History|Roman...,Anthony Hopkins,Alexander,138863,24598,Brian Blessed,3.0,ancient greece|conquest|greek|greek myth|king,http://www.imdb.com/title/tt0346491/?ref_=fn_t...,1390.0,English,Germany,R,155000000.0,2004.0,11000.0,5.5,2.35,0
124,Color,Lana Wachowski,245.0,129.0,0.0,233.0,Collin Chou,309.0,139259759.0,Action|Sci-Fi,Essie Davis,The Matrix Revolutions,364948,1062,Nona Gaye,0.0,battle|epic|fight|future|machine,http://www.imdb.com/title/tt0242653/?ref_=fn_t...,2121.0,English,Australia,R,150000000.0,2003.0,269.0,6.7,2.35,0
126,Color,Lana Wachowski,275.0,138.0,0.0,30.0,Daniel Bernhardt,234.0,281492479.0,Action|Sci-Fi,Steve Bastoni,The Matrix Reloaded,421818,534,Helmut Bakaitis,0.0,car motorcycle chase|one against many|oracle|p...,http://www.imdb.com/title/tt0234215/?ref_=fn_t...,2789.0,English,USA,R,150000000.0,2003.0,198.0,7.2,2.35,0
128,Color,George Miller,739.0,120.0,750.0,943.0,Charlize Theron,27000.0,153629485.0,Action|Adventure|Sci-Fi|Thriller,Tom Hardy,Mad Max: Fury Road,552503,40025,Zoë Kravitz,0.0,australia|desert|escape|on the run|post apocal...,http://www.imdb.com/title/tt1392190/?ref_=fn_t...,1588.0,English,Australia,R,150000000.0,2015.0,9000.0,8.1,2.35,191000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5004,Color,Mike Bruce,3.0,78.0,6.0,17.0,Kirpatrick Thomas,32.0,243768.0,Western,Joseph Campanella,The Legend of God's Gun,143,72,Christian Anderson,0.0,,http://www.imdb.com/title/tt1073221/?ref_=fn_t...,9.0,English,USA,R,30000.0,2007.0,17.0,4.1,2.35,13
5007,Color,Ben Wheatley,53.0,93.0,214.0,59.0,Tony Way,177.0,9609.0,Comedy|Crime|Drama,Michael Smiley,Down Terrace,2646,365,David Schaal,4.0,black comedy,http://www.imdb.com/title/tt1489167/?ref_=fn_t...,22.0,English,UK,R,,2009.0,95.0,6.5,,535
5012,Color,David Ayer,233.0,109.0,453.0,120.0,Martin Donovan,1000.0,10499968.0,Action|Crime|Drama|Thriller,Mireille Enos,Sabotage,47502,1458,Maurice Compte,3.0,dea|drug cartel|kicked in the crotch|strip clu...,http://www.imdb.com/title/tt1742334/?ref_=fn_t...,212.0,English,USA,R,35000000.0,2014.0,206.0,5.7,1.85,10000
5021,Color,Jay Duplass,51.0,85.0,157.0,10.0,Katie Aselton,830.0,192467.0,Comedy|Drama|Romance,Mark Duplass,The Puffy Chair,4067,1064,Bari Hyman,0.0,birthday|gift|motel|new york city|upholsterer,http://www.imdb.com/title/tt0436689/?ref_=fn_t...,71.0,English,USA,R,15000.0,2005.0,224.0,6.6,,297


In [5]:
# do it in code here
deg_freedom = newer_R_df['movie_title'].count() - 1
lower, upper = st.t.interval(0.95, df=deg_freedom, loc = newer_R_df['gross'].mean(), scale = (st.sem(newer_R_df['gross'], nan_policy = 'omit')))

In [6]:
print(f"{round(lower,2):,}")

25,442,351.71


In [7]:
print('I can say with 95% confidence that the average take for a post-2000, ')
print('R-rated movie falls between $' , (f"{round(lower,2):,}"), 'and $' , (f"{round(upper,2):,}"))

I can say with 95% confidence that the average take for a post-2000, 
R-rated movie falls between $ 25,442,351.71 and $ 29,855,345.17


## Question 2a

Your ability to answer the first question has the executive excited and now she has many other questions about the types of movies being made and the differences in those movies budgets and gross amounts.

Read through the questions below and **determine what type of statistical test you should use** for each question and **write down the null and alternative hypothesis for those tests**.

**- Is there a relationship between the number of Facebook likes for a cast and the box office gross of the movie?**


In [8]:
# Simple Linear Regression

# H0: no relationship 
# H1: some relationship

In [9]:
df_2a1 = df[df['cast_total_facebook_likes'].notna()]
df_2a1 = df_2a1[df_2a1['gross'].notna()]

cast_facebook_likes = list(df_2a1['cast_total_facebook_likes'])
box_office_gross = list(df_2a1['gross'])

In [10]:
len(box_office_gross)

4159

In [11]:
df_2a1['gross'].corr(df_2a1['cast_total_facebook_likes'])

0.2473996379795019

In [12]:
stat, p = st.pearsonr(cast_facebook_likes, box_office_gross)
print('stat=%.3f, p=%.3f' % (stat, p))

stat=0.247, p=0.000


In [13]:
# P < alpha - reject H0. There is likely some relationship. 

**- Do foreign films perform differently at the box office than non-foreign films?**


In [14]:
# T test

# H0: mu_foreign_gross == mu_domestic_gross
# H1: mu_foreign_gross != mu_domestic_gross

In [15]:
df_2a2 = df[df['gross'].notna()]
df_2a2 = df_2a2[df_2a2['country'].notna()]

df_2a2_domestic = df_2a2[df_2a2['country'] == "USA"]
df_2a2_foreign = df_2a2[df_2a2['country'] != "USA"]

In [16]:
st.ttest_ind(df_2a2_domestic['gross'], df_2a2_foreign['gross'])

Ttest_indResult(statistic=12.098302287742106, pvalue=3.863109466861356e-33)

In [17]:
# P < alpha - reject H0. It is likely that the mean take of foreign films is different to that of domestic films. 

**- Of all movies created are 40% rated R?**

In [18]:
# Z - test for proportion

# H0: R/all = 0.4, H1: R/all != 0.4

- Is there a relationship between the language of a film and the content rating (G, PG, PG-13, R) of that film?

In [19]:
# chi_squared

#H0: there is a relationship between the language and content rating.
#H1: there is no relationship between the language and content. 

- Is there a relationship between the content rating of a film and its budget? 

In [20]:
# anova

#H0: there is a relationship
#H1: there is no relationship. 

## Question 2b

Calculate the answer for the second question:

- Do foreign films perform differently at the box office than non-foreign films?

In [21]:
# T_test




## Question 3

Now that you have answered all of those questions, the executive wants you to create a model that predicts the money a movie will make if it is released next year in the US. She wants to use this to evaluate different scripts and then decide which one has the largest revenue potential. 

Below is a list of potential features you could use in the model. Would you use all of these features in the model? Identify which features you might drop and why.


*Remember you want to be able to use this model to predict the box office gross of a film **before** anyone has seen it.*

- **budget**: The amount of money spent to make the movie
- **title_year**: The year the movie first came out in the box office
- **years_old**: How long has it been since the movie was released
- **genre**: Each movie is assigned one genre category like action, horror, comedy
- **avg_user_rating**: This rating is taken from Rotten tomatoes, and is the average rating given to the movie by the audience
- **actor_1_facebook_likes**: The number of likes that the most popular actor in the movie has
- **cast_total_facebook_likes**: The sum of likes for the three most popular actors in the movie
- **language**: the original spoken language of the film
- **director_name**
- **duration**


In [22]:
#Drop title_year, years_old, avg_user_rating since this data is not relevant/available for a movie before anyone has seen it. 

In [23]:
df_3 = df[['budget','actor_1_facebook_likes', 'cast_total_facebook_likes', 'language', 'gross']]

In [24]:
df_3 = df_3[df_3['gross'].notna()]
df_3 = df_3[df_3['language'].notna()]
df_3 = df_3[df_3['budget'].notna()]
df_3 = df_3[df_3['actor_1_facebook_likes'].notna()]

x = df_3['language']
condlist = [x == 'English', x != 'English']
choicelist = [1, 0]
df_3['english?'] = np.select(condlist, choicelist)

df_3 = df_3[['budget','actor_1_facebook_likes', 'cast_total_facebook_likes', 'english?', 'gross']]


In [25]:
y = df_3['gross']
X = sm.add_constant(df_3.drop('gross', axis=1))

model2 = sm.OLS(y, X).fit()
model2.summary()

0,1,2,3
Dep. Variable:,gross,R-squared:,0.138
Model:,OLS,Adj. R-squared:,0.137
Method:,Least Squares,F-statistic:,154.7
Date:,"Wed, 26 May 2021",Prob (F-statistic):,5.2800000000000004e-123
Time:,17:18:13,Log-Likelihood:,-75394.0
No. Observations:,3885,AIC:,150800.0
Df Residuals:,3880,BIC:,150800.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.209e+06,4.89e+06,-0.247,0.805,-1.08e+07,8.38e+06
budget,0.0323,0.005,6.829,0.000,0.023,0.042
actor_1_facebook_likes,-3230.1648,207.861,-15.540,0.000,-3637.693,-2822.637
cast_total_facebook_likes,3304.5649,169.186,19.532,0.000,2972.863,3636.267
english?,3.999e+07,5.02e+06,7.967,0.000,3.01e+07,4.98e+07

0,1,2,3
Omnibus:,2319.828,Durbin-Watson:,1.023
Prob(Omnibus):,0.0,Jarque-Bera (JB):,34837.718
Skew:,2.567,Prob(JB):,0.0
Kurtosis:,16.742,Cond. No.,1520000000.0


## Question 4a

Create the following variables:

- `years_old`: The number of years since the film was released.
- Dummy categories for each of the following ratings:
    - `G`
    - `PG`
    - `R`
    


In [26]:
df['years_old'] = 2021 - df['title_year']
df = df[df['content_rating'].isin(['R', 'PG', 'G', 'PG-13'])]

In [27]:
ohe = OneHotEncoder(sparse = False)
ohe.fit_transform(df[['content_rating']])

array([[0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       ...,
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.]])

In [28]:
comma_df = pd.DataFrame(ohe.fit_transform(df[['content_rating']]), columns=ohe.get_feature_names())
df = df[['cast_total_facebook_likes','budget','years_old', 'gross']].reset_index().join(comma_df)
df.drop('index', axis = 1, inplace = True)
df.dropna(inplace = True)


Once you have those variables, create a summary output for the following OLS model:

`gross~cast_total_facebook_likes+budget+years_old+G+PG+R`

In [29]:
# your answer here
y = df['gross']
X = sm.add_constant(df.drop('gross', axis = 1))
X = X.drop('x0_PG-13', axis = 1)

In [30]:
model2 = sm.OLS(y, X).fit()
model2.summary()

0,1,2,3
Dep. Variable:,gross,R-squared:,0.139
Model:,OLS,Adj. R-squared:,0.137
Method:,Least Squares,F-statistic:,100.1
Date:,"Wed, 26 May 2021",Prob (F-statistic):,3.58e-117
Time:,17:18:13,Log-Likelihood:,-72519.0
No. Observations:,3735,AIC:,145100.0
Df Residuals:,3728,BIC:,145100.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.647e+07,2.84e+06,19.889,0.000,5.09e+07,6.2e+07
cast_total_facebook_likes,823.2396,56.453,14.583,0.000,712.557,933.922
budget,0.0258,0.005,5.449,0.000,0.017,0.035
years_old,-5.146e+04,1.27e+05,-0.406,0.685,-3e+05,1.97e+05
x0_G,2.291e+07,7.18e+06,3.191,0.001,8.83e+06,3.7e+07
x0_PG,1.039e+07,3.3e+06,3.146,0.002,3.91e+06,1.69e+07
x0_R,-3.377e+07,2.41e+06,-13.986,0.000,-3.85e+07,-2.9e+07

0,1,2,3
Omnibus:,2228.542,Durbin-Watson:,1.077
Prob(Omnibus):,0.0,Jarque-Bera (JB):,36066.06
Skew:,2.534,Prob(JB):,0.0
Kurtosis:,17.355,Cond. No.,1560000000.0


## Question 4b

Below is the summary output you should have gotten above. Identify any key takeaways from it.
- How ‘good’ is this model?
- Which features help to explain the variance in the target variable? 
    - Which do not? 


<img src="ols_summary.png" style="withd:300px;">

In [31]:
# your answer here



## Question 5

**Bayes Theorem**

An advertising executive is studying television viewing habits of married men and women during prime time hours. Based on the past viewing records he has determined that during prime time wives are watching television 60% of the time. It has also been determined that when the wife is watching television, 40% of the time the husband is also watching. When the wife is not watching the television, 30% of the time the husband is watching the television. Find the probability that if the husband is watching the television, the wife is also watching the television.

In [32]:
# The husband is watching TV 30% of 40% of the time plus
# 40% of 60% of the time.

# The wife is watching TV with the husband 40% of 60% of the time. 

P_wife_given_husband = (0.6 * 0.4) / ((0.3 * 0.4) + (0.6 * 0.4))
P_wife_given_husband

# If the husband is watching TV, there is a 2/3 probability that the 
# wife is also watching TV. 

0.6666666666666666

## Question 6

Explain what a Type I error is and how it relates to the significance level when doing a statistical test. 

In [33]:
# your answer here

# A Type-I error is also known as a false positive. It means you have rejected
# a null hypothesis that you ought not to have. It can come from a biased
# sample, for instance, using a school basketball team as a sample to determine
# whether average student height is greater than 75".

## Question 7

How is the confidence interval for a sample related to a one sample t-test?

In [34]:
# For the same confidence level, if a test population mean
# falls within the confidence interval, a one sample t-test
# will indicate that the test population mean should be rejected 
# as the true population mean. 

