## Apply hypothesis testing to explore what makes a movie "successful"
- The stakeholder's first question is: does the MPAA rating of a movie (G/PG/PG-13/R) affect how much revenue the movie generates?
    - They want you to perform a statistical test to get a mathematically-supported answer.
    - They want you to report if you found a significant difference between ratings.
    - If so, what was the p-value of your analysis?
    - And which rating earns the most revenue?
    - They want you to prepare a visualization that supports your finding.

In [1]:
# Standard Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

# Additional Imports
import os, json, math, time, glob
from tqdm.notebook import tqdm_notebook
import tmdbsimple as tmdb
from sqlalchemy import create_engine, text
import pymysql

In [2]:
# Create the sqlalchemy engine and connection
pymysql.install_as_MySQLdb()
with open('/Users/yupfj/.secret/mySQL.json') as f:
    login = json.load(f)
username = login['username']
password = login['password']
# password = quote_plus("Myp@ssword!") # Use the quote function if you have special chars in password
db_name = "movie"
connection = f"mysql+pymysql://{username}:{password}@localhost/{db_name}"
engine = create_engine(connection)
conn = engine.connect()

In [3]:
q = """
SHOW tables;
"""
# Pass the query though the text function before running read_sql
pd.read_sql(text(q), conn)

Unnamed: 0,Tables_in_movie
0,genres
1,ratings
2,title_basics
3,title_genres
4,tmdb_data


In [4]:
q = """
SELECT * FROM tmdb_data;
"""
# Pass the query though the text function before running read_sql
tmdb_data=pd.read_sql(text(q), conn)
tmdb_data

Unnamed: 0,imdb_id,budget,revenue,certification
0,tt0035423,48000000.0,76019000.0,PG-13
1,tt0096056,0.0,0.0,
2,tt0114447,0.0,0.0,
3,tt0116916,0.0,0.0,PG
4,tt0118589,22000000.0,5271670.0,PG-13
...,...,...,...,...
2635,tt8665056,0.0,0.0,
2636,tt8795764,0.0,0.0,NR
2637,tt8825252,0.0,0.0,
2638,tt9071078,127389.0,0.0,


In [5]:
groups = ['G','R','PG','PG-13']
data={}
for i in groups:
    ## Get series for group and rename
    data[i] = tmdb_data.loc[tmdb_data['certification']==i,'revenue'].copy()
data.keys()

dict_keys(['G', 'R', 'PG', 'PG-13'])

#### Before ANOVA test, we need to check Significant outliers, Normality, Equal variance for each group

In [6]:
# remove significant outliers only one-time
for i in groups:
    zscores= stats.zscore(data[i])
    outliers = np.abs(zscores)>3
    num_out =np.sum(outliers)
    if num_out>0:
        print(f"remove {num_out} outliers from the {i} group")
        data[i] = data[i][~outliers]

remove 1 outliers from the G group
remove 15 outliers from the R group
remove 3 outliers from the PG group
remove 2 outliers from the PG-13 group


In [9]:
## Running Normality test on each group and confirming there are >20 in each group
norm_results = {}
for i in groups:
    stat, p = stats.normaltest(data[i])
    ## save the p val, test statistic, and the size of the group
    norm_results[i] = {'n': len(data[i]), 'sig': p < .05, 
                       'p value':p, 'test stat':stat}
## convert to a dataframe
norm_results_df = pd.DataFrame(norm_results).T
norm_results_df

Unnamed: 0,n,sig,p value,test stat
G,21,False,0.138266,3.957147
R,474,True,0.0,311.092628
PG,64,True,0.0,61.249901
PG-13,199,True,0.0,100.759508


#### Although the normality test is NOT met for R/PG/PG-13, you can proceed if the sample size is considered large enough > 20.

In [10]:
# Testing Assumption of Equal Variance with the * operator 
stats.levene(*data.values())

LeveneResult(statistic=36.667822979945335, pvalue=3.981505763776953e-22)

#### We DO NOT meet the assumption of equal variance.