# Apply hypothesis testing to explore what makes a movie "successful" based on the 2001-2005 movies
Stakeholders want you to perform a statistical test to get a mathematically-supported answer.
1. does the MPAA rating of a movie (G/PG/PG-13/R) affect how much revenue the movie generates?
    - They want you to report if you found a significant difference between MPAA ratings. And which rating earns the most revenue?
    - They want you to prepare a visualization that supports your finding.

In [None]:
# Standard Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

# Additional Imports
import os, json, math, time, glob
from tqdm.notebook import tqdm_notebook
import tmdbsimple as tmdb
from sqlalchemy import create_engine, text
import pymysql

In [None]:
# Create the sqlalchemy engine and connection
pymysql.install_as_MySQLdb()
with open('/Users/yupfj/.secret/mySQL.json') as f:
    login = json.load(f)
username = login['username']
password = login['password']
# password = quote_plus("Myp@ssword!") # Use the quote function if you have special chars in password
db_name = "movie"
connection = f"mysql+pymysql://{username}:{password}@localhost/{db_name}"
engine = create_engine(connection)
conn = engine.connect()

In [None]:
q = """
SHOW tables;
"""
# Pass the query though the text function before running read_sql
pd.read_sql(text(q), conn)

In [None]:
q = """
SELECT * FROM tmdb_data
WHERE revenue >0 AND certification IS NOT NULL;
"""
# Pass the query though the text function before running read_sql
df=pd.read_sql(text(q), conn)
df

In [None]:
groups = ['G','R','PG','PG-13']
data={}
for i in groups:
    ## Get series for group and rename
    data[i] = df.loc[df['certification']==i,'revenue'].copy()
data.keys()

#### Before ANOVA test, we need to check Significant outliers, Normality, Equal variance for each group

In [None]:
# remove significant outliers only one-time
for i in groups:
    zscores= stats.zscore(data[i])
    outliers = np.abs(zscores)>3
    num_out =np.sum(outliers)
    if num_out>0:
        print(f"remove {num_out} outliers from the {i} group of {len(data[i])} records")
        data[i] = data[i][~outliers]
        print(f'[!] now [{len(data[i])}] records are left')

In [None]:
## Running Normality test on each group and confirming there are >20 in each group
norm_results = {}
for i in groups:
    stat, p = stats.normaltest(data[i])
    ## save the p val, test statistic, where sig=True rejects the null hypothesis that a sample comes from a normal distribution
    norm_results[i] = {'p value':p, 'test stat':stat, 'sig': p < .05 }
## convert to a dataframe
norm_results_df = pd.DataFrame(norm_results).T
norm_results_df

#### Although the normality test is NOT met for R/PG/PG-13, you can proceed if the sample size is considered large enough > 20.

In [None]:
# Testing Assumption of Equal Variance with the * operator 
stats.levene(*data.values())

#### We DO NOT meet the assumption of equal variance, so we will not run the One-Way ANOVA test. We may opt to use non-parametric equivalent of the ANOVA.

In [None]:
# Compute the Kruskal-Wallis H-test
stats.kruskal(*data.values())

In [None]:
# Performs the Alexander Govern test
stats.alexandergovern(*data.values())

In [None]:
# just to try the One-Way ANOVA Test
stats.f_oneway(*data.values())

#### A statistical significance exists. The null hypothesis is rejected and the alternative hypothesis is supported that ```the MPAA rating of a movie does affect how much revenue the movie generates.```
Tukey's Pairwise Test will compare every group against every other group 

In [None]:
# Creating DataFrame by passing Dictionary
left_data= {}
newdf= pd.DataFrame()
for i in groups:
    left_data[i] = pd.DataFrame({'certification': i, 'revenue': data[i]})
    newdf = pd.concat([newdf, left_data[i]], ignore_index=True)
newdf

In [None]:
## perform tukey's multiple comparison test and display the summary
from statsmodels.stats.multicomp import pairwise_tukeyhsd
values = newdf['revenue']
labels = newdf['certification']
tukeys_results = pairwise_tukeyhsd(values,labels)
tukeys_results.summary()

#### We see that there is a significant difference in revenue between 'R' and the other three.
Then, let's prepare a visualization to see which rating earns the most revenue? >>>>```R earns the least and PG may earn the most```

In [None]:
from matplotlib.ticker import FormatStrFormatter, StrMethodFormatter
ax=sns.barplot(data=newdf, x='certification', y='revenue', palette="viridis")
plt.xlabel("MPAA rating", fontsize = 16, weight='bold')
plt.xticks(weight='bold')
ax.yaxis.set_major_formatter(StrMethodFormatter('{x:,.0f}'))
ax.set_ylabel('Revenue ($)',fontweight='bold',fontsize=14);

2. Do movies that are over 2 hours long earn more revenue than movies that are 1.5 hours long (or less)?
   - Null Hypothesis: The two groups have the same average revenue.
   - For this two sample T-test, our alpha value is 0.05.

In [None]:
q = """
SELECT revenue, runtime FROM tmdb_data
JOIN title_basics on tconst=imdb_id
WHERE revenue >0;
"""
# Pass the query though the text function before running read_sql
df=pd.read_sql(text(q), conn)
df

In [None]:
# Filtering out two groups with short and long runtimes
data={}
groups=('long','short')
data['long'] =df.loc[df['runtime']>=120,'revenue'].copy()
data['short'] =df.loc[df['runtime']<=90,'revenue'].copy()

In [None]:
# remove significant outliers only one-time
for i in groups:
    zscores= stats.zscore(data[i])
    outliers = np.abs(zscores)>3
    num_out =np.sum(outliers)
    if num_out>0:
        print(f"remove {num_out} outliers from the {i} group of {len(data[i])} records")
        data[i] = data[i][~outliers]
        print(f'[!] now [{len(data[i])}] records are left')

In [None]:
## Running Normality test on each group and confirming there are >20 in each group
norm_results = {}
for i in groups:
    stat, p = stats.normaltest(data[i])
    ## save the p val, test statistic, where sig=True rejects the null hypothesis that a sample comes from a normal distribution
    norm_results[i] = {'test stat':stat, 'sig': p < .05, 'p value':p}
## convert to a dataframe
norm_results_df = pd.DataFrame(norm_results).T
norm_results_df

In [None]:
# Testing Assumption of Equal Variance with the * operator 
stats.levene(*data.values())

#### Looks like we don't have equal variances, but that won't stop us! 

In [None]:
## We just need to be sure to include "equal_var = False" when we perform our t-test.
result = stats.ttest_ind(*data.values(), equal_var = False)
print(f"Significant: {result.pvalue <.05}")
result

#### We see that there is big difference between movies that are over 2 hours long eand movies that are 1.5 hours long (or less)
Then, let's prepare a visualization to see which group earns more revenue? >>>>movies that are over 2 hours long

In [None]:
# Creating DataFrame by passing Dictionary
result1 = pd.DataFrame({'runtime': '>= 120 min', 'revenue': data['long']})
result2 = pd.DataFrame({'runtime': '<= 90 min', 'revenue': data['short']})
result = pd.concat([result1, result2], ignore_index=True)
result

In [None]:
ax=sns.barplot(data=result, x='runtime', y='revenue')
plt.xlabel("runtime", fontsize = 16, weight='bold')
plt.xticks(weight='bold')
ax.yaxis.set_major_formatter(StrMethodFormatter('{x:,.0f}'))
ax.set_ylabel('Revenue ($)',fontweight='bold',fontsize=14);

3. Do some movie genres earn more revenue than others?
   - Null Hypothesis: No statistical significance exists, so the genre of a movie does not affect how much revenue a movie generates.
   - Set our alpha 0.05.
   - If parametric: one-way ANOVA an/or post hoc tests.

In [None]:
q = """
SELECT revenue, genre_name FROM tmdb_data
JOIN title_genres on tconst=imdb_id
JOIN genres on title_genres.genre_id=genres.genre_id
WHERE revenue >0;
"""
# Pass the query though the text function before running read_sql
df=pd.read_sql(text(q), conn)
df

In [None]:
# Display the unique values and their counts for this column
df['genre_name'].value_counts()

In [None]:
df['genre_name'].value_counts().index

In [None]:
groups = ['Drama', 'Comedy', 'Action', 'Romance', 'Crime', 'Adventure',
       'Thriller', 'Mystery', 'Horror', 'Fantasy', 'Family', 'Animation',
       'Sci-Fi']
data={}
for i in groups:
    ## Get series for group and rename
    data[i] = df.loc[df['genre_name']==i,'revenue'].copy()

In [None]:
# remove significant outliers only one-time and confirming there are >20 left in each group
for i in groups:
    zscores= stats.zscore(data[i])
    outliers = np.abs(zscores)>3
    num_out =np.sum(outliers)
    if num_out>0:
        print(f"remove {num_out} outliers from the {i} group of {len(data[i])} records")
        data[i] = data[i][~outliers]
        print(f'[!] now [{len(data[i])}] records are left')

In [None]:
## Running Normality test on each group and confirming there are >20 in each group
norm_results = {}
for i in groups:
    stat, p = stats.normaltest(data[i])
    ## save the p val, test statistic, where sig=True rejects the null hypothesis that a sample comes from a normal distribution
    norm_results[i] = {'p value':p, 'test stat':stat, 'sig': p < .05 }
## convert to a dataframe
norm_results_df = pd.DataFrame(norm_results).T
norm_results_df

#### We have large enough groups (each n>20) that we can safely disregard the assumption of normality, even though these groups do NOT come from normal distributions

In [None]:
# Testing Assumption of Equal Variance with the * operator 
stats.levene(*data.values())

In [None]:
# Compute the Kruskal-Wallis H-test due to Assumption of Equal Variance fails
stats.kruskal(*data.values())

In [None]:
# Creating DataFrame by passing Dictionary
left_data= {}
newdf= pd.DataFrame()
for i in groups:
    left_data[i] = pd.DataFrame({'genre_name': i, 'revenue': data[i]})
    newdf = pd.concat([newdf, left_data[i]], ignore_index=True)
newdf

In [None]:
values = newdf['revenue']
labels = newdf['genre_name']
# Perform tukey's multiple comparison test and display the summary
tukeys_results = pairwise_tukeyhsd(values,labels)
tukeys_results.summary()

#### A statistical significance exists, meaning the genre of a movie does affect how much revenue the movie generates.
we show a supporting visualization that helps display the result>>>>```Adventure, Animation, Sci-Fi, Fantasy are the top 4 genres``` 	

In [None]:
ax=sns.barplot(data=newdf, x='genre_name', y='revenue')
plt.xlabel("movie genres", fontsize = 16, weight='bold')
plt.xticks(weight='bold')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
ax.yaxis.set_major_formatter(StrMethodFormatter('{x:,.0f}'))
ax.set_ylabel('Revenue ($)',fontweight='bold',fontsize=14);