### A Look into Freaky Franchise's Rotten Tomato Competition

Podcasts are all the rage, especially in these weird times where many people have more free time than they did in the past. Today we are going to look at some gathered data from one particular podcast, Freaky Franchise where they "unmask horror movies based on quantity over quality." I strongly suggest checking [Freaky Franchise](http://freakyfranchise.com/about) out if you are into horror movies.

The first part of the episode the two hosts have a friendly competition where they guess the Rotten Tomato scores of the movies they are discussing, the loser having to sum the movie up in under a minute. In this post, we are going to look data surrounding this competition to see if we can predict which host will "win" the competition on any particular episode.

### First, we need to load in the data set and see what we are working with

In [None]:
import pandas as pd
ff_data = pd.read_csv('Freaky_Franchise_data.csv')
ff_data

In [None]:
# Let's take a look at the data type of the columns and where we have null values
ff_data.info()

From this summery we can see a few things we will have to do to the DataFrame before we start using it to create statistical information.
- First see that we can reset the index.
- It seems like there is a second table on the bottom that we should remove before continuing
- We can see that Cordie, Theo, the Difference in scores, and the RT scores are listed as objects while we will need them as floats or integers
- In the same vein of above, we may want to convert Date Aired to DateTime.
- We also see that there are some null values that we will have to deal with


### Scrubbing the data for modeling

#### First we are going to drop the extra table on the bottom. 

In [None]:
# Dropping the rows without an index (episode number) by telling pandas to just keep rows that
# the episode number is not empty.
ff_data = ff_data[ff_data['#'].notna()]
ff_data.tail()

#### First, we can set the index to the episode number

In [None]:
ff_data.set_index("#", inplace=True)
ff_data

#### Let's look at the null values and decide what to do with them

In [None]:
ff_data.isnull().sum()

4 of these columns have the same amount of null values. This could be a coincidence or the null values could be in the same row. We should look deeper into that since it could help us decide how we deal with the null values.

In [None]:
# First we are going to just look at rows that have null values
ff_data[ff_data.isnull().any(axis=1)]

In [None]:
# This produced more rows than we wanted. We want to see if the 12 in are the same
# To check this we are going to create a new df without notes
no_notes = ff_data.copy()
no_notes.drop(labels='Notes', axis=1, inplace=True)
no_notes.head()

In [None]:
# run the same code again with no_notes to see all rows with null values
no_notes[no_notes.isnull().any(axis=1)]

We can see that like we suspected, the 12 null values all fall on the same rows. These episodes are mostly retrospectives and specials which we can guess (and I can confirm from listening to them) did not include the competition. Since the main thing we are looking at in this blog is the Rotten Tomato competition, we can safely drop these rows without loss of data. 

In [None]:
# Using the same method we used to remove the extra table
# Since the null values fall across the row, we just need to choose one column
ff_data = ff_data[ff_data['Cordie'].notna()]
ff_data.head()

In [None]:
# Let's look at ff_data.info() again to check it worked
ff_data.info()

#### Now that we have the data we will be working with, we need to convert it into a format we can work with

In [None]:
# Using a for loop we will convert all into float 

# First, create a list of the column names that we need to convert
columns = ['Cordie', 'Theo', 'Difference in scores', 'RT Score']

# Use a for loop to loop through columns to convert any columns that can be into floats
for x in columns:
    ff_data[x] = pd.to_numeric(ff_data[x], errors='coerce')

In [None]:
ff_data.info()

Here we can see that 'Theo', 'Difference in scores', and 'RT Score' have one less non-null object than before. That mean most likely there was a non-number filler which we converted to a null value when we coerced the errors. Seeing that, we will need to check for null values again and decide what to do with them.

In [None]:
# Checking again for nulls using the same method as above
no_notes = ff_data.copy()
no_notes.drop(labels='Notes', axis=1, inplace=True)
no_notes[no_notes.isnull().any(axis=1)]

It looks like there is one episode where Theo's guess is not listed and thus the difference is not listed and another episode where no Rotten Tomato Score is listed. Both of these episodes have winners so we shouldn't get drop them right out. Since it is just three null values, we are going replace the null values with probable answers using the other data we have.

In [None]:
# Since Cordie won the Sleepaway Camp IV with a guess of zero and simple search, I
# found that the movie does not have a RT score so we will replace the null with a 0
ff_data['RT Score'] = ff_data['RT Score'].fillna(0)

# For Jason Lives, we know Theo wins so we will fill it with with the RT Score
# Then fill difference with the difference between it and Cordie's guess
ff_data['Theo'] = ff_data['Theo'].fillna(ff_data['RT Score'])
ff_data['Difference in scores'] = ff_data['Difference in scores'].fillna(
                                   abs(ff_data['Cordie'] - ff_data['Theo']))

In [None]:
# Check for nulls once again
ff_data.info()

In [None]:
ff_data.describe()

#### One last thing we will do before we start running test and models is create boolean columns of who went first and who won using one-hot encoding.

In [None]:
ff_data.columns = ff_data.columns.str.replace(' ', '_')
ff_data.head()

In [None]:
ff_data

In [None]:
# We are just going to keep the columns with data that will affect the model
feats = ['Cordie','Theo','Difference_in_scores','RT_Score','Goes_First', 'Winner']
ff_data = ff_data[feats]
ff_data = pd.get_dummies(ff_data, drop_first=True)
ff_data

### Exploring the Data

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import six
%matplotlib inline

In [None]:
ff_data.describe()

In [None]:
ff_data.hist(figsize = (20,18));

In [None]:
import scipy.stats as stats

In [None]:
feats = ['Cordie','Theo','Difference_in_scores','RT_Score','Goes_First_Theo']
corr = ff_data[feats].corr()
corr

In [None]:
fig, ax = plt.subplots(figsize=(7,5))
sns.heatmap(corr, center=0, annot=True)
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5) ;

In [None]:
sns.jointplot('Goes_First_Theo','Winner_Theo', 
              data=ff_data, kind='reg').annotate(stats.pearsonr)
;

In [None]:
sns.jointplot('Difference_in_scores','Winner_Theo', 
              data=ff_data, kind='reg').annotate(stats.pearsonr)
;

In [None]:
sns.jointplot('Difference_in_scores','Winner_Theo', 
              data=ff_data, kind='reg').annotate(stats.pearsonr)
;

## Finally we can model the data and see what relationships we have between winning

In [None]:
# Define the problem
outcome = 'Winner_Theo'
x_cols = list(ff_data.columns)
x_cols.remove(outcome)

In [None]:
# Some brief preprocessing
ff_data.columns = [col.replace(' ', '_') for col in ff_data.columns]
for col in x_cols:
    ff_data[col] = (ff_data[col] - ff_data[col].mean())/ff_data[col].std()
ff_data.head()

In [None]:
from statsmodels.formula.api import ols

In [None]:
# Fitting the actual model
predictors = '+'.join(x_cols)
formula = outcome + '~' + predictors
model = ols(formula=formula, data=ff_data).fit()
model.summary()