# Box Office Mojo Regression Analysis

In this section, we are going to do our regression analysis.

First, we'll do some cleaning to get the data in a format that's workable for our regression modeling.

We'll establish some baseline models with each independent alone against our dependent.

After we'll use forward selection to choose our best model

Finally, we'll put all the variables together into a ridge regularization, 
As LASSO tends to find to sparse solutions, driving most coefficients to zero, we can use it to do the feature selection for us and to see which features the regularization model prefers.

In [1]:
import os
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
import luther_util as lu

## Cleaning

In [2]:
fname = sorted([x for x in os.listdir('data')
                if re.match('box_office_mojo_pp', x)])[-1]
df = (pd.read_csv('data/%s' % fname)
      .set_index('title')
      .assign(release_date=lambda x: x.release_date.astype('datetime64'),
              release_month=lambda x: x.release_date.dt.month,
              release_year=lambda x: x.release_date.dt.year,
              log_gross=lambda x: np.log(x.domestic_total_gross),
              roi=lambda x: x.domestic_total_gross.div(x.budget) - 1)
      .query('roi < 15')) # filter out ROI outliers

## Baseline Models

In [3]:
independents = [
  'budget', 
  'domestic_total_gross', 
  'open_wkend_gross',
  'runtime',
  'widest_release', 
  'in_release_days',
  ['rating[T.PG]', 'rating[T.PG-13]', 'rating[T.R]'], 
  ['release_month', 'release_year']]

In [4]:
def results_df(results):
    cols = ['features', 'degree', 'training_r2', 'test_r2', 'mse']
    return (pd.DataFrame(results)
            .reindex(columns=cols)
            .assign(rsme=lambda x: np.sqrt(x.mse))
            .sort_values(['mse', 'test_r2'], ascending=[True, False]))

In [5]:
results = list()
for variable in independents:
  if isinstance(variable, list):
    X = df.loc[:, variable]
    y = df.loc[:, 'roi']
    lr = LinearRegression()
    lu.log_model(results, lr, X, y, variable)
  else:
    X = df.loc[:, variable].values.reshape(-1, 1)
    y = df.loc[:, 'roi']
    for degree in range(1, 4):
      if degree == 1:
        lr = LinearRegression()
        lu.log_model(results, lr, X, y, variable)
      else:
        lr = Pipeline([('poly', PolynomialFeatures(degree)), 
                       ('regr', LinearRegression())])
        lu.log_model(results, lr, X, y, variable, degree)
# Let's also add a bias model
X = np.ones((df.shape[0], 1))
y = df.loc[:, 'roi']
lr = LinearRegression(fit_intercept=False)
lu.log_model(results, lr, X, y, 'bias')
results_df(results)

Unnamed: 0,features,degree,training_r2,test_r2,mse,rsme
1,budget,2,0.298164,0.221162,2.751055,1.658631
2,budget,3,0.162094,0.142412,3.235131,1.798647
0,budget,1,0.159838,0.107614,3.261668,1.806009
15,in_release_days,1,0.029881,-0.02843,3.764404,1.940207
18,"[rating[T.PG], rating[T.PG-13], rating[T.R]]",1,0.02998,-0.044607,3.765328,1.940445
3,domestic_total_gross,1,0.022591,-0.004223,3.769902,1.941624
17,in_release_days,3,0.03529,-0.049908,3.772707,1.942346
16,in_release_days,2,0.035227,-0.060211,3.776756,1.943388
4,domestic_total_gross,2,0.02754,-0.025451,3.77706,1.943466
13,widest_release,2,0.022554,-0.012092,3.80883,1.951622



From the above we can see that our best model is from _Budget_.

The number of _Days In Release_ is our second best predictor.

_Rating_ is our third best predictor.

Even though our budget model isn't great though, only predicting 20% of the variance in the depdent variable, let's take a look at it.

At the very least, most of these models are better than our bias term model though.

## Forward Selection

In [7]:
import itertools
import functools
import operator

number_independents = 2

independents = [x if isinstance(x, list) else [x] for x in independents]
combs = list(itertools.combinations(independents, number_independents))
variables_list = [functools.reduce(operator.iconcat, x, []) for x in combs]
for variables in variables_list:
    X = df.loc[:, variables]
    y = df.loc[:, 'roi']
    lr = LinearRegression()
    lu.log_model(results, lr, X, y, variables)
    
print(results_df(results).head(5).to_html(index=False))

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th>features</th>
      <th>degree</th>
      <th>training_r2</th>
      <th>test_r2</th>
      <th>mse</th>
      <th>rsme</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>[budget, domestic_total_gross]</td>
      <td>1</td>
      <td>0.418192</td>
      <td>0.352395</td>
      <td>2.293047</td>
      <td>1.514281</td>
    </tr>
    <tr>
      <td>[budget, open_wkend_gross]</td>
      <td>1</td>
      <td>0.339812</td>
      <td>0.299092</td>
      <td>2.591965</td>
      <td>1.609958</td>
    </tr>
    <tr>
      <td>budget</td>
      <td>2</td>
      <td>0.298164</td>
      <td>0.221162</td>
      <td>2.751055</td>
      <td>1.658631</td>
    </tr>
    <tr>
      <td>[budget, in_release_days]</td>
      <td>1</td>
      <td>0.243988</td>
      <td>0.128253</td>
      <td>2.988644</td>
      <td>1.728770</td>
    </tr>
    <tr>
      <td>[budget, widest_release]</td>
      <td>1</td>
      

In [9]:
number_independents = 3

independents = [x if isinstance(x, list) else [x] for x in independents]
combs = list(itertools.combinations(independents, number_independents))
variables_list = [functools.reduce(operator.iconcat, x, []) for x in combs]
for variables in variables_list:
    X = df.loc[:, variables]
    y = df.loc[:, 'roi']
    lr = LinearRegression()
    lu.log_model(results, lr, X, y, variables)
    
results_df(results).head(5)

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th>features</th>
      <th>degree</th>
      <th>training_r2</th>
      <th>test_r2</th>
      <th>mse</th>
      <th>rsme</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>[budget, domestic_total_gross, release_month, release_year]</td>
      <td>1</td>
      <td>0.422910</td>
      <td>0.346552</td>
      <td>2.281242</td>
      <td>1.510378</td>
    </tr>
    <tr>
      <td>[budget, domestic_total_gross, release_month, release_year]</td>
      <td>1</td>
      <td>0.422910</td>
      <td>0.346552</td>
      <td>2.281242</td>
      <td>1.510378</td>
    </tr>
    <tr>
      <td>[budget, domestic_total_gross, in_release_days]</td>
      <td>1</td>
      <td>0.422280</td>
      <td>0.354251</td>
      <td>2.289236</td>
      <td>1.513022</td>
    </tr>
    <tr>
      <td>[budget, domestic_total_gross, in_release_days]</td>
      <td>1</td>
      <td>0.422280</td>
      <td>0.354251</td>
     