## 5: Linear Regression and Train/Test Split

Use the `2013_movies.csv` data set:

In [77]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from dateutil import parser
from datetime import datetime

import patsy
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression

In [78]:
pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

%matplotlib inline
sns.set_style('darkgrid')

## Load and Inspect Data

In [15]:
df = pd.read_csv('./challenges_data/2013_movies.csv')

In [16]:
df.describe()

Unnamed: 0,Budget,DomesticTotalGross,Runtime
count,89.0,100.0,100.0
mean,74747190.0,100596900.0,112.26
std,59416920.0,87396410.0,18.190696
min,2500000.0,25568250.0,75.0
25%,28000000.0,42704130.0,98.0
50%,55000000.0,69542370.0,112.0
75%,110000000.0,120475900.0,123.0
max,225000000.0,424668000.0,180.0


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 7 columns):
Title                 100 non-null object
Budget                89 non-null float64
DomesticTotalGross    100 non-null int64
Director              96 non-null object
Rating                100 non-null object
Runtime               100 non-null int64
ReleaseDate           100 non-null object
dtypes: float64(1), int64(2), object(4)
memory usage: 5.5+ KB


In [18]:
# Look at head and tail
pd.concat([df.head(), df.tail()], axis=0)

Unnamed: 0,Title,Budget,DomesticTotalGross,Director,Rating,Runtime,ReleaseDate
0,The Hunger Games: Catching Fire,130000000.0,424668047,Francis Lawrence,PG-13,146,2013-11-22 00:00:00
1,Iron Man 3,200000000.0,409013994,Shane Black,PG-13,129,2013-05-03 00:00:00
2,Frozen,150000000.0,400738009,Chris BuckJennifer Lee,PG,108,2013-11-22 00:00:00
3,Despicable Me 2,76000000.0,368061265,Pierre CoffinChris Renaud,PG,98,2013-07-03 00:00:00
4,Man of Steel,225000000.0,291045518,Zack Snyder,PG-13,143,2013-06-14 00:00:00
95,Rush,38000000.0,26947624,Ron Howard,R,123,2013-09-20 00:00:00
96,The Host,40000000.0,26627201,Andrew Niccol,PG-13,125,2013-03-29 00:00:00
97,The World's End,20000000.0,26004851,Edgar Wright,R,109,2013-08-23 00:00:00
98,21 and Over,13000000.0,25682380,Jon LucasScott Moore,R,93,2013-03-01 00:00:00
99,Her,23000000.0,25568251,Spike Jonze,R,120,2013-12-18 00:00:00


In [19]:
# Look at random sample
df.sample(5)

Unnamed: 0,Title,Budget,DomesticTotalGross,Director,Rating,Runtime,ReleaseDate
37,Elysium,115000000.0,93050117,Neill Blomkamp,R,109,2013-08-09 00:00:00
40,Oblivion,120000000.0,89107235,Joseph Kosinski,PG-13,125,2013-04-19 00:00:00
92,One Direction: This is Us,10000000.0,28873374,Morgan Spurlock,PG,92,2013-08-30 00:00:00
23,Lone Survivor,40000000.0,125095601,Peter Berg,R,121,2013-12-25 00:00:00
54,The Purge,3000000.0,64473115,James DeMonaco,R,85,2013-06-07 00:00:00


### Remove Null Rows

In [20]:
df = df[~df.isnull().any(axis=1)]

### Transform Date into Month and Year

In [26]:
df['Date'] = df['ReleaseDate'].map(lambda x: parser.parse(x))

In [28]:
df['Year'] = df['Date'].map(lambda x: x.year)
df['Month'] = df['Date'].map(lambda x: x.month)

## Challenge 1

Build a linear model that uses only a constant term (a column of ones) to predict a continuous outcome (like domestic total gross). How can you interpret the results of this model? What does it predict? Make a plot of predictions against actual outcome. Make a histogram of residuals. How are the residuals distributed?

In [69]:
# Create your feature matrix (X) and target vector (y)
y, X = patsy.dmatrices('DomesticTotalGross ~ 1', data=df, return_type='dataframe')

In [70]:
X.head()

Unnamed: 0,Intercept
0,1.0
1,1.0
2,1.0
3,1.0
4,1.0


In [71]:
y.head()

Unnamed: 0,DomesticTotalGross
0,424668047.0
1,409013994.0
2,400738009.0
3,368061265.0
4,291045518.0


### statsmodels

In [43]:
# Create your model
model = sm.OLS(y, X)
# Fit your model to your training set
fit = model.fit()
# Print summary statistics of the model's performance
fit.summary()

  return self.ess/self.df_model


0,1,2,3
Dep. Variable:,DomesticTotalGross,R-squared:,-0.0
Model:,OLS,Adj. R-squared:,-0.0
Method:,Least Squares,F-statistic:,-inf
Date:,"Tue, 30 Jan 2018",Prob (F-statistic):,
Time:,07:13:27,Log-Likelihood:,-1714.4
No. Observations:,87,AIC:,3431.0
Df Residuals:,86,BIC:,3433.0
Df Model:,0,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,1.035e+08,9.43e+06,10.975,0.000,8.48e+07,1.22e+08

0,1,2,3
Omnibus:,47.828,Durbin-Watson:,0.016
Prob(Omnibus):,0.0,Jarque-Bera (JB):,119.528
Skew:,2.032,Prob(JB):,1.11e-26
Kurtosis:,7.058,Cond. No.,1.0


### sklearn

In [44]:
# Create an empty model
lr = LinearRegression()
# Fit the model to the full dataset
lr.fit(X, y)
# Print out the R^2 for the model against the full dataset
lr.score(X,y)

0.0

## Challenge 2

Repeat the process of challenge one, but also add one continuous (numeric) predictor variable. Also add plots of model prediction against your feature variable and residuals against feature variable. How can you interpret what's happening in the model?

In [66]:
# Create your feature matrix (X) and target vector (y)
y, X = patsy.dmatrices('DomesticTotalGross ~ Budget', data=df, return_type='dataframe')

In [67]:
X.head()

Unnamed: 0,Intercept,Budget
0,1.0,130000000.0
1,1.0,200000000.0
2,1.0,150000000.0
3,1.0,76000000.0
4,1.0,225000000.0


In [68]:
y.head()

Unnamed: 0,DomesticTotalGross
0,424668047.0
1,409013994.0
2,400738009.0
3,368061265.0
4,291045518.0


### statsmodels

In [48]:
# Create your model
model = sm.OLS(y, X)
# Fit your model to your training set
fit = model.fit()
# Print summary statistics of the model's performance
fit.summary()

0,1,2,3
Dep. Variable:,DomesticTotalGross,R-squared:,0.282
Model:,OLS,Adj. R-squared:,0.274
Method:,Least Squares,F-statistic:,33.43
Date:,"Tue, 30 Jan 2018",Prob (F-statistic):,1.19e-07
Time:,07:13:51,Log-Likelihood:,-1700.0
No. Observations:,87,AIC:,3404.0
Df Residuals:,85,BIC:,3409.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,4.443e+07,1.3e+07,3.418,0.001,1.86e+07,7.03e+07
Budget,0.7831,0.135,5.782,0.000,0.514,1.052

0,1,2,3
Omnibus:,38.475,Durbin-Watson:,0.666
Prob(Omnibus):,0.0,Jarque-Bera (JB):,92.671
Skew:,1.577,Prob(JB):,7.530000000000001e-21
Kurtosis:,6.952,Cond. No.,155000000.0


### sklearn

In [49]:
# Create an empty model
lr = LinearRegression()
# Fit the model to the full dataset
lr.fit(X, y)
# Print out the R^2 for the model against the full dataset
lr.score(X,y)

0.28230037692954857

## Challenge 3

Repeat the process of challenge 1, but add a categorical feature (like genre). You'll have to convert a column of text into a number of numerical columns ("dummy variables"). How can you interpret what's happening in the model?

In [63]:
# Create your feature matrix (X) and target vector (y)
rating = patsy.dmatrix('Rating', data=df, return_type='dataframe')
y = df[['DomesticTotalGross']]
X = df[['Budget', 'Runtime']].join(rating)

In [64]:
X.head()

Unnamed: 0,Budget,Runtime,Intercept,Rating[T.PG-13],Rating[T.R]
0,130000000.0,146,1.0,1.0,0.0
1,200000000.0,129,1.0,1.0,0.0
2,150000000.0,108,1.0,0.0,0.0
3,76000000.0,98,1.0,0.0,0.0
4,225000000.0,143,1.0,1.0,0.0


In [65]:
y.head()

Unnamed: 0,DomesticTotalGross
0,424668047
1,409013994
2,400738009
3,368061265
4,291045518


### statsmodels

In [56]:
# Create your model
model = sm.OLS(y, X)
# Fit your model to your training set
fit = model.fit()
# Print summary statistics of the model's performance
fit.summary()

0,1,2,3
Dep. Variable:,DomesticTotalGross,R-squared:,0.299
Model:,OLS,Adj. R-squared:,0.264
Method:,Least Squares,F-statistic:,8.73
Date:,"Tue, 30 Jan 2018",Prob (F-statistic):,6.38e-06
Time:,07:15:15,Log-Likelihood:,-1699.0
No. Observations:,87,AIC:,3408.0
Df Residuals:,82,BIC:,3420.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Budget,0.6957,0.172,4.054,0.000,0.354,1.037
Runtime,6.254e+05,5.33e+05,1.173,0.244,-4.35e+05,1.69e+06
Intercept,3.09e+06,5.29e+07,0.058,0.954,-1.02e+08,1.08e+08
Rating[T.PG-13],-2.747e+07,2.5e+07,-1.097,0.276,-7.73e+07,2.24e+07
Rating[T.R],-2.59e+07,2.76e+07,-0.939,0.351,-8.08e+07,2.9e+07

0,1,2,3
Omnibus:,35.472,Durbin-Watson:,0.733
Prob(Omnibus):,0.0,Jarque-Bera (JB):,76.021
Skew:,1.509,Prob(JB):,3.11e-17
Kurtosis:,6.445,Cond. No.,628000000.0


### sklearn

In [57]:
# Create an empty model
lr = LinearRegression()
# Fit the model to the full dataset
lr.fit(X, y)
# Print out the R^2 for the model against the full dataset
lr.score(X,y)

0.29867705443266701

## Challenge 4

Enhance your model further by adding more features and/or transforming existing features. Think about how you build the model matrix and how to interpret what the model is doing.

In [79]:
# Create your feature matrix (X) and target vector (y)
director = patsy.dmatrix('Director', data=df, return_type='dataframe')
rating = patsy.dmatrix('Rating', data=df, return_type='dataframe')
dummy = rating.join(director.drop(columns='Intercept'))

In [80]:
y = df[['DomesticTotalGross']]
X = df[['Budget', 'Year', 'Month']].join(rating)

In [81]:
X.head()

Unnamed: 0,Budget,Year,Month,Intercept,Rating[T.PG-13],Rating[T.R]
0,130000000.0,2013,11,1.0,1.0,0.0
1,200000000.0,2013,5,1.0,1.0,0.0
2,150000000.0,2013,11,1.0,0.0,0.0
3,76000000.0,2013,7,1.0,0.0,0.0
4,225000000.0,2013,6,1.0,1.0,0.0


In [82]:
y.head()

Unnamed: 0,DomesticTotalGross
0,424668047
1,409013994
2,400738009
3,368061265
4,291045518


### statsmodels

In [83]:
# Create your model
model = sm.OLS(y, X)
# Fit your model to your training set
fit = model.fit(y, X)
# Print summary statistics of the model's performance
fit.summary()

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

### sklearn

In [84]:
# Create an empty model
lr = LinearRegression()
# Fit the model to the full dataset
lr.fit(X, y)
# Print out the R^2 for the model against the full dataset
lr.score(X, y)

0.29086740172151471

## Challenge 5

Fitting and checking predictions on the exact same data set can be
misleading. Divide your data into two sets: a training and a test set
(roughly 75% training, 25% test is a fine split). Fit a model on the
training set, check the predictions (by plotting versus actual values)
in the test set.