## 5: Linear Regression and Train/Test Split

Use the `2013_movies.csv` data set:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from dateutil import parser
from datetime import datetime

import patsy
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [None]:
pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

%matplotlib inline
sns.set_style('darkgrid')

## Load and Inspect Data

In [None]:
df = pd.read_csv('./challenges_data/2013_movies.csv')

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
# Look at head and tail
pd.concat([df.head(), df.tail()], axis=0)

In [None]:
# Look at random sample
df.sample(5)

### Remove Null Rows

In [None]:
df = df[~df.isnull().any(axis=1)] #JB good call

### Transform Date into Month and Year

In [None]:
df['Date'] = df['ReleaseDate'].map(lambda x: parser.parse(x))

In [None]:
df['Year'] = df['Date'].map(lambda x: x.year)
df['Month'] = df['Date'].map(lambda x: x.month)

## Challenge 1

Build a linear model that uses only a constant term (a column of ones) to predict a continuous outcome (like domestic total gross). How can you interpret the results of this model? What does it predict? Make a plot of predictions against actual outcome. Make a histogram of residuals. How are the residuals distributed?

In [None]:
# Create your feature matrix (X) and target vector (y)
y, X = patsy.dmatrices('DomesticTotalGross ~ 1', data=df, return_type='dataframe')

In [None]:
X.head()

In [None]:
y.head()

### statsmodels

In [None]:
# Create your model
model = sm.OLS(y, X) #JB or could have used smf to do it all in one line!
# Fit your model to your training set
fit = model.fit()
# Print summary statistics of the model's performance
fit.summary()

### sklearn

In [None]:
# Create an empty model
lr = LinearRegression()
# Fit the model to the full dataset
lr.fit(X, y)
# Print out the R^2 for the model against the full dataset
lr.score(X,y)

## Challenge 2

Repeat the process of challenge one, but also add one continuous (numeric) predictor variable. Also add plots of model prediction against your feature variable and residuals against feature variable. How can you interpret what's happening in the model?

In [None]:
# Create your feature matrix (X) and target vector (y)
y, X = patsy.dmatrices('DomesticTotalGross ~ Budget', data=df, return_type='dataframe')

In [None]:
X.head()

In [None]:
y.head()

### statsmodels

In [None]:
# Create your model
model = sm.OLS(y, X)
# Fit your model to your training set
fit = model.fit()
# Print summary statistics of the model's performance
fit.summary()

### sklearn

In [None]:
# Create an empty model
lr = LinearRegression()
# Fit the model to the full dataset
lr.fit(X, y)
# Print out the R^2 for the model against the full dataset
lr.score(X,y)

## Challenge 3

Repeat the process of challenge 1, but add a categorical feature (like genre). You'll have to convert a column of text into a number of numerical columns ("dummy variables"). How can you interpret what's happening in the model?

In [None]:
# Create your feature matrix (X) and target vector (y)
rating = patsy.dmatrix('Rating', data=df, return_type='dataframe')
y = df[['DomesticTotalGross']]
X = df[['Budget', 'Runtime']].join(rating)

In [None]:
X.head()

In [None]:
y.head()

### statsmodels

In [None]:
# Create your model
model = sm.OLS(y, X)
# Fit your model to your training set
fit = model.fit()
# Print summary statistics of the model's performance
fit.summary()

### sklearn

In [None]:
# Create an empty model
lr = LinearRegression()
# Fit the model to the full dataset
lr.fit(X, y)
# Print out the R^2 for the model against the full dataset
lr.score(X,y)

## Challenge 4

Enhance your model further by adding more features and/or transforming existing features. Think about how you build the model matrix and how to interpret what the model is doing.

In [None]:
# Create your feature matrix (X) and target vector (y)
director = patsy.dmatrix('Director', data=df, return_type='dataframe') #JB what were you going for here? Can still use R formula for multi
rating = patsy.dmatrix('Rating', data=df, return_type='dataframe')

In [None]:
y = df[['DomesticTotalGross']]
X = pd.concat([df[['Budget', 'Year', 'Month']], rating], axis=1)

### statsmodels

In [None]:
# Create your model
model = sm.OLS(y, X)
# Fit your model to your training set
fit = model.fit()
# Print summary statistics of the model's performance
fit.summary()

### sklearn

In [None]:
# Create an empty model
lr = LinearRegression()
# Fit the model to the full dataset
lr.fit(X, y)
# Print out the R^2 for the model against the full dataset
lr.score(X, y)

## Challenge 5

Fitting and checking predictions on the exact same data set can be
misleading. Divide your data into two sets: a training and a test set
(roughly 75% training, 25% test is a fine split). Fit a model on the
training set, check the predictions (by plotting versus actual values)
in the test set.

In [None]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=129)

### statsmodels

In [None]:
# Create your model
model = sm.OLS(y_train, X_train)
# Fit your model to your training set
fit = model.fit()
# Print summary statistics of the model's performance
fit.summary()

In [None]:
# Plot Actual vs. Predicted
y_pred = fit.predict(X_test)
y_error = y_pred - y_test

fig = plt.figure(figsize=(12, 8))
plt.scatter(np.arange(len(y_pred)), y_pred, color='red', label='predicted')
plt.scatter(np.arange(len(y_test)), y_test.iloc[:, 0], color='black', label='actual')
plt.legend(loc='upper right')
plt.xlabel('Index')
plt.ylabel('Domestic Gross ($)')
plt.title('Predicted vs. Actual')

In [None]:
# Plot residuals
fit.resid.plot(style='o', figsize=(12,8));
plt.xlabel('Index')
plt.ylabel('Error ($)')
plt.title('Model Residuals')

### sklearn

In [None]:
lr = LinearRegression()
# Fit the model against the training data
lr.fit(X_train, y_train)
# Evaluate the model against the testing data
lr.score(X_train, y_train)

In [None]:
# Plot Actual vs. Predicted
y_pred = fit.predict(X_test)
y_error = y_pred - y_test

fig = plt.figure(figsize=(12, 8))
plt.scatter(np.arange(len(y_pred)), y_pred, color='red', label='predicted')
plt.scatter(np.arange(len(y_test)), y_test.iloc[:, 0], color='black', label='actual')
plt.legend(loc='upper right')
plt.xlabel('Index')
plt.ylabel('Domestic Gross ($)')
plt.title('Predicted vs. Actual')

In [None]:
# Calculate and Residuals
y_pred = lr.predict(X_test)
y_error = y_pred - y_test

fig = plt.figure(figsize=(12, 8))
plt.scatter(np.arange(len(y_error)), y_error.iloc[:, 0], color='blue', label='residuals')
plt.xlabel('Index')
plt.ylabel('Error ($)')
plt.title('Model Residuals');