<a href="https://colab.research.google.com/github/brighamfrandsen/econ484/blob/master/examples/movies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Import necessary libraries
!git clone https://github.com/brighamfrandsen/econ484.git
%cd econ484/utilities
from preamble import *
%cd ../data

# Example: Movie-going and Weather

This notebook will illustrate the entire supervised machine learning process in the context of predicting movie attendance based on the weather on opening weekend.

## 1. Figure out your question

How many people would be expected to attend a movie on a weekend with temperatures in the X1s, precipitation of X2, humidity of X3, . . . ?

## 2. Obtain a labeled dataset

In [None]:
import pandas as pd
import numpy as np

In [None]:
moviedata=pd.read_csv('./opening_wkend.csv')

print(moviedata.head())
print("Shape: {}".format(str(moviedata.shape)))

Let's define our "label" (y) vector and our "feature" matrix (X):

In [None]:
y = moviedata.filter(items=['tickets_wk1d_r'])
X = moviedata.filter(like='res_own',axis=1)
print('our y vector is:\n',y.head)
print('our X matrix is:\n',X.head)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X)

## 3. Divide into training and test sets

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=42,test_size=.25)
y_train.shape

## 4. Pick an appropriate method

In [None]:
from sklearn.linear_model import Lasso

## 5. Choose regularization parameters via cross-validation on the training set

By hand if you really want:

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
lasso = Lasso(alpha=0.1, max_iter=100)
scores = cross_val_score(lasso,X_train,y_train,cv=5)
print("Cross-validation scores: {}".format(scores))
print("Average cross-validation score: {:.4f}".format(scores.mean()))

Or use GridSearchCV and do it automatically:

In [None]:
from sklearn.model_selection import GridSearchCV
# define grid for alpha
alpha_grid = {'alpha': [.0001, .005,.01, .015,.02, .025, .03, .031, .032, .034, .035, .036, .04 ,.05, .06 ],'max_iter': [1000]}
grid_search = GridSearchCV(Lasso(),alpha_grid,cv=5,return_train_score=True)
best_model=grid_search.fit(X_train,y_train)
print("Best alpha: ",best_model.best_estimator_.get_params()['alpha'])

Or, even easier, just use LassoCV:

In [None]:
from sklearn.linear_model import LassoCV
lassocv = LassoCV(cv=5,max_iter=1000).fit(X_train, y_train)
print(lassocv.alpha_)

## 6. Fit model on whole training set using the cross-validated parameters

In [None]:
lassowcvalpha=Lasso(alpha = lassocv.alpha_,max_iter=100000).fit(X_train,y_train)

In [None]:
lassowcvalpha.coef_

## 7. Evaluate model by applying it to test set

In [None]:
print('Lasso score on test set: {:.4f}'.format(lassowcvalpha.score(X_test,y_test)))

## 8. Repeat 4-7 for several methods

## 9. Apply to new observations for which we have no labels

In [None]:
Xnew=pd.read_csv('./newobs.csv')

# standardize new Xs in same way
Xnew = scaler.transform(Xnew)
yhatnew=lassowcvalpha.predict(Xnew)
print("predicted residualized ticket sales for new observation: ",yhatnew)

In [None]:
lassowcvalpha.coef_