# Modeling: The Movie

(Go to the READ.ME of this repository for the entire write-up.)

For modeling, we took the practice of throwing everything at the wall and seeing what worked. We imported many different models, including linear regression, lasso, SGD regressor, bagging regressor, random forrest regressor, SVR, and adaboost regressor, as well as classifiers including logistic regression, random forest classifier, adaboost classifier, k-nearest neighbors classifier, decision tree classifier, and even a neural network. 

In [84]:
import imdb
import re
import warnings
warnings.simplefilter("ignore")
import pandas as pd
import numpy as np
import ast
from datetime import datetime, timedelta
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Lasso, LassoCV, SGDRegressor
from sklearn.feature_selection import RFE
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, AdaBoostRegressor
from sklearn.metrics import mean_squared_error, f1_score
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.utils import np_utils
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier
import matplotlib.pyplot as plt 
import seaborn as sns

%matplotlib inline

We brought in our six dataframes:
1. 1 df 2 = directors and actors weighted, , deleted columns with 1 or fewer terms
2. 2 df 2 = directors and actors weighted, deleted columns with 1 or fewer terms
3. 3 df 2 = directors and actors and writers weighted, deleted columns with 1 or fewer terms
4. 1 df 3 = directors and actors weighted, , deleted columns with 2 or fewer terms
5. 2 df 2 = directors and actors weighted, deleted columns with 2 or fewer terms
6. 3 df 2 = directors and actors and writers weighted, deleted columns with 2 or fewer terms

In [81]:
# Pre-made dataframes with directors weighted

# X_train = pd.read_csv('train_everything_director_weights_df2.csv') # 1
# X_test = pd.read_csv('test_everything_director_weights_df2.csv') # 1
# X_train = pd.read_csv('train_everything_director_actor_weights_df2.csv') # 2
# X_test = pd.read_csv('test_everything_director_actor_weights_df2.csv') # 2 
X_train = pd.read_csv('train_everything_director_actor_writer_weights_df2.csv') # 3
X_test = pd.read_csv('test_everything_director_actor_writer_weights_df2.csv') # 3
# X_train = pd.read_csv('train_everything_director_weights_df3.csv') # 4
# X_test = pd.read_csv('test_everything_director_weights_df3.csv') # 4
# X_train = pd.read_csv('train_everything_director_actor_weights_df3.csv') # 5
# X_test = pd.read_csv('test_everything_director_actor_weights_df3.csv') # 5
# X_train = pd.read_csv('train_everything_director_actor_writer_weights_df3.csv') # 6
# X_test = pd.read_csv('test_everything_director_actor_writer_weights_df3.csv') # 6

We then fed the dataframes through the following cell, which gave us three regressor scores, then transformed our y variable for classification (based on median Metacritic score) and fed that through three classifiers. Throughout this process many models were attempted and thrown out. Dataframes were changed and had to be saved again and reloaded. At the end of the day we decided on the following models:

- Regression
    - Bagging Regressor
    - Random Forest Regressor
    - LASSO
- Classification
    - Logistic Regression
    - Bagging Classifier
    - Random Forest Classifier
    
Except for LASSO and logistic regression, there wasn't much rhyme or reason for modeling choices. These just gave us the best relative scores (of the ones we tried), and also didn't take a huge amount of time. Also, the bagging regressor and classifier, which didn't seem to ever give us scores that were as good as the other models, still worked quickly and served as a veritable canary in a coal mine, warning us if something had gone wrong with the models. 

In [82]:
y_train = X_train.Metascore
y_test = X_test.Metascore

X_train.drop(['Metascore'], axis=1, inplace=True)
X_test.drop(['Metascore'], axis=1, inplace=True)

ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

In [51]:
br = BaggingRegressor()
br.fit(X_train, y_train)
# print('br train score: ')
# print(br.score(X_train, y_train))
print('br test score: ')
print(br.score(X_test, y_test))
print()

rf = RandomForestRegressor()
rf.fit(X_train, y_train)
# print('rf train score: ')
# print(rf.score(X_train, y_train))
print('rf test score: ')
print(rf.score(X_test, y_test))
print()

lasso = Lasso(.15)
lasso.fit(X_train, y_train)
# print('rf train score: ')
# print(rf.score(X_train, y_train))
print('lasso test score: ')
print(lasso.score(X_test, y_test))
print()

median = np.median(y_train)

new_y = []
for n in y_train:
    if n > median:
        new_y.append(1)
    else:
        new_y.append(0)
y_train = new_y

new_y = []
for n in y_test:
    if n > median:
        new_y.append(1)
    else:
        new_y.append(0)
y_test = new_y

logreg = LogisticRegression() 
logreg.fit(X_train, y_train)
# print('logreg train score: ')
# print(logreg.score(X_train, y_train))
print('logreg test score: ')
print(logreg.score(X_test, y_test))
print()

br = BaggingClassifier()
br.fit(X_train, y_train)
# print('br train score: ')
# print(br.score(X_train, y_train))
print('br test score: ')
print(br.score(X_test, y_test))
print()

rf = RandomForestClassifier()
rf.fit(X_train, y_train)
# print('rf train score: ')
# print(rf.score(X_train, y_train))
print('rf test score: ')
print(rf.score(X_test, y_test))
print()

# 1 reg 

# br test score: 
# 0.038342711196289625

# rf test score: 
# 0.11832620794676674

# lasso test score: 
# 0.19244316790430385

# 1 class 

# logreg test score: 
# 0.6736577181208053

# br test score: 
# 0.662751677852349

# rf test score: 
# 0.6753355704697986

# 2 reg 

# br test score: 
# 0.006896130293002622

# rf test score: 
# 0.07139091002869702

# lasso test score: 
# 0.1924431679043039

# 2 class

# logreg test score: 
# 0.6736577181208053

# br test score: 
# 0.6375838926174496

# rf test score: 
# 0.6585570469798657

# 3 reg

# br test score: 
# 0.05994540328342234

# rf test score: 
# -0.03186605837286138

# lasso test score: 
# 0.1924431679043039

# 3 class

# logreg test score: 
# 0.6736577181208053

# br test score: 
# 0.6384228187919463

# rf test score: 
# 0.6719798657718121

# 4 reg

# br test score: 
# 0.023266042810753954

# rf test score: 
# 0.07619378931494514

# lasso test score: 
# 0.21460320119560472

# 4 class 

# logreg test score: 
# 0.6854026845637584

# br test score: 
# 0.6434563758389261

# rf test score: 
# 0.6375838926174496

# 5 reg

# br test score: 
# 0.005276011558945859

# rf test score: 
# 0.03497975713168888

# lasso test score: 
# 0.21460320119560497

# 5 class 

# logreg test score: 
# 0.6854026845637584

# br test score: 
# 0.6518456375838926

# rf test score: 
# 0.6610738255033557

# 6 reg

# br test score: 
# 0.03739589734130877

# rf test score: 
# -0.02765041735558893

# lasso test score: 
# 0.21460320119560503

# 6 class

# logreg test score: 
# 0.6854026845637584

# br test score: 
# 0.6593959731543624

# rf test score: 
# 0.6753355704697986

br test score: 
0.03739589734130877

rf test score: 
-0.02765041735558893

lasso test score: 
0.21460320119560503

logreg test score: 
0.6854026845637584

br test score: 
0.6593959731543624

rf test score: 
0.6753355704697986



In [None]:
cap_mods = pd.read_csv('capstone_models_1.csv')
ap_mods.columns = ['', '1 df 3', '2 df 3', '3 df 3', '1 df 2', '2 df2 ',
       '3 df 2 ']
cap_mods = cap_mods.set_index('')
cap_mods_class = cap_mods.iloc[3:,:].copy()
cap_mods_reg = cap_mods.iloc[:3,:].copy()

In [None]:
sns.set_style("darkgrid",{"xtick.color":"black", "ytick.color":"black"})
plt.figure(figsize=(10,5))
sns.heatmap(cap_mods_reg, annot = True, cmap="Greens")
# plt.tick_params(color='white', labelcolor='white');

In [None]:
# sns.set_style("dark",{"xtick.color":"white", "ytick.color":"white"})
plt.figure(figsize=(10,5))
sns.heatmap(cap_mods_class, annot = True, cmap = "Blues")
# plt.tick_params(color='white', labelcolor='white');

After analyzing the output from our models, we decided to use the 3 df 2 dataframe, aka, # 3, to tune hyperparameters on. Similarly, we tuned classifers on random forest, logreg, and LASSO, omitting the others for time. Frankly, the differences between performance is largely negligible, but we had might as well take the .02 bump provided by our best models. 

In [71]:
cap_mods

Unnamed: 0.1,Unnamed: 0,1 df 3,2 df 3,3 df 3,1 df 2,2 df2,3 df 2
0,br reg,0.038343,0.006896,0.059945,0.023266,0.005276,0.037396
1,rf reg,0.118326,0.071391,-0.031866,0.076194,0.03498,-0.02765
2,lasso reg,0.192443,0.192443,0.192443,0.214603,0.214603,0.214603
3,logreg class,0.673658,0.673658,0.673658,0.685403,0.685403,0.685403
4,br class,0.662752,0.637584,0.638423,0.643456,0.651846,0.659396
5,rf class,0.675336,0.658557,0.67198,0.637584,0.661074,0.675336


In [78]:
# y_train = X_train.Metascore
# y_test = X_test.Metascore

# X_train.drop(['Metascore'], axis=1, inplace=True)
# X_test.drop(['Metascore'], axis=1, inplace=True)

# ss = StandardScaler()
# X_train = ss.fit_transform(X_train)
# X_test = ss.transform(X_test)

# median = np.median(y_train)

# new_y = []
# for n in y_train:
#     if n > median:
#         new_y.append(1)
#     else:
#         new_y.append(0)
# y_train = new_y

# new_y = []
# for n in y_test:
#     if n > median:
#         new_y.append(1)
#     else:
#         new_y.append(0)
# y_test = new_y

rf_params = {
    'max_depth': [None],
    'n_estimators': [200],
    'max_features': [2, 10],
}

gs = GridSearchCV(rf, param_grid=rf_params)
gs.fit(X_train, y_train)
print(gs.score(X_test, y_test))
print(gs.best_score_)
print(gs.best_params_)qwa

# 0.7030201342281879
# 0.667
# {'max_depth': None, 'max_features': 10, 'n_estimators': 200}

# 0.697986577181208
# 0.6808888888888889
# {'max_depth': 1000, 'max_features': 2, 'n_estimators': 200}

0.697986577181208
0.6808888888888889
{'max_depth': 1000, 'max_features': 2, 'n_estimators': 200}


In [89]:
lasso = LassoCV()
lasso.fit(X_train, y_train)
lasso.score(X_test, y_test)

0.21850824761335874

In [88]:
lasso = Lasso()
lasso_params = {
    'alphas': [None, .15],
}

gs = GridSearchCV(lasso, param_grid=lasso_params)
gs.fit(X_train, y_train)
print(gs.score(X_test, y_test))
print(gs.best_score_)
print(gs.best_params_)


ValueError: Invalid parameter alphas for estimator Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False). Check the list of available parameters with `estimator.get_params().keys()`.

In [87]:
logreg_params = {
    'penalty': ['l1'],
    'C': [10, 100],
}

gs = GridSearchCV(logreg, param_grid=logreg_params)
gs.fit(X_train, y_train)
print(gs.score(X_test, y_test))
print(gs.best_score_)
print(gs.best_params_)

# 0.6971476510067114
# 0.6945555555555556
# {'C': 10, 'penalty': 'l1'}

# 0.7055369127516778
# 0.6921111111111111
# {'C': 10, 'penalty': 'l1'}

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 625, in dispatch_one_batch
    self._dispatch(tasks)
  File "/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 588, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 111, in apply_async
    result = ImmediateResult(func)
  File "/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 332, in __init__
    self.results = batch()
  File "/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 131, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 131, in <listcomp>
    return [func(*args, **kwargs) for func, args, kwargs in self.

TypeError: must be str, not list

Our best classifier (logreg) accuracy was 

0.6945555555555556 

using C = 10 with an l1 penalty. 

And our best regression R$^2$ score was 

0.21460320119560503

with an $\alpha$ = .15

There is no reason we shouldn't be able to achieve better than this given more time in the future. 

Future recommendations are numerous. There are many different ways possible to make this score better, the only constraint being time. 

In terms of data collection, there are several other large databases to access, including imdb's itself as well as Metacritic's. It is entirely possible we have all the Metacritic scores, but we could always use more. Plus, Metacritic has statistics such as whether the movie is part of a franchise and how well the previous film did. We can, of course, make that data ourselves, but again, time is a factor here.

We would also like access to more of the cast and crew including producers, cinematographers, composers, editors, and more of the cast. After all, the theory underlying this entire endeavous is that people make movies and people are consistent in their product. 

We could impute null values, especially with things like box office revenue, opening weekend box office revenue, Rotten Tomatoes scores, which could all replace Metacritic scores as the target variable. It would then be a simple mapping from one to the other. There could easily be more Rotten Tomatoes scores than Metacritic.

In terms of feature engineering, there are always more columns to make. We could use polynomial features on our numerical data. We could just use directors and writers. We could run more n-grams on the titles. We could change our min_dfs per column. We could sift down out list of actor weights. We could go back and try to get the actors averages like before. 

Finally, there are more models for us to use. Several will allow us to tune hyperarameters to eek out better scores. There are models that work better with NLP. We can try a neural network for both classification and regression. With can try a passive aggressive classifer. And we'll do all that and we'll predict movie scores and eventually they'll make a movie about us. 

And that's my capstone! Wasn't it great? 