# Modeling: The Movie

(Go to the READ.ME of this repository for the entire write-up.)

For modeling, we took the practice of throwing everything at the wall and seeing what worked. We imported many different models, including linear regression, lasso, SGD regressor, bagging regressor, random forrest regressor, SVR, and adaboost regressor, as well as classifiers including logistic regression, random forest classifier, adaboost classifier, k-nearest neighbors classifier, decision tree classifier, and even a neural network. 

In [60]:
import imdb
import re
import pandas as pd
import numpy as np
import ast
from datetime import datetime, timedelta
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Lasso, SGDRegressor
from sklearn.feature_selection import RFE
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, AdaBoostRegressor
from sklearn.metrics import mean_squared_error, f1_score
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.utils import np_utils
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier
import matplotlib.pyplot as plt 
import seaborn as sns

%matplotlib inline

We brought in our six dataframes:
1. 1 df 2 = directors and actors weighted, , deleted columns with 1 or fewer terms
2. 2 df 2 = directors and actors weighted, deleted columns with 1 or fewer terms
3. 3 df 2 = directors and actors and writers weighted, deleted columns with 1 or fewer terms
4. 1 df 3 = directors and actors weighted, , deleted columns with 2 or fewer terms
5. 2 df 2 = directors and actors weighted, deleted columns with 2 or fewer terms
6. 3 df 2 = directors and actors and writers weighted, deleted columns with 2 or fewer terms

In [39]:
# Pre-made dataframes with directors weighted

# X_train = pd.read_csv('train_everything_director_weights_df2.csv') # 1
# X_test = pd.read_csv('test_everything_director_weights_df2.csv') # 1
# X_train = pd.read_csv('train_everything_director_actor_weights_df2.csv') # 2
# X_test = pd.read_csv('test_everything_director_actor_weights_df2.csv') # 2 
X_train = pd.read_csv('train_everything_director_actor_writer_weights_df2.csv') # 3
X_test = pd.read_csv('test_everything_director_actor_writer_weights_df2.csv') # 3
# X_train = pd.read_csv('train_everything_director_weights_df3.csv') # 4
# X_test = pd.read_csv('test_everything_director_weights_df3.csv') # 4
# X_train = pd.read_csv('train_everything_director_actor_weights_df3.csv') # 5
# X_test = pd.read_csv('test_everything_director_actor_weights_df3.csv') # 5
# X_train = pd.read_csv('train_everything_director_actor_writer_weights_df3.csv') # 6
# X_test = pd.read_csv('test_everything_director_actor_writer_weights_df3.csv') # 6

We then fed the dataframes through the following cell, which gave us three regressor scores, then transformed our y variable for classification (based on median Metacritic score) and fed that through three classifiers. Throughout this process many models were attempted and thrown out. Dataframes were changed and had to be saved again and reloaded. At the end of the day we decided on the following models:

- Regression
    - Bagging Regressor
    - Random Forest Regressor
    - LASSO
- Classification
    - Logistic Regression
    - Bagging Classifier
    - Random Forest Classifier
    
Except for LASSO and logistic regression, there wasn't much rhyme or reason for modeling choices. These just gave us the best relative scores (of the ones we tried), and also didn't take a huge amount of time. Also, the bagging regressor and classifier, which didn't seem to ever give us scores that were as good as the other models, still worked quickly and served as a veritable canary in a coal mine, warning us if something had gone wrong with the models. 

In [51]:
y_train = X_train.Metascore
y_test = X_test.Metascore

X_train.drop(['Metascore'], axis=1, inplace=True)
X_test.drop(['Metascore'], axis=1, inplace=True)

ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

br = BaggingRegressor()
br.fit(X_train, y_train)
# print('br train score: ')
# print(br.score(X_train, y_train))
print('br test score: ')
print(br.score(X_test, y_test))
print()

rf = RandomForestRegressor()
rf.fit(X_train, y_train)
# print('rf train score: ')
# print(rf.score(X_train, y_train))
print('rf test score: ')
print(rf.score(X_test, y_test))
print()

lasso = Lasso(.15)
lasso.fit(X_train, y_train)
# print('rf train score: ')
# print(rf.score(X_train, y_train))
print('lasso test score: ')
print(lasso.score(X_test, y_test))
print()

median = np.median(y_train)

new_y = []
for n in y_train:
    if n > median:
        new_y.append(1)
    else:
        new_y.append(0)
y_train = new_y

new_y = []
for n in y_test:
    if n > median:
        new_y.append(1)
    else:
        new_y.append(0)
y_test = new_y

logreg = LogisticRegression() 
logreg.fit(X_train, y_train)
# print('logreg train score: ')
# print(logreg.score(X_train, y_train))
print('logreg test score: ')
print(logreg.score(X_test, y_test))
print()

br = BaggingClassifier()
br.fit(X_train, y_train)
# print('br train score: ')
# print(br.score(X_train, y_train))
print('br test score: ')
print(br.score(X_test, y_test))
print()

rf = RandomForestClassifier()
rf.fit(X_train, y_train)
# print('rf train score: ')
# print(rf.score(X_train, y_train))
print('rf test score: ')
print(rf.score(X_test, y_test))
print()

# 1 reg 

# br test score: 
# 0.038342711196289625

# rf test score: 
# 0.11832620794676674

# lasso test score: 
# 0.19244316790430385

# 1 class 

# logreg test score: 
# 0.6736577181208053

# br test score: 
# 0.662751677852349

# rf test score: 
# 0.6753355704697986

# 2 reg 

# br test score: 
# 0.006896130293002622

# rf test score: 
# 0.07139091002869702

# lasso test score: 
# 0.1924431679043039

# 2 class

# logreg test score: 
# 0.6736577181208053

# br test score: 
# 0.6375838926174496

# rf test score: 
# 0.6585570469798657

# 3 reg

# br test score: 
# 0.05994540328342234

# rf test score: 
# -0.03186605837286138

# lasso test score: 
# 0.1924431679043039

# 3 class

# logreg test score: 
# 0.6736577181208053

# br test score: 
# 0.6384228187919463

# rf test score: 
# 0.6719798657718121

# 4 reg

# br test score: 
# 0.023266042810753954

# rf test score: 
# 0.07619378931494514

# lasso test score: 
# 0.21460320119560472

# 4 class 

# logreg test score: 
# 0.6854026845637584

# br test score: 
# 0.6434563758389261

# rf test score: 
# 0.6375838926174496

# 5 reg

# br test score: 
# 0.005276011558945859

# rf test score: 
# 0.03497975713168888

# lasso test score: 
# 0.21460320119560497

# 5 class 

# logreg test score: 
# 0.6854026845637584

# br test score: 
# 0.6518456375838926

# rf test score: 
# 0.6610738255033557

# 6 reg

# br test score: 
# 0.03739589734130877

# rf test score: 
# -0.02765041735558893

# lasso test score: 
# 0.21460320119560503

# 6 class

# logreg test score: 
# 0.6854026845637584

# br test score: 
# 0.6593959731543624

# rf test score: 
# 0.6753355704697986



br test score: 
0.03739589734130877

rf test score: 
-0.02765041735558893

lasso test score: 
0.21460320119560503

logreg test score: 
0.6854026845637584

br test score: 
0.6593959731543624

rf test score: 
0.6753355704697986



In [None]:
cap_mods = pd.read_csv('capstone_models_1.csv')

cap_mods.columns

cap_mods.columns = ['', '1 df 3', '2 df 3', '3 df 3', '1 df 2', '2 df2 ',
       '3 df 2 ']

cap_mods = cap_mods.set_index('')

cap_mods_class = cap_mods.iloc[3:,:].copy()

cap_mods_reg = cap_mods.iloc[:3,:].copy()

cap_mods

In [None]:
sns.set_style("darkgrid",{"xtick.color":"black", "ytick.color":"black"})
plt.figure(figsize=(10,5))
sns.heatmap(cap_mods_reg, annot = True, cmap="Greens")
# plt.tick_params(color='white', labelcolor='white');

In [None]:
# sns.set_style("dark",{"xtick.color":"white", "ytick.color":"white"})
plt.figure(figsize=(10,5))
sns.heatmap(cap_mods_class, annot = True, cmap = "Blues")
# plt.tick_params(color='white', labelcolor='white');

In [71]:
cap_mods

Unnamed: 0.1,Unnamed: 0,1 df 3,2 df 3,3 df 3,1 df 2,2 df2,3 df 2
0,br reg,0.038343,0.006896,0.059945,0.023266,0.005276,0.037396
1,rf reg,0.118326,0.071391,-0.031866,0.076194,0.03498,-0.02765
2,lasso reg,0.192443,0.192443,0.192443,0.214603,0.214603,0.214603
3,logreg class,0.673658,0.673658,0.673658,0.685403,0.685403,0.685403
4,br class,0.662752,0.637584,0.638423,0.643456,0.651846,0.659396
5,rf class,0.675336,0.658557,0.67198,0.637584,0.661074,0.675336


In [None]:
# y_train = X_train.Metascore
# y_test = X_test.Metascore

# X_train.drop(['Metascore'], axis=1, inplace=True)
# X_test.drop(['Metascore'], axis=1, inplace=True)

# ss = StandardScaler()
# X_train = ss.fit_transform(X_train)
# X_test = ss.transform(X_test)

# median = np.median(y_train)

# new_y = []
# for n in y_train:
#     if n > median:
#         new_y.append(1)
#     else:
#         new_y.append(0)
# y_train = new_y

# new_y = []
# for n in y_test:
#     if n > median:
#         new_y.append(1)
#     else:
#         new_y.append(0)
# y_test = new_y

rf_params = {
    'max_depth': [None, 1000],
    'n_estimators': [200],
    'max_features': [2, 10],
}

gs = GridSearchCV(rf, param_grid=rf_params)
gs.fit(X_train, y_train)
print(gs.score(X_test, y_test))
print(gs.best_score_)
print(gs.best_params_)

# 0.7030201342281879
# 0.667
# {'max_depth': None, 'max_features': 10, 'n_estimators': 200}

# Capstone Project

Your Capstone project is the culmination of your time at GA. You will be tasked with developing an interesting question, collecting the data required to model that data, developing the strongest model (or models) for prediction, and communicating those findings to other data scientists and non-technical individuals. This introductory document lays out the five consitutent portions of the project and their due dates.

## Your Deliverables

- A well-made predictive model using either structured or unstructured machine learning techniques (or other technique approved in advanced by the global instructors), as well as clean, well-written code. 
- A technical report aimed at fellow data scientists that explains your process and findings
- A public presentation of your findings aimed at laypeople. 

### **[Capstone, Part 1: Topic Proposals](./part_01/)**

In Part 1, get started by choosing **three potential topics and problems**, describing your goals & criteria for success, potential audience(s), and identifying 1-2 potential datasets. In the field of data science, good projects are practical. Your capstone project should be manageable and affect a real world audience. This might be a domain you are familiar with, a particular interest you have, something that affects a community you are involved in, or an area that relates to a field you wish to work in.

One of the best ways to test ideas quickly is to share them with others. A good data scientist has to be comfortable discussing ideas and presenting to audiences. That's why for Part 1 of your Capstone project, you'll be preparing a lightning talk in addition to your initial notebook outlining the scope of your project.  You will present your candidate topics in a slide deck, and should be prepared to answer questions and defend your data selection(s). Presentations should take no more than 3-5 minutes.

**The ultimate choice of topic for your capstone project is yours!** However, this is research and development work. Sometimes projects that look easy can be difficult and vice versa. It never hurts to have a second (or third) option available.

- **Goal**: Prepare a 3-5 minute lightning talk that covers three potential topics, including potential sources of data, goals, metrics and audience.
- **Due**: Thursday, June 7

### **[Capstone, Part 2: Problem Statement + EDA](./part_02/)**

For Part 2, provide a clear statement of the problem that you have chosen and an overview of your approach to solving that problem. Summarize your objectives, goals & success metrics, and any risks & assumptions. Outline your proposed methods and models, perform your initial EDA, and summarize the process. **Your data should be in hand by this point in the process!**

**Again, your data should be in hand by this point the process!**

- **Goal**: Describe your proposed approach and summarize your initial EDA in a code submission to your local instructor ([submission link](https://docs.google.com/forms/d/e/1FAIpQLScez-8PsyIgP548fNtsoDpuNTdKxsr6tVvKPDtbr-mQov6NCw/viewform?usp=sf_link))
- **Due**: Wednesday, June 20

### **[Capstone, Part 3: Progress Report + Preliminary Findings](./part_03/)**

In Part 3, you'll create a progress report of your work in order to get feedback along the way. Describe your approach, initial results, and any setbacks or lessons learned so far. Your report should include updated visual and statistical analysis of your data. You’ll also meet with your local instructional team to get feedback on your results so far!

- **Goal**: Discuss progress and setbacks, include visual and statistical analysis, review with instructor. (A submission link for your progress report will be provided prior to the due date.)
- **Due**: Monday, July 2

### **[Capstone, Part 4: Report Writeup + Technical Analysis](./part_04/)**

By now, you're ready to apply your modeling skills to make machine learning predictions. Your goal for Part 4 is to develop a technical document (in the form of Jupyter notebook) that can be shared among your peers.

Document your research and analysis including a summary, an explanation of your modeling approach as well as the strengths and weaknesses of any variables in the process. You should provide insight into your analysis, using best practices like cross validation or applicable prediction metrics.

- **Goal**: Detailed report and code with a summary of your statistical analysis, model, and evaluation metrics.
- **Due**: Friday, July 13

### **[Capstone, Part 5: Presentation + Recommendations](./part_05/)**

Whether during an interview or as part of a job, you will frequently have to present your findings to business partners and other interested parties - many of whom won't know anything about data science! That's why for Part 5, you'll create a presentation of your previous findings with a non-technical audience in mind.

You should already have the analytical work complete, so now it's time to clean up and clarify your findings. Come up with a detailed slide deck or interactive demo that explains your data, visualizes your model, describes your approach, articulates strengths and weaknesses, and presents specific recommendations. Be prepared to explain and defend your model to an inquisitive audience!

- **Goal**: Detailed presentation deck that relates your data, model, and findings to a non-technical audience.
- **Due**: Tuesday, July 17
