# Modeling

So far... we have over 700 scripts devolved into bag-of-words, along with ratings (from 1 to 10). Let's start modeling!

We'll start by loading the data from the csv file, deleting the unnecessary columns, dividing into X (features) and y (ratings), and creating a training set and a testing set.

In [15]:
import pandas as pd
from sklearn.model_selection import train_test_split

ratingsScripts = pd.read_csv('ratingsAndScriptsBagOfWords.csv')

In [16]:
ratingsScriptsML = ratingsScripts.drop(['movie_name','tconst','titleType','primaryTitle','startYear','genres','movie_title','numVotes'], axis=1)

In [17]:
X = ratingsScriptsML.drop('averageRating', axis= 1)
y = round(ratingsScriptsML['averageRating'],0)
# round the ratings so we can fit a random forest

X_tr, X_te, y_tr, y_te = train_test_split(X,y,test_size = .2, random_state = 42)

Confirm there are no duplicates

In [5]:
duplicateRowsDF = ratingsScriptsML[ratingsScriptsML.duplicated()]
duplicateRowsDF

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,00,000,10,100,101,102,103,104,...,yourselves,youth,zero,zip,zips,zone,zoo,zoom,zooms,averageRating


We have too many features. Let's reduce that with SVD.

In [18]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components = 27, random_state = 42)
svd.fit(X_tr) # make sure only fit to the training set
X_tr_transformed = svd.transform(X_tr)
X_te_transformed = svd.transform(X_te)

In [19]:
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import f1_score
from sklearn.metrics import auc
from sklearn.linear_model import LinearRegression
from sklearn.metrics import classification_report,confusion_matrix,roc_curve,roc_auc_score
from sklearn.metrics import accuracy_score,log_loss
from matplotlib import pyplot
from sklearn.ensemble import RandomForestClassifier

Create a validation set to leave the testing to last

In [20]:
X_train, X_val, y_train, y_val = train_test_split(X_tr_transformed,y_tr,test_size = .2, random_state = 42)

First let's try a very simple model

In [32]:
import numpy as np

y_val_mean = [round(np.mean(y_train),0)] * len(y_val)

In [33]:
y_val_mean[0:5]

[6.0, 6.0, 6.0, 6.0, 6.0]

In [34]:
ac = accuracy_score(y_val, y_val_mean)

f1 = f1_score(y_val, y_val_mean, average='weighted')

print('Simple Average: Accuracy=%.3f' % (ac))

print('Simple Average: f1-score=%.3f' % (f1))

Simple Average: Accuracy=0.241
Simple Average: f1-score=0.094


So we have lots of room for improvement!

Try a random forest.

In [72]:
def rfc_and_print_scores(X_tra, y_tra, X_va, y_va, n_estimators, max_depth,criterion):
    clf = RandomForestClassifier(n_estimators=n_estimators, max_depth = max_depth,criterion=criterion, random_state = 1,n_jobs=-1)
    model_res = clf.fit(X_tra, y_tra)
    y_pred = model_res.predict(X_va)

    ac = accuracy_score(y_va, y_pred)

    f1 = f1_score(y_va, y_pred, average='weighted')

    print('Random Forest: Accuracy=%.3f' % (ac))

    print('Random Forest: f1-score=%.3f' % (f1))

In [73]:
rfc_and_print_scores(X_train, y_train, X_val, y_val,300,None,'gini')

Random Forest: Accuracy=0.345
Random Forest: f1-score=0.309


Well that's better - especially the F1 score!

Let's see what happens when we change the number of estimators - we don't have a lot of scripts

In [74]:
rfc_and_print_scores(X_train, y_train, X_val, y_val,150,None,'gini')

Random Forest: Accuracy=0.362
Random Forest: f1-score=0.324


That's a little better. What happens when we drop the estimators more?

In [75]:
rfc_and_print_scores(X_train, y_train, X_val, y_val,100,None,'gini')

Random Forest: Accuracy=0.345
Random Forest: f1-score=0.314


Accuracy goes down again. So 150 looks like a good number of estimators for a random forest

What about linear regression?

In [49]:
lr = LinearRegression()

lr.fit(X_train,y_train)

y_pred_lr = lr.predict(X_val)

ac = accuracy_score(y_val, np.round(y_pred_lr,0))

f1 = f1_score(y_val, np.round(y_pred_lr,0), average='weighted')

print('Linear Regression: Accuracy=%.3f' % (ac))

print('Linear Regression: f1-score=%.3f' % (f1))

Linear Regression: Accuracy=0.250
Linear Regression: f1-score=0.147


Not as good as the random forest - barely better than the simple model!

We can try gradient boosting

In [51]:
from sklearn.ensemble import GradientBoostingClassifier

In [77]:
learning_rates = [0.05, 0.1, 0.25, 0.5, 0.75, 1]
for learning_rate in learning_rates:
    gb = GradientBoostingClassifier(n_estimators=20, learning_rate = learning_rate, max_features=2, max_depth = 2, random_state = 0)
    gb.fit(X_train, y_train)
    y_pred_gb = gb.predict(X_val)
    ac = accuracy_score(y_val, np.round(y_pred_gb,0))
    f1 = f1_score(y_val, np.round(y_pred_gb,0), average='weighted')
    print("Learning rate: ", learning_rate)
    print("Accuracy score : {0:.3f}".format(ac))
    print("F1 score: {0:.3f}".format(f1))
    print()

Learning rate:  0.05
Accuracy score : 0.328
F1 score: 0.245

Learning rate:  0.1
Accuracy score : 0.345
F1 score: 0.278

Learning rate:  0.25
Accuracy score : 0.336
F1 score: 0.281

Learning rate:  0.5
Accuracy score : 0.293
F1 score: 0.278

Learning rate:  0.75
Accuracy score : 0.310
F1 score: 0.287

Learning rate:  1
Accuracy score : 0.302
F1 score: 0.299



The random forest performed better.

Let's make sure we have the best random forest.

In [78]:
num_trees_list = [130, 140, 150, 160, 170]
for tree in num_trees_list:
    print('Number of trees: ', tree)
    rfc_and_print_scores(X_train, y_train, X_val, y_val,tree,None,'gini')

Number of trees:  130
Random Forest: Accuracy=0.371
Random Forest: f1-score=0.332
Number of trees:  140
Random Forest: Accuracy=0.362
Random Forest: f1-score=0.322
Number of trees:  150
Random Forest: Accuracy=0.362
Random Forest: f1-score=0.324
Number of trees:  160
Random Forest: Accuracy=0.371
Random Forest: f1-score=0.330
Number of trees:  170
Random Forest: Accuracy=0.345
Random Forest: f1-score=0.309


It looks like 130 trees gives us a best score.

Let's tune another hyper-parameter, max_depth

In [79]:
max_depths = [2, 5, 10, 15, None]
for depth in max_depths:
    print('Max depth: ', depth)
    rfc_and_print_scores(X_train, y_train, X_val, y_val,130,depth,'gini')

Max depth:  2
Random Forest: Accuracy=0.319
Random Forest: f1-score=0.236
Max depth:  5
Random Forest: Accuracy=0.379
Random Forest: f1-score=0.292
Max depth:  10
Random Forest: Accuracy=0.362
Random Forest: f1-score=0.320
Max depth:  15
Random Forest: Accuracy=0.379
Random Forest: f1-score=0.338
Max depth:  None
Random Forest: Accuracy=0.371
Random Forest: f1-score=0.332


Looks like limiting the depth of the trees to 15 gives us slightly better results.

Let's check the criterion, given our other hyperparameters.

In [80]:
crits = ['gini', 'entropy', 'log_loss']
for crit in crits:
    print('Criterion: ', crit)
    rfc_and_print_scores(X_train, y_train, X_val, y_val,130,15,crit)

Criterion:  gini
Random Forest: Accuracy=0.379
Random Forest: f1-score=0.338
Criterion:  entropy
Random Forest: Accuracy=0.328
Random Forest: f1-score=0.304
Criterion:  log_loss
Random Forest: Accuracy=0.328
Random Forest: f1-score=0.304


The default 'gini' score is the best.

So it looks like our best model is a random forest with 130 trees, max depth of 15, and the default 'gini' criterion for splitting nodes. Let's test it on the test data.

In [81]:
rfc_and_print_scores(X_train, y_train, X_te_transformed, y_te,130,15,'gini')

Random Forest: Accuracy=0.345
Random Forest: f1-score=0.305


So we have a 34.5% chance of finding a good script for our new movie!