# IMDB Model Development, Tuning, and Evaluation

Goal: Achieve highest accuracy possible in using classification to distinguish between positive and negative textual reviews.

Models to be developed:
1. Logistic Regression
2. SVM
3. XGBoost


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.metrics import classification_report
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
import warnings

warnings.filterwarnings("ignore")

##### Reading in Data
Textual data has already been vectorized using Word2Vec resulting in 200 features representing each review and 1 target feature, Sentiment.

In [4]:
df = pd.read_csv('IMDb_stemmed_w2v_data.csv')
print(df.shape)
df.head()

(49582, 201)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,191,192,193,194,195,196,197,198,199,sentiment
0,0.002828,-0.116383,-0.074479,0.0143,-0.028962,0.085481,-0.176025,-0.081737,-0.094974,-0.110703,...,-0.124804,-0.018246,-0.054315,0.1522,-0.015419,-0.041798,-0.23908,-0.030428,0.199432,1
1,-0.020967,-0.138727,-0.011062,-0.019859,-0.042811,0.035183,-0.111274,-0.140352,-0.015004,-0.111678,...,-0.09844,-0.077015,-0.098341,0.050779,0.050711,-0.026713,-0.151631,0.099133,0.137044,1
2,-0.003445,-0.15532,0.025571,0.011466,-0.088588,0.079316,-0.112842,-0.096738,-0.045185,-0.160625,...,-0.090647,-0.019206,-0.081949,0.104909,0.024483,-0.075603,-0.197556,0.04476,0.225423,1
3,-0.00719,-0.010888,-0.02966,-0.02142,-0.057307,0.1322,-0.126503,-0.119443,0.017381,-0.122474,...,-0.107715,-0.003781,-0.105711,0.042626,0.060016,-0.019369,-0.24123,0.11615,0.215752,0
4,0.012823,-0.168378,-0.008363,0.024256,-0.052891,0.029318,-0.069691,-0.098672,-0.060706,-0.125244,...,-0.12274,-0.071718,-0.053371,0.164652,0.043371,-0.095517,-0.151275,0.049927,0.197893,1


#### Splitting into Training and Testing sets
We will be splitting the data into 80% training and 20% testing sets.

The __training set__ will be used for __fitting and hyperparameter tuning__ using K-fold Cross validation.

The __testing set__ will be used for __final model evaluations__.

In [5]:
# all features except target
X = df.drop('sentiment', axis=1)    # to be used for 
# only target feature (sentiment)
y = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.20, random_state=34)

In [6]:
# ensuring we have the same splits for all models
kf = KFold(n_splits=5)

### 1. Logistic Regression

##### 1.1 Initial Modeling

In [7]:
# LR performance before hyperparameter tuning (default params)
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
y_pred = lr_model.predict(X_test)
print(classification_report(y_test, y_pred, digits=4))

              precision    recall  f1-score   support

           0     0.8819    0.8728    0.8773      4921
           1     0.8760    0.8849    0.8804      4996

    accuracy                         0.8789      9917
   macro avg     0.8789    0.8788    0.8789      9917
weighted avg     0.8789    0.8789    0.8789      9917



##### 1.2 LR tuning using GridSearchCV

In [9]:
# hyper parameters to be tuned
params = [{
    'penalty': ['l1', 'l2', 'elasticnet', 'none'],
    'C': np.logspace(-4, 4, 5),
    'max_iter': [1000, 2500, 5000],
    'solver': ['lbfgs','newton-cg', 'liblinear', 'sag', 'saga']
}]

In [10]:
lr_GCV = GridSearchCV(LogisticRegression(), param_grid=params, cv = kf, verbose=4)
best_lr = lr_GCV.fit(X_train, y_train)

Fitting 5 folds for each of 300 candidates, totalling 1500 fits
[CV 1/5] END C=0.0001, max_iter=1000, penalty=l1, solver=lbfgs;, score=nan total time=   0.0s
[CV 2/5] END C=0.0001, max_iter=1000, penalty=l1, solver=lbfgs;, score=nan total time=   0.0s
[CV 3/5] END C=0.0001, max_iter=1000, penalty=l1, solver=lbfgs;, score=nan total time=   0.0s
[CV 4/5] END C=0.0001, max_iter=1000, penalty=l1, solver=lbfgs;, score=nan total time=   0.0s
[CV 5/5] END C=0.0001, max_iter=1000, penalty=l1, solver=lbfgs;, score=nan total time=   0.0s
[CV 1/5] END C=0.0001, max_iter=1000, penalty=l1, solver=newton-cg;, score=nan total time=   0.0s
[CV 2/5] END C=0.0001, max_iter=1000, penalty=l1, solver=newton-cg;, score=nan total time=   0.0s
[CV 3/5] END C=0.0001, max_iter=1000, penalty=l1, solver=newton-cg;, score=nan total time=   0.0s
[CV 4/5] END C=0.0001, max_iter=1000, penalty=l1, solver=newton-cg;, score=nan total time=   0.0s
[CV 5/5] END C=0.0001, max_iter=1000, penalty=l1, solver=newton-cg;, score

Optimal hyperparameters:

In [11]:
# 73 mins to fit
print(best_lr.best_params_)

{'C': 1.0, 'max_iter': 1000, 'penalty': 'l2', 'solver': 'newton-cg'}


Tuned Logistic Regression performance:

In [12]:
y_pred = best_lr.predict(X_test)
print(classification_report(y_test, y_pred, digits=4))

              precision    recall  f1-score   support

           0     0.8824    0.8724    0.8774      4921
           1     0.8757    0.8855    0.8806      4996

    accuracy                         0.8790      9917
   macro avg     0.8791    0.8789    0.8790      9917
weighted avg     0.8790    0.8790    0.8790      9917



### 2. SVM


##### 2.1 Initial Modeling

In [13]:
svc_model = LinearSVC()
svc_model.fit(X_train, y_train)
y_pred = svc_model.predict(X_test)
print(classification_report(y_test,y_pred, digits=4))

              precision    recall  f1-score   support

           0     0.8833    0.8693    0.8763      4921
           1     0.8733    0.8869    0.8800      4996

    accuracy                         0.8782      9917
   macro avg     0.8783    0.8781    0.8782      9917
weighted avg     0.8783    0.8782    0.8782      9917



##### 2.2 SVC tuning using GridSearchCV

In [14]:
params = [{
    'C': [0.1, 1, 10, 100, 1000],
    'max_iter': [1000, 2500, 5000],
    'loss': ['hinge','squared_hinge'],
    'penalty': ['l1', 'l2']
}]

svc_GCV = GridSearchCV(LinearSVC(dual=False), param_grid=params, cv = kf, verbose=4)
best_svc = svc_GCV.fit(X_train, y_train)


Fitting 5 folds for each of 60 candidates, totalling 300 fits
[CV 1/5] END C=0.1, loss=hinge, max_iter=1000, penalty=l1;, score=nan total time=   0.1s
[CV 2/5] END C=0.1, loss=hinge, max_iter=1000, penalty=l1;, score=nan total time=   0.0s
[CV 3/5] END C=0.1, loss=hinge, max_iter=1000, penalty=l1;, score=nan total time=   0.0s
[CV 4/5] END C=0.1, loss=hinge, max_iter=1000, penalty=l1;, score=nan total time=   0.0s
[CV 5/5] END C=0.1, loss=hinge, max_iter=1000, penalty=l1;, score=nan total time=   0.0s
[CV 1/5] END C=0.1, loss=hinge, max_iter=1000, penalty=l2;, score=nan total time=   0.0s
[CV 2/5] END C=0.1, loss=hinge, max_iter=1000, penalty=l2;, score=nan total time=   0.0s
[CV 3/5] END C=0.1, loss=hinge, max_iter=1000, penalty=l2;, score=nan total time=   0.0s
[CV 4/5] END C=0.1, loss=hinge, max_iter=1000, penalty=l2;, score=nan total time=   0.0s
[CV 5/5] END C=0.1, loss=hinge, max_iter=1000, penalty=l2;, score=nan total time=   0.0s
[CV 1/5] END C=0.1, loss=hinge, max_iter=2500, p

Optimal hyperparameters:

In [15]:
best_svc.best_params_

{'C': 10, 'loss': 'squared_hinge', 'max_iter': 1000, 'penalty': 'l2'}

Tuned SVC performance:

In [16]:
#79 mins to fi
best_svc.best_estimator_

y_pred = best_svc.predict(X_test)
print(classification_report(y_test, y_pred, digits=4))

              precision    recall  f1-score   support

           0     0.8825    0.8697    0.8761      4921
           1     0.8735    0.8859    0.8797      4996

    accuracy                         0.8779      9917
   macro avg     0.8780    0.8778    0.8779      9917
weighted avg     0.8780    0.8779    0.8779      9917



### 3. XGBoost

##### 3.1 Initial Modeling

In [17]:
xgb_model = XGBClassifier()
xgb_model.fit(X_train, y_train)

y_pred = xgb_model.predict(X_test)
print(classification_report(y_test, y_pred, digits=4))

              precision    recall  f1-score   support

           0     0.8722    0.8504    0.8612      4921
           1     0.8562    0.8773    0.8666      4996

    accuracy                         0.8640      9917
   macro avg     0.8642    0.8639    0.8639      9917
weighted avg     0.8642    0.8640    0.8639      9917



##### 3.2 XGBoost Classifier tuning using Grid Search

In [18]:
params = [{
    'ns_estimator': [100, 200, 500],
    'max_depth': [3, 5, 9],
    'learning_rate': [0.001, 0.01, 0.1, 0.2, 0.3]
}]

xgb_GCV = GridSearchCV(XGBClassifier(), param_grid=params, cv = kf, verbose=4)
best_xgb = xgb_GCV.fit(X_train, y_train)

Fitting 5 folds for each of 45 candidates, totalling 225 fits
[CV 1/5] END learning_rate=0.001, max_depth=3, ns_estimator=100;, score=0.734 total time=   1.0s
[CV 2/5] END learning_rate=0.001, max_depth=3, ns_estimator=100;, score=0.711 total time=   1.0s
[CV 3/5] END learning_rate=0.001, max_depth=3, ns_estimator=100;, score=0.732 total time=   1.0s
[CV 4/5] END learning_rate=0.001, max_depth=3, ns_estimator=100;, score=0.728 total time=   1.0s
[CV 5/5] END learning_rate=0.001, max_depth=3, ns_estimator=100;, score=0.721 total time=   1.0s
[CV 1/5] END learning_rate=0.001, max_depth=3, ns_estimator=200;, score=0.734 total time=   0.9s
[CV 2/5] END learning_rate=0.001, max_depth=3, ns_estimator=200;, score=0.711 total time=   1.0s
[CV 3/5] END learning_rate=0.001, max_depth=3, ns_estimator=200;, score=0.732 total time=   0.9s
[CV 4/5] END learning_rate=0.001, max_depth=3, ns_estimator=200;, score=0.728 total time=   0.9s
[CV 5/5] END learning_rate=0.001, max_depth=3, ns_estimator=200;,

Optimal hyperparameters:

In [None]:
best_xgb.best_params_

{'learning_rate': 0.2, 'max_depth': 9, 'ns_estimator': 100}

Tuned XGBoost Classifier performance:

In [20]:
# fit took 19 minutes
best_xgb.best_estimator_

y_pred = best_xgb.predict(X_test)
print(classification_report(y_test, y_pred, digits=4))

              precision    recall  f1-score   support

           0     0.8720    0.8510    0.8614      4921
           1     0.8567    0.8769    0.8667      4996

    accuracy                         0.8641      9917
   macro avg     0.8643    0.8640    0.8640      9917
weighted avg     0.8643    0.8641    0.8640      9917

