<a href="https://colab.research.google.com/github/anastasiaarsky/ML_Capstone/blob/main/ExperimentWithEnsembleModels.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The Goal:
1. To **experiment with other forms of feature extraction**.
  - So far I have tried TF-IDF Vectorization (on the word level) and a modified Bag of Words model, but I also want to consider TF-IDF Vectorization on the ngram and character levels, as well as Count Vectorization.

2. To **explore other ensemble methods**, specifically boosting methods such as XGBoost, and compare the resutls to my previously used bagging method (Random Forest).
  - Ensemble methods are less complex and faster to train compared to deep learning methods, so they are preferred if they are able to achieve comparable results.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
import numpy as np
import os
import datetime
import time
import matplotlib.pyplot as plt
import xgboost as xgb

from scipy import stats
from scipy.stats import randint
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import f1_score, recall_score, precision_score, accuracy_score

In [3]:
# Load clean data into a pandas dataframe
DATA_PATH = "/content/drive/My Drive/UCSD Machine Learning Engineering Bootcamp/Capstone Project/"
df = pd.read_csv(DATA_PATH + 'CleanData.csv')

In [4]:
# Split the dataset into training and testing datasets
X_train_all, X_test, y_train_all, y_test = train_test_split(df.Clean_Text, df.Label, test_size=0.3, shuffle=True, random_state=1)

In [5]:
# Split the training dataset into training and validating datasets
X_train, X_val, y_train, y_val = train_test_split(X_train_all, y_train_all, random_state=1)

# Step 1: Experimenting with Feature Extraction

In [6]:
def vectorization(vectorizer, X_train_all, X_train, X_val):
  vectorizer.fit(X_train_all)
  X_train_vectorized = vectorizer.transform(X_train)
  X_val_vectorized = vectorizer.transform(X_val)
  return X_train_vectorized, X_val_vectorized

### Count Vectorization

In [7]:
%%time
# count vectorization
count_vec = CountVectorizer()
X_train_count, X_val_count = vectorization(count_vec, X_train_all, X_train, X_val)
print('count vec', X_train_count.shape, X_val_count.shape)

count vec (20875, 142617) (6959, 142617)
CPU times: user 10.1 s, sys: 83.8 ms, total: 10.1 s
Wall time: 10.3 s


### TFIDF Vectorization

In [8]:
%%time
# word_level tfidf
tfidf_vec = TfidfVectorizer()
X_train_tfidf, X_val_tfidf = vectorization(tfidf_vec, X_train_all, X_train, X_val)
print('tfidf - word level', X_train_tfidf.shape, X_val_tfidf.shape)

tfidf - word level (20875, 142617) (6959, 142617)
CPU times: user 8.92 s, sys: 87.9 ms, total: 9.01 s
Wall time: 9.09 s


In [14]:
%%time
# ngram level tfidf
tfidf_vect_ngram = TfidfVectorizer(ngram_range=(2,3))
X_train_tfidf_ngram, X_val_tfidf_ngram = vectorization(tfidf_vect_ngram, X_train_all, X_train, X_val)
print('tfidf - ngram level', X_train_tfidf_ngram.shape, X_val_tfidf_ngram.shape)

tfidf - ngram level (20875, 3233364) (6959, 3233364)
CPU times: user 49.2 s, sys: 1.29 s, total: 50.5 s
Wall time: 1min 8s


In [9]:
%%time
# characters level tfidf
tfidf_vect_ngram_char = TfidfVectorizer(analyzer='char', ngram_range=(2,3))
X_train_tfidf_ngram_char, X_val_tfidf_ngram_char = vectorization(tfidf_vect_ngram_char, X_train_all, X_train, X_val)
print('tfidf - char ngram level', X_train_tfidf_ngram_char.shape, X_val_tfidf_ngram_char.shape)

tfidf - char ngram level (20875, 51512) (6959, 51512)
CPU times: user 1min 24s, sys: 957 ms, total: 1min 25s
Wall time: 1min 27s


# Step 2: Experimenting with Ensemble Methods

In [10]:
def train_model(classifier, X_train_vec, y_train, X_val_vec, y_val):

    # fit the training dataset on the classifier
    classifier.fit(X_train_vec, y_train)

    # predict the labels on validation dataset
    y_pred = classifier.predict(X_val_vec)

    # print metrics results
    accuracy = accuracy_score(y_val, y_pred)
    f1 = f1_score(y_val, y_pred, average='macro')
    recall = recall_score(y_val, y_pred, average='macro')
    precision = precision_score(y_val, y_pred, average='macro')

    print("Test accuracy: {:.3f}".format(accuracy))
    print("F1 Score: {:.3f}".format(f1))
    print("Recall: {:.3f}".format(recall))
    print("Precision: {:.3f}".format(precision))

### Random Forest

In [17]:
%%time
# RF on Count Vectors
print("Random Forest, Count Vector Results:")
train_model(RandomForestClassifier(), X_train_count, y_train, X_val_count, y_val)

Random Forest, Count Vector Results:
Test accuracy: 0.980
F1 Score: 0.980
Recall: 0.980
Precision: 0.980
CPU times: user 2min 16s, sys: 293 ms, total: 2min 16s
Wall time: 2min 44s


In [18]:
%%time
# RF on Word Level TF IDF Vectors
print("Random Forest, Word Level TF-IDF Results:")
train_model(RandomForestClassifier(), X_train_tfidf, y_train, X_val_tfidf, y_val)

Random Forest, Word Level TF-IDF Results:
Test accuracy: 0.983
F1 Score: 0.983
Recall: 0.983
Precision: 0.983
CPU times: user 2min 14s, sys: 255 ms, total: 2min 14s
Wall time: 2min 38s


In [19]:
%%time
# RF on Ngram Level TF IDF Vectors
print("Random Forest, Ngram Level TF-IDF Results:")
train_model(RandomForestClassifier(), X_train_tfidf_ngram, y_train, X_val_tfidf_ngram, y_val)

Random Forest, Ngram Level TF-IDF Results:
Test accuracy: 0.915
F1 Score: 0.915
Recall: 0.917
Precision: 0.922
CPU times: user 38min 23s, sys: 2.87 s, total: 38min 26s
Wall time: 40min 7s


In [20]:
%%time
# RF on Char Level TF IDF Vectors
print("Random Forest, Char Level TF-IDF Results:")
train_model(RandomForestClassifier(), X_train_tfidf_ngram_char, y_train, X_val_tfidf_ngram_char, y_val)

Random Forest, Char Level TF-IDF Results:
Test accuracy: 0.971
F1 Score: 0.971
Recall: 0.971
Precision: 0.971
CPU times: user 1min 10s, sys: 162 ms, total: 1min 10s
Wall time: 1min 11s


### XGBoost

In [21]:
%%time
# Extereme Gradient Boosting on Count Vectors
print("XGBoost, Count Vector Results:")
train_model(xgb.XGBClassifier(objective = 'binary:logistic'), X_train_count.tocsc(), y_train, X_val_count.tocsc(), y_val)

XGBoost, Count Vector Results:
Test accuracy: 0.972
F1 Score: 0.972
Recall: 0.973
Precision: 0.972
CPU times: user 1min 37s, sys: 80.3 ms, total: 1min 37s
Wall time: 58.9 s


In [22]:
%%time
# Extereme Gradient Boosting on Word Level TF IDF Vectors
print("XGBoost, Word Level TF-IDF Results:")
train_model(xgb.XGBClassifier(objective = 'binary:logistic'), X_train_tfidf.tocsc(), y_train, X_val_tfidf.tocsc(), y_val)

XGBoost, Word Level TF-IDF Results:
Test accuracy: 0.970
F1 Score: 0.970
Recall: 0.971
Precision: 0.970
CPU times: user 2min 36s, sys: 144 ms, total: 2min 37s
Wall time: 1min 33s


In [23]:
%%time
# Extereme Gradient Boosting on Ngram Level TF IDF Vectors
print("XGBoost, Ngram Level TF-IDF Results:")
train_model(xgb.XGBClassifier(objective = 'binary:logistic'), X_train_tfidf_ngram.tocsc(), y_train, X_val_tfidf_ngram.tocsc(), y_val)

XGBoost, Ngram Level TF-IDF Results:
Test accuracy: 0.891
F1 Score: 0.890
Recall: 0.893
Precision: 0.898
CPU times: user 1h 43min 7s, sys: 6.47 s, total: 1h 43min 14s
Wall time: 1h 1min 26s


In [24]:
%%time
# Extereme Gradient Boosting on Char Level TF IDF Vectors
print("XGBoost, Char Level TF-IDF Results:")
train_model(xgb.XGBClassifier(objective = 'binary:logistic'), X_train_tfidf_ngram_char.tocsc(), y_train, X_val_tfidf_ngram_char.tocsc(), y_val)

XGBoost, Char Level TF-IDF Results:
Test accuracy: 0.978
F1 Score: 0.978
Recall: 0.978
Precision: 0.978
CPU times: user 14min 39s, sys: 1.46 s, total: 14min 41s
Wall time: 8min 58s


## Step 3: Tuning

### Random Forest

For the random forest classifier, I will tune the Count Vectorizer and Word Level TF-IDF versions as they produced very similar results.

In [25]:
%%time
# Random Forest, Count Vectorizer Tuning
rf = RandomForestClassifier()

param_dist = {'bootstrap': [True, False],
               'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None],
               'max_features': ['auto', 'sqrt'],
               'min_samples_leaf': [1, 2, 4],
               'min_samples_split': [2, 5, 10],
               'n_estimators': [130, 180, 230]}

grid_search_rf_count = RandomizedSearchCV(rf,
                                          param_distributions = param_dist,
                                          cv = 3,
                                          n_iter = 20,
                                          scoring = 'accuracy',
                                          verbose = 3)

vectorizer = CountVectorizer()
X_train_all_count = vectorizer.fit_transform(X_train_all)

grid_search_rf_count.fit(X_train_all_count, y_train_all)

best_rf_count = grid_search_rf_count.best_estimator_
print(grid_search_rf_count.best_params_)

Fitting 3 folds for each of 20 candidates, totalling 60 fits
[CV 1/3] END bootstrap=False, max_depth=110, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=230;, score=0.977 total time= 3.8min
[CV 2/3] END bootstrap=False, max_depth=110, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=230;, score=0.979 total time= 3.8min
[CV 3/3] END bootstrap=False, max_depth=110, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=230;, score=0.981 total time= 3.8min


  warn(


[CV 1/3] END bootstrap=True, max_depth=50, max_features=auto, min_samples_leaf=4, min_samples_split=2, n_estimators=230;, score=0.968 total time=  53.3s


  warn(


[CV 2/3] END bootstrap=True, max_depth=50, max_features=auto, min_samples_leaf=4, min_samples_split=2, n_estimators=230;, score=0.969 total time=  54.1s


  warn(


[CV 3/3] END bootstrap=True, max_depth=50, max_features=auto, min_samples_leaf=4, min_samples_split=2, n_estimators=230;, score=0.970 total time=  54.6s
[CV 1/3] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=4, min_samples_split=10, n_estimators=130;, score=0.973 total time= 1.3min
[CV 2/3] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=4, min_samples_split=10, n_estimators=130;, score=0.974 total time= 1.3min
[CV 3/3] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=4, min_samples_split=10, n_estimators=130;, score=0.978 total time= 1.2min


  warn(


[CV 1/3] END bootstrap=False, max_depth=50, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=230;, score=0.969 total time= 1.7min


  warn(


[CV 2/3] END bootstrap=False, max_depth=50, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=230;, score=0.972 total time= 1.7min


  warn(


[CV 3/3] END bootstrap=False, max_depth=50, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=230;, score=0.973 total time= 1.6min


  warn(


[CV 1/3] END bootstrap=False, max_depth=100, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=180;, score=0.974 total time= 1.9min


  warn(


[CV 2/3] END bootstrap=False, max_depth=100, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=180;, score=0.977 total time= 1.9min


  warn(


[CV 3/3] END bootstrap=False, max_depth=100, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=180;, score=0.978 total time= 1.8min
[CV 1/3] END bootstrap=False, max_depth=90, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=230;, score=0.973 total time= 2.3min
[CV 2/3] END bootstrap=False, max_depth=90, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=230;, score=0.974 total time= 2.3min
[CV 3/3] END bootstrap=False, max_depth=90, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=230;, score=0.977 total time= 2.3min
[CV 1/3] END bootstrap=False, max_depth=30, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=180;, score=0.969 total time=  49.8s
[CV 2/3] END bootstrap=False, max_depth=30, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=180;, score=0.963 total time=  54.3s
[CV 3/3] END bootstrap=False, max_depth=30, max_features=sqrt, min_samples_

  warn(


[CV 1/3] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=10, n_estimators=180;, score=0.976 total time= 1.5min


  warn(


[CV 2/3] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=10, n_estimators=180;, score=0.978 total time= 1.4min


  warn(


[CV 3/3] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=10, n_estimators=180;, score=0.980 total time= 1.5min


  warn(


[CV 1/3] END bootstrap=True, max_depth=30, max_features=auto, min_samples_leaf=1, min_samples_split=10, n_estimators=230;, score=0.970 total time=  45.4s


  warn(


[CV 2/3] END bootstrap=True, max_depth=30, max_features=auto, min_samples_leaf=1, min_samples_split=10, n_estimators=230;, score=0.967 total time=  45.4s


  warn(


[CV 3/3] END bootstrap=True, max_depth=30, max_features=auto, min_samples_leaf=1, min_samples_split=10, n_estimators=230;, score=0.972 total time=  44.3s


  warn(


[CV 1/3] END bootstrap=True, max_depth=40, max_features=auto, min_samples_leaf=2, min_samples_split=10, n_estimators=180;, score=0.970 total time=  43.8s


  warn(


[CV 2/3] END bootstrap=True, max_depth=40, max_features=auto, min_samples_leaf=2, min_samples_split=10, n_estimators=180;, score=0.967 total time=  42.2s


  warn(


[CV 3/3] END bootstrap=True, max_depth=40, max_features=auto, min_samples_leaf=2, min_samples_split=10, n_estimators=180;, score=0.969 total time=  41.6s


  warn(


[CV 1/3] END bootstrap=False, max_depth=100, max_features=auto, min_samples_leaf=4, min_samples_split=5, n_estimators=130;, score=0.971 total time= 1.1min


  warn(


[CV 2/3] END bootstrap=False, max_depth=100, max_features=auto, min_samples_leaf=4, min_samples_split=5, n_estimators=130;, score=0.974 total time= 1.1min


  warn(


[CV 3/3] END bootstrap=False, max_depth=100, max_features=auto, min_samples_leaf=4, min_samples_split=5, n_estimators=130;, score=0.975 total time= 1.1min
[CV 1/3] END bootstrap=True, max_depth=30, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=130;, score=0.965 total time=  26.8s
[CV 2/3] END bootstrap=True, max_depth=30, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=130;, score=0.964 total time=  25.7s
[CV 3/3] END bootstrap=True, max_depth=30, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=130;, score=0.967 total time=  24.9s


  warn(


[CV 1/3] END bootstrap=False, max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=180;, score=0.962 total time=  36.7s


  warn(


[CV 2/3] END bootstrap=False, max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=180;, score=0.966 total time=  36.4s


  warn(


[CV 3/3] END bootstrap=False, max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=180;, score=0.966 total time=  37.3s
[CV 1/3] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=4, min_samples_split=10, n_estimators=180;, score=0.974 total time= 1.8min
[CV 2/3] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=4, min_samples_split=10, n_estimators=180;, score=0.975 total time= 1.8min
[CV 3/3] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=4, min_samples_split=10, n_estimators=180;, score=0.978 total time= 1.7min
[CV 1/3] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=230;, score=0.929 total time=  15.3s
[CV 2/3] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=230;, score=0.929 total time=  15.3s
[CV 3/3] END bootstrap=True, max_depth=10, max_features=sqrt, min_sam

  warn(


[CV 1/3] END bootstrap=True, max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=10, n_estimators=230;, score=0.938 total time=  15.2s


  warn(


[CV 2/3] END bootstrap=True, max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=10, n_estimators=230;, score=0.927 total time=  15.1s


  warn(


[CV 3/3] END bootstrap=True, max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=10, n_estimators=230;, score=0.954 total time=  14.8s
{'n_estimators': 230, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 110, 'bootstrap': False}
CPU times: user 1h 11min 37s, sys: 4.21 s, total: 1h 11min 42s
Wall time: 1h 11min 58s


In [26]:
%%time
# Random Forest, TFIDF (word level) Vectorizer Tuning
rf = RandomForestClassifier()

param_dist = {'bootstrap': [True, False],
               'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None],
               'max_features': ['auto', 'sqrt'],
               'min_samples_leaf': [1, 2, 4],
               'min_samples_split': [2, 5, 10],
               'n_estimators': [130, 180, 230]}

grid_search_rf_tfidf = RandomizedSearchCV(rf,
                                          param_distributions = param_dist,
                                          cv = 3,
                                          n_iter = 20,
                                          scoring = 'accuracy',
                                          verbose = 3)

vectorizer = TfidfVectorizer()
X_train_all_tfidf = vectorizer.fit_transform(X_train_all)

grid_search_rf_tfidf.fit(X_train_all_tfidf, y_train_all)

best_rf_tfidf = grid_search_rf_tfidf.best_estimator_
print(grid_search_rf_tfidf.best_params_)

Fitting 3 folds for each of 20 candidates, totalling 60 fits
[CV 1/3] END bootstrap=False, max_depth=30, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=180;, score=0.969 total time=  52.8s
[CV 2/3] END bootstrap=False, max_depth=30, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=180;, score=0.969 total time=  51.2s
[CV 3/3] END bootstrap=False, max_depth=30, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=180;, score=0.971 total time=  53.0s


  warn(


[CV 1/3] END bootstrap=False, max_depth=70, max_features=auto, min_samples_leaf=4, min_samples_split=2, n_estimators=130;, score=0.970 total time=  59.6s


  warn(


[CV 2/3] END bootstrap=False, max_depth=70, max_features=auto, min_samples_leaf=4, min_samples_split=2, n_estimators=130;, score=0.971 total time=  59.7s


  warn(


[CV 3/3] END bootstrap=False, max_depth=70, max_features=auto, min_samples_leaf=4, min_samples_split=2, n_estimators=130;, score=0.972 total time=  56.6s


  warn(


[CV 1/3] END bootstrap=False, max_depth=40, max_features=auto, min_samples_leaf=4, min_samples_split=10, n_estimators=130;, score=0.966 total time=  42.0s


  warn(


[CV 2/3] END bootstrap=False, max_depth=40, max_features=auto, min_samples_leaf=4, min_samples_split=10, n_estimators=130;, score=0.966 total time=  42.4s


  warn(


[CV 3/3] END bootstrap=False, max_depth=40, max_features=auto, min_samples_leaf=4, min_samples_split=10, n_estimators=130;, score=0.969 total time=  43.5s


  warn(


[CV 1/3] END bootstrap=True, max_depth=30, max_features=auto, min_samples_leaf=1, min_samples_split=10, n_estimators=130;, score=0.967 total time=  25.0s


  warn(


[CV 2/3] END bootstrap=True, max_depth=30, max_features=auto, min_samples_leaf=1, min_samples_split=10, n_estimators=130;, score=0.972 total time=  25.7s


  warn(


[CV 3/3] END bootstrap=True, max_depth=30, max_features=auto, min_samples_leaf=1, min_samples_split=10, n_estimators=130;, score=0.971 total time=  26.3s


  warn(


[CV 1/3] END bootstrap=False, max_depth=40, max_features=auto, min_samples_leaf=4, min_samples_split=5, n_estimators=130;, score=0.967 total time=  42.6s


  warn(


[CV 2/3] END bootstrap=False, max_depth=40, max_features=auto, min_samples_leaf=4, min_samples_split=5, n_estimators=130;, score=0.968 total time=  44.8s


  warn(


[CV 3/3] END bootstrap=False, max_depth=40, max_features=auto, min_samples_leaf=4, min_samples_split=5, n_estimators=130;, score=0.968 total time=  41.8s


  warn(


[CV 1/3] END bootstrap=False, max_depth=60, max_features=auto, min_samples_leaf=4, min_samples_split=2, n_estimators=180;, score=0.970 total time= 1.2min


  warn(


[CV 2/3] END bootstrap=False, max_depth=60, max_features=auto, min_samples_leaf=4, min_samples_split=2, n_estimators=180;, score=0.969 total time= 1.3min


  warn(


[CV 3/3] END bootstrap=False, max_depth=60, max_features=auto, min_samples_leaf=4, min_samples_split=2, n_estimators=180;, score=0.974 total time= 1.2min


  warn(


[CV 1/3] END bootstrap=True, max_depth=70, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=230;, score=0.969 total time= 1.3min


  warn(


[CV 2/3] END bootstrap=True, max_depth=70, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=230;, score=0.971 total time= 1.3min


  warn(


[CV 3/3] END bootstrap=True, max_depth=70, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=230;, score=0.973 total time= 1.3min
[CV 1/3] END bootstrap=True, max_depth=30, max_features=sqrt, min_samples_leaf=2, min_samples_split=10, n_estimators=130;, score=0.969 total time=  24.9s
[CV 2/3] END bootstrap=True, max_depth=30, max_features=sqrt, min_samples_leaf=2, min_samples_split=10, n_estimators=130;, score=0.968 total time=  24.7s
[CV 3/3] END bootstrap=True, max_depth=30, max_features=sqrt, min_samples_leaf=2, min_samples_split=10, n_estimators=130;, score=0.967 total time=  22.7s


  warn(


[CV 1/3] END bootstrap=False, max_depth=60, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=180;, score=0.969 total time= 1.5min


  warn(


[CV 2/3] END bootstrap=False, max_depth=60, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=180;, score=0.974 total time= 1.5min


  warn(


[CV 3/3] END bootstrap=False, max_depth=60, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=180;, score=0.973 total time= 1.5min
[CV 1/3] END bootstrap=False, max_depth=60, max_features=sqrt, min_samples_leaf=4, min_samples_split=5, n_estimators=180;, score=0.969 total time= 1.2min
[CV 2/3] END bootstrap=False, max_depth=60, max_features=sqrt, min_samples_leaf=4, min_samples_split=5, n_estimators=180;, score=0.970 total time= 1.3min
[CV 3/3] END bootstrap=False, max_depth=60, max_features=sqrt, min_samples_leaf=4, min_samples_split=5, n_estimators=180;, score=0.971 total time= 1.2min


  warn(


[CV 1/3] END bootstrap=False, max_depth=80, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=130;, score=0.971 total time= 1.3min


  warn(


[CV 2/3] END bootstrap=False, max_depth=80, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=130;, score=0.975 total time= 1.3min


  warn(


[CV 3/3] END bootstrap=False, max_depth=80, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=130;, score=0.974 total time= 1.2min


  warn(


[CV 1/3] END bootstrap=False, max_depth=20, max_features=auto, min_samples_leaf=4, min_samples_split=10, n_estimators=230;, score=0.965 total time=  44.5s


  warn(


[CV 2/3] END bootstrap=False, max_depth=20, max_features=auto, min_samples_leaf=4, min_samples_split=10, n_estimators=230;, score=0.962 total time=  42.6s


  warn(


[CV 3/3] END bootstrap=False, max_depth=20, max_features=auto, min_samples_leaf=4, min_samples_split=10, n_estimators=230;, score=0.968 total time=  41.7s


  warn(


[CV 1/3] END bootstrap=True, max_depth=60, max_features=auto, min_samples_leaf=4, min_samples_split=10, n_estimators=230;, score=0.966 total time= 1.0min


  warn(


[CV 2/3] END bootstrap=True, max_depth=60, max_features=auto, min_samples_leaf=4, min_samples_split=10, n_estimators=230;, score=0.969 total time=  59.3s


  warn(


[CV 3/3] END bootstrap=True, max_depth=60, max_features=auto, min_samples_leaf=4, min_samples_split=10, n_estimators=230;, score=0.970 total time=  57.9s


  warn(


[CV 1/3] END bootstrap=True, max_depth=100, max_features=auto, min_samples_leaf=4, min_samples_split=2, n_estimators=180;, score=0.969 total time=  59.3s


  warn(


[CV 2/3] END bootstrap=True, max_depth=100, max_features=auto, min_samples_leaf=4, min_samples_split=2, n_estimators=180;, score=0.972 total time=  58.0s


  warn(


[CV 3/3] END bootstrap=True, max_depth=100, max_features=auto, min_samples_leaf=4, min_samples_split=2, n_estimators=180;, score=0.973 total time=  57.0s


  warn(


[CV 1/3] END bootstrap=True, max_depth=70, max_features=auto, min_samples_leaf=2, min_samples_split=10, n_estimators=180;, score=0.970 total time= 1.0min


  warn(


[CV 2/3] END bootstrap=True, max_depth=70, max_features=auto, min_samples_leaf=2, min_samples_split=10, n_estimators=180;, score=0.972 total time=  59.9s


  warn(


[CV 3/3] END bootstrap=True, max_depth=70, max_features=auto, min_samples_leaf=2, min_samples_split=10, n_estimators=180;, score=0.972 total time=  59.5s


  warn(


[CV 1/3] END bootstrap=False, max_depth=50, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=180;, score=0.972 total time= 1.5min


  warn(


[CV 2/3] END bootstrap=False, max_depth=50, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=180;, score=0.972 total time= 1.5min


  warn(


[CV 3/3] END bootstrap=False, max_depth=50, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=180;, score=0.974 total time= 1.4min
[CV 1/3] END bootstrap=False, max_depth=70, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=230;, score=0.974 total time= 2.6min
[CV 2/3] END bootstrap=False, max_depth=70, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=230;, score=0.975 total time= 2.7min
[CV 3/3] END bootstrap=False, max_depth=70, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=230;, score=0.977 total time= 2.6min
[CV 1/3] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=4, min_samples_split=5, n_estimators=230;, score=0.977 total time= 2.4min
[CV 2/3] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=4, min_samples_split=5, n_estimators=230;, score=0.978 total time= 2.4min
[CV 3/3] END bootstrap=False, max_depth=None, max_features=sqrt, min_sam

### XGBoost

I will only tune the Char Level TF-IDF XGBoost classifier as it had significantly better results compared to the other models.

In [27]:
%%time
# XGBoost, Char Level TF-IDF Vectorizer Tuning
xgb = xgb.XGBClassifier(objective = 'binary:logistic')
param_dist = {'n_estimators': stats.randint(150, 1000),
              'learning_rate': stats.uniform(0.01, 0.59),
              'subsample': stats.uniform(0.3, 0.6),
              'max_depth': [3, 4, 5, 6, 7, 8, 9],
              'colsample_bytree': stats.uniform(0.5, 0.4),
              'min_child_weight': [1, 2, 3, 4]}

grid_search_xgb = RandomizedSearchCV(xgb,
                                    param_distributions = param_dist,
                                    cv = 3,
                                    n_iter = 20,
                                    scoring = 'accuracy',
                                    verbose = 3)

vectorizer = TfidfVectorizer(analyzer='char', ngram_range=(2,3))
X_train_all_tfidf_char = vectorizer.fit_transform(X_train_all)

grid_search_xgb.fit(X_train_all_tfidf_char, y_train_all)

best_xgb = grid_search_xgb.best_estimator_
print(grid_search_xgb.best_params_)

Fitting 3 folds for each of 20 candidates, totalling 60 fits
[CV 1/3] END colsample_bytree=0.849639276642066, learning_rate=0.219453560661678, max_depth=9, min_child_weight=2, n_estimators=745, subsample=0.6244526870676246;, score=0.978 total time=20.2min
[CV 2/3] END colsample_bytree=0.849639276642066, learning_rate=0.219453560661678, max_depth=9, min_child_weight=2, n_estimators=745, subsample=0.6244526870676246;, score=0.977 total time=20.8min
[CV 3/3] END colsample_bytree=0.849639276642066, learning_rate=0.219453560661678, max_depth=9, min_child_weight=2, n_estimators=745, subsample=0.6244526870676246;, score=0.980 total time=20.7min
[CV 1/3] END colsample_bytree=0.8650400619139674, learning_rate=0.4431296739003133, max_depth=5, min_child_weight=3, n_estimators=277, subsample=0.6297694638296976;, score=0.973 total time= 8.1min
[CV 2/3] END colsample_bytree=0.8650400619139674, learning_rate=0.4431296739003133, max_depth=5, min_child_weight=3, n_estimators=277, subsample=0.6297694638

## Step 4: Results

In the previous section (Reproduction of Available Solutions), I reproduced an RF model with TFIDF Vecorization. Here are its results on the testing set:

In [11]:
%%time
# Count Vectorizor
vectorizer = TfidfVectorizer()
vectorizer.fit(X_train_all)
X_train_all_tfidf = vectorizer.transform(X_train_all)
X_test_tfidf = vectorizer.transform(X_test)

CPU times: user 12.5 s, sys: 216 ms, total: 12.8 s
Wall time: 13 s


In [20]:
%%time
# RF on TFIDF Vectors
print("TUNED Random Forest, TFIDF Vector Results:")
best_rf_tfidf = RandomForestClassifier(n_estimators=230, min_samples_split=5, min_samples_leaf=4, max_features='sqrt', max_depth=None, bootstrap=False)
train_model(best_rf_tfidf, X_train_all_tfidf, y_train_all, X_test_tfidf, y_test)

TUNED Random Forest, TFIDF Vector Results:
Test accuracy: 0.980
F1 Score: 0.980
Recall: 0.980
Precision: 0.979
CPU times: user 1min 1s, sys: 118 ms, total: 1min 1s
Wall time: 1min 12s


In [21]:
%%time
# RF on TFIDF Vectors
print("UNTUNED Random Forest, TFIDF Vector Results:")
train_model(RandomForestClassifier(), X_train_all_tfidf, y_train_all, X_test_tfidf, y_test)

Random Forest, TFIDF Vector Results:
Test accuracy: 0.983
F1 Score: 0.983
Recall: 0.983
Precision: 0.983
CPU times: user 1min 50s, sys: 173 ms, total: 1min 50s
Wall time: 1min 53s


In this section, after experimenting with various feature extraction and ensemble models, the Random Forest model with Count Vectorization performed best, so let's view its results with the testing set:

In [13]:
%%time
# Count Vectorizor
vectorizer = CountVectorizer()
vectorizer.fit(X_train_all)
X_train_all_count = vectorizer.transform(X_train_all)
X_test_count = vectorizer.transform(X_test)

CPU times: user 12.2 s, sys: 94.4 ms, total: 12.3 s
Wall time: 12.4 s


In [22]:
%%time
# RF on Count Vectors
print("TUNED Random Forest, Count Vector Results:")
best_rf_count = RandomForestClassifier(n_estimators=230, min_samples_split=2, min_samples_leaf=1, max_features='sqrt', max_depth=110, bootstrap=False)
train_model(best_rf_count, X_train_all_count, y_train_all, X_test_count, y_test)

TUNED Random Forest, Count Vector Results:
Test accuracy: 0.980
F1 Score: 0.980
Recall: 0.980
Precision: 0.980
CPU times: user 3min, sys: 183 ms, total: 3min
Wall time: 3min 1s


In [23]:
%%time
# RF on Count Vectors
print("UNTUNED Random Forest, Count Vector Results:")
train_model(RandomForestClassifier(), X_train_all_count, y_train_all, X_test_count, y_test)

UNTUNED Random Forest, Count Vector Results:
Test accuracy: 0.981
F1 Score: 0.981
Recall: 0.981
Precision: 0.981
CPU times: user 2min 2s, sys: 148 ms, total: 2min 2s
Wall time: 2min 3s


XGBoost Classifier with Char Level TF-IDF

In [15]:
%%time
# Char Level TF-IDF Vectorizor
vectorizer = TfidfVectorizer(analyzer='char', ngram_range=(2,3))
vectorizer.fit(X_train_all)
X_train_all_tfidf_char = vectorizer.transform(X_train_all)
X_test_tfidf_char = vectorizer.transform(X_test)

CPU times: user 1min 50s, sys: 1.2 s, total: 1min 51s
Wall time: 1min 52s


In [24]:
%%time
# XGBoost on Char Level TF-IDF Vectors
print("TUNED XGBoost, Char Level TF-IDF Vector Results:")
best_xgb = xgb.XGBClassifier(objective = 'binary:logistic', n_estimators=558, learning_rate=0.14232127484186657, subsample=0.6118059580832087, max_depth=4, colsample_bytree=0.5415650307519067, min_child_weight=4)
train_model(best_xgb, X_train_all_tfidf_char, y_train_all, X_test_tfidf_char, y_test)

TUNED XGBoost, Char Level TF-IDF Vector Results:
Test accuracy: 0.982
F1 Score: 0.982
Recall: 0.983
Precision: 0.982
CPU times: user 28min 32s, sys: 1.77 s, total: 28min 33s
Wall time: 17min 20s


In [25]:
%%time
# XGBoost on Char Level TF-IDF Vectors
print("UNTUNED XGBoost, Char Level TF-IDF Vector Results:")
train_model(xgb.XGBClassifier(), X_train_all_tfidf_char, y_train_all, X_test_tfidf_char, y_test)

UNTUNED XGBoost, Char Level TF-IDF Vector Results:
Test accuracy: 0.978
F1 Score: 0.978
Recall: 0.978
Precision: 0.978
CPU times: user 19min 39s, sys: 1.18 s, total: 19min 40s
Wall time: 12min 2s


### Step 4: Summary

Overall, I found that the **untuned Random Forest model using both TF-IDF (on the word level) and Count Vectorization, as well as tuned XGBoost model using TF-IDF (on the character level), produced approximately equal results** - between **98.1-98.3% accuracy**.

Furthermore, I found that TF-IDF feature extraction on the ngram level resulted in a less accurate model for both Random Forest and XGBoost models.