<h1 align="center"><span style='font-weight: bold;'>Whole-Food Plant-Based vs. Paleo:</span><br />Identifying the best-performing classification model</h1>

---
<h2 align="center"><span style='font-weight: bold;'>03: Model Exploration</span></h2>

---

## Imports

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB

from sklearn.metrics import accuracy_score

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

pd.options.display.max_colwidth = 400

## Load Data

In [2]:
diets = pd.read_csv('../data/diets.csv')

## Basic Re-Checks

In [3]:
diets.shape

(9993, 2)

In [4]:
diets.isnull().sum()

subreddit    0
title        0
dtype: int64

In [5]:
diets[diets['title'].isnull()]

Unnamed: 0,subreddit,title


**All null values are coming from the plant_based DataFrame. Interesting that they didn't appear when I ran the check in the *02_EDA_and_Cleaning* notebook.**

In [6]:
# Importing the plant_based DataFrame to see what the null values originally were.

plant_based = pd.read_csv('../data/plant_based.csv').drop(columns = ['selftext', 'created_utc'])

plant_based.iloc[[1821, 3108, 4383, 4604], :]

Unnamed: 0,subreddit,title
1821,PlantBasedDiet,https://beautifulingredient.com/quinoa-bacon-bits/
3108,PlantBasedDiet,sucavu.com/?wa=Koken
4383,PlantBasedDiet,https://filmi-beats.blogspot.com/2021/05/IRON-MAN-ROBERT-DOWNEY-JR.-HD-WALLPAPER.html
4604,PlantBasedDiet,plantstrong.com/


**The null values were all originally URLs that were purposely removed in the cleaning process. Since these value only contained URLs and nothing else, they'll need to be dropped.**

In [7]:
diets = diets.dropna()
diets.shape

(9993, 2)

In [8]:
# Re-saving the diets DataFrame

diets.to_csv('../data/diets.csv', index=False)

## Train/Test Split

In [9]:
X = diets['title']
y = diets['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.2, 
                                                    random_state=42, 
                                                    stratify=y)

## Baseline Accuracies

In [10]:
print(y.value_counts(normalize=True))
print(y_train.value_counts(normalize=True))
print(y_test.value_counts(normalize=True))

1    0.50015
0    0.49985
Name: subreddit, dtype: float64
1    0.500125
0    0.499875
Name: subreddit, dtype: float64
1    0.50025
0    0.49975
Name: subreddit, dtype: float64


**The baseline accuracy is 0.50, so we'll need a model that can beat that score, at the very minumum.**

## Model Exploration

### TfidfVectorizer Parameter Set-Up

In [11]:
# Creating a Lemmatizer object to include as a tokenizer in the TfidfVectorizer
# Help from here: https://stackoverflow.com/questions/47423854/sklearn-adding-lemmatizer-to-countvectorizer

class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, articles):
        return [self.wnl.lemmatize(t) for t in word_tokenize(articles)]

In [12]:
# Modifying the stopwords list per a warning that appeared listing these words

stopwords = stopwords.words('english')
new_stop_words = ['amp', "'d", "'ll", "'re", "'s", "'ve", 'could', 'doe', 'ha', 'might', 'must', "n't", 'need', 'sha', 'wa', 'wo', 'would']
stopwords.extend(new_stop_words)

In [13]:
# Identifying how many features are produced with the TfidfVectorizer that will be used for all models

# tvec = TfidfVectorizer(lowercase=True, 
#                        preprocessor=None, 
#                        tokenizer=LemmaTokenizer(), 
#                        stop_words=stopwords, 
#                        analyzer='word')

# X_train = tvec.fit_transform(X_train)

In [14]:
# X_train.todense().shape

**The TfidfVectorizer identifies 6,867 distinct words after lemmatizing and removing stopwords.**

### RandomForestClassifier

In [15]:
rf_pipe = Pipeline([
    ('rf_tvec', TfidfVectorizer(lowercase=True, 
                             preprocessor=None,
                             tokenizer=LemmaTokenizer(),
                             stop_words=stopwords,
                             analyzer='word')), 
    ('rf', RandomForestClassifier(random_state=42))
])

In [16]:
cross_val_score(rf_pipe, X_train, y_train, cv=3).mean()

0.8019767891719111

**The accuracy score we can expect on the test data without any hyper-parameter tuning is 0.8019.**

In [17]:
rf_pipe.fit(X_train, y_train)
print(rf_pipe.score(X_train, y_train))
print(rf_pipe.score(X_test, y_test))

0.9897423067300475
0.8084042021010506


**Without any hyper-parameter tuning, the Random Forest model achieved a testing accuracy of 0.8084 but is incredibly overfit by 0.1813.**

In [18]:
# Setting hyper-parameters that RandomizedSearchCV will search through

rf_pipe_params = {
    'rf_tvec__max_features' : [None, 1000, 2000, 4000],
    'rf_tvec__min_df' : [1, 2, 3, 5],
    'rf_tvec__max_df' : [0.25, 0.75, .90],
    'rf_tvec__ngram_range' : [(1,1),(1, 2)],
    'rf__n_estimators': [10, 50, 100],
    'rf__max_depth': [None, 10, 100]
}

In [19]:
# RandomizedSearchCV will sample 10 parameter settings to reduce runtime

rf_rs = RandomizedSearchCV(rf_pipe,
                 param_distributions = rf_pipe_params, 
                 n_iter=10,
                 cv = 3,
                 random_state=42)

In [20]:
rf_rs.fit(X_train, y_train)

print(rf_rs.best_params_)
print(rf_rs.best_score_)
print(rf_rs.score(X_train, y_train))
print(rf_rs.score(X_test, y_test))

{'rf_tvec__ngram_range': (1, 1), 'rf_tvec__min_df': 1, 'rf_tvec__max_features': 2000, 'rf_tvec__max_df': 0.75, 'rf__n_estimators': 50, 'rf__max_depth': 100}
0.7998500376549157
0.9019264448336253
0.8009004502251126


**After hyper-parameter tuning, there was no accuracy improvement but the variance was decreased to 0.101.**

### LogisticRegression

In [21]:
logr_pipe = Pipeline([
    ('logr_tvec', TfidfVectorizer(lowercase=True, 
                             preprocessor=None,
                             tokenizer=LemmaTokenizer(),
                             stop_words=stopwords,
                             analyzer='word')), 
    ('logr', LogisticRegression())
])

In [22]:
cross_val_score(logr_pipe, X_train, y_train, cv=3).mean()

0.8091062450818548

**The accuracy score we can expect on the test data without any hyper-parameter tuning is 0.8091.**

In [23]:
logr_pipe.fit(X_train, y_train)
print(logr_pipe.score(X_train, y_train))
print(logr_pipe.score(X_test, y_test))

0.9014260695521641
0.8204102051025512


**Without any hyper-parameter tuning, the Logisitic Regression model achieved a testing accuracy of 0.8204 but is overfit by 0.081.**

In [24]:
# Setting hyper-parameters that RandomizedSearchCV will search through

logr_pipe_params = {
    'logr_tvec__max_features' : [None, 1000, 2000, 4000],
    'logr_tvec__min_df' : [1, 2, 3, 5],
    'logr_tvec__max_df' : [0.25, 0.75, 0.90],
    'logr_tvec__ngram_range' : [(1,1),(1, 2)],
}

In [25]:
# RandomizedSearchCV will sample 10 parameter settings to reduce runtime

logr_rs = RandomizedSearchCV(logr_pipe,
                 param_distributions = logr_pipe_params, 
                 n_iter=10,
                 cv = 3,
                 random_state=42)

In [26]:
logr_rs.fit(X_train, y_train)

print(logr_rs.best_params_)
print(logr_rs.best_score_)
print(logr_rs.score(X_train, y_train))
print(logr_rs.score(X_test, y_test))

{'logr_tvec__ngram_range': (1, 2), 'logr_tvec__min_df': 1, 'logr_tvec__max_features': None, 'logr_tvec__max_df': 0.75}
0.8157370785419565
0.9419564673505129
0.8219109554777388


**After hyper-parameter tuning, there was hardly an improvement in accuracy and the variance increased to 0.12.**

### KNeighborsClassifier

In [27]:
knn_pipe = Pipeline([
    ('knn_tvec', TfidfVectorizer(lowercase=True, 
                             preprocessor=None,
                             tokenizer=LemmaTokenizer(),
                             stop_words=stopwords,
                             analyzer='word')), 
    ('knn', KNeighborsClassifier())
])

In [28]:
cross_val_score(knn_pipe, X_train, y_train, cv=3).mean()

0.599695380183185

**The accuracy score we can expect on the test data without any hyper-parameter tuning is 0.5996.**

In [29]:
knn_pipe.fit(X_train, y_train)
print(knn_pipe.score(X_train, y_train))
print(knn_pipe.score(X_test, y_test))

0.648861646234676
0.5757878939469735


**Without any hyper-parameter tuning, the k-Nearest Neighbors model achieved a testing accuracy of 0.5757 and is overfit by 0.0731.**

In [30]:
# Setting hyper-parameters that RandomizedSearchCV will search through

knn_pipe_params = {
    'knn_tvec__max_features' : [None, 1000, 2000, 4000],
    'knn_tvec__min_df' : [1, 2, 3, 5],
    'knn_tvec__max_df' : [0.25, 0.75, .90],
    'knn_tvec__ngram_range' : [(1,1),(1, 2)],
    'knn__n_neighbors': [3, 5, 11],
    'knn__weights': ['uniform', 'distance']
}

In [31]:
# RandomizedSearchCV will sample 10 parameter settings to reduce runtime

knn_rs = RandomizedSearchCV(knn_pipe,
                 param_distributions = knn_pipe_params, 
                 n_iter=10,
                 cv = 3,
                 random_state=42)

In [32]:
knn_rs.fit(X_train, y_train)

print(knn_rs.best_params_)
print(knn_rs.best_score_)
print(knn_rs.score(X_train, y_train))
print(knn_rs.score(X_test, y_test))

{'knn_tvec__ngram_range': (1, 2), 'knn_tvec__min_df': 3, 'knn_tvec__max_features': 4000, 'knn_tvec__max_df': 0.25, 'knn__weights': 'uniform', 'knn__n_neighbors': 3}
0.6591204713155933
0.8986740055041281
0.6983491745872936


**After hyper-parameter tuning, the testing accuracy did improve by 0.12, but was became incredibly overfit by 0.2003.**

### AdaBoostClassifier

In [33]:
ada_pipe = Pipeline([
    ('ada_tvec', TfidfVectorizer(lowercase=True, 
                             preprocessor=None,
                             tokenizer=LemmaTokenizer(),
                             stop_words=stopwords,
                             analyzer='word')), 
    ('ada', AdaBoostClassifier(random_state=42))
])

In [34]:
cross_val_score(ada_pipe, X_train, y_train, cv=3).mean()

0.7855897642483008

**The accuracy score we can expect on the test data without any hyper-parameter tuning is 0.7855.**

In [35]:
ada_pipe.fit(X_train, y_train)
print(ada_pipe.score(X_train, y_train))
print(ada_pipe.score(X_test, y_test))

0.8038528896672504
0.7938969484742371


**Without any hyper-parameter tuning, the AdaBoost model achieved a testing accuracy of 0.7938 and can hardly be considered overfit since the scores only differ by 0.01.**

In [36]:
# Setting hyper-parameters that RandomizedSearchCV will search through

ada_pipe_params = {
    'ada_tvec__max_features' : [None, 1000, 2000, 4000],
    'ada_tvec__min_df' : [1, 2, 3, 5],
    'ada_tvec__max_df' : [0.25, 0.75, .90],
    'ada_tvec__ngram_range' : [(1,1),(1, 2)],
    'ada__n_estimators': [10, 50, 100],
    'ada__learning_rate': [1.0, 2.0, 3.0],
}

In [37]:
# RandomizedSearchCV will sample 10 parameter settings to reduce runtime

ada_rs = RandomizedSearchCV(ada_pipe,
                 param_distributions = ada_pipe_params, 
                 n_iter=10,
                 cv = 3, 
                 random_state=42)

In [38]:
ada_rs.fit(X_train, y_train)

print(ada_rs.best_params_)
print(ada_rs.best_score_)
print(ada_rs.score(X_train, y_train))
print(ada_rs.score(X_test, y_test))

{'ada_tvec__ngram_range': (1, 1), 'ada_tvec__min_df': 5, 'ada_tvec__max_features': None, 'ada_tvec__max_df': 0.25, 'ada__n_estimators': 100, 'ada__learning_rate': 1.0}
0.7868408746457526
0.8164873655241431
0.7953976988494247


**After hyper-parameter tuning, the testing accuracy increased only slighty and the variance was increased to 0.0211. Regardless, this seems to be the best performing model thus far in terms of how well it seems to work on unseen data.**

### GradientBoostingClassifier

In [39]:
gb_pipe = Pipeline([
    ('gb_tvec', TfidfVectorizer(lowercase=True, 
                             preprocessor=None,
                             tokenizer=LemmaTokenizer(),
                             stop_words=stopwords,
                             analyzer='word')), 
    ('gb', GradientBoostingClassifier(random_state=42))
])

In [40]:
cross_val_score(gb_pipe, X_train, y_train, cv=3).mean()

0.7833384979726442

**The accuracy score we can expect on the test data without any hyper-parameter tuning is 0.7833.**

In [41]:
gb_pipe.fit(X_train, y_train)
print(gb_pipe.score(X_train, y_train))
print(gb_pipe.score(X_test, y_test))

0.8028521391043283
0.7873936968484242


**Without any hyper-parameter tuning, the Gradient Boosting model achieved a testing accuracy of 0.7873 and can hardly be considered overfit since the scores only differ by 0.0155.**

In [42]:
# Setting hyper-parameters that RandomizedSearchCV will search through

gb_pipe_params = {
    'gb_tvec__max_features' : [None, 1000, 2000, 4000],
    'gb_tvec__min_df' : [1, 2, 3, 5],
    'gb_tvec__max_df' : [0.25, 0.75, .90],
    'gb_tvec__ngram_range' : [(1,1),(1, 2)],
    'gb__learning_rate': [1.0, 2.0, 3.0],
    'gb__n_estimators': [10, 50, 100]
}

In [43]:
# RandomizedSearchCV will sample 10 parameter settings to reduce runtime

gb_rs = RandomizedSearchCV(gb_pipe,
                 param_distributions = gb_pipe_params, 
                 n_iter=10,
                 cv = 3,
                 random_state=42)

In [44]:
gb_rs.fit(X_train, y_train)

print(gb_rs.best_params_)
print(gb_rs.best_score_)
print(gb_rs.score(X_train, y_train))
print(gb_rs.score(X_test, y_test))

{'gb_tvec__ngram_range': (1, 1), 'gb_tvec__min_df': 5, 'gb_tvec__max_features': None, 'gb_tvec__max_df': 0.25, 'gb__n_estimators': 100, 'gb__learning_rate': 1.0}
0.7760822736432492
0.8925444083062297
0.8024012006003002


**After hyper-parameter tuning, the testing accuracy increased slightly, but at the expense of becoming pretty overfit by 0.0901.**

### XGBClassifier

In [45]:
xgb_pipe = Pipeline([
    ('xgb_tvec', TfidfVectorizer(lowercase=True, 
                             preprocessor=None,
                             tokenizer=LemmaTokenizer(),
                             stop_words=stopwords,
                             analyzer='word')), 
    ('xgb', XGBClassifier(random_state=42))
])

In [46]:
cross_val_score(xgb_pipe, X_train, y_train, cv=3).mean()

0.7935958472543838

**The accuracy score we can expect on the test data without any hyper-parameter tuning is 0.7935.**

In [47]:
xgb_pipe.fit(X_train, y_train)
print(xgb_pipe.score(X_train, y_train))
print(xgb_pipe.score(X_test, y_test))

0.8551413560170128
0.807903951975988


**Without any hyper-parameter tuning, the Gradient Boosting model achieved a testing accuracy of 0.8079 and is a little overfit by 0.0472.**

In [48]:
# Setting hyper-parameters that RandomizedSearchCV will search through

xgb_pipe_params = {
    'xgb_tvec__max_features' : [None, 1000, 2000, 4000],
    'xgb_tvec__min_df' : [1, 2, 3, 5],
    'xgb_tvec__max_df' : [0.25, 0.75, .90],
    'xgb_tvec__ngram_range' : [(1,1),(1, 2)],
    'xgb__learning_rate': [1.0, 2.0, 3.0],
    'xgb__n_estimators': [10, 50, 100]
}

In [49]:
# RandomizedSearchCV will sample 10 parameter settings to reduce runtime

xgb_rs = RandomizedSearchCV(xgb_pipe,
                 param_distributions = xgb_pipe_params, 
                 n_iter=10,
                 cv = 3,
                 random_state=42)

In [50]:
xgb_rs.fit(X_train, y_train)

print(xgb_rs.best_params_)
print(xgb_rs.best_score_)
print(xgb_rs.score(X_train, y_train))
print(xgb_rs.score(X_test, y_test))

{'xgb_tvec__ngram_range': (1, 2), 'xgb_tvec__min_df': 5, 'xgb_tvec__max_features': 2000, 'xgb_tvec__max_df': 0.25, 'xgb__n_estimators': 10, 'xgb__learning_rate': 1.0}
0.7824624342917025
0.8124843632724543
0.7968984492246123


**After hyper-parameter tuning, the testing accuracy decreased a tad, and the variance decreased to a 0.0156 difference. This makes this model a strong contender for best-performing alongside the AdaBoost model.**

### SVC

In [51]:
svc_pipe = Pipeline([
    ('svc_tvec', TfidfVectorizer(lowercase=True, 
                             preprocessor=None,
                             tokenizer=LemmaTokenizer(),
                             stop_words=stopwords,
                             analyzer='word')), 
    ('svc', SVC(random_state=42))
])

In [52]:
cross_val_score(svc_pipe, X_train, y_train, cv=3).mean()

0.8138603894701455

**The accuracy score we can expect on the test data without any hyper-parameter tuning is 0.8138.**

In [53]:
svc_pipe.fit(X_train, y_train)
print(svc_pipe.score(X_train, y_train))
print(svc_pipe.score(X_test, y_test))

0.9691018263697774
0.8244122061030515


**Without any hyper-parameter tuning, the Support Vector Classifier model achieved a testing accuracy of 0.8244 and is overfit by 0.1447.**

In [54]:
# Setting hyper-parameters that RandomizedSearchCV will search through

svc_pipe_params = {
    'svc_tvec__max_features' : [None, 1000, 2000, 4000],
    'svc_tvec__min_df' : [1, 2, 3, 5],
    'svc_tvec__max_df' : [0.25, 0.75, 0.9],
    'svc_tvec__ngram_range' : [(1,1),(1, 2)],
    'svc__C' : [1.0, 2.0, 3.0],
    'svc__kernel' : ['linear','rbf','poly'],
    'svc__degree' : [2, 3],
    'svc__class_weight' : [None, 'balanced']
}

In [55]:
# RandomizedSearchCV will sample 10 parameter settings to reduce runtime

svc_rs = RandomizedSearchCV(svc_pipe,
                 param_distributions = svc_pipe_params, 
                 n_iter=10,
                 cv = 3,
                 random_state=42)

In [56]:
svc_rs.fit(X_train, y_train)

print(svc_rs.best_params_)
print(svc_rs.best_score_)
print(svc_rs.score(X_train, y_train))
print(svc_rs.score(X_test, y_test))

{'svc_tvec__ngram_range': (1, 1), 'svc_tvec__min_df': 2, 'svc_tvec__max_features': 2000, 'svc_tvec__max_df': 0.9, 'svc__kernel': 'rbf', 'svc__degree': 3, 'svc__class_weight': None, 'svc__C': 1.0}
0.8083566681127657
0.9490868151113335
0.8189094547273637


**After hyper-parameter tuning, the accuracy score decreased and variance was reduced very minimally.**

### BernoulliNB

In [57]:
bnb_pipe = Pipeline([
    ('bnb_tvec', TfidfVectorizer(lowercase=True, 
                             preprocessor=None,
                             tokenizer=LemmaTokenizer(),
                             stop_words=stopwords,
                             analyzer='word')), 
    ('bnb', BernoulliNB())
])

In [58]:
cross_val_score(bnb_pipe, X_train, y_train, cv=3).mean()

0.8063539618417668

**The accuracy score we can expect on the test data without any hyper-parameter tuning is 0.8063.**

In [59]:
bnb_pipe.fit(X_train, y_train)
print(bnb_pipe.score(X_train, y_train))
print(bnb_pipe.score(X_test, y_test))

0.9016762571928947
0.8119059529764883


**Without any hyper-parameter tuning, the Binomial Naive Bayes model achieved a testing accuracy of 0.8119 and is overfit by 0.0897.**

In [60]:
# Setting hyper-parameters that RandomizedSearchCV will search through

bnb_pipe_params = {
    'bnb_tvec__max_features' : [None, 1000, 2000, 4000],
    'bnb_tvec__min_df' : [1, 2, 3, 5],
    'bnb_tvec__max_df' : [0.25, 0.75, .90],
    'bnb_tvec__ngram_range' : [(1,1),(1, 2)],
}

In [61]:
# RandomizedSearchCV will sample 10 parameter settings to reduce runtime

bnb_rs = RandomizedSearchCV(bnb_pipe,
                 param_distributions = bnb_pipe_params, 
                 n_iter=10,
                 cv = 3,
                 random_state=42)

In [62]:
bnb_rs.fit(X_train, y_train)

print(bnb_rs.best_params_)
print(bnb_rs.best_score_)
print(bnb_rs.score(X_train, y_train))
print(bnb_rs.score(X_test, y_test))

{'bnb_tvec__ngram_range': (1, 2), 'bnb_tvec__min_df': 1, 'bnb_tvec__max_features': None, 'bnb_tvec__max_df': 0.75}
0.8106065258504281
0.9578433825369027
0.8254127063531765


**After hyper-parameter tuning, there was hardly an improvement in testing accuracy, and the variance increased to a difference of 0.1324**

### MultinomialNB

In [63]:
mnb_pipe = Pipeline([
    ('mnb_tvec', TfidfVectorizer(lowercase=True, 
                             preprocessor=None,
                             tokenizer=LemmaTokenizer(),
                             stop_words=stopwords,
                             analyzer='word')), 
    ('mnb', MultinomialNB())
])

In [64]:
cross_val_score(mnb_pipe, X_train, y_train, cv=3).mean()

0.7950967383894213

**The accuracy score we can expect on the test data without any hyper-parameter tuning is 0.7950.**

In [65]:
mnb_pipe.fit(X_train, y_train)
print(mnb_pipe.score(X_train, y_train))
print(mnb_pipe.score(X_test, y_test))

0.9076807605704278
0.7963981990995498


**Without any hyper-parameter tuning, the Multinomial Naive Bayes model achieved a testing accuracy of 0.7963 and is overfit by 0.1113.**

In [66]:
# Setting hyper-parameters that RandomizedSearchCV will search through

mnb_pipe_params = {
    'mnb_tvec__max_features' : [None, 1000, 2000, 4000],
    'mnb_tvec__min_df' : [1, 2, 3, 5],
    'mnb_tvec__max_df' : [0.25, 0.75, .90],
    'mnb_tvec__ngram_range' : [(1,1),(1, 2)],
}

In [67]:
# RandomizedSearchCV will sample 10 parameter settings to reduce runtime

mnb_rs = RandomizedSearchCV(mnb_pipe,
                 param_distributions = mnb_pipe_params, 
                 n_iter=10,
                 cv = 3,
                 random_state=42)

In [68]:
mnb_rs.fit(X_train, y_train)

print(mnb_rs.best_params_)
print(mnb_rs.best_score_)
print(mnb_rs.score(X_train, y_train))
print(mnb_rs.score(X_test, y_test))

{'mnb_tvec__ngram_range': (1, 2), 'mnb_tvec__min_df': 1, 'mnb_tvec__max_features': None, 'mnb_tvec__max_df': 0.75}
0.8016015546503352
0.9667250437828371
0.8124062031015508


**After hyper-parameter tuning, the testing accuracy increased slightly, but the variance remained increased to a difference of 0.1543.**

## Model Exploration Insights

Each pipeline was fitted without any hyperparameters as a first pass. The testing scores of all the models, with the exception of k-Nearest Neighbors were between 78-82%, so what it really came down to was the amount of variance. As a result, AdaBoost, Gradient Boosting and XGBoost models emerged as the strongest contenders for best-performing based on their competitive testing accuracy scores and low-variance. 

Of the three strongest contenders, XGBoost achieved the highest testing accuracy score of 0.8079, but was the most overfit by 0.0472. AdaBoost achieved the second highest testing score of 0.7938, but achieved the lowest variance of 0.01. Gradient Boosting had the lowest testing score of 0.7873, but achieved the second lowest variance of 0.0155.

During the subsequent passes with a Randomized Search of hyper-parameters, XGBoost emerged as the best-performer with a testing accuracy score of 0.7968 and the lowest variance of 0.0156. This makes it the model that will perform best on unseen data.