# Models with GridSearch

##### TABLE OF CONTENTS
 - [Observations and Overview for Models with GridSearch](#Observations-and-Overview-for-Models-with-GridSearch)
 - [Import and Define our Variables for Models with GridSearch](#Import-and-Define-our-Variables-for-Models-with-GridSearch)
 - [Make Classification Class with GridSearch](#Make-Classification-Class-with-GridSearch)
 - [Logistic Regression for Models with GridSearch](#Logistic-Regression-for-Models-with-GridSearch)
 - [Naive Bayes for Models with GridSearch](#Naive-Bayes-for-Models-with-GridSearch)
 - [K Nearest Neighbors for Models with GridSearch](#K-Nearest-Neighbors-for-Models-with-GridSearch)


### Observations and Overview for Models with GridSearch
[(back to top)](#Models-with-GridSearch) <br />

Looking at this, __*Logistic Regression*__ performed the best with F1 Score of 0.6796 using the TFIDF Vectorizer, and on the Test Data that was not part of the TTS. Although Logistic Regression had a bias towards predicting 'AMA', it performed better than the other two models outlined here.

The __*KNN Classifier*__ also did well with an F1 Score of 0.6800 using the Count Vectorizer. However, it did not do as well as the Logistic Regression Model since the KNN Model was more heavily biased towards the AMA subreddit. 

The __*Naive Bayes model*__ underperformed both the Logistic Regression and KNN Classification Models with an F1 Score of 0.5860. The best Naive Bayes model used the TFIDF Vectorizer, and was Biased towards the AskReddit subreddit.


### Import and Define our Variables for Models with GridSearch
[(back to top)](#Models-with-GridSearch) <br />


In [1]:
from ipynb.fs.full.functions import *

In [2]:
# Data to create our model
df = pd.read_csv('../data/clean_data.csv')
df = add_binary_and_drop(df, drop='subreddit', repl_w_zero='AMA')
df = remove_keywords(df, col_to_modify='body', remove_from='AMA')
df = remove_keywords(df, col_to_modify='body', remove_from='AskReddit')
df = remove_deleted_comments(df, col_to_modify='body', repl_w_nan='[deleted]')

In [3]:
# Model X, and y

X = df['body']
y = df['subreddit_binary']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=3)

In [4]:
# TEST data (not part of train/test/split)

df1 = pd.read_csv('../data/2021-04-27_1812_AMA_comments.csv')
df2 = pd.read_csv('../data/2021-04-27_1812_AskReddit_comments.csv')
df_test_pred = pd.concat([df1, df2], axis=0)

In [5]:
df_test_pred = drop_cols_cleaning(df_test_pred)
df_test_pred = add_binary_and_drop(df_test_pred, drop='subreddit', repl_w_zero='AMA')


In [6]:
df_test_pred = remove_deleted_comments(df_test_pred, col_to_modify='body', repl_w_nan='[deleted]')


In [7]:
df_test_pred = df_test_pred.sample(n=df_test_pred.shape[0], random_state=3)

X_new = df_test_pred['body']
y_new = df_test_pred['subreddit_binary'] 

### Make Classification Class with GridSearch
[(back to top)](#Models-with-GridSearch) <br />


A quick note; I made a ClassificationModel Class to help keep this information organized and in DataFrames. Please head over to the functions Notebook to see the code for it.

# Logistic Regression for Models with GridSearch
[(back to top)](#Models-with-GridSearch) <br />


<h2> (gridsearch) CountVectorizer(), LogisticRegression() </h2>


In [8]:
gs_cv_lgr = ClassificationModel(make_pipeline(
    CountVectorizer(), 
    LogisticRegression()), 
    X_train, X_test, y_train, y_test,
    params={
        'countvectorizer__ngram_range': [ (1, 1), (1, 2), (2, 2), (1, 3), (2, 3), (3, 3) ],
        'countvectorizer__stop_words': [ 'english', None ],
        'countvectorizer__max_features': [ 500, 1000, 2000, 5000 ]
}, verbose=3, mod_name='Train/Test cVect LogReg')

print(gs_cv_lgr.model.best_score_)
print(gs_cv_lgr.model.best_estimator_)

Fitting 5 folds for each of 48 candidates, totalling 240 fits
0.7052590909517684
Pipeline(steps=[('countvectorizer',
                 CountVectorizer(max_features=5000, ngram_range=(1, 2))),
                ('logisticregression', LogisticRegression())])


In [9]:
X_gs_cv_lgr = ClassificationModel(make_pipeline(
    CountVectorizer(), 
    LogisticRegression()), 
    X, X_new, y, y_new,
    params={
        'countvectorizer__ngram_range': [ (1, 1), (1, 2), (2, 2), (1, 3), (2, 3), (3, 3) ],
        'countvectorizer__stop_words': [ 'english', None ],
        'countvectorizer__max_features': [ 500, 1000, 2000, 5000 ]
}, verbose=3, mod_name='Xy/new cVect LogReg')

print(X_gs_cv_lgr.model.best_score_)
print(X_gs_cv_lgr.model.best_estimator_)

Fitting 5 folds for each of 48 candidates, totalling 240 fits
0.6989007253365258
Pipeline(steps=[('countvectorizer',
                 CountVectorizer(max_features=5000, ngram_range=(1, 3))),
                ('logisticregression', LogisticRegression())])


<h2 style="color:red;"> (gridsearch) TfidfVectorizer(), LogisticRegression() </h2>
<h2 style="color:red;"> BEST! best score (gs): 0.7137560911489386 </h2>
Pipeline(steps=[('tfidfvectorizer',
                 TfidfVectorizer(max_features=5000, ngram_range=(1, 2))),
                ('logisticregression', LogisticRegression())])

In [10]:
# ### Best Train ###

gs_tv_lgr = ClassificationModel(make_pipeline(
    TfidfVectorizer(), 
    LogisticRegression()), 
    X_train, X_test, y_train, y_test,
    params={
        'tfidfvectorizer__ngram_range': [ (1, 1), (1, 2), (2, 2), (1, 3), (2, 3), (3, 3) ],
        'tfidfvectorizer__stop_words': [ 'english', None ],
        'tfidfvectorizer__max_features': [ 500, 1000, 2000, 5000 ]
}, verbose=3, mod_name='Train/Test Tfidf LogReg')

print(gs_tv_lgr.model.best_score_)
print(gs_tv_lgr.model.best_estimator_)

Fitting 5 folds for each of 48 candidates, totalling 240 fits
0.7137560911489386
Pipeline(steps=[('tfidfvectorizer',
                 TfidfVectorizer(max_features=5000, ngram_range=(1, 2))),
                ('logisticregression', LogisticRegression())])


In [11]:
## FIT BEST onto X, and try on fresh data!!

X_gs_tv_lgr = ClassificationModel(make_pipeline(
    TfidfVectorizer(), 
    LogisticRegression()), 
    X_train=X, X_test=X_new, y_train=y, y_test=y_new,
    params={
        'tfidfvectorizer__ngram_range': [ (1, 1), (1, 2), (2, 2), (1, 3), (2, 3), (3, 3) ],
        'tfidfvectorizer__stop_words': [ 'english', None ],
        'tfidfvectorizer__max_features': [ 500, 1000, 2000, 5000 ]
}, verbose=3, mod_name='Xy/new Tfidf LogReg')

print(X_gs_tv_lgr.model.best_score_)
print(X_gs_tv_lgr.model.best_estimator_)

Fitting 5 folds for each of 48 candidates, totalling 240 fits
0.71102542135343
Pipeline(steps=[('tfidfvectorizer',
                 TfidfVectorizer(max_features=5000, ngram_range=(1, 3))),
                ('logisticregression', LogisticRegression())])


In [12]:
compare_lgr = pd.concat([gs_cv_lgr.df, X_gs_cv_lgr.df, gs_tv_lgr.df, X_gs_tv_lgr.df], axis=1)
compare_lgr

Unnamed: 0,Train/Test cVect LogReg,Xy/new cVect LogReg,Train/Test Tfidf LogReg,Xy/new Tfidf LogReg
F1 Score,0.724341,0.65098,0.724428,0.679641
Recall Score,0.758025,0.693835,0.74321,0.711599
Accuracy,0.711878,0.631088,0.717633,0.667358
Balanced Accuracy,0.711935,0.631604,0.717664,0.667721
Precision Score,0.693524,0.613112,0.706573,0.65043
Average Precision Score,0.757351,0.66457,0.790564,0.707247
ROC AUC Score,0.78003,0.696798,0.798083,0.735102
True Positive,811.0,554.0,843.0,607.0
False Negative,407.0,419.0,375.0,366.0
False Positive,294.0,293.0,312.0,276.0


# Naive Bayes for Models with GridSearch
[(back to top)](#Models-with-GridSearch) <br />


<h2> (gridsearch) CountVectorizer(), MultinomialNB() </h2>


In [13]:
gs_cv_nb = ClassificationModel(make_pipeline(
    CountVectorizer(), 
    MultinomialNB()), 
    X_train, X_test, y_train, y_test,
    params={
        'countvectorizer__ngram_range': [ (1, 1), (1, 2), (2, 2), (1, 3), (2, 3), (3, 3) ],
        'countvectorizer__stop_words': [ 'english', None ],
        'countvectorizer__max_features': [ 500, 1000, 2000, 5000 ]
}, verbose=3, mod_name='Train/Test cVect nBayes')


print(gs_cv_nb.model.best_score_)
print(gs_cv_nb.model.best_estimator_)

Fitting 5 folds for each of 48 candidates, totalling 240 fits
0.691009135549776
Pipeline(steps=[('countvectorizer', CountVectorizer(max_features=5000)),
                ('multinomialnb', MultinomialNB())])


In [14]:
X_gs_cv_nb = ClassificationModel(make_pipeline(
    CountVectorizer(), 
    MultinomialNB()), 
    X, X_new, y, y_new,
    params={
        'countvectorizer__ngram_range': [ (1, 1), (1, 2), (2, 2), (1, 3), (2, 3), (3, 3) ],
        'countvectorizer__stop_words': [ 'english', None ],
        'countvectorizer__max_features': [ 500, 1000, 2000, 5000 ]
}, verbose=3, mod_name='Xy/new cVect nBayes')


print(X_gs_cv_nb.model.best_score_)
print(X_gs_cv_nb.model.best_estimator_)

Fitting 5 folds for each of 48 candidates, totalling 240 fits
0.6862602543982864
Pipeline(steps=[('countvectorizer', CountVectorizer(max_features=5000)),
                ('multinomialnb', MultinomialNB())])


<h2> (gridsearch) TfidfVectorizer(), MultinomialNB() </h2>


In [15]:
gs_tv_nb = ClassificationModel(make_pipeline(
    TfidfVectorizer(), 
    MultinomialNB()), 
    X_train, X_test, y_train, y_test,
    params={
        'tfidfvectorizer__ngram_range': [ (1, 1), (1, 2), (2, 2), (1, 3), (2, 3), (3, 3) ],
        'tfidfvectorizer__stop_words': [ 'english', None ],
        'tfidfvectorizer__max_features': [ 500, 1000, 2000, 5000 ]
}, verbose=3, mod_name='Train/Test Tfidf nBayes')


print(gs_tv_nb.model.best_score_)
print(gs_tv_nb.model.best_estimator_)

Fitting 5 folds for each of 48 candidates, totalling 240 fits
0.7010116705944209
Pipeline(steps=[('tfidfvectorizer', TfidfVectorizer(max_features=5000)),
                ('multinomialnb', MultinomialNB())])


In [16]:
X_gs_tv_nb = ClassificationModel(make_pipeline(
    TfidfVectorizer(), 
    MultinomialNB()), 
    X, X_new, y, y_new,
    params={
        'tfidfvectorizer__ngram_range': [ (1, 1), (1, 2), (2, 2), (1, 3), (2, 3), (3, 3) ],
        'tfidfvectorizer__stop_words': [ 'english', None ],
        'tfidfvectorizer__max_features': [ 500, 1000, 2000, 5000 ]
}, verbose=3, mod_name='Xy/new Tfidf nBayes')


print(X_gs_tv_nb.model.best_score_)
print(X_gs_tv_nb.model.best_estimator_)

Fitting 5 folds for each of 48 candidates, totalling 240 fits
0.6917065863048061
Pipeline(steps=[('tfidfvectorizer', TfidfVectorizer(max_features=5000)),
                ('multinomialnb', MultinomialNB())])


In [17]:
compare_nb = pd.concat([gs_cv_nb.df, X_gs_cv_nb.df, gs_tv_nb.df, X_gs_tv_nb.df], axis=1)
compare_nb

Unnamed: 0,Train/Test cVect nBayes,Xy/new cVect nBayes,Train/Test Tfidf nBayes,Xy/new Tfidf nBayes
F1 Score,0.640535,0.573113,0.661064,0.585987
Recall Score,0.55144,0.507837,0.582716,0.528736
Accuracy,0.690917,0.62487,0.701603,0.629534
Balanced Accuracy,0.690745,0.623908,0.701457,0.628705
Precision Score,0.763968,0.657645,0.763754,0.657143
Average Precision Score,0.772473,0.657542,0.803198,0.655863
ROC AUC Score,0.786829,0.668039,0.8002,0.691077
True Positive,1011.0,720.0,999.0,709.0
False Negative,207.0,253.0,219.0,264.0
False Positive,545.0,471.0,507.0,451.0


# K Nearest Neighbors for Models with GridSearch
[(back to top)](#Models-with-GridSearch) <br />


<h2> (gridsearch) CountVectorizer(), KNeighborsClassifier() </h2>


In [18]:
gs_cv_knn = ClassificationModel(make_pipeline(
    CountVectorizer(), 
    KNeighborsClassifier()), 
    X_train, X_test, y_train, y_test,
    params={
        'countvectorizer__ngram_range': [ (1, 1), (1, 2), (2, 2), (1, 3), (2, 3), (3, 3) ],
        'countvectorizer__stop_words': [ 'english', None ],
        'countvectorizer__max_features': [ 500, 1000, 2000, 5000 ]
}, verbose=3, mod_name='Train/Test cVect KNN')


print(gs_cv_knn.model.best_score_)
print(gs_cv_knn.model.best_estimator_)

Fitting 5 folds for each of 48 candidates, totalling 240 fits
0.6334600542687335
Pipeline(steps=[('countvectorizer', CountVectorizer(max_features=500)),
                ('kneighborsclassifier', KNeighborsClassifier())])


In [19]:
X_gs_cv_knn = ClassificationModel(make_pipeline(
    CountVectorizer(), 
    KNeighborsClassifier()), 
    X, X_new, y, y_new,
    params={
        'countvectorizer__ngram_range': [ (1, 1), (1, 2), (2, 2), (1, 3), (2, 3), (3, 3) ],
        'countvectorizer__stop_words': [ 'english', None ],
        'countvectorizer__max_features': [ 500, 1000, 2000, 5000 ]
}, verbose=3, mod_name='Xy/new cVect KNN')

print(X_gs_cv_knn.model.best_score_)
print(X_gs_cv_knn.model.best_estimator_)

Fitting 5 folds for each of 48 candidates, totalling 240 fits
0.6343620855021903
Pipeline(steps=[('countvectorizer',
                 CountVectorizer(max_features=2000, ngram_range=(1, 2))),
                ('kneighborsclassifier', KNeighborsClassifier())])


<h2> (gridsearch) TfidfVectorizer(), KNeighborsClassifier() </h2>


In [20]:
gs_tv_knn = ClassificationModel(make_pipeline(
    TfidfVectorizer(), 
    KNeighborsClassifier()), 
    X_train, X_test, y_train, y_test,
    params={
        'tfidfvectorizer__ngram_range': [ (1, 1), (1, 2), (2, 2), (1, 3), (2, 3), (3, 3) ],
        'tfidfvectorizer__stop_words': [ 'english', None ],
        'tfidfvectorizer__max_features': [ 500, 1000, 2000, 5000 ]
}, verbose=3, mod_name='Train/Test Tfidf KNN')

print(gs_tv_knn.model.best_score_)
print(gs_tv_knn.model.best_estimator_)

Fitting 5 folds for each of 48 candidates, totalling 240 fits
0.6052326138188101
Pipeline(steps=[('tfidfvectorizer',
                 TfidfVectorizer(max_features=500, ngram_range=(2, 3))),
                ('kneighborsclassifier', KNeighborsClassifier())])


In [21]:
X_gs_tv_knn = ClassificationModel(make_pipeline(
    TfidfVectorizer(), 
    KNeighborsClassifier()), 
    X, X_new, y, y_new,
    params={
        'tfidfvectorizer__ngram_range': [ (1, 1), (1, 2), (2, 2), (1, 3), (2, 3), (3, 3) ],
        'tfidfvectorizer__stop_words': [ 'english', None ],
        'tfidfvectorizer__max_features': [ 500, 1000, 2000, 5000 ]
}, verbose=3, mod_name='Xy/new Tfidf KNN')

print(X_gs_tv_knn.model.best_score_)
print(X_gs_tv_knn.model.best_estimator_)

Fitting 5 folds for each of 48 candidates, totalling 240 fits
0.5982938412642107
Pipeline(steps=[('tfidfvectorizer',
                 TfidfVectorizer(max_features=1000, ngram_range=(2, 2))),
                ('kneighborsclassifier', KNeighborsClassifier())])


In [22]:
compare_knn = pd.concat([gs_cv_knn.df, X_gs_cv_knn.df, gs_tv_knn.df, X_gs_tv_knn.df], axis=1)
compare_knn

Unnamed: 0,Train/Test cVect KNN,Xy/new cVect KNN,Train/Test Tfidf KNN,Xy/new Tfidf KNN
F1 Score,0.598753,0.68,0.655532,0.631148
Recall Score,0.592593,0.781609,0.775309,0.724138
Accuracy,0.60337,0.635233,0.593095,0.580311
Balanced Accuracy,0.603357,0.636437,0.593319,0.581493
Precision Score,0.605042,0.60177,0.567812,0.559322
Average Precision Score,0.608313,0.608445,0.581013,0.562692
ROC AUC Score,0.650144,0.666435,0.628324,0.610387
True Positive,748.0,478.0,501.0,427.0
False Negative,470.0,495.0,717.0,546.0
False Positive,495.0,209.0,273.0,264.0


In [23]:
compare_lgr

Unnamed: 0,Train/Test cVect LogReg,Xy/new cVect LogReg,Train/Test Tfidf LogReg,Xy/new Tfidf LogReg
F1 Score,0.724341,0.65098,0.724428,0.679641
Recall Score,0.758025,0.693835,0.74321,0.711599
Accuracy,0.711878,0.631088,0.717633,0.667358
Balanced Accuracy,0.711935,0.631604,0.717664,0.667721
Precision Score,0.693524,0.613112,0.706573,0.65043
Average Precision Score,0.757351,0.66457,0.790564,0.707247
ROC AUC Score,0.78003,0.696798,0.798083,0.735102
True Positive,811.0,554.0,843.0,607.0
False Negative,407.0,419.0,375.0,366.0
False Positive,294.0,293.0,312.0,276.0


In [24]:
compare_nb

Unnamed: 0,Train/Test cVect nBayes,Xy/new cVect nBayes,Train/Test Tfidf nBayes,Xy/new Tfidf nBayes
F1 Score,0.640535,0.573113,0.661064,0.585987
Recall Score,0.55144,0.507837,0.582716,0.528736
Accuracy,0.690917,0.62487,0.701603,0.629534
Balanced Accuracy,0.690745,0.623908,0.701457,0.628705
Precision Score,0.763968,0.657645,0.763754,0.657143
Average Precision Score,0.772473,0.657542,0.803198,0.655863
ROC AUC Score,0.786829,0.668039,0.8002,0.691077
True Positive,1011.0,720.0,999.0,709.0
False Negative,207.0,253.0,219.0,264.0
False Positive,545.0,471.0,507.0,451.0
