## Modeling Stemmatized Text (Words and Potentially Significant Digits)

As the title indicates, the third set of text data modeled was stemmed text consisting of words and digits. The logistic regression model, performed below on text vectorized by term frequency, turned out to be the strongest model over all.

Imports, data read-in, and a `train-test-split`:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

In [2]:
df = pd.read_csv('../datasets/main_final.csv')[['comp_stem', 'subreddit']].reset_index(drop=True)

In [3]:
X = df['comp_stem']
y = df['subreddit']

X_train, X_test, y_train, y_test=train_test_split(X, y, stratify = y)

Input data is run through `CountVectorizer()` and `TfidfVectorizer`, respectively:

In [4]:
cvec = CountVectorizer()
cvec.fit(X_train)
X_train_cv = pd.DataFrame(cvec.transform(X_train).todense(), columns = cvec.get_feature_names())
X_test_cv = pd.DataFrame(cvec.transform(X_test).todense(), columns = cvec.get_feature_names())

In [5]:
tvec = TfidfVectorizer()
tvec.fit(X_train)
X_train_tv = pd.DataFrame(tvec.transform(X_train).todense(), columns = tvec.get_feature_names())
X_test_tv = pd.DataFrame(tvec.transform(X_test).todense(), columns = tvec.get_feature_names())

Logistic regression:

In [6]:
logreg = LogisticRegression()

pipe_params = {
    'penalty' : ['l1', 'l2'],
    'C' : np.linspace(0.1, 1, 10),
    'solver' : ['liblinear']
}

In [7]:
gs = GridSearchCV(logreg, param_grid = pipe_params, cv = 5)

In [8]:
gs.fit(X_train_cv, y_train)

GridSearchCV(cv=5, estimator=LogisticRegression(),
             param_grid={'C': array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
                         'penalty': ['l1', 'l2'], 'solver': ['liblinear']})

In [9]:
gs.best_params_

{'C': 0.1, 'penalty': 'l2', 'solver': 'liblinear'}

In [10]:
print('Best cross-validation score: ', gs.best_score_, '\n',
      'Best training score: ', gs.score(X_train_cv, y_train), '\n',
      'Best testing score: ', gs.score(X_test_cv, y_test))

Best cross-validation score:  0.8241024225189657 
 Best training score:  0.937306158617634 
 Best testing score:  0.8232558139534883


In [11]:
gs.fit(X_train_tv, y_train)

GridSearchCV(cv=5, estimator=LogisticRegression(),
             param_grid={'C': array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
                         'penalty': ['l1', 'l2'], 'solver': ['liblinear']})

In [12]:
gs.best_params_

{'C': 1.0, 'penalty': 'l2', 'solver': 'liblinear'}

As a note, the results below reflect the best results for any model, for any text data, throughout the course of the project. The overfit of the training data is a concern, but the cross-validation score and the test score are the highest in comarison to any others:

In [13]:
print('Best cross-validation score: ', gs.best_score_, '\n',
      'Best training score: ', gs.score(X_train_tv, y_train), '\n',
      'Best testing score: ', gs.score(X_test_tv, y_test))

Best cross-validation score:  0.8357334584932344 
 Best training score:  0.9175897208684094 
 Best testing score:  0.8315614617940199


Datasets are fit to `BaggingClassifier()` and show results consistent with other datasets:

In [14]:
bc = BaggingClassifier(n_estimators = 11)

In [15]:
bc.fit(X_train_cv, y_train)

BaggingClassifier(n_estimators=11)

In [16]:
print('Training score: ', bc.score(X_train_cv, y_train))
print('Cross-validation score: ', cross_val_score(bc, X_train_cv, y_train, cv= 5).mean())
print('Testing score: ', bc.score(X_test_cv, y_test))

Training score:  0.9906956136464333
Cross-validation score:  0.8078213281060668
Testing score:  0.8102990033222591


In [17]:
bc.fit(X_train_tv, y_train)

BaggingClassifier(n_estimators=11)

In [18]:
print('Training score: ', bc.score(X_train_tv, y_train))
print('Cross-validation score: ', cross_val_score(bc, X_train_tv, y_train, cv= 5).mean())
print('Testing score: ', bc.score(X_test_tv, y_test))

Training score:  0.9902525476295968
Cross-validation score:  0.8070444777795162
Testing score:  0.8196013289036544


Similarly, AdaBoost:

In [19]:
abc = AdaBoostClassifier(n_estimators = 50)
abc.fit(X_train_cv, y_train)

AdaBoostClassifier()

In [20]:
abc.score(X_train_cv, y_train)

0.819672131147541

In [21]:
abc.score(X_test_cv, y_test)

0.8136212624584718

In [22]:
abc = AdaBoostClassifier(n_estimators = 100)
abc.fit(X_train_cv, y_train)

AdaBoostClassifier(n_estimators=100)

In [23]:
abc.score(X_train_cv, y_train)

0.8389455028799291

In [24]:
abc.score(X_test_cv, y_test)

0.8129568106312293

In [25]:
abc = AdaBoostClassifier(n_estimators = 150)
abc.fit(X_train_cv, y_train)

AdaBoostClassifier(n_estimators=150)

In [26]:
abc.score(X_train_cv, y_train)

0.8535666814355339

In [27]:
abc.score(X_test_cv, y_test)

0.8186046511627907

In [28]:
abc = AdaBoostClassifier(n_estimators = 50)
abc.fit(X_train_tv, y_train)

AdaBoostClassifier()

In [29]:
abc.score(X_train_tv, y_train)

0.82233052724856

In [30]:
abc.score(X_test_tv, y_test)

0.815282392026578

In [31]:
abc = AdaBoostClassifier(n_estimators = 100)
abc.fit(X_train_tv, y_train)

AdaBoostClassifier(n_estimators=100)

In [32]:
abc.score(X_train_tv, y_train)

0.8447053610988037

In [33]:
abc.score(X_test_tv, y_test)

0.8225913621262458

In [34]:
abc = AdaBoostClassifier(n_estimators = 150)
abc.fit(X_train_tv, y_train)

AdaBoostClassifier(n_estimators=150)

In [35]:
abc.score(X_train_tv, y_train)

0.8581081081081081

In [36]:
abc.score(X_test_tv, y_test)

0.8212624584717608

Random forests are created, and again come in a fairly close second to logistic regression when measured by their cross-validation scores:

In [37]:
rf = RandomForestClassifier()

rf_params = {
    'n_estimators' : [100, 150],
    'max_depth'    : [None, 4]
}

gs = GridSearchCV(rf, param_grid=rf_params, cv= 5)

In [38]:
gs.fit(X_train_cv, y_train)

GridSearchCV(cv=5, estimator=RandomForestClassifier(),
             param_grid={'max_depth': [None, 4], 'n_estimators': [100, 150]})

In [39]:
gs.best_params_

{'max_depth': None, 'n_estimators': 100}

In [40]:
gs.best_score_

0.8272057745342549

In [41]:
gs.score(X_train_cv, y_train)

0.9993354009747453

In [42]:
gs.score(X_test_cv, y_test)

0.8308970099667774

In [43]:
gs.fit(X_train_tv, y_train)

GridSearchCV(cv=5, estimator=RandomForestClassifier(),
             param_grid={'max_depth': [None, 4], 'n_estimators': [100, 150]})

In [44]:
gs.best_params_

{'max_depth': None, 'n_estimators': 100}

In [45]:
gs.best_score_

0.8235506759554946

In [46]:
gs.score(X_train_tv, y_train)

0.9993354009747453

In [47]:
gs.score(X_test_tv, y_test)

0.8395348837209302