## Modeling Stemmed Text (Words)

Same process as the previous three, with details in `01-wds-dgts-lem.ipynb`. The end of this notebook includes a random forest model (fit to count-vectorized text) that ultimately became one of two models fine-tuned and used in the final comparison and lost only to a slightly more accurate logistic regression model.

Imports, data read-in, and a `train-test-split`:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

In [2]:
df = pd.read_csv('../datasets/main_final.csv')[['simp_stem', 'subreddit']].reset_index(drop=True)

In [3]:
X = df['simp_stem']
y = df['subreddit']

X_train, X_test, y_train, y_test=train_test_split(X, y, stratify = y)

Datasets are count-vectorized and vectorized by term frequency:

In [4]:
cvec = CountVectorizer()
cvec.fit(X_train)
X_train_cv = pd.DataFrame(cvec.transform(X_train).todense(), columns = cvec.get_feature_names())
X_test_cv = pd.DataFrame(cvec.transform(X_test).todense(), columns = cvec.get_feature_names())

In [5]:
tvec = TfidfVectorizer()
tvec.fit(X_train)
X_train_tv = pd.DataFrame(tvec.transform(X_train).todense(), columns = tvec.get_feature_names())
X_test_tv = pd.DataFrame(tvec.transform(X_test).todense(), columns = tvec.get_feature_names())

First, logistic regression. Even here, when fit to the data as vectorized by term frequency, logistic regression edges out the best random forest model.

In [6]:
logreg = LogisticRegression()

pipe_params = {
    'penalty' : ['l1', 'l2'],
    'C' : np.linspace(0.1, 1, 10),
    'solver' : ['liblinear']
}

In [7]:
gs = GridSearchCV(logreg, param_grid = pipe_params, cv = 5)

In [8]:
gs.fit(X_train_cv, y_train)

GridSearchCV(cv=5, estimator=LogisticRegression(),
             param_grid={'C': array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
                         'penalty': ['l1', 'l2'], 'solver': ['liblinear']})

In [9]:
gs.best_params_

{'C': 0.1, 'penalty': 'l2', 'solver': 'liblinear'}

In [10]:
print('Best cross-validation score: ', gs.best_score_, '\n',
      'Best training score: ', gs.score(X_train_cv, y_train), '\n',
      'Best testing score: ', gs.score(X_test_cv, y_test))

Best cross-validation score:  0.8218872763303608 
 Best training score:  0.9352015950376606 
 Best testing score:  0.8255813953488372


In [11]:
gs.fit(X_train_tv, y_train)

GridSearchCV(cv=5, estimator=LogisticRegression(),
             param_grid={'C': array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
                         'penalty': ['l1', 'l2'], 'solver': ['liblinear']})

In [12]:
gs.best_params_

{'C': 0.9, 'penalty': 'l2', 'solver': 'liblinear'}

In [13]:
print('Best cross-validation score: ', gs.best_score_, '\n',
      'Best training score: ', gs.score(X_train_tv, y_train), '\n',
      'Best testing score: ', gs.score(X_test_tv, y_test))

Best cross-validation score:  0.8322982486816797 
 Best training score:  0.91116526362428 
 Best testing score:  0.8401993355481727


Bagging models are run, with nothing remarkable as a result (though, again, far more accurate than the baseline).

In [14]:
bc = BaggingClassifier(n_estimators = 11)

In [15]:
bc.fit(X_train_cv, y_train)

BaggingClassifier(n_estimators=11)

In [16]:
print('Training score: ', bc.score(X_train_cv, y_train))
print('Cross-validation score: ', cross_val_score(bc, X_train_cv, y_train, cv= 5).mean())
print('Testing score: ', bc.score(X_test_cv, y_test))

Training score:  0.9881479840496233
Cross-validation score:  0.8103686388554003
Testing score:  0.8086378737541529


In [17]:
bc.fit(X_train_tv, y_train)

BaggingClassifier(n_estimators=11)

In [18]:
print('Training score: ', bc.score(X_train_tv, y_train))
print('Cross-validation score: ', cross_val_score(bc, X_train_tv, y_train, cv= 5).mean())
print('Testing score: ', bc.score(X_test_tv, y_test))

Training score:  0.9914709791758972
Cross-validation score:  0.807045336720013
Testing score:  0.8046511627906977


AdaBoost beats out the baseline also, but fades in comparison to logistic regression and random forests.

In [19]:
abc = AdaBoostClassifier(n_estimators = 50)
abc.fit(X_train_cv, y_train)

AdaBoostClassifier()

In [20]:
abc.score(X_train_cv, y_train)

0.8190075321222863

In [21]:
abc.score(X_test_cv, y_test)

0.8089700996677741

In [22]:
abc = AdaBoostClassifier(n_estimators = 100)
abc.fit(X_train_cv, y_train)

AdaBoostClassifier(n_estimators=100)

In [23]:
abc.score(X_train_cv, y_train)

0.8407177669472752

In [24]:
abc.score(X_test_cv, y_test)

0.8262458471760797

In [25]:
abc = AdaBoostClassifier(n_estimators = 150)
abc.fit(X_train_cv, y_train)

AdaBoostClassifier(n_estimators=150)

In [26]:
abc.score(X_train_cv, y_train)

0.8517944173681878

In [27]:
abc.score(X_test_cv, y_test)

0.8186046511627907

In [28]:
abc = AdaBoostClassifier(n_estimators = 50)
abc.fit(X_train_tv, y_train)

AdaBoostClassifier()

In [29]:
abc.score(X_train_tv, y_train)

0.8259858218874613

In [30]:
abc.score(X_test_tv, y_test)

0.8129568106312293

In [31]:
abc = AdaBoostClassifier(n_estimators = 100)
abc.fit(X_train_tv, y_train)

AdaBoostClassifier(n_estimators=100)

In [32]:
abc.score(X_train_tv, y_train)

0.843265396544085

In [33]:
abc.score(X_test_tv, y_test)

0.8122923588039868

In [34]:
abc = AdaBoostClassifier(n_estimators = 150)
abc.fit(X_train_tv, y_train)

AdaBoostClassifier(n_estimators=150)

In [35]:
abc.score(X_train_tv, y_train)

0.8617634027470094

In [36]:
abc.score(X_test_tv, y_test)

0.8146179401993355

Random forests:

In [37]:
rf = RandomForestClassifier()

rf_params = {
    'n_estimators' : [100, 150],
    'max_depth'    : [None, 4]
}

gs = GridSearchCV(rf, param_grid=rf_params, cv= 5)

The model fit in the cell directly below reflects the strongest random forest model developed during the course of this project:

In [38]:
gs.fit(X_train_cv, y_train)

GridSearchCV(cv=5, estimator=RandomForestClassifier(),
             param_grid={'max_depth': [None, 4], 'n_estimators': [100, 150]})

In [39]:
gs.best_params_

{'max_depth': None, 'n_estimators': 150}

In [40]:
gs.best_score_

0.8314153192037621

In [41]:
gs.score(X_train_cv, y_train)

0.9990031014621179

In [42]:
gs.score(X_test_cv, y_test)

0.8382059800664452

In [43]:
gs.fit(X_train_tv, y_train)

GridSearchCV(cv=5, estimator=RandomForestClassifier(),
             param_grid={'max_depth': [None, 4], 'n_estimators': [100, 150]})

In [44]:
gs.best_params_

{'max_depth': None, 'n_estimators': 150}

In [45]:
gs.best_score_

0.8295306196948922

In [46]:
gs.score(X_train_tv, y_train)

0.9990031014621179

In [47]:
gs.score(X_test_tv, y_test)

0.8325581395348837