# NLP Project: Subreddit Binary Text Classification 
- In this binary classification problem, I scrape data from two subreddits (*r/userexperience* and *r/UXResearch*) using [Pushshift’s API](https://github.com/pushshift/api), then use Natural Language Processing (NLP) to train for a classifier model on which subreddit a given post came from.
- This is the second of two notebooks for this project. In this notebook, I assess various classification models and their ability to correctly classify which subreddit a post came from. Models are evaluated using accuracy.  
---

# Contents
- [Train/Test Split & Baseline Accuracy](#Train/Test-Split-&-Baseline-Accuracy)
- [Model 1: Logistic Regression, CountVectorizer()](#Model-1:-Logistic-Regression,-CountVectorizer())
- [Model 2: Multinomial Naive Bayes, CountVectorizer()](#Model-2:-Multinomial-Naive-Bayes,-CountVectorizer())
- [Model 3: Logistic Regression, TfidfVectorizer()](#Model-3:-Logistic-Regression,-TfidfVectorizer())
- [Model 4: Logistic Regression, TfidfVectorizer() (2)](#Model-4:-Logistic-Regression,-TfidfVectorizer())
- [Model 5: Mulitnomial NB, TfidfVectorizer()](#Model-5:-Mulitnomial-NB,-TfidfVectorizer())
- [Model 6: Random Forest, CountVectorizer()](#Model-6:-Random-Forest,-CountVectorizer())
- [Model 7: Random Forest, CountVectorizer() (2)](#Model-7:-Random-Forest,-CountVectorizer())
- [Model 8: AdaBoost, CountVectorizer()](#Model-8:-AdaBoost,-CountVectorizer())
- [Model 9: Voting Classifier, CountVectorizer()](#Model-9:-Voting-Classifier,-CountVectorizer())
- [Conclusions](#Conclusions)
---
### Note: 
*The print() statement for each model includes:*
- *best parameters*
- *train accuracy score*
- *test accuracy score*


# Import Libraries and Load Data

In [37]:
# the magic trio
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# processing
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction import text 
import time

# models
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier, AdaBoostClassifier, VotingClassifier
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score, accuracy_score

In [2]:
# positve (1) = userexperience
# negative (0) = UXResearch
df = pd.read_csv('./datasets/merged_processed.csv')
df.shape

(13611, 5)

# Train/Test Split & Baseline Accuracy

In [None]:
# set up X matrix and y vector
X = df['text']
y = df['subreddit']

# train/test split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    stratify=y,
                                                    random_state=42)

In [4]:
# baseline accuracy
df['subreddit'].value_counts(normalize=True)

1    0.531408
0    0.468592
Name: subreddit, dtype: float64

In [5]:
# create custom stopwords
additional_stopwords = ['user', 'experience', 'ux']
my_swords = text.ENGLISH_STOP_WORDS.union(additional_stopwords)

# https://stackoverflow.com/questions/26826002/adding-words-to-stop-words-list-in-tfidfvectorizer-in-sklearn

In [8]:
# create pipeline
cpipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('lr', LogisticRegression(max_iter=1_000, random_state=42))
])

## Model 1: Logistic Regression, CountVectorizer()

In [7]:
# cvec, logreg
cpipe_params = {
    'cvec__stop_words': ['english', my_swords],
    'cvec__ngram_range': [(1,1), (1,2), (1,3)],
}

gs = GridSearchCV(cpipe,
                  cpipe_params,
                  cv=5)

gs.fit(X_train, y_train)
gs_model = gs.best_estimator_

print('best parameters:', gs.best_params_)
print('')
print('train accuracy:', gs_model.score(X_train, y_train))
print('test accuracy:', gs_model.score(X_test, y_test))

best parameters: {'cvec__ngram_range': (1, 3), 'cvec__stop_words': 'english'}

train accuracy: 0.9850117554858934
test accuracy: 0.742873934763444


## Model 2: Multinomial Naive Bayes, CountVectorizer()

In [9]:
# cvec, multinomialNB
cpipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('mnb', MultinomialNB())
])

cpipe_params = {
    'cvec__stop_words': ['english'],
    'cvec__ngram_range': [(1,3)],
}

gs = GridSearchCV(cpipe,
                  cpipe_params,
                  cv=5)

gs.fit(X_train, y_train)
gs_model = gs.best_estimator_

print('best parameters:', gs.best_params_)
print('')
print('train accuracy:', gs_model.score(X_train, y_train))
print('test accuracy:', gs_model.score(X_test, y_test))

best parameters: {'cvec__ngram_range': (1, 3), 'cvec__stop_words': 'english'}

train accuracy: 0.9695336990595611
test accuracy: 0.7587422862180428


## Model 3: Logistic Regression, TfidfVectorizer()

In [10]:
pipe = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('lr', LogisticRegression(max_iter=1_000, random_state=42))
])

In [11]:
pipe_params = {
    'tvec__stop_words': [None, 'english'],
    'tvec__ngram_range': [(1, 2)],
}

grid = GridSearchCV(pipe,
                    pipe_params,
                    cv=5)

grid.fit(X_train, y_train)
grid_model = grid.best_estimator_

print(grid.best_params_)
print('')
print('train accuracy:', grid_model.score(X_train, y_train))
print('test accuracy:', grid_model.score(X_test, y_test))

{'tvec__ngram_range': (1, 2), 'tvec__stop_words': 'english'}

train accuracy: 0.9234913793103449
test accuracy: 0.7558037026153394


## Model 4: Logistic Regression, TfidfVectorizer() 
- w/ L1 penalty

In [12]:
pipe_params = {
    'tvec__stop_words': ['english'],
    'tvec__ngram_range': [(1, 2)],
    'lr__solver': ['liblinear'],
    'lr__penalty': ['l1']

}

grid = GridSearchCV(pipe,
                    pipe_params,
                    cv=5)

grid.fit(X_train, y_train)
grid_model = grid.best_estimator_

print(grid.best_params_)
print('')
print('train accuracy:', grid_model.score(X_train, y_train))
print('test accuracy:', grid_model.score(X_test, y_test))

{'lr__penalty': 'l1', 'lr__solver': 'liblinear', 'tvec__ngram_range': (1, 2), 'tvec__stop_words': 'english'}

train accuracy: 0.7400078369905956
test accuracy: 0.7220099911842492


## Model 5: Mulitnomial NB, TfidfVectorizer()

In [18]:
pipe = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('gnb', MultinomialNB())
])

pipe_params = {
    'tvec__stop_words': ['english'],
    'tvec__ngram_range': [(1, 2)],
}

grid = GridSearchCV(pipe,
                    pipe_params,
                    cv=5)

grid.fit(X_train, y_train)
grid_model = grid.best_estimator_

print(grid.best_params_)
print('')
print('train accuracy:', grid_model.score(X_train, y_train))
print('test accuracy:', grid_model.score(X_test, y_test))

{'tvec__ngram_range': (1, 2), 'tvec__stop_words': 'english'}

train accuracy: 0.9625783699059561
test accuracy: 0.7516896855715545


## Model 6: Random Forest, CountVectorizer()

In [32]:
pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('rf', RandomForestClassifier(random_state=42))
])

pipe_params = {
    'cvec__stop_words': ['english'],
    'cvec__ngram_range': [(1,2)],
#     'rf__max_depth': [(None, 5, 10)],
}

grid = GridSearchCV(pipe,
                    pipe_params,
                    cv=3)

grid.fit(X_train, y_train)
grid_model = grid.best_estimator_

print(grid.best_params_)
print('')
print('train accuracy:', grid_model.score(X_train, y_train))
print('test accuracy:', grid_model.score(X_test, y_test))

{'cvec__ngram_range': (1, 2), 'cvec__stop_words': 'english'}

train accuracy: 0.9890282131661442
test accuracy: 0.7346459006758742


In [33]:
pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('rf', RandomForestClassifier(random_state=42))
])

pipe_params = {
    'cvec__stop_words': ['english'],
    'cvec__ngram_range': [(1,2)],
    'rf__max_depth': [None, 5, 10],
}

grid = GridSearchCV(pipe,
                    pipe_params,
                    cv=3)

grid.fit(X_train, y_train)
grid_model = grid.best_estimator_

print(grid.best_params_)
print('')
print('train accuracy:', grid_model.score(X_train, y_train))
print('test accuracy:', grid_model.score(X_test, y_test))

{'cvec__ngram_range': (1, 2), 'cvec__stop_words': 'english', 'rf__max_depth': None}

train accuracy: 0.9890282131661442
test accuracy: 0.7346459006758742


## Model 7: Random Forest, CountVectorizer() 
### tweaking `max_features` hyperparameter

In [34]:
pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('rf', RandomForestClassifier(random_state=42))
])

pipe_params = {
    'cvec__stop_words': ['english'],
    'cvec__ngram_range': [(1,3)],
    'rf__max_depth': [None],
    'rf__max_features': ['log2', 'sqrt'],
}

grid = GridSearchCV(pipe,
                    pipe_params,
                    cv=3)

grid.fit(X_train, y_train)
grid_model = grid.best_estimator_

print(grid.best_params_)
print('')
print('train accuracy:', grid_model.score(X_train, y_train))
print('test accuracy:', grid_model.score(X_test, y_test))

{'cvec__ngram_range': (1, 3), 'cvec__stop_words': 'english', 'rf__max_depth': None, 'rf__max_features': 'sqrt'}

train accuracy: 0.9890282131661442
test accuracy: 0.7290625918307376


In [36]:
pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('rf', RandomForestClassifier(random_state=42))
])

pipe_params = {
    'cvec__stop_words': ['english'],
    'cvec__ngram_range': [(1,2)],
    'rf__max_depth': [None],
    'rf__max_features': ['sqrt'],
}

grid = GridSearchCV(pipe,
                    pipe_params,
                    cv=3)

grid.fit(X_train, y_train)
grid_model = grid.best_estimator_

print(grid.best_params_)
print('')
print('train accuracy:', grid_model.score(X_train, y_train))
print('test accuracy:', grid_model.score(X_test, y_test))

{'cvec__ngram_range': (1, 2), 'cvec__stop_words': 'english', 'rf__max_depth': None, 'rf__max_features': 'sqrt'}

train accuracy: 0.9890282131661442
test accuracy: 0.7346459006758742


In [48]:
gboost = GradientBoostingClassifier()
gboost_params = {
    'max_depth': [3,6],
    'n_estimators': [100, 150],
}
gb_gs = GridSearchCV(gboost, param_grid=gboost_params, cv=3)
gb_gs.fit(X_train, y_train)
print(gb_gs.best_score_)
gb_gs.best_params_

0.7131661674875179


{'max_depth': 6, 'n_estimators': 150}

## Model 8: AdaBoost, CountVectorizer() 

In [45]:
cvec = CountVectorizer()
X_train = cvec.fit_transform(X_train)
X_test = cvec.transform(X_test)

ada = AdaBoostClassifier(base_estimator=DecisionTreeClassifier())
ada_params = {
    'n_estimators': [50,100],
    'base_estimator__max_depth': [1,2],
    'learning_rate': [.9, .1]
}
gs = GridSearchCV(ada, param_grid=ada_params, cv=3)
gs.fit(X_train, y_train)
print(gs.best_score_)
gs.best_params_

0.6954359068887644


{'base_estimator__max_depth': 1, 'learning_rate': 0.9, 'n_estimators': 100}

## Model 9: Voting Classifier, CountVectorizer() 

In [50]:
vote = VotingClassifier([
    ('tree', DecisionTreeClassifier()),
    ('ada', AdaBoostClassifier()),
    ('gb', GradientBoostingClassifier())
])
vote_params = {
    'ada__n_estimators': [100, 125],
    'ada__n_estimators': [100, 125],
    'gb__n_estimators': [100,125],
    'tree__max_depth': [5, 10]
}
gs = GridSearchCV(vote, param_grid=vote_params, cv=2)
gs.fit(X_train, y_train)
print(gs.best_score_)
gs.best_params_

0.6960227272727273


{'ada__n_estimators': 125, 'gb__n_estimators': 125, 'tree__max_depth': 5}

# Conclusions
- *Model #4: Logistic Regression, TfidfVectorizer* performed best
  - Train accuracy score: 74%
  - Test accuracy score: 72%
  - Parameters:
    - `stopwords: 'english'` (native to sklearn)
    - `n-gram range: (1,2)`
    - `solver: liblinear`
    - `penalty: l1` (lasso regression)
  - Analysis:
    - Though 72% accuracy was not the best across all test sets, it was the closest to that of the train set. For other models, I was getting test scores at least 20 percentage points less than those of train scores.
    - Therefore, I selected this model as the best as it would perform most closely with expectations on unseen data.
    - The most notable hyperparameter adjustment was changing the penalty from the default `l2` (ridge regression) to `l1` (lasso regression). All else held equal, the train/test respectively scored 92.3%/75.6% accuracy using the `l1` penalty. Allowing the coefficients to zero out using `l2` makes the train/test accuracy scores more consistent with one another.
- Though I would have liked to see a higher accuracy score in my best model, a 72% accuracy isn't bad considering how closely related the subreddits of choice were. Compared to the baseline accuracy (53% positive class) my best model performed 19 percentage points higher.
- The largest challenge with this project was computing constraints. Given the complexity of some of the models and their hyperparameters, my computer wasn't able to process everything. For example, when I tried to compare the `elasticnet` and `l1` hyperparameters for Logistic Regression, my computer ran for 20+ minutes before the Jupyter kernel froze. Given sufficient time and computing resources, I believe I could continue tuning the models to get an accuracy score well above 72%.