# Project 3: Reddit API Classification & Natural Language Processing

## Tom Ludlow, DSI-NY-6

Using NLP to identify posts from **r/audioengineering** and **r/livesound**

In [1]:
# library imports
import requests
import time
import pandas as pd
import numpy as np
import ast
import re
from tqdm import tqdm

# preprocessing imports
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# modeling imports
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.metrics import confusion_matrix, accuracy_score


In [2]:
# random state var
r = 1220

# Notebook 3: Model Selection

In [4]:
X_train = pd.read_csv('./csv/181220_X_train.csv', index_col=0)
X_test = pd.read_csv('./csv/181220_X_test.csv', index_col=0)
y_train = pd.read_csv('./csv/181220_y_train.csv', index_col=0)
y_test = pd.read_csv('./csv/181220_y_test.csv', index_col=0)

In [5]:
y_train = pd.DataFrame(y_train, columns=['is_ls'])
y_test = pd.DataFrame(y_test, columns=['is_ls'])

In [6]:
y_train.shape

(1427, 1)

In [7]:
X_train.shape

(1427, 2)

In [8]:
X_test.shape

(476, 2)

In [9]:
y_test.shape

(476, 1)

## GridSearchCV

The `GridSearchCV` tool allows us to program multiple hyperparameters across our models.  It will generate a model with each combination of our desired hyperparameters, and optimize the highest-scoring result.

We will run a single model for each of the following 6 classifiers:
 - Multinomial Naive Bayes
 - K-Nearest Neighbors
 - Logistic Regression
 - Random Forest
 - AdaBoost (adaptive boost)
 - Gradient Boost
 
We will run two GridSearches to benchmark these models for two feature extraction techniques: `CountVectorizer` and `TfidfVectorizer`.  We can use the accuracy of the results to narrow our model selection to the most effective approaches.

As these models execute, the results will be displayed, then stored into a DataFrame for final comparison.

### CountVectorizer

In [10]:
steps_list_gr_cv = [ # list of pipeline steps for each model combo
    [('cv',CountVectorizer()),('multi_nb',MultinomialNB())],
    [('cv',CountVectorizer()),('scaler',StandardScaler(with_mean=False)),('knn',KNeighborsClassifier())], 
    [('cv',CountVectorizer()),('scaler',StandardScaler(with_mean=False)),('logreg',LogisticRegression())],
    [('cv',CountVectorizer()),('rf',RandomForestClassifier())],
    [('cv',CountVectorizer()),('ada',AdaBoostClassifier())],
    [('cv',CountVectorizer()),('gb',GradientBoostingClassifier())]
]

In [11]:
steps_titles = ['multi_nb','knn','logreg','rf','ada','gb']

In [12]:
pipe_params_cv = [
    {"cv__stop_words":['english'], "cv__ngram_range":[(1,1),(1,2)]},
    {"cv__stop_words":['english'], "cv__ngram_range":[(1,1),(1,2)]},
    {"cv__stop_words":['english'], "cv__ngram_range":[(1,1),(1,2)]},
    {"cv__stop_words":['english'], "cv__ngram_range":[(1,1),(1,2)]},
    {"cv__stop_words":['english'], "cv__ngram_range":[(1,1),(1,2)]},
    {"cv__stop_words":['english'], "cv__ngram_range":[(1,1),(1,2)]}
]


In [13]:
# instantiate results DataFrame

grid_results = pd.DataFrame(columns=['model','best_params','train_accuracy','test_accuracy','tn','fp','fn','tp'])
grid_results.head()

Unnamed: 0,model,best_params,train_accuracy,test_accuracy,tn,fp,fn,tp


In [14]:
X_train_pre_post = X_train['post_lm']
X_test_pre_post = X_test['post_lm']

In [15]:
for i in tqdm(range(len(steps_list_gr_cv))):           # timed loop through index of number of steps
    pipe = Pipeline(steps=steps_list_gr_cv[i])         # configure pipeline for each model
    grid = GridSearchCV(pipe, pipe_params_cv[i], cv=3) # fit GridSearchCV to model and model's params

    model_results = {}

    grid.fit(X_train_pre_post, y_train)
    
    print('Model: ',steps_titles[i])
    model_results['model'] = steps_titles[i]

    print('Best Params: ', grid.best_params_)
    model_results['best_params'] = grid.best_params_

    print(grid.score(X_train_pre_post, y_train), '\n')
    model_results['train_accuracy'] = grid.score(X_train_pre_post, y_train)
    
    print(grid.score(X_test_pre_post, y_test), '\n')
    model_results['test_accuracy'] = grid.score(X_test_pre_post, y_test)

    # Display the confusion matrix results showing true/false positive/negative
    tn, fp, fn, tp = confusion_matrix(y_test, grid.predict(X_test_pre_post)).ravel() 
    print("True Negatives: %s" % tn)
    model_results['tn'] = tn

    print("False Positives: %s" % fp)  
    model_results['fp'] = fp

    print("False Negatives: %s" % fn)
    model_results['fn'] = fn

    print("True Positives: %s" % tp, '\n')
    model_results['tp'] = tp

    grid_results = grid_results.append(model_results, ignore_index=True)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


Model:  multi_nb
Best Params:  {'cv__ngram_range': (1, 2), 'cv__stop_words': 'english'}
0.9957953749124037 



 17%|█▋        | 1/6 [00:01<00:09,  1.87s/it]

0.8361344537815126 

True Negatives: 195
False Positives: 36
False Negatives: 42
True Positives: 203 



  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)


Model:  knn
Best Params:  {'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}
0.547302032235459 



 33%|███▎      | 2/6 [00:03<00:07,  1.89s/it]

0.5336134453781513 

True Negatives: 10
False Positives: 221
False Negatives: 1
True Positives: 244 



  y = column_or_1d(y, warn=True)


Model:  logreg
Best Params:  {'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}
0.9992992291520673 

0.7668067226890757 



 50%|█████     | 3/6 [00:05<00:05,  1.81s/it]

True Negatives: 174
False Positives: 57
False Negatives: 54
True Positives: 191 



  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)


Model:  rf
Best Params:  {'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}
0.9936930623686054 

0.7415966386554622 



 67%|██████▋   | 4/6 [00:07<00:03,  1.80s/it]

True Negatives: 180
False Positives: 51
False Negatives: 72
True Positives: 173 



  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


Model:  ada
Best Params:  {'cv__ngram_range': (1, 2), 'cv__stop_words': 'english'}
0.8626489138051857 



 83%|████████▎ | 5/6 [00:10<00:02,  2.31s/it]

0.7773109243697479 

True Negatives: 178
False Positives: 53
False Negatives: 53
True Positives: 192 



  y = column_or_1d(y, warn=True)


Model:  gb
Best Params:  {'cv__ngram_range': (1, 2), 'cv__stop_words': 'english'}
0.9138051857042747 



100%|██████████| 6/6 [00:21<00:00,  5.00s/it]

0.8067226890756303 

True Negatives: 167
False Positives: 64
False Negatives: 28
True Positives: 217 






In [16]:
grid_results_cv = grid_results

In [17]:
grid_results.sort_values('test_accuracy',ascending=False)

Unnamed: 0,model,best_params,train_accuracy,test_accuracy,tn,fp,fn,tp
0,multi_nb,"{'cv__ngram_range': (1, 2), 'cv__stop_words': ...",0.995795,0.836134,195,36,42,203
5,gb,"{'cv__ngram_range': (1, 2), 'cv__stop_words': ...",0.913805,0.806723,167,64,28,217
4,ada,"{'cv__ngram_range': (1, 2), 'cv__stop_words': ...",0.862649,0.777311,178,53,53,192
2,logreg,"{'cv__ngram_range': (1, 1), 'cv__stop_words': ...",0.999299,0.766807,174,57,54,191
3,rf,"{'cv__ngram_range': (1, 1), 'cv__stop_words': ...",0.993693,0.741597,180,51,72,173
1,knn,"{'cv__ngram_range': (1, 1), 'cv__stop_words': ...",0.547302,0.533613,10,221,1,244


### TF-IDF

In [18]:
steps_list_gr_tf = [ # list of pipeline steps for each model combo
    [('tf',TfidfVectorizer()),('multi_nb',MultinomialNB())],
    [('tf',TfidfVectorizer()),('scaler',StandardScaler(with_mean=False)),('knn',KNeighborsClassifier())], 
    [('tf',TfidfVectorizer()),('scaler',StandardScaler(with_mean=False)),('logreg',LogisticRegression())],
    [('tf',TfidfVectorizer()),('rf',RandomForestClassifier())],
    [('tf',TfidfVectorizer()),('ada',AdaBoostClassifier())],
    [('tf',TfidfVectorizer()),('gb',GradientBoostingClassifier())]
]

In [19]:
steps_titles = ['multi_nb','knn','logreg','rf','ada','gb']

In [20]:
pipe_params_tf = [
    {"tf__stop_words":['english'], "tf__ngram_range":[(1,1),(1,2)]},
    {"tf__stop_words":['english'], "tf__ngram_range":[(1,1),(1,2)]},
    {"tf__stop_words":['english'], "tf__ngram_range":[(1,1),(1,2)]},
    {"tf__stop_words":['english'], "tf__ngram_range":[(1,1),(1,2)]},
    {"tf__stop_words":['english'], "tf__ngram_range":[(1,1),(1,2)]},
    {"tf__stop_words":['english'], "tf__ngram_range":[(1,1),(1,2)]}
]


In [21]:
# instantiate results DataFrame

grid_results = pd.DataFrame(columns=['model','best_params','train_accuracy','test_accuracy','tn','fp','fn','tp'])
grid_results.head()

Unnamed: 0,model,best_params,train_accuracy,test_accuracy,tn,fp,fn,tp


In [22]:
X_train_pre_post = X_train['post_lm']
X_test_pre_post = X_test['post_lm']

In [23]:
for i in tqdm(range(len(steps_list_gr_tf))):           # timed loop through index of number of steps
    pipe = Pipeline(steps=steps_list_gr_tf[i])         # configure pipeline for each model
    grid = GridSearchCV(pipe, pipe_params_tf[i], cv=3) # fit GridSearchCV to model and model's params

    model_results = {}

    grid.fit(X_train_pre_post, y_train)
    
    print('Model: ',steps_titles[i])
    model_results['model'] = steps_titles[i]

    print('Best Params: ', grid.best_params_)
    model_results['best_params'] = grid.best_params_

    print(grid.score(X_train_pre_post, y_train), '\n')
    model_results['train_accuracy'] = grid.score(X_train_pre_post, y_train)
    
    print(grid.score(X_test_pre_post, y_test), '\n')
    model_results['test_accuracy'] = grid.score(X_test_pre_post, y_test)

    tn, fp, fn, tp = confusion_matrix(y_test, grid.predict(X_test_pre_post)).ravel()
    print("True Negatives: %s" % tn)
    model_results['tn'] = tn

    print("False Positives: %s" % fp)
    model_results['fp'] = fp

    print("False Negatives: %s" % fn)
    model_results['fn'] = fn

    print("True Positives: %s" % tp, '\n')
    model_results['tp'] = tp

    grid_results = grid_results.append(model_results, ignore_index=True)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


Model:  multi_nb
Best Params:  {'tf__ngram_range': (1, 1), 'tf__stop_words': 'english'}
0.9572529782761037 

0.819327731092437 



 17%|█▋        | 1/6 [00:01<00:07,  1.53s/it]

True Negatives: 195
False Positives: 36
False Negatives: 50
True Positives: 195 



  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)


Model:  knn
Best Params:  {'tf__ngram_range': (1, 2), 'tf__stop_words': 'english'}
0.5409950946040645 



 33%|███▎      | 2/6 [00:03<00:06,  1.75s/it]

0.5294117647058824 

True Negatives: 7
False Positives: 224
False Negatives: 0
True Positives: 245 



  y = column_or_1d(y, warn=True)


Model:  logreg
Best Params:  {'tf__ngram_range': (1, 2), 'tf__stop_words': 'english'}
0.9992992291520673 



 50%|█████     | 3/6 [00:05<00:05,  1.80s/it]

0.8214285714285714 

True Negatives: 183
False Positives: 48
False Negatives: 37
True Positives: 208 



  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)
  self._final_estimator.fit(Xt, y, **fit_params)


Model:  rf
Best Params:  {'tf__ngram_range': (1, 1), 'tf__stop_words': 'english'}
0.9922915206727401 

0.7563025210084033 



 67%|██████▋   | 4/6 [00:07<00:03,  1.79s/it]

True Negatives: 182
False Positives: 49
False Negatives: 67
True Positives: 178 



  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


Model:  ada
Best Params:  {'tf__ngram_range': (1, 1), 'tf__stop_words': 'english'}
0.8689558514365803 

0.7605042016806722 



 83%|████████▎ | 5/6 [00:10<00:02,  2.26s/it]

True Negatives: 176
False Positives: 55
False Negatives: 59
True Positives: 186 



  y = column_or_1d(y, warn=True)


Model:  gb
Best Params:  {'tf__ngram_range': (1, 1), 'tf__stop_words': 'english'}
0.9327259985984583 

0.7962184873949579 



100%|██████████| 6/6 [00:21<00:00,  4.84s/it]

True Negatives: 170
False Positives: 61
False Negatives: 36
True Positives: 209 






In [24]:
grid_results_tf = grid_results

## Results assessment

Adding columns for the gap between train and set accuracy scores.  This will tell us about the level of overfitting that may be present in each model.  

In [25]:
grid_results_tf['tt_gap'] = grid_results_tf['train_accuracy'] - grid_results_tf['test_accuracy']
grid_results_cv['tt_gap'] = grid_results_cv['train_accuracy'] - grid_results_cv['test_accuracy']

The **baseline accuracy** is the likelihood of a post being `is_is=1` based solely on the percentage of our dataset that is our target value.  Here, we normalize our value counts to show a baseline accuracy of **51.4%**.

In [26]:
# baseline accuracy
y_train.is_ls.value_counts(normalize=True)

1    0.514366
0    0.485634
Name: is_ls, dtype: float64

In [27]:
grid_results_tf['ba_gap'] = grid_results_tf['test_accuracy'] - y_train.is_ls.value_counts(normalize=True)[1]
grid_results_cv['ba_gap'] = grid_results_cv['test_accuracy'] - y_train.is_ls.value_counts(normalize=True)[1]

By consolidating and sorting our results values by `test_accuracy`, we can assess which models will be the best starting points.  Overall, CountVectorized and TF-IDF models performed similarly.  Because CountVectorized registered the highest score, we will use that as our vectorizer.

In [28]:
grid_results_cv.sort_values('test_accuracy',ascending=False)

Unnamed: 0,model,best_params,train_accuracy,test_accuracy,tn,fp,fn,tp,tt_gap,ba_gap
0,multi_nb,"{'cv__ngram_range': (1, 2), 'cv__stop_words': ...",0.995795,0.836134,195,36,42,203,0.159661,0.321769
5,gb,"{'cv__ngram_range': (1, 2), 'cv__stop_words': ...",0.913805,0.806723,167,64,28,217,0.107082,0.292357
4,ada,"{'cv__ngram_range': (1, 2), 'cv__stop_words': ...",0.862649,0.777311,178,53,53,192,0.085338,0.262945
2,logreg,"{'cv__ngram_range': (1, 1), 'cv__stop_words': ...",0.999299,0.766807,174,57,54,191,0.232493,0.252441
3,rf,"{'cv__ngram_range': (1, 1), 'cv__stop_words': ...",0.993693,0.741597,180,51,72,173,0.252096,0.227231
1,knn,"{'cv__ngram_range': (1, 1), 'cv__stop_words': ...",0.547302,0.533613,10,221,1,244,0.013689,0.019248


In [29]:
grid_results_tf.sort_values('test_accuracy',ascending=False)

Unnamed: 0,model,best_params,train_accuracy,test_accuracy,tn,fp,fn,tp,tt_gap,ba_gap
2,logreg,"{'tf__ngram_range': (1, 2), 'tf__stop_words': ...",0.999299,0.821429,183,48,37,208,0.177871,0.307063
0,multi_nb,"{'tf__ngram_range': (1, 1), 'tf__stop_words': ...",0.957253,0.819328,195,36,50,195,0.137925,0.304962
5,gb,"{'tf__ngram_range': (1, 1), 'tf__stop_words': ...",0.932726,0.796218,170,61,36,209,0.136508,0.281853
4,ada,"{'tf__ngram_range': (1, 1), 'tf__stop_words': ...",0.868956,0.760504,176,55,59,186,0.108452,0.246138
3,rf,"{'tf__ngram_range': (1, 1), 'tf__stop_words': ...",0.992292,0.756303,182,49,67,178,0.235989,0.241937
1,knn,"{'tf__ngram_range': (1, 2), 'tf__stop_words': ...",0.540995,0.529412,7,224,0,245,0.011583,0.015046


Looking at model types, we can see that the CountVectorized Multinomial Naive-Bayes and TF-IDF Logistic Regression performed best on an initial run.  We will select these two, as well as the RandomForest model, which was requested by the project requirements, and GradientBoost Decision Tree to enhance modeling accuracy.  We will continue to optimize each of these models.

### Model Selections: 
#### 1. Lemmatized CountVectorizer Multinomial Naive-Bayes
  - `cv__ngram_range=(1,2)`
  - `cv__stop_words='english'`
  
#### 2. Lemmatized CountVectorizer Random Forest 
*(project requirement)*
  - `cv__ngram_range=(1,1)`
  - `cv__stop_words='english'`
  
#### 3. Lemmatized CountVectorizer Gradient-Boost Decision Tree
  - `cv__ngram_range=(1,2)`
  - `cv__stop_words='english'`
  
#### 4. Lemmatized TF-IDF Scaled Logistic Regression
  - `tf__ngram_range=(1,2)`
  - `tf__stop_words='english'`

## Continue to Notebook 4: Model Optimization