# Notebook 04: Random Forest, Extremely Randomized Trees, Support Vector Classifier

# Introduction

In this notebook, the subreddit posts are tokenized and Random Forest, ExtraTrees, and Support Vector Classifier models are built to predict the origin of each post, using training and testing sets, and also including cross-validation. Model performance is compared against the best logistic regression model. It is not optimized but instead the TfidfVectorizer is constrained to use the same parameters as model 1 that was created through LogisticRegression. The model is considered successful if it performs better than the best Logistic Regression model.

The best model (SVC, at 85.4% testing accuracy) is then fit to the entire dataset, and pickled for use on posts scraped from 17 subsequent months. In other words, these 2000 posts are treated as the training set and posts from the other 17 subsequent months are treated as the test sets (in the next notebook). 

A summary of the observations is listed at the end.

## Contents:

1. Generate models

    1.1 Random Forest
    
    1.2 Extremely Randomized Trees
    
    1.3 Support Vector Classifier

2. Rebuild SVC Model

3. Summary

In [1]:
# imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import pickle

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.svm import SVC # support vector classifier
from sklearn.pipeline import Pipeline

# Import CountVectorizer and TFIDFVectorizer from feature_extraction.text.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# this setting widens how many characters pandas will display in a column:
pd.options.display.max_colwidth = 350

In [2]:
df = pd.read_csv('../data/df_model.csv')

## 1. Generate Models

In [3]:
# create a model based just on 'title' - there are lots of words in 'title', and some 'selftext' is empty
X = df['title']
y = df['subreddit']

In [4]:
# Define training and testing sets
# Choose not to explicitly stratify on y since dataset is pretty evenly split

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.25,
                                                    random_state=42)

### 1.1 Random Forest

In [5]:
# order of tuples tells the pipeline what order to execute the different things in
# use tfidf, Random Forest

pipe = Pipeline([
    ('tf', TfidfVectorizer()),
    ('rfc', RandomForestClassifier())
])

# 
pipe_params = {
    'tf__stop_words': [['lpt']], # use tf because above, we used tf for CountVectorizer
    'tf__min_df': [2],
    'rfc__n_estimators': [200, 500, 1000],
    'rfc__max_depth': [10] # constrain max_depth else it would select None
}

In [6]:
# Instantiate GridSearchCV.

gs = GridSearchCV(pipe, # what object are we optimizing? An estimator to be optimized
                  pipe_params, # parameter values we are searching over
                  cv = 7) # try 7--fold cross-validation.

In [7]:
# gridsearch now has the pipeline
# Fit GridSearch to training data.

gs.fit(X_train, y_train)

GridSearchCV(cv=7,
             estimator=Pipeline(steps=[('tf', TfidfVectorizer()),
                                       ('rfc', RandomForestClassifier())]),
             param_grid={'rfc__max_depth': [10],
                         'rfc__n_estimators': [200, 500, 1000],
                         'tf__min_df': [2], 'tf__stop_words': [['lpt']]})

In [8]:
# it told us what's best
# What's the best score?
gs.best_params_

{'rfc__max_depth': 10,
 'rfc__n_estimators': 200,
 'tf__min_df': 2,
 'tf__stop_words': ['lpt']}

In [9]:
# Score model on training set.
# Score model on testing set.
gs.score(X_train, y_train), gs.score(X_test, y_test)

(0.8740690589031821, 0.8052738336713996)

This model was basically perfect (99% accuracy) when running the training set with unlimited max_depth. So I constrained max_depth to 10. This way, it is not grossly overfit, and performs not too differently compared to the selected LogisticRegression model (test score 85.2%).

### 1.2 Extremely Randomized Trees

In [10]:
# use tfidf, ExtraTrees

pipe = Pipeline([
    ('tf', TfidfVectorizer()),
    ('etc', ExtraTreesClassifier())
])

# 
pipe_params = {
    'tf__stop_words': [['lpt']], # use tf because above, we used tf for CountVectorizer
    'tf__min_df': [2],
    'etc__n_estimators': [200, 500, 1000],
    'etc__max_depth': [10]
}

In [11]:
# Instantiate GridSearchCV.

gs = GridSearchCV(pipe, # what object are we optimizing? An estimator to be optimized
                  pipe_params, # parameter values we are searching over
                  cv = 7) # try 7--fold cross-validation.

In [12]:
# gridsearch now has the pipeline
# Fit GridSearch to training data.

gs.fit(X_train, y_train)

GridSearchCV(cv=7,
             estimator=Pipeline(steps=[('tf', TfidfVectorizer()),
                                       ('etc', ExtraTreesClassifier())]),
             param_grid={'etc__max_depth': [10],
                         'etc__n_estimators': [200, 500, 1000],
                         'tf__min_df': [2], 'tf__stop_words': [['lpt']]})

In [13]:
# What's the best score?
gs.best_params_

{'etc__max_depth': 10,
 'etc__n_estimators': 500,
 'tf__min_df': 2,
 'tf__stop_words': ['lpt']}

In [14]:
# Score model on training set.
# Score model on testing set.
gs.score(X_train, y_train), gs.score(X_test, y_test)

(0.9052132701421801, 0.8174442190669371)

This model performs very similarly to its friend, the RandomForest above. The testing accuracy is higher but the degree of bias/variance tradeoff is the same (8% lower testing accuracy compared to training).

### 1.3 Support Vector Classifier

In [15]:
# use tfidf, Support Vector Classifier

pipe = Pipeline([
    ('tf', TfidfVectorizer()),
    ('svc', SVC())
])

# 
pipe_params = {
    'tf__stop_words': [['lpt']], # use tf because above, we used tf for CountVectorizer
    'tf__min_df': [2],
    'svc__C': np.linspace(0, 5, 20),
    'svc__kernel':['rbf','polynomial'],
    'svc__degree':list(range(4))
}

In [16]:
# Instantiate GridSearchCV.

gs = GridSearchCV(pipe, # what object are we optimizing? An estimator to be optimized
                  pipe_params, # parameter values we are searching over
                  n_jobs = 4, 
                  cv = 7) # try 7--fold cross-validation.

In [17]:
# gridsearch now has the pipeline
# Fit GridSearch to training data.

gs.fit(X_train, y_train)

GridSearchCV(cv=7,
             estimator=Pipeline(steps=[('tf', TfidfVectorizer()),
                                       ('svc', SVC())]),
             n_jobs=4,
             param_grid={'svc__C': array([0.        , 0.26315789, 0.52631579, 0.78947368, 1.05263158,
       1.31578947, 1.57894737, 1.84210526, 2.10526316, 2.36842105,
       2.63157895, 2.89473684, 3.15789474, 3.42105263, 3.68421053,
       3.94736842, 4.21052632, 4.47368421, 4.73684211, 5.        ]),
                         'svc__degree': [0, 1, 2, 3],
                         'svc__kernel': ['rbf', 'polynomial'],
                         'tf__min_df': [2], 'tf__stop_words': [['lpt']]})

In [18]:
# it told us what's best
# What's the best score?
gs.best_params_

{'svc__C': 1.0526315789473684,
 'svc__degree': 0,
 'svc__kernel': 'rbf',
 'tf__min_df': 2,
 'tf__stop_words': ['lpt']}

In [19]:
# Score model on training set.
# Score model on testing set.
gs.score(X_train, y_train), gs.score(X_test, y_test)

(0.991198375084631, 0.8539553752535497)

The training score is basically perfect on the SVC, and the testing score is higher than Random Forest and ExtraTrees. Use this model on subsequent test datasets.

## 2. Rebuild SVC Model

Rebuild the model with the whole dataset.

In [20]:
pipe = Pipeline([
    ('tf', TfidfVectorizer(min_df=2, stop_words = ['lpt'])),
    ('svc', SVC(C=1.0526315789473684, degree=0, kernel='rbf'))
])

pipe.fit(X,y)
pipe.score(X,y)

0.9923857868020305

## 3. Summary

* 3 additional models were evaluated. The Random Forest model was basically perfect (99% accuracy) when running the training set with unlimited max_depth. So I constrained max_depth to 10 (similarly for ExtraTrees). This way, it is not grossly overfit, and performs not too differently compared to the selected LogisticRegression model (test score 85.2%).

| Model | Transformer |   Estimator  |            Details           | Accuracy (train) | Accuracy (train) |
|:-----:|:-----------:|:------------:|:----------------------------:|:----------------:|:----------------:|
|   1   |  TfidfVect  | RandomForest | GridSearchCV, 'lpt' stopword |       0.881      |       0.809      |
|   2   |  TfidfVect  |  ExtraTrees  | GridSearchCV, 'lpt' stopword |       0.903      |       0.822      |
|   3   |   TfidVect  |      SVC     | GridSearchCV, 'lpt' stopword |       0.991      |       0.854      |

* With 99% training accuracy and 85.4% testing accuracy, the SVC was selected for future use.

## Pickle the selected model

Pickle the optimal model for use with other datasets.

In [21]:
with open('./models/subreddit_model_svc.pkl', mode='wb') as pickle_out:
    pickle.dump(pipe, pickle_out) # first write pipe, then the open file itself