In this notebook I will fit a random forest model on TFIDF tokenized data, count vectorized data, and count vectorized data with sentiment analysis. This is my last attempt at creating a model that can accurately (accuracy score greater than 75 percent) predict which subreddit a post came from 

In [2]:
import pandas as pd
import numpy as np
import time
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import stop_words

### Loading in my train, test splitted dataframes

In [3]:
reddit = pd.read_csv('../Data/reddit.csv', index_col=0)
X_train = pd.read_csv('../Data/X_train.csv', header = None, index_col=0)
X_test = pd.read_csv('../Data/X_test.csv', header = None, index_col=0)
y_train = pd.read_csv('../Data/y_train.csv', header = None, index_col=0)
y_test = pd.read_csv('../Data/y_test.csv', header = None, index_col=0)
X_train_sen = pd.read_csv('../Data/X_train_sen.csv', index_col=0)
X_test_sen = pd.read_csv('../Data/X_test_sen.csv', index_col=0)
y_train_sen = pd.read_csv('../Data/y_train_sen.csv', header = None, index_col=0)
y_test_sen = pd.read_csv('../Data/y_test_sen.csv', header = None, index_col=0)
reddit.fillna('', inplace=True)

Creating a custom list of stop words that includes numbers and common terms found in urls

In [13]:
custom_stopwords = list(stop_words.ENGLISH_STOP_WORDS)
custom_stopwords.extend(['10', '12', '13', '14', '15', '18', '25','200', '000','https', 'com', 'youtube', 'www'])

### Random Forest on TFIDF data

After extensive grid searching, the best parameters I found are as follows:

tfidf_min_df: Ignore terms that don't appear in at least 1 document

tfidf_max_df: Ignore terms that appear in over $85 \%$ of the documents

tfidf_norm: Regularize the data with an l2 norm

rf_n_estimators: Create 100 decision trees for voting

rf_max_features: The max features for each decision tree is determined by a log2 scale

rf_min_samples_split: At least 3 samples to split an internal node

rf_min_samples_leaf: At least 1 sample to be at the leaf node

In [14]:
pipe = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words=custom_stopwords)),
    ('rf', RandomForestClassifier())
])

params = {
    'tfidf__min_df': [1,3],
    'tfidf__max_df': [.80,.85],
    'tfidf__norm': ['l1','l2'],
    'rf__n_estimators': [90,100],
    'rf__max_features': ['auto', 'log2', 'sqrt'],
    'rf__min_samples_split': [3,4],
    'rf__min_samples_leaf': [1]
}

gs = GridSearchCV(pipe, params)

gs.fit(X_train[1], y_train[1])

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'tfidf__min_df': [1, 3], 'tfidf__max_df': [0.8, 0.85], 'tfidf__norm': ['l1', 'l2'], 'rf__n_estimators': [90, 100], 'rf__max_features': ['auto', 'log2', 'sqrt'], 'rf__min_samples_split': [3, 4], 'rf__min_samples_leaf': [1]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

### Scoring my model

In [15]:
gs.score(X_train[1],y_train[1])

0.9906103286384976

In [16]:
gs.score(X_test[1],y_test[1])

0.6197183098591549

In [17]:
gs.best_params_

{'rf__max_features': 'log2',
 'rf__min_samples_leaf': 1,
 'rf__min_samples_split': 3,
 'rf__n_estimators': 100,
 'tfidf__max_df': 0.85,
 'tfidf__min_df': 1,
 'tfidf__norm': 'l2'}

Our first random forest model scored an accuracy of $61.9\%$ and seems to be severely overfit

### Random Forest on count vectorized data

Now I'll see if a random forest model performs better on data that is represented as a count rather than a ratio

In [44]:
pipe = Pipeline([
    ('cvec', CountVectorizer(stop_words=custom_stopwords)),
    ('rf', RandomForestClassifier())
])

After extensive grid searching, the best parameters I found are as follows:

cvec_min_df: Ignore words that don't appear in at least 1 document

cvec_max_df: Ignore words that appear in over $90\%$ of the documents

rf_n_estimators: Make 100 decision trees for voting

rf_max_features: Scale the max features for each decision tree by a square root

rf_min_samples_split: There needs to be at least 3 samples to split an internal node

rf_min_samples_leaf: There needs to be at least 2 samples at a leaf node

In [45]:
params = {
    'cvec__min_df': [1,3],
    'cvec__max_df': [.85,.9],
    'rf__n_estimators': [90,100],
    'rf__max_features': ['auto', 'log2', 'sqrt'],
    'rf__min_samples_split': [3,4],
    'rf__min_samples_leaf': [1,2]
}

In [46]:
gs_cvec = GridSearchCV(pipe, params)

In [47]:
gs_cvec.fit(X_train[1], y_train[1])

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('cvec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words=['ie', 'mor...n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'cvec__min_df': [1, 3], 'cvec__max_df': [0.85, 0.9], 'rf__n_estimators': [90, 100], 'rf__max_features': ['auto', 'log2', 'sqrt'], 'rf__min_samples_split': [3, 4], 'rf__min_samples_leaf': [1, 2]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

### Scoring my model

In [48]:
gs_cvec.score(X_train[1],y_train[1])

0.8835680751173709

In [49]:
gs_cvec.score(X_test[1],y_test[1])

0.6591549295774648

In [24]:
gs_cvec.best_params_

{'cvec__max_df': 0.9,
 'cvec__min_df': 1,
 'rf__max_features': 'sqrt',
 'rf__min_samples_leaf': 2,
 'rf__min_samples_split': 3,
 'rf__n_estimators': 100}

My random forest model performed better on count vectorized data, with an accuracy of $65.9 \%$. It is also less overfit

### Random forest on count vectorized data with sentiment analysis

Let's see if adding features for sentiment analysis will help our random forest model differentiate between the subreddits

After extensive grid searching, the best parameters I found are as follows:

n_estimators: Create 90 decision trees for voting

max_feature: Each decision tree has a max feature size of the log2 of the total features

min_samples_split: There needs to be at least 4 samples to split an internal node

min_samples_leaf: There needs to be at least 1 sample at a leaf node

In [28]:
params = {
    'n_estimators': [80,90],
    'max_features': ['auto', 'log2', 'sqrt'],
    'min_samples_split': [3,4],
    'min_samples_leaf': [1]
}

In [29]:
gs = GridSearchCV(RandomForestClassifier(), params)

In [30]:
gs.fit(X_train_sen,y_train_sen[1])

GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_estimators': [90, 100], 'max_features': ['auto', 'log2', 'sqrt'], 'min_samples_split': [3, 4], 'min_samples_leaf': [1]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

### Scoring my model

In [31]:
gs.score(X_train_sen,y_train_sen[1])

0.9586854460093897

In [32]:
gs.score(X_test_sen,y_test_sen[1])

0.5915492957746479

In [34]:
gs.best_params_

{'max_features': 'log2',
 'min_samples_leaf': 1,
 'min_samples_split': 4,
 'n_estimators': 90}

My model did not improve with an accuracy of $59.1 \%$, thus sentiment analysis did not help with regards to differentiating between the subreddits. Let's look at the most important features from my random forest model

### Most important features

In [100]:
weights = pd.DataFrame(data = [gs.best_estimator_.feature_importances_], columns = X_train_sen.columns)

In [101]:
weights = weights.abs().T

In [102]:
weights.sort_values(ascending=False, by = 0).head()

Unnamed: 0,0
neutral,0.072381
negative,0.057525
positive,0.052037
clinton,0.020468
trump,0.015775


It seems that in this model the sentiments are the most important features, but this model only scored $9 \%$ better than a coin flip, leaving me to conclude that they are not very good features.

### Conclusion

Random forest performed the best out of all of my models with its best accuracy score being $65.9 \%$ on a count vectorized dataframe. However, this score is only marginal better than my other two models, logistic regression and k-nearest neighbors, which had accuracy scores of $63 \%$ and $60 \%$ respectively. In general, it was hard to differentiate between the two subreddits, r/democrat, and r/republican, leaving me to conclude that the topics discuss on those two subreddits are laregely the same. I suspect this is becuase a large majority of posts are news headlines, and are thus have the same title post on both of the subreddits. This is a good thing, becuase this implies that neither party has a bias or preference as to what news or topics are to be discussed.