# Random Forest
Random Forest models are an ensemble learning method that tend to perform well because they build a series or ensemble of descision trees based on a random selection of features. This helps improve the accuracy of the overall model by leveraging the variety of results of those trees. Random Forests are great when trying the goal in theory is to minimize overfitting.
In practice the model was not overfit however the accuracy was only marginally better than a coinflip on both my train and test set. I do think with some more optimization this model could perform well on my data. However this would be very expensive to compute. This model took a very long time to fit and might be better served processed in a cloud computing situation. My gridsearch indicated to me that a maxdf of .95 would be optimal. If I chose to run this model again in the cloud I would definintely try adjusting the min-df because this model determined that not having one was better than the values I passed. In hindsight I may have set tried a range between 10-20%. I would have also liked to try a range between 100-500 for my n_estimators since my result was the max value in my parameter grid.

In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import confusion_matrix
import re
import pickle
import matplotlib.pyplot as plt



# Loading data

In [3]:
combined = pd.read_pickle('../assets/combined.pkl')


In [4]:
X_train = pd.read_pickle('../assets/X_train.pkl')
X_test = pd.read_pickle('../assets/X_test.pkl')
y_train = pd.read_pickle('../assets/y_train.pkl')
y_test = pd.read_pickle('../assets/y_test.pkl')

# Setting up the pipeline

In [3]:
pipe = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('rfc',RandomForestClassifier(n_jobs=3,max_depth=53))
])
    

# Setting up the parameter grid

In [6]:
param_grid =  {
    'tfidf__min_df': np.arange(1,3,2),
    'tfidf__max_df': [0.9, 0.95],
    'rfc__n_estimators':[50,100],
    'rfc__min_samples_leaf': [1,2]
}

# Running a gridsearch

In [7]:
gs = GridSearchCV(pipe,param_grid=param_grid, verbose=1)

# Fitting the model

In [8]:
gs.fit(X_train,y_train)

Fitting 3 folds for each of 8 candidates, totalling 24 fits


[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:  7.2min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...n_jobs=3,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'tfidf__min_df': array([1]), 'tfidf__max_df': [0.9, 0.95], 'rfc__n_estimators': [50, 100], 'rfc__min_samples_leaf': [1, 2]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=1)

# Scoring the model
This data had decent performance. The amount of time it takes to fit was the only thing the kept me from tuning this model further.

In [9]:
gs.score(X_train,y_train)

0.6857056855141356

In [10]:
gs.score(X_test,y_test)

0.6586372475098218

In [None]:
combined['subreddit'].value_counts()/combined.shape[0]

# Best estimators


In [6]:
features = gs.best_estimator_.steps[0][1].get_feature_names()

In [7]:
importance = gs.best_estimator_.steps[1][1].feature_importances_

In [8]:
important_features = list(zip(importance,features))

In [9]:
important_features.sort(reverse=True)

# Exploring top features
Below I examine some of the important features. Although the interpretations aren't as straight forward as a linear model. It does give a me pulse for words and relationships worth exploring. The word libertarian in all it's forms comes up quite a few times. Often self referentially in r/libertarian and in a critcal way in r/LSC Other interesting words to note are Trump,speech, gun,
rights,working,capitalism, government rich a freedom. All of these words are not surprising to me as they center around different libertarian philosophies like personal freedom, property and gun rights and rule of law. Identifying important keywords for reddit subs can help the sites moderators improve the content on the boards by giving them a a clue into popular areas of discourse. This model could also help in targeted marketing by slyly embedding important keywords into their content in order to potentially increase engaement.

In [11]:
important_features[:20]

[(0.035843533652117315, 'libertarian'),
 (0.02245607732244579, 'government'),
 (0.019337309513915057, 'libertarians'),
 (0.01437608923886426, 'trump'),
 (0.01261113443903594, 'capitalism'),
 (0.01051449965954342, 'work'),
 (0.009577902309721059, 'libertarianism'),
 (0.009396957071501535, 'speech'),
 (0.009065005928113129, 'rights'),
 (0.008122084556353272, 'gun'),
 (0.008021595035872076, 'rich'),
 (0.007534477370745697, 'stupid'),
 (0.007516658547972291, 'freedom'),
 (0.007352504657517883, 'job'),
 (0.006925330778571558, 'working'),
 (0.006322979517875078, 'liberty'),
 (0.006077281604461981, 'workers'),
 (0.005938518588745745, 'law'),
 (0.005727109834183893, 'property'),
 (0.005692148593999743, 'state')]

In [12]:
df = {
    "features":features,
    "importance":importance,
   
}
df = pd.DataFrame(df)

In [13]:
df[df['importance']>.007].head(20)

Unnamed: 0,features,importance
10100,capitalism,0.012611
26590,freedom,0.007517
28911,government,0.022456
29674,gun,0.008122
36443,job,0.007353
39238,libertarian,0.035844
39254,libertarianism,0.009578
39264,libertarians,0.019337
58070,rich,0.008022
58193,rights,0.009065


In [14]:
X_train_preds = gs.predict(X_train)

In [15]:
preds = pd.DataFrame({
    "preds":X_train_preds,
    "features":X_train,
    "truth":y_train
})

In [16]:
preds.head()

Unnamed: 0,preds,features,truth
30147,LateStageCapitalism,all white meat. babies of colour are kept in a...,LateStageCapitalism
57685,Libertarian,it was illegal for people of other religions t...,Libertarian
29468,Libertarian,the usa is already spending way more per capit...,LateStageCapitalism
110065,Libertarian,"overall, yes. the genocide of the natives was ...",LateStageCapitalism
120518,Libertarian,&; this is basic burden of proof lol ...,Libertarian


# Exploring misclassified posts
Looking through some of these posts is a bit revealing. I notice a small trend in my model misclassifying phrases with swear words in them as coming r/Libertarian when infact tehey. In a future iteration I might trying fitting a model with swear words included as stop words to see how it effects the overall accuracy. 


In [23]:
preds[(preds['preds'] != preds['truth'])].head(100)


Unnamed: 0,preds,features,truth
29468,Libertarian,the usa is already spending way more per capit...,LateStageCapitalism
110065,Libertarian,"overall, yes. the genocide of the natives was ...",LateStageCapitalism
108695,Libertarian,"so just...fuck disabled people, i guess?",LateStageCapitalism
56582,Libertarian,you literally just refrased my own insult,LateStageCapitalism
194812,Libertarian,tribalism.,LateStageCapitalism
6383,LateStageCapitalism,i guess fuck white kids then? that's your argu...,Libertarian
151147,LateStageCapitalism,lol you're simply fucked in the head if you th...,Libertarian
84395,Libertarian,the article you linked showed me that f &lt;-&...,LateStageCapitalism
106849,Libertarian,"it pays in image boosting/advertisement, and a...",LateStageCapitalism
105209,Libertarian,you have to laugh at some of these rankings an...,LateStageCapitalism


# Misclassified  Predictions

In [24]:
len(preds[(preds['preds'] != preds['truth'])])

80399

# Total Predictions
This means I have a false postive rate of around 30%

In [26]:
len(preds)

255808

# Best estimators

In [126]:
gs.best_estimator_

Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.95, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...n_jobs=3,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

In [18]:
gs.best_params_

{'rfc__min_samples_leaf': 2,
 'rfc__n_estimators': 100,
 'tfidf__max_df': 0.95,
 'tfidf__min_df': 1}

In [27]:
gs.predict(['The jobs numbers were great, the numbers have been incredible.'])

array(['LateStageCapitalism'], dtype='<U19')

In [34]:
# with open('../assets/random_forest.pkl','wb+') as f:
#     pickle.dump(gs,f)

In [5]:
#  gs = pickle.load( open( "../assets/random_forest.pkl", "rb" ) )