# Multinomial Naive Bayes - Production
I chose a to fit a Multinomial Naive Bayes because of the simplicity of it's assumptions and it's straightforaward implemenation. Of all models fit the Bayes model performed the best with the least amount of tuning and in the shortest amount of time. 

In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import confusion_matrix
import nltk
import re
import pickle



# Loading in pickled data

In [3]:
combined = pd.read_pickle('../assets/combined.pkl')


# Loading pickled train test split set

In [4]:
X_train = pd.read_pickle('../assets/X_train.pkl')
X_test = pd.read_pickle('../assets/X_test.pkl')
y_train = pd.read_pickle('../assets/y_train.pkl')
y_test = pd.read_pickle('../assets/y_test.pkl')

# Setting up my pipe line.
Feeding my data through a pipeline which includes a tfidf vectorizer and a multinomial bayes classifier.
I passed the tfidf english stop words and gridsearched for my best parameters. I chose to look through a range of documents for mindf rather than through floats using np.linspace, this was simply due to limitations on computing due to the volume in data. Ultimately the model determined running with a mindf of 1 was appropriate. The max DF on the other hand, which ignores words that appear in more than the threshold set ended up being .95. This means that the max df should be filtering out plenty of repetative words that may not be included in my stop words. The best alpha parameter for my NB was .50 which in retrospect I may have been able to increase the alpha to achieve gains in accuracy, but 76% accuracy seemed reasonable given my dataset and higher and I would suspect an overfit model.

In [4]:
pipe = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('nb',MultinomialNB()),
])


In [5]:
param_grid =  {
    'tfidf__min_df': np.arange(1,10,2),
    'tfidf__max_df': [.95, .98, 1.0],
    'nb__alpha': [.01,.10,.20,.50]
}

# Running a grid search for the best parameters

In [6]:
gs = GridSearchCV(pipe, param_grid=param_grid,verbose=1)

In [7]:
gs.fit(X_train,y_train)

Fitting 3 folds for each of 60 candidates, totalling 180 fits


[Parallel(n_jobs=1)]: Done 180 out of 180 | elapsed: 19.2min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...True,
        vocabulary=None)), ('nb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'tfidf__min_df': array([1, 3, 5, 7, 9]), 'tfidf__max_df': [0.95, 0.98, 1.0], 'nb__alpha': [0.01, 0.1, 0.2, 0.5]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=1)

# Best Parameters



In [8]:
gs.best_params_

{'nb__alpha': 0.5, 'tfidf__max_df': 0.95, 'tfidf__min_df': 1}

# Scoring the model

In [9]:
gs.score(X_train,y_train)

0.7667547535651739

In [10]:
gs.score(X_test,y_test)

0.7234255327592365

In [20]:
gs.predict_proba(X_test)

array([[0.15037602, 0.84962398],
       [0.43696979, 0.56303021],
       [0.25381307, 0.74618693],
       ...,
       [0.47298755, 0.52701245],
       [0.39041347, 0.60958653],
       [0.23486574, 0.76513426]])

In [21]:
gs.best_score_

0.7204700400300225

# Calculating the baseline score

In [25]:
combined['subreddit'].value_counts()/combined.shape[0]

LateStageCapitalism    0.502404
Libertarian            0.497596
Name: subreddit, dtype: float64

# Playing feeding simple keywords into the model for prediction
A neat feature of setting up a pipeline is the ability to run new data through the model. I've pulled newer comments 
and articles from each sub to test the accuracy. It can  recognize very specific keywords from each sub with some impressive accuracy. But when the two subs have talk about subjects that overlap you see an increase in False Positives. My 1 class being whether or not a post was from r/Libertarian. 

In [27]:
foo = ["""
'I love Marx'
"""]

In [28]:
gs.predict(foo)

array(['LateStageCapitalism'], dtype='<U19')

In [29]:
preds = gs.predict(X_test)
preds

array(['Libertarian', 'Libertarian', 'Libertarian', ..., 'Libertarian',
       'Libertarian', 'Libertarian'], dtype='<U19')

# Setting up a confusion matrix
Although this model has a 76% accuracy score it suffers from a serious drawback. When predicting my 1 class the model classifies too many false positives. The model only scores a measley 40% when it comes to sensitivity. In future iterations I would do some more feature engineering and tuning.

In [30]:
cm = confusion_matrix(y_test,preds,labels=['LateStageCapitalism','Libertarian'])

In [31]:
cm_df = pd.DataFrame(data=cm,columns=['pred_LSC','pred_LIB'],index=['actual_Negtive','actual_Positive'])
cm_df

Unnamed: 0,pred_LSC,pred_LIB
actual_Negtive,39054,20604
actual_Positive,14243,52094


In [35]:
y_test.count('Libertarian'),y_test.count('LateStageCapitalism')

(66337, 59658)

In [218]:
with open('../assets/naive_bayes_model.pkl','wb+') as f:
    pickle.dump(gs,f)

In [219]:
 gs = pickle.load( open( "../assets/naive_bayes_model.pkl", "rb" ) )

In [16]:
gs.best_params_

{'nb__alpha': 0.5, 'tfidf__max_df': 0.95, 'tfidf__min_df': 1}

In [11]:
gs.best_estimator_


Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.95, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...True,
        vocabulary=None)), ('nb', MultinomialNB(alpha=0.5, class_prior=None, fit_prior=True))])

In [221]:
weights = gs.best_estimator_.steps[-1][1].coef_[0]

In [222]:
features = gs.best_estimator_.steps[0][1].get_feature_names()

In [223]:
import pandas as pd

In [224]:
df = {
    "features":features,
    "weights":weights,
}
df = pd.DataFrame(df)

In [230]:
df[df.features.str.contains("karl")].head(100)

Unnamed: 0,features,weights
16018,culturalkarlmarxism,-13.789317
36974,karl,-10.393648
36975,karlos,-13.789317


In [217]:
def keyworder(frame,keyword_list):
    for word in keyword_list:
        if word in keyword_list:
            return frame[frame.features.str.contains(word)]
keyworder(df,['marx','socialism','freedom','democracy'])




Unnamed: 0,features,weights
7952,bookmarx,-12.660116
16018,culturalkarlmarxism,-13.789317
39258,libertarianmarxist,-13.096649
41547,marx,-8.95376
41548,marxerino,-13.325928
41549,marxian,-11.98138
41550,marxim,-13.789317
41551,marximum,-13.789317
41552,marxism,-9.357587
41553,marxist,-9.218564


numpy.ndarray

In [134]:
X_train_preds = gs.predict(X_train)

In [114]:
preds = pd.DataFrame({
    "preds":X_train_preds,
    "features":X_train,
    "truth":y_train
})

In [116]:
preds.head()

Unnamed: 0,preds,features,truth
30147,LateStageCapitalism,all white meat. babies of colour are kept in a...,LateStageCapitalism
57685,Libertarian,it was illegal for people of other religions t...,Libertarian
29468,Libertarian,the usa is already spending way more per capit...,LateStageCapitalism
110065,Libertarian,"overall, yes. the genocide of the natives was ...",LateStageCapitalism
120518,Libertarian,&; this is basic burden of proof lol ...,Libertarian


In [129]:
preds[(preds['preds'] != preds['truth'])]

Unnamed: 0,preds,features,truth
29468,Libertarian,the usa is already spending way more per capit...,LateStageCapitalism
110065,Libertarian,"overall, yes. the genocide of the natives was ...",LateStageCapitalism
164912,LateStageCapitalism,i don't know.... shared services sounds alot l...,Libertarian
194812,Libertarian,tribalism.,LateStageCapitalism
6383,LateStageCapitalism,i guess fuck white kids then? that's your argu...,Libertarian
178569,LateStageCapitalism,cheers!,Libertarian
151147,LateStageCapitalism,lol you're simply fucked in the head if you th...,Libertarian
84395,Libertarian,the article you linked showed me that f &lt;-&...,LateStageCapitalism
184323,LateStageCapitalism,. war is the only real problem on this list. ....,Libertarian
105209,Libertarian,you have to laugh at some of these rankings an...,LateStageCapitalism
