# Naive Bayes


In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('data/processed_tweets.csv')

In [3]:
df.dropna(inplace=True)

In [4]:
df.reset_index(drop=True, inplace=True)

In [5]:
df = df[[
    'audience_feature',
    'bias_feature',
    'message_feature',
    'label_feature',
    'source_feature',
    'text_feature'
]]

In [6]:
df['label_feature'].value_counts()

From: Ileana Ros-Lehtinen (Representative from Florida)         76
From: Kevin Brady (Representative from Texas)                   69
From: John Fleming (Representative from Louisiana)              48
From: Cory Booker (Senator from New Jersey)                     47
From: Bernard Sanders (Senator from Vermont)                    40
From: Kyrsten Sinema (Representative from Arizona)              37
From: Todd Rokita (Representative from Indiana)                 35
From: Michael Crapo (Senator from Idaho)                        34
From: John Cornyn (Senator from Texas)                          33
From: Darrell Issa (Representative from California)             33
From: Niki Tsongas (Representative from Massachusetts)          32
From: Bill Flores (Representative from Texas)                   31
From: Eric Swalwell (Representative from California)            29
From: Michael Fitzpatrick (Representative from Pennsylvania)    28
From: John Boehner (Representative from Ohio)                 

In [7]:
df

Unnamed: 0,audience_feature,bias_feature,message_feature,label_feature,source_feature,text_feature
0,national,partisan,policy,From: Trey Radel (Representative from Florida),twitter,RT @nowthisnews: Rep. Trey Radel (R- #FL) slam...
1,national,partisan,attack,From: Mitch McConnell (Senator from Kentucky),twitter,VIDEO - #Obamacare: Full of Higher Costs and ...
2,national,neutral,support,From: Kurt Schrader (Representative from Oregon),twitter,Please join me today in remembering our fallen...
3,national,neutral,policy,From: Michael Crapo (Senator from Idaho),twitter,RT @SenatorLeahy: 1st step toward Senate debat...
4,national,partisan,policy,From: Mark Udall (Senator from Colorado),twitter,.@amazon delivery #drones show need to update ...
5,national,neutral,mobilization,From: Frederica Wilson (Representative from Fl...,twitter,"@BBCWorld, help us keep the kidnapped Nigerian..."
6,constituency,neutral,mobilization,From: Ron Barber (Representative from Arizona),twitter,Show your Arizona pride-choose your favorite S...
7,national,neutral,personal,From: Chuck Fleischmann (Representative from T...,twitter,What a wonderful night at State Senator Ken Ya...
8,national,partisan,support,From: Steny Hoyer (Representative from Maryland),twitter,Great op-ed by Pres. Clinton about signing #FM...
9,national,partisan,policy,From: John Fleming (Representative from Louisi...,twitter,"As POTUS golfs, pushes amnesty &amp; ignores K..."


In [8]:
from sklearn.model_selection import train_test_split

In [9]:
y = df['bias_feature']
X = df[['text_feature']]
y = y.replace({'partisan': 1, 'neutral': 0})

In [10]:
1 - y.mean()

0.7395309882747069

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [12]:
from sklearn_pandas import DataFrameMapper
from sklearn.feature_extraction.text import CountVectorizer

In [13]:
mapper = DataFrameMapper([
    ['text_feature', CountVectorizer()]
], df_out=True)

In [14]:
Z_train = mapper.fit_transform(X_train)
Z_test = mapper.transform(X_test)

#### 4. Fit a Naive Bayes model!

<details><summary> Which Naive Bayes model should we pick, and why? </summary>
```
- The columns of X are all integer counts, so MultinomialNB is the best choice here.
- BernoulliNB is best when we have 0/1 counts in all columns of X. (a.k.a. dummy variables)
- GaussianNB is best when the columns of X are Normally distributed. (Practically, though, it gets used whenever BernoulliNB and MultinomialNB are inappropriate.)
```
</details>

In [15]:
from sklearn.naive_bayes import MultinomialNB

In [16]:
model = MultinomialNB()

Remember earlier that I said we had the opportunity to set priors. We could do so here if we wanted, but we'll stick with the default and allow `sklearn` to estimate priors from the training data directly.

In [17]:
model.fit(Z_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [18]:
predictions = model.predict(Z_test)

In [19]:
model.score(Z_train, y_train)

0.9154103852596315

In [20]:
model.score(Z_test, y_test)

0.7705192629815746

In [21]:
from sklearn.metrics import confusion_matrix

In [22]:
confusion_matrix(y_test, predictions)

array([[776,  86],
       [188, 144]])

In [23]:
tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()

In [24]:
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 776
False Positives: 86
False Negatives: 188
True Positives: 144
