## In this notebook, I explore the use of Na ̈ıve Bayes classification to tackle a particularly famous instance: who wrote the Federalist Papers?

# Loading modules

In [1]:
import numpy as np
import json
from sklearn.feature_extraction import text
import scipy
from matplotlib import pyplot as plt

# Loading data

In [2]:
x = open('/Users/daveyap/Desktop/github/Federalist_paper/fedpapers_split.txt').read()
papers = json.loads(x)

papersH = papers[0] # papers by Hamilton 
papersM = papers[1] # papers by Madison
papersD = papers[2] # disputed papers

nH, nM, nD = len(papersH), len(papersM), len(papersD)

In [3]:
print(nH, nM, nD)

51 17 12


# Bag-of-Word model


In [4]:
# This allows you to ignore certain common words in English
# You may want to experiment by choosing the second option or your own
# list of stop words, but be sure to keep 'HAMILTON' and 'MADISON' in
# this list at a minimum, as their names appear in the text of the papers
# and leaving them in could lead to unpredictable results
my_stop_words = text.ENGLISH_STOP_WORDS.union({'hamilton','madison'})
#stop_words = {'HAMILTON','MADISON'}

## Form bag of words model using words used at least 10 times
vectorizer = text.CountVectorizer(lowercase=True,stop_words=my_stop_words,min_df=10)
X = vectorizer.fit_transform(papersH+papersM+papersD).toarray()
# Split word counts into separate matrices
XH, XM, XD = X[:nH,:], X[nH:nH+nM,:], X[nH+nM:,:]

In [5]:
# Uncomment this line to see the full list of words remaining after filtering out 
# stop words and words used less than min_df times
vectorizer.vocabulary_

{'general': 541,
 'independent': 614,
 'journal': 659,
 'saturday': 1096,
 '1787': 0,
 'people': 873,
 'state': 1156,
 'new': 801,
 'york': 1303,
 'experience': 462,
 'federal': 499,
 'government': 549,
 'called': 139,
 'deliberate': 307,
 'constitution': 247,
 'united': 1244,
 'states': 1158,
 'america': 58,
 'subject': 1164,
 'importance': 595,
 'consequences': 233,
 'existence': 455,
 'union': 1242,
 'safety': 1090,
 'welfare': 1289,
 'parts': 861,
 'composed': 205,
 'fate': 492,
 'empire': 393,
 'respects': 1067,
 'interesting': 643,
 'world': 1297,
 'frequently': 533,
 'remarked': 1039,
 'country': 275,
 'conduct': 221,
 'example': 439,
 'decide': 291,
 'important': 596,
 'question': 994,
 'men': 756,
 'capable': 141,
 'establishing': 422,
 'good': 547,
 'reflection': 1019,
 'choice': 167,
 'forever': 518,
 'depend': 315,
 'political': 896,
 'constitutions': 249,
 'force': 515,
 'truth': 1230,
 'remark': 1038,
 'crisis': 283,
 'propriety': 971,
 'regarded': 1022,
 'decision': 293,

In [8]:
XH, XM, XD = np.array(XH), np.array(XM), np.array(XD)

# Why Laplace smoothing is important for text classification in this project?

Suppose we are solving a spam classification task, and in my training data, I see the word ‘buy’ only in spam emails, and it do not appear in non-spam emails. Therefore, the estimate will tell me that  𝑃(buy|not-spam)=0 , that is, no non-spam email will contain the word ‘buy’. Now, clearly, this does not make sense. The probability of this event is low, but it is not zero. Further, because we are multiplying all the probabilities during inference, even one such zero probability term will lead to the entire process failing.

Therefore, we need smoothing — the goal is to increase the zero probability values to a small positive number and reduce other values so that the sum is 1. Laplace smoothing is one such method.

# What priors I use for the two classes?

The method that I calculate priors probability for Hamilton and Madison is the length of each essay divide by total length of Hamilton and Madison's essays. 

In [9]:
#compute prior probability
prior_H = nH  / (nH+nM)  # prior prob for Hamilton
prior_M = nM  / (nH+nM)  # prior prob for Madison
print('priors of Hamilton is = '+ str(prior_H))
print('priors of Madison is = '+ str(prior_M))

priors of Hamilton is = 0.75
priors of Madison is = 0.25


# how I build the Naive Bayes classifier for bag of words classification.

step 1: Form bag of words model using words used at least 10 times and split word counts into 3 separate matrices, which is disputed papers(XD), papers by Hamilton(XH),papers by Madison(XM)<br>
step 2: we calculate priors for each class based on papers by Hamilton and papers by Madison (Priors of Hamilton and Madison in this example)<br>
step 3: Estimate probability of each word in vocabulary being used by Hamilton and Madison with laplace smoothing<br>
step 4: Estimate posterior probability of each sentence with Hamilton and Madison's probability for each word<br>
step 5: Compare posterior probability of each sentence with Hamilton and Madison's posterior probability and we choose person who has higher posterior prob wrote each of the twelve disputed essays.<br>

In [10]:
# Estimate probability of each word in vocabulary being used by Hamilton with laplace smoothing
fH = (1.0 + sum(XH)) / (len(XH[0]) + sum(sum(XH)))

# Estimate probability of each word in vocabulary being used by Madison with laplace smoothing
fM = (1.0 + sum(XM)) / (len(XM[0]) + sum(sum(XM)))

# How many of the essays do you think were written by Hamilton and how many were by Madison?

In [13]:
# Estimate posterior probability of each sentence with Hamilton and Madison's probability for each word
post_H = 0
post_M = 0 
lines = 0

for sen in XD:
    lines+=1
    post_H = np.log(prior_H) + sum(np.log(fH) * sen)
    post_M = np.log(prior_M) + sum(np.log(fM) * sen)
    
    if post_H >post_M:
        print('Hamilton wrote the disputed essay {}'.format(lines))
    else:
        print('Madison wrote the disputed essay {}'.format(lines))

Madison wrote the disputed essay 1
Madison wrote the disputed essay 2
Madison wrote the disputed essay 3
Madison wrote the disputed essay 4
Hamilton wrote the disputed essay 5
Hamilton wrote the disputed essay 6
Madison wrote the disputed essay 7
Hamilton wrote the disputed essay 8
Hamilton wrote the disputed essay 9
Madison wrote the disputed essay 10
Madison wrote the disputed essay 11
Madison wrote the disputed essay 12


Note that there is no actually verifiable correct answer, no one knows.