# Naive Bayes text classifier 

The following code will attempt to illustrate how to apply the "Naive Bayes" methodology to construct a simple text classifier.

The data folder contains 71 files:

- thirty files labelled p0.txt -- p29.txt, with text extracted from ten recent news articles about politics
- thirty files labelled s0.txt -- s29.txt, with text extracted from ten recent news articles about sports
- ten files labelled t0.txt -- t9.txt, which will be the "test set"

The data were pulled down on 10 Feb 2016, so dominated by Iowa and New Hampshire primaries on the politics side, and by the SuperBowl on the sports side.

We will train a binary classifier (politics/sports) based on the first sixty "training" files (p0.txt -- p29.txt and s0.txt -- s29.txt), and test it on the ten "test" files (t0.txt -- t9.txt).

### Import the texts into our notebook.
Their contents can be imported into three python lists as follows:

In [1]:
ptexts=[open('data/p{}.txt'.format(i)).read() for i in range(30)]
stexts=[open('data/s{}.txt'.format(i)).read() for i in range(30)]
ttexts=[open('data/t{}.txt'.format(i)).read() for i in range(10)]

So, the file p0.txt is now available to us as ptexts[0], in fact we could try to view its content by typing:  `print(ptexts[0])`

At this point we need to split the texts into something like words. For this purpose, characters are all made lowercase, then all strings of letters a-z plus apostrophes can be extracted as a list. For example, for the first political text ptexts[0], and using a regular expresion that finds all strings with a-z and ':


In [None]:
import re
re.findall("[a-z']+",ptexts[0].lower())

`re.findall("[a-z']+",txt)` just finds all contiguous strings made up of the characters a-z or ', and returns them as a list.

This will return a list of all the 'words' in that document. For the time being, we only want to count the number of documents in which a word occurs (not the number of occurrences within the document), so we apply set() to the above list so that each word only appears once. Finally, it's convenient to use the Counter object (http://docs.python.org/2/library/collections.html), which is just a dictionary to accumulate counts for the words. The result is:

In [None]:
from collections import Counter
Counter(set(re.findall("[a-z']+",ptexts[0].lower())))

### Define two counters

In order to accumulate the number counts (number of documents in which word occurs), we'll define two counters, n_p for the number counts of words in the politics training set, and n_s for the number counts in the sports training set, by iterating through the associated training sets:

In [4]:
import numpy as np
n_p = np.sum([Counter(set(re.findall("[a-z']+",txt.lower()))) for txt in ptexts])
n_s = np.sum([Counter(set(re.findall("[a-z']+",txt.lower()))) for txt in stexts])

<font size="-1">[Note: the above requires having numpy functions loaded in the primary namespace, e.g., using `%pylab inline` (executes `from numpy import *`, populating the namespace with numpy functions). We could also use instead, as shown above, `import numpy as np`, but then we need `np.sum()` in the above since the standard `sum()` does not support adding `Counter()` objects. Something to remember is that each `Counter()` is a dictionary whose keys are words, and associated values are their number counts, so the `sum()` has to be smart enough to know that adding `Counter()`s means taking the set union of their keys and adding the count values of any coincident ones. The generic python `sum()` does not do this (instead generates a "TypeError: unsupported operand type(s) for +"), but the numpy `sum()` does work properly for this.]</font>

The Counter object has a convenient 'most_common' method, so we can look at the numbers for the most frequent words (where 30 means they occurred in all thirty training documents):

In [5]:
n_p.most_common(10),n_s.most_common(10)

([('his', 30),
  ('of', 30),
  ('be', 30),
  ('for', 30),
  ('the', 30),
  ('and', 30),
  ('that', 30),
  ('from', 30),
  ('but', 30),
  ('on', 30)],
 [('be', 30),
  ('the', 30),
  ('all', 30),
  ('was', 30),
  ('at', 30),
  ('that', 30),
  ('but', 30),
  ('on', 30),
  ('a', 30),
  ('and', 30)])

### Write the classifier 

The only thing left at this point is to write the classifier code to implement the Naive Bayes classifier. The formula is the following:

$$
p({\rm politics}\ |\ words) = \frac{p(words\ |\ {\rm politics})\,p({\rm politics})}
{p(words\ |\ {\rm politics})\,p({\rm politics})+p(words\ |\ {\rm sports})\,p({\rm sports})}
\approx\frac{\prod_i p(w_i\ |\ {\rm politics})}{\prod_i p(w_i\ |\ {\rm politics})+\prod_i p(w_i\ |\ {\rm sports})}
$$

In [6]:
p_train=30.
s_train=30.
def bayes_classifier(txt):
    p=1.
    s=1.
    word_counts=Counter(re.findall("[a-z']+",txt.lower()))
    # We need to account for stopwords
    keywords=[w for w,c in word_counts.most_common() if n_p[w]< 26 and n_s[w]<26][:30]
    print('\tbased on "{} ...":'.format(', '.join(keywords[:9])))
    for word in keywords:
        # handling words in the test documents that haven't been seen before (".1 smoothing")
        if word not in n_p: n_p[word]=.1
        if word not in n_s: n_s[word]=.1
        p *= n_p[word]/p_train
        s *= n_s[word]/s_train
    prob=p/(p+s)
    if prob >= .5:
        return 'POLITICS',prob
    else:
        return 'SPORT',1-prob

### Results

In [7]:
for i in range(10):
    print('t{}.txt:'.format(i),bayes_classifier(ttexts[i]))

	based on "rubio, him, he's, rubio's, there's, could, question, suggest, florida ...":
t0.txt: ('POLITICS', 0.9999999164340486)
	based on "warriors, i, ever, spurs, about, greatest, were, been, maybe ...":
t1.txt: ('SPORT', 1.0)
	based on "saban, alabama, football, coach, college, team, will, state, national ...":
t2.txt: ('SPORT', 1.0)
	based on "trump, will, iowa, cruz, gop, trump's, republican, rubio, there ...":
t3.txt: ('POLITICS', 1.0)
	based on "sanders, democratic, left, will, win, party, people, s, lose ...":
t4.txt: ('POLITICS', 1.0)
	based on "georgia, world, russia, rugby, cup, team, georgia's, war, lelos ...":
t5.txt: ('SPORT', 0.9999998466773602)
	based on "bush, campaign, hampshire, about, jeb, trump, he's, him, how ...":
t6.txt: ('POLITICS', 0.9999999999996046)
	based on "curry, play, season, after, been, warriors, five, better, team ...":
t7.txt: ('SPORT', 0.9999999999998045)
	based on "scholarships, athletes, college, players, scholarship, year, could, her, she ...":
