#### Assignment
It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  Here is one example of such data:  [UCI Machine Learning Repository: Spambase Data Set](http://archive.ics.uci.edu/dataset/94/spambase)

For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

For more adventurous students, you are welcome (encouraged!) to come up a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.

#### Approach
For this assignment, we decided to go with text readily available in the Brown corpus of the NLTK package, since the UCI data lacked direct column headers, and variable word counts did not specify which word was being counted, thus complicating the interpetation. Instead, we trained a classifier based on the `humor` and `science_fiction` categories in the corpus and compared the accuracy of classification to a withheld test set from those categories. To do this, we had to create multiple shorter documents out of these two corpora.

In [1]:
import pandas as pd
import random
import matplotlib.pyplot as plt
import numpy as np
import nltk
from nltk.corpus import brown

humor = [w.lower() for w in brown.words(categories='humor') if w.isalpha()]
scifi = [w.lower() for w in brown.words(categories='science_fiction') if w.isalpha()]
print(len(humor), len(scifi))

17776 11762


We based our code heavily on *Natural Language Processing with Python* chapter 6, essentially by using the top 2000 words for both categories combined as our feature extractor.

In [2]:
combined = humor + scifi
all_words = nltk.FreqDist(combined)
word_features = list(all_words)[:2000]

def document_features(document): 
    document_words = set(document) 
    features = {}
    for word in word_features:
        features['%s' % word] = (word in document_words)
    return features

document_features(combined)

{'the': True,
 'of': True,
 'and': True,
 'to': True,
 'a': True,
 'in': True,
 'was': True,
 'he': True,
 'that': True,
 'it': True,
 'i': True,
 'had': True,
 'for': True,
 'his': True,
 'you': True,
 'on': True,
 'with': True,
 'as': True,
 'but': True,
 'not': True,
 'is': True,
 'they': True,
 'at': True,
 'be': True,
 'her': True,
 'were': True,
 'she': True,
 'this': True,
 'would': True,
 'said': True,
 'all': True,
 'have': True,
 'one': True,
 'an': True,
 'my': True,
 'by': True,
 'from': True,
 'him': True,
 'or': True,
 'no': True,
 'them': True,
 'which': True,
 'we': True,
 'their': True,
 'when': True,
 'what': True,
 'there': True,
 'up': True,
 'so': True,
 'could': True,
 'out': True,
 'been': True,
 'time': True,
 'me': True,
 'if': True,
 'are': True,
 'did': True,
 'who': True,
 'do': True,
 'like': True,
 'more': True,
 'into': True,
 'your': True,
 'now': True,
 'then': True,
 'about': True,
 'only': True,
 'even': True,
 'other': True,
 'such': True,
 'over': T

Then, in order to have something to classify, and be able to split our features documents into training and test sets, we had to subdivide our corpora into various shorter document sizes. Keeping each document at an arbitrary 1000 words meant subdividing the humorous corpus into 18 documents, and the scifi corpus into 12 documents. This gave us a featureset of 30 documents to work with.

In [3]:
humor_subdiv=[]
for i in range(18):
    humor_subdiv.append([humor[i*1000:(i+1)*1000], 'humor'])

scifi_subdiv=[]
for i in range(12):
    scifi_subdiv.append([scifi[i*1000:(i+1)*1000], 'scifi'])

documents = humor_subdiv+scifi_subdiv

Now we can run our feature extractor on each of the documents we've created, and split the resultant `featuresets` into training and test sets. We shuffle the documents to get a random split.

In [4]:
random.shuffle(documents)
featuresets = [(document_features(d), c) for (d,c) in documents]

train_set, test_set = featuresets[24:], featuresets[:6]

The final step is to run our Naive Bayes Classifier. Unfortunately, we don't a great accuracy on this. This could be due to the relatively small size of our input corpora (~8000 words total).

In [10]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

print('Accuracy of our Naive Bayes classifier: ', nltk.classify.accuracy(classifier, test_set), '\n')

classifier.show_most_informative_features()

Accuracy of our Naive Bayes classifier:  0.5 

Most Informative Features
                   above = True            scifi : humor  =      3.0 : 1.0
                   after = False           scifi : humor  =      3.0 : 1.0
                     age = True            scifi : humor  =      3.0 : 1.0
                    also = False           scifi : humor  =      3.0 : 1.0
                although = True            scifi : humor  =      3.0 : 1.0
                 assumed = True            scifi : humor  =      3.0 : 1.0
                    away = True            scifi : humor  =      3.0 : 1.0
                  became = True            scifi : humor  =      3.0 : 1.0
                  become = True            scifi : humor  =      3.0 : 1.0
                  better = True            scifi : humor  =      3.0 : 1.0
