#### Assignment
It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  Here is one example of such data:  [UCI Machine Learning Repository: Spambase Data Set](http://archive.ics.uci.edu/dataset/94/spambase)

For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

For more adventurous students, you are welcome (encouraged!) to come up a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.

#### Approach
For this assignment, we decided to go with text readily available in the Brown corpus of the NLTK package, since the UCI data lacked direct column headers, and variable word counts did not specify which word was being counted, thus complicating the interpetation. Instead, we trained a classifier based on the `humor` and `science_fiction` categories in the corpus and compared the accuracy of classification to a withheld test set from those categories. To do this, we had to create multiple shorter documents out of these two corpora.

In [57]:
import pandas as pd
import random
import matplotlib.pyplot as plt
import numpy as np
import nltk
from nltk.corpus import brown

humor = [w.lower() for w in brown.words(categories='humor') if w.isalpha()]
scifi = [w.lower() for w in brown.words(categories='science_fiction') if w.isalpha()]
print(len(humor), len(scifi))

17776 11762


We based our code heavily on *Natural Language Processing with Python* chapter 6, essentially by using the top 2000 words for both categories combined as our feature extractor.

In [58]:
random.seed(2)

combined = humor + scifi
all_words = nltk.FreqDist(combined)
word_features = list(all_words)[:2000]

def document_features(document): 
    document_words = set(document) 
    features = {}
    for word in word_features:
        features['%s' % word] = (word in document_words)
    return features

import itertools
dict(itertools.islice(document_features(combined).items(), 20))

{'the': True,
 'of': True,
 'and': True,
 'to': True,
 'a': True,
 'in': True,
 'was': True,
 'he': True,
 'that': True,
 'it': True,
 'i': True,
 'had': True,
 'for': True,
 'his': True,
 'you': True,
 'on': True,
 'with': True,
 'as': True,
 'but': True,
 'not': True}

Then, in order to have something to classify, and be able to split our features documents into training and test sets, we had to subdivide our corpora into various shorter document sizes. Keeping each document at a fairly arbitrary 1000 words, meant subdividing the humorous corpus into 18 documents, and the scifi corpus into 12 documents (using a quick calculation). This gave us a featureset of 30 documents to work with.

In [59]:
humor_subdiv=[]
for i in range(round(len(humor)/1000)):
    humor_subdiv.append([humor[i*1000:(i+1)*1000], 'humor'])

scifi_subdiv=[]
for i in range(round(len(scifi)/1000)):
    scifi_subdiv.append([scifi[i*1000:(i+1)*1000], 'scifi'])

documents = humor_subdiv+scifi_subdiv

# for i in documents:
#     for x in i:
#         print(len(x), end=' ')
#         print(x[-1])

Now we can run our feature extractor on each of the documents we've created, and split the resultant `featuresets` into training and test sets, using an 80-20 split. We shuffle the documents to make sure this split is random.

In [60]:
random.shuffle(documents)
featuresets = [(document_features(d), c) for (d,c) in documents]

train_set, test_set = featuresets[24:], featuresets[:6]

The final step is to run our Naive Bayes Classifier. We got an accuracy of 0.83, which isn't too bad. This isn't a bad accuracy, but we can probably do better if we remove stopwords, and possibly if we lemmatize our training corpus first.

In [70]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

print('Accuracy of our Naive Bayes classifier: ', nltk.classify.accuracy(classifier, test_set), '\n')

classifier.show_most_informative_features()

Accuracy of our Naive Bayes classifier:  0.8333333333333334 

Most Informative Features
                  actual = False           humor : scifi  =      2.3 : 1.0
                   after = True            humor : scifi  =      2.3 : 1.0
                   again = False           scifi : humor  =      2.3 : 1.0
                although = False           humor : scifi  =      2.3 : 1.0
                  around = True            scifi : humor  =      2.3 : 1.0
                     ask = False           humor : scifi  =      2.3 : 1.0
                audience = False           scifi : humor  =      2.3 : 1.0
                 because = False           scifi : humor  =      2.3 : 1.0
                  beside = False           humor : scifi  =      2.3 : 1.0
                  bitter = False           scifi : humor  =      2.3 : 1.0


Let's remove stopwords. We can see that stopwords account for about 50% of each category's corpus of words. That's a lot!

In [62]:
from nltk.corpus import stopwords

def content_fraction(text):
    stopwords = nltk.corpus.stopwords.words('english')
    content = [w for w in text if w.lower() in stopwords]
    return len(content) / len(text)

print("Fraction of stopwords in humor corpus: ", content_fraction(humor), "\n",
      "Fraction of stopwords in scifi corpus ", content_fraction(scifi))

Fraction of stopwords in humor corpus:  0.5015751575157515 
 Fraction of stopwords in scifi corpus  0.5006801564359803


In [110]:
def remove_stopwords(text):
    stopwords = nltk.corpus.stopwords.words('english')
    content = [w for w in text if w.lower() not in stopwords]
    return content

humor_nostop = remove_stopwords(humor)
scifi_nostop = remove_stopwords(scifi)

len(scifi_nostop)

humor_nostop_subdiv=[]
for i in range(round(len(humor_nostop)/1000)):
    humor_nostop_subdiv.append([humor_nostop[i*1000:(i+1)*1000], 'humor'])

scifi_nostop_subdiv=[]
for i in range(round(len(scifi_nostop)/1000)):
    scifi_nostop_subdiv.append([scifi_nostop[i*1000:(i+1)*1000], 'scifi'])

documents_nostop = humor_nostop_subdiv+scifi_nostop_subdiv

random.shuffle(documents_nostop)
len(documents_nostop)

15

In this case we get 15 documents, and shuffle and split again in an 80-20% fashion. But, before we do that, we want to change our feature extractor to also exclude stopwords.

In [91]:
combined_ns = humor_nostop + scifi_nostop
all_words_ns = nltk.FreqDist(combined_ns)
word_features_ns = list(all_words_ns)[:2000]

def document_features_ns(document): 
    document_words = set(document) 
    features = {}
    for word in word_features_ns:
        features['%s' % word] = (word in document_words)
    return features

import itertools
dict(itertools.islice(document_features_ns(combined_ns).items(), 20))

{'would': True,
 'said': True,
 'one': True,
 'could': True,
 'time': True,
 'like': True,
 'even': True,
 'long': True,
 'people': True,
 'way': True,
 'man': True,
 'know': True,
 'get': True,
 'made': True,
 'little': True,
 'years': True,
 'never': True,
 'two': True,
 'back': True,
 'much': True}

In [111]:
featuresets2 = [(document_features_ns(d), c) for (d,c) in documents_nostop]

train_set_ns, test_set_ns = featuresets2[12:], featuresets2[:3]

classifier2 = nltk.NaiveBayesClassifier.train(train_set_ns)

print('Accuracy of our Naive Bayes classifier: ', nltk.classify.accuracy(classifier2, test_set_ns), '\n')

classifier2.show_most_informative_features(50)

Accuracy of our Naive Bayes classifier:  0.3333333333333333 

Most Informative Features
                    able = True            scifi : humor  =      1.5 : 1.0
                   abuse = False           scifi : humor  =      1.5 : 1.0
               according = True            scifi : humor  =      1.5 : 1.0
                  action = False           scifi : humor  =      1.5 : 1.0
              activities = False           scifi : humor  =      1.5 : 1.0
                   actor = False           scifi : humor  =      1.5 : 1.0
                 actress = False           scifi : humor  =      1.5 : 1.0
                  advice = False           scifi : humor  =      1.5 : 1.0
               afternoon = False           scifi : humor  =      1.5 : 1.0
                     age = False           scifi : humor  =      1.5 : 1.0
                    aged = False           scifi : humor  =      1.5 : 1.0
                     ago = True            scifi : humor  =      1.5 : 1.0
            