#### Assignment
It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  Here is one example of such data:  [UCI Machine Learning Repository: Spambase Data Set](http://archive.ics.uci.edu/dataset/94/spambase)

For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

For more adventurous students, you are welcome (encouraged!) to come up a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.

#### Approach
For this assignment, we decided to go with text readily available in the Brown corpus of the NLTK package, since the UCI data lacked direct column headers, and variable word counts did not specify which word was being counted, thus complicating the interpetation. Instead, we trained a classifier based on the `humor` and `science_fiction` categories in the corpus and compared the accuracy of classification to a withheld test set from those categories. To do this, we had to create multiple shorter documents out of these two corpora.

In [147]:
import pandas as pd
import random
import matplotlib.pyplot as plt
import numpy as np
import nltk
from nltk.corpus import brown

humor = [w.lower() for w in brown.words(categories='humor') if w.isalpha()]
scifi = [w.lower() for w in brown.words(categories='science_fiction') if w.isalpha()]
print(len(humor), len(scifi))

17776 11762


We based our code heavily on *Natural Language Processing with Python* chapter 6, essentially by using the top 2000 words for both categories combined as our feature extractor.

In [148]:
random.seed(2)

combined = humor + scifi
all_words = nltk.FreqDist(combined)
word_features = list(all_words)[:2000]

def document_features(document): 
    document_words = set(document) 
    features = {}
    for word in word_features:
        features['%s' % word] = (word in document_words)
    return features

import itertools
dict(itertools.islice(document_features(combined).items(), 20))

{'the': True,
 'of': True,
 'and': True,
 'to': True,
 'a': True,
 'in': True,
 'was': True,
 'he': True,
 'that': True,
 'it': True,
 'i': True,
 'had': True,
 'for': True,
 'his': True,
 'you': True,
 'on': True,
 'with': True,
 'as': True,
 'but': True,
 'not': True}

Then, in order to have something to classify, and be able to split our features documents into training and test sets, we had to subdivide our corpora into various shorter document sizes. Keeping each document short was key so that the Naive Bayes classifier had many documents to train on. Longer, but fewer overall documents led to worse results. So we subdivided them into 100 word length documents: the humor corpus into 178 documents, and the scifi corpus into 118 documents. This gave us a featureset of 296 documents to work with.

In [163]:
humor_subdiv=[]
for i in range(round(len(humor)/100)):
    humor_subdiv.append([humor[i*100:(i+1)*100], 'humor'])

scifi_subdiv=[]
for i in range(round(len(scifi)/100)):
    scifi_subdiv.append([scifi[i*100:(i+1)*100], 'scifi'])

documents = humor_subdiv+scifi_subdiv

#len(scifi_subdiv)
# for i in documents:
#     for x in i:
#         print(len(x), end=' ')
#         print(x[-1])

len(documents)

296

Now we can run our feature extractor on each of the documents we've created, and split the resultant `featuresets` into training and test sets, using an 80-20 split. We shuffle the documents to make sure this split is random.

In [165]:
random.shuffle(documents)
featuresets = [(document_features(d), c) for (d,c) in documents]

train_set, test_set = featuresets[237:], featuresets[:59]

The final step is to run our Naive Bayes Classifier. We got an accuracy of 0.83. This isn't a bad accuracy, but perhaps we can probably do better if we remove stopwords.

In [166]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

print('Accuracy of our Naive Bayes classifier: ', nltk.classify.accuracy(classifier, test_set), '\n')

classifier.show_most_informative_features()

Accuracy of our Naive Bayes classifier:  0.8305084745762712 

Most Informative Features
                       a = False           scifi : humor  =      6.2 : 1.0
                     any = True            scifi : humor  =      6.2 : 1.0
                   earth = True            scifi : humor  =      6.2 : 1.0
                   since = True            scifi : humor  =      6.2 : 1.0
                  before = True            scifi : humor  =      4.8 : 1.0
                     cut = True            scifi : humor  =      4.8 : 1.0
               equipment = True            scifi : humor  =      4.8 : 1.0
                   going = True            scifi : humor  =      4.8 : 1.0
                    help = True            scifi : humor  =      4.8 : 1.0
                    here = True            scifi : humor  =      4.5 : 1.0


Let's remove stopwords. We can see that stopwords account for about 50% of each category's corpus of words. That's a lot!

In [167]:
from nltk.corpus import stopwords

def content_fraction(text):
    stopwords = nltk.corpus.stopwords.words('english')
    content = [w for w in text if w.lower() in stopwords]
    return len(content) / len(text)

print("Fraction of stopwords in humor corpus: ", content_fraction(humor), "\n",
      "Fraction of stopwords in scifi corpus ", content_fraction(scifi))

Fraction of stopwords in humor corpus:  0.5015751575157515 
 Fraction of stopwords in scifi corpus  0.5006801564359803


In [168]:
def remove_stopwords(text):
    stopwords = nltk.corpus.stopwords.words('english')
    content = [w for w in text if w.lower() not in stopwords]
    return content

humor_nostop = remove_stopwords(humor)
scifi_nostop = remove_stopwords(scifi)

len(scifi_nostop)

humor_nostop_subdiv=[]
for i in range(round(len(humor_nostop)/100)):
    humor_nostop_subdiv.append([humor_nostop[i*100:(i+1)*100], 'humor'])

scifi_nostop_subdiv=[]
for i in range(round(len(scifi_nostop)/50)):
    scifi_nostop_subdiv.append([scifi_nostop[i*100:(i+1)*100], 'scifi'])

documents_nostop = humor_nostop_subdiv+scifi_nostop_subdiv

random.shuffle(documents_nostop)
len(documents_nostop)

206

In this case we get 206 documents, and shuffle and split again in an 80-20% fashion. But, before we do that, we want to change our feature extractor to also exclude stopwords.

In [169]:
combined_ns = humor_nostop + scifi_nostop
all_words_ns = nltk.FreqDist(combined_ns)
word_features_ns = list(all_words_ns)[:2000]

def document_features_ns(document): 
    document_words = set(document) 
    features = {}
    for word in word_features_ns:
        features['%s' % word] = (word in document_words)
    return features

import itertools
dict(itertools.islice(document_features_ns(combined_ns).items(), 20))

{'would': True,
 'said': True,
 'one': True,
 'could': True,
 'time': True,
 'like': True,
 'even': True,
 'long': True,
 'people': True,
 'way': True,
 'man': True,
 'know': True,
 'get': True,
 'made': True,
 'little': True,
 'years': True,
 'never': True,
 'two': True,
 'back': True,
 'much': True}

In [171]:
featuresets2 = [(document_features_ns(d), c) for (d,c) in documents_nostop]

train_set_ns, test_set_ns = featuresets2[165:], featuresets2[:41]

classifier2 = nltk.NaiveBayesClassifier.train(train_set_ns)

print('Accuracy of our Naive Bayes classifier: ', nltk.classify.accuracy(classifier2, test_set_ns), '\n')

classifier2.show_most_informative_features(25)

Accuracy of our Naive Bayes classifier:  0.8048780487804879 

Most Informative Features
                  things = True            humor : scifi  =      9.3 : 1.0
                  always = True            humor : scifi  =      5.6 : 1.0
                  little = True            humor : scifi  =      5.6 : 1.0
                   thing = True            humor : scifi  =      5.6 : 1.0
                   woman = True            humor : scifi  =      5.6 : 1.0
                    eyes = True            humor : scifi  =      4.4 : 1.0
                   right = True            humor : scifi  =      4.4 : 1.0
                  seemed = True            humor : scifi  =      4.4 : 1.0
                 waiting = True            humor : scifi  =      4.4 : 1.0
                  become = True            humor : scifi  =      4.1 : 1.0
                 another = True            humor : scifi  =      3.4 : 1.0
                    good = True            humor : scifi  =      3.4 : 1.0
            

The accuracy actually went down a bit in this case, to 80%, so removing stopwords didn't do much for our classifier.