# Assignment Week 10 – Document Classification

### Betsy Rosalen and Mikhail Groysman

## Project Overview

It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  Here is one example of such data: [UCI Machine Learning Repository: Spambase Data Set](http://archive.ics.uci.edu/ml/datasets/Spambase)

For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

For more adventurous students, you are welcome (encouraged!) to come up a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.

This assignment is due end of day on Monday 11/11.

NOTE: This is a two week assignment.

## Choosing Documents for Classification

Let's look at available texts in the guttenberg corpus.

In [14]:
import nltk
import random
random.seed(250)
import pandas as pd
pd.set_option('display.max_rows', 500)

nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

We have 3 books by Jane Austen, Bible, 1 book by Blake, and so on. Each author writes using his/her own style. Can we use samples of their work to predict who wrote specific passage?

## Austen vs Blake

### Create texts

First we need to take all three of Austen's works and combine them to create one text.  We will also remove punctuation and convert everything to lowercase to eliminate duplicate words.  Then we can take that and create a list of text segments.  Each segment will have a length of 1000 words.

In [3]:
austen = nltk.corpus.gutenberg.words('austen-emma.txt')+nltk.corpus.gutenberg.words('austen-persuasion.txt')+nltk.corpus.gutenberg.words('austen-sense.txt')
austen = [word.lower() for word in austen if word.isalpha()]
austen1=[]
for i in range(366):
    austen1.append([austen[i*1000:(i+1)*1000],'au'])
len(austen)

366454

In [4]:
len(austen1)

366

We now have a list of 432 1000-word segments of text written by Jane Austen.  

We will skip the Bible since it was written by many different authors using many different styles, but let's take the next text in the guttenburg corpus, poems by Blake, and do the same thing we did with Austen.

In [5]:
blake = nltk.corpus.gutenberg.words('blake-poems.txt')
blake = [word.lower() for word in blake if word.isalpha()]
blake1=[]
for i in range(7):
    blake1.append([blake[i*990:(i+1)*990],'bl'])
len(blake)

6934

Since there are just shy of 7000 words total in the Blake text, we will make each segment 990 words in order to get 7 equal segments for Blake.

In [6]:
len(blake1)

7

We now have a list of seven 990-word segments of text written by William Blake. 

### Create Feature Extractor

Now let's take the two original lists of words and combine them to create one longer list and find the 2000 most frequent words, which we will later use to create a feature list for our classifier.

In [15]:
ab=austen+blake
all_words = nltk.FreqDist(w.lower() for w in ab)
word_features = list(all_words)[:2000] 

'''

'''
wlist = []
for i in range(0, 2000, 200):
    df = pd.DataFrame(word_features[i:(i+200)])
    df.columns=['200 words']
    wlist.append(df)

pd.concat(wlist, axis=1)

Unnamed: 0,200 words,200 words.1,200 words.2,200 words.3,200 words.4,200 words.5,200 words.6,200 words.7,200 words.8,200 words.9
0,emma,after,wish,beautiful,encouragements,avowed,hate,moved,entertaining,brain
1,by,dinner,impossible,moonlight,smoothed,adoption,outward,forwards,wakefield,during
2,jane,usual,things,mild,matters,assume,boasted,alacrity,romance,expediency
3,austen,then,till,draw,enough,unlikely,beauty,impulse,forest,low
4,volume,only,awoke,back,comprehend,assistance,cleverness,indifferent,ride,fairly
5,i,sit,made,fire,straightforward,apprehension,distinction,credit,kingston,doubtful
6,chapter,lost,necessary,found,open,aunt,middle,attentively,fifty,sufficient
7,woodhouse,event,cheerful,damp,hearted,capricious,failing,honours,horseback,vicarage
8,handsome,every,spirits,dirty,unaffected,governed,possible,meal,foot,deficiency
9,clever,promise,required,catch,safely,nature,named,recommend,raise,meetings


We will use the function in the Natural Language Processing with Python textbook on page 228 to create a feature generator that uses the 2000 most frequent words list and indicates whether or not each word is present in the text as a feature.

In [16]:
def document_features(document): 
    document_words = set(document) 
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

Let's test it on the full Blake text...

In [25]:
features = document_features(blake)
list(features.items())[:20]

[('contains(emma)', False),
 ('contains(by)', True),
 ('contains(jane)', False),
 ('contains(austen)', False),
 ('contains(volume)', False),
 ('contains(i)', True),
 ('contains(chapter)', False),
 ('contains(woodhouse)', False),
 ('contains(handsome)', False),
 ('contains(clever)', False),
 ('contains(and)', True),
 ('contains(rich)', True),
 ('contains(with)', True),
 ('contains(a)', True),
 ('contains(comfortable)', False),
 ('contains(home)', True),
 ('contains(happy)', True),
 ('contains(disposition)', False),
 ('contains(seemed)', True),
 ('contains(to)', True)]

### Create Test Train Dataset

Now we need to create a list of all text segments from both Austen and Blake and shuffle them to create the text corpus that we will use to train and test our classifier model.

In [26]:
documents=austen1+blake1
documents

[[['emma',
   'by',
   'jane',
   'austen',
   'volume',
   'i',
   'chapter',
   'i',
   'emma',
   'woodhouse',
   'handsome',
   'clever',
   'and',
   'rich',
   'with',
   'a',
   'comfortable',
   'home',
   'and',
   'happy',
   'disposition',
   'seemed',
   'to',
   'unite',
   'some',
   'of',
   'the',
   'best',
   'blessings',
   'of',
   'existence',
   'and',
   'had',
   'lived',
   'nearly',
   'twenty',
   'one',
   'years',
   'in',
   'the',
   'world',
   'with',
   'very',
   'little',
   'to',
   'distress',
   'or',
   'vex',
   'her',
   'she',
   'was',
   'the',
   'youngest',
   'of',
   'the',
   'two',
   'daughters',
   'of',
   'a',
   'most',
   'affectionate',
   'indulgent',
   'father',
   'and',
   'had',
   'in',
   'consequence',
   'of',
   'her',
   'sister',
   's',
   'marriage',
   'been',
   'mistress',
   'of',
   'his',
   'house',
   'from',
   'a',
   'very',
   'early',
   'period',
   'her',
   'mother',
   'had',
   'died',
   'too',


In [27]:
import random
random.shuffle(documents)
featuresets = [(document_features(d), c) for (d,c) in documents]
len(featuresets)

373

Next we split our dataset into test and train sections, train our classifier on the training set, and check the accuracy of our model on the test set.

In [28]:
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [29]:
print(nltk.classify.accuracy(classifier, test_set)) 

1.0


It is very easy to for NLKT to distinguish between Austen and Blake. Let's try more authors.

## Adding Bryant

In [30]:
bryant = nltk.corpus.gutenberg.words('bryant-stories.txt')
bryant = [word.lower() for word in bryant if word.isalpha()]
bryant1=[]
for i in range(46):
    bryant1.append([bryant[i*1000:(i+1)*1000],'br'])
len(bryant)

46611

In [31]:
abb=austen+blake+bryant
all_words = nltk.FreqDist(w.lower() for w in abb)
word_features = list(all_words)[:2000] 

documents=austen1+blake1+bryant1

In [32]:
random.shuffle(documents)
featuresets = [(document_features(d), c) for (d,c) in documents]
len(featuresets)

419

In [33]:
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [34]:
print(nltk.classify.accuracy(classifier, test_set)) 

0.97


We still get pretty good results in distinguishing between Austen, Blake, and Braynt.

## Adding Burgess

In [35]:
burgess = nltk.corpus.gutenberg.words('burgess-busterbrown.txt')
burgess = [word.lower() for word in burgess if word.isalpha()]
burgess1=[]
for i in range(16):
    burgess1.append([burgess[i*1000:(i+1)*1000],'bu'])
len(burgess)

16327

In [36]:
abbb=austen+blake+bryant+burgess
all_words = nltk.FreqDist(w.lower() for w in abbb)
word_features = list(all_words)[:2000] 

documents=austen1+blake1+bryant1+burgess1

In [37]:
random.shuffle(documents)
featuresets = [(document_features(d), c) for (d,c) in documents]
len(featuresets)

435

In [38]:
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.95


Accuracy declined a bit, but we are still in the mid 90's.  

Let's see what features are most important in training our model...

In [39]:
classifier.show_most_informative_features(25)

Most Informative Features
           contains(her) = False              bu : au     =    169.2 : 1.0
           contains(had) = False              bl : au     =    116.7 : 1.0
          contains(fish) = True               bu : au     =     87.5 : 1.0
        contains(forest) = True               bu : au     =     77.5 : 1.0
            contains(as) = False              bl : au     =     70.0 : 1.0
           contains(cow) = True               bl : au     =     70.0 : 1.0
         contains(awoke) = True               bl : au     =     70.0 : 1.0
          contains(sing) = True               bl : au     =     70.0 : 1.0
          contains(mild) = True               bl : au     =     54.4 : 1.0
          contains(very) = False              bl : au     =     42.0 : 1.0
        contains(farmer) = True               bu : au     =     39.8 : 1.0
         contains(sleep) = True               bl : au     =     32.7 : 1.0
         contains(angel) = True               bl : au     =     31.8 : 1.0

It appears that a text that does not contain the word 'her' is 169 times more likely
to be by Burgess than by Austen, while a text that contains the word 'fish' or 'forest' are about 87 and 77 times more likely to be by Burgess than by Austen respectively.  Texts that contain the word 'cow', 'awoke', or 'sing' are each 70 times more likely to be by Blake than by Austen.  

## Adding Carroll

In [40]:
carroll = nltk.corpus.gutenberg.words('carroll-alice.txt')
carroll = [word.lower() for word in carroll if word.isalpha()]
carroll1=[]
for i in range(27):
    carroll1.append([carroll[i*1000:(i+1)*1000],'ca'])
len(carroll)

27333

In [41]:
abbbc=austen+blake+bryant+burgess+carroll
all_words = nltk.FreqDist(w.lower() for w in abbbc)
word_features = list(all_words)[:2000] 

documents=austen1+blake1+bryant1+burgess1+carroll1

In [42]:
random.shuffle(documents)
featuresets = [(document_features(d), c) for (d,c) in documents]
len(featuresets)

462

In [43]:
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.98


Interestingly, by adding Carroll we actually improved accuracy. I assume his style is very different from others and easy to diffirentiate.

In [44]:
classifier.show_most_informative_features(25)

Most Informative Features
           contains(her) = False              bu : au     =    183.9 : 1.0
        contains(forest) = True               bu : au     =     78.8 : 1.0
            contains(as) = False              bl : au     =     68.1 : 1.0
           contains(had) = False              bl : au     =     57.2 : 1.0
          contains(sing) = True               bl : au     =     49.9 : 1.0
          contains(very) = False              bl : au     =     40.9 : 1.0
          contains(flew) = True               bl : au     =     40.9 : 1.0
          contains(mild) = True               bl : au     =     40.9 : 1.0
          contains(lies) = True               bl : au     =     40.9 : 1.0
          contains(tend) = True               bl : au     =     40.9 : 1.0
          contains(wool) = True               bl : au     =     40.9 : 1.0
          contains(food) = True               br : au     =     38.6 : 1.0
      contains(frighten) = True               bu : au     =     34.0 : 1.0

Common words that indicate that a text is more likely to have been written by Blake are "sing", "mild", "flew", "wool",  and "lies". For Burgess, indicator words are "forest", "frighten" and "forever", for Bryant, "food" and "worked", and for Austen, "her", "as", "had" and "very".

## Adding Chesterson

In [45]:
chesterson = nltk.corpus.gutenberg.words('chesterton-ball.txt')+nltk.corpus.gutenberg.words('chesterton-brown.txt')+nltk.corpus.gutenberg.words('chesterton-thursday.txt')
chesterson = [word.lower() for word in chesterson if word.isalpha()]
chesterson1=[]
for i in range(214):
    chesterson1.append([chesterson[i*1000:(i+1)*1000],'ch'])
len(chesterson)

214692

In [46]:
abbbcc=austen+blake+bryant+burgess+carroll+chesterson
all_words = nltk.FreqDist(w.lower() for w in abbbcc)
word_features = list(all_words)[:2000] 

documents=austen1+blake1+bryant1+burgess1+carroll1+chesterson1

In [47]:
random.shuffle(documents)
featuresets = [(document_features(d), c) for (d,c) in documents]
len(featuresets)

676

In [48]:
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.99


Wow, we get 99%!

In [49]:
classifier.show_most_informative_features(25)

Most Informative Features
           contains(her) = False              bu : au     =    187.9 : 1.0
         contains(arise) = True               bl : ch     =     98.0 : 1.0
            contains(as) = False              bl : au     =     90.7 : 1.0
        contains(forest) = True               bu : au     =     86.1 : 1.0
         contains(fight) = True               ch : au     =     53.3 : 1.0
          contains(maid) = True               bl : ch     =     52.8 : 1.0
      contains(youthful) = True               bl : ch     =     52.8 : 1.0
           contains(had) = False              bl : au     =     50.0 : 1.0
        contains(smiles) = True               bl : ch     =     49.8 : 1.0
          contains(very) = False              bl : au     =     47.5 : 1.0
          contains(sing) = True               bl : au     =     45.9 : 1.0
         contains(visit) = True               au : ch     =     41.7 : 1.0
  contains(acquaintance) = True               au : ch     =     41.3 : 1.0

## Adding the rest of the authors

In [50]:
edgeworth = nltk.corpus.gutenberg.words('edgeworth-parents.txt')
edgeworth = [word.lower() for word in edgeworth if word.isalpha()]
edgeworth1=[]
for i in range(170):
    edgeworth1.append([edgeworth[i*1000:(i+1)*1000],'ed'])
len(edgeworth)

170737

In [51]:
melville = nltk.corpus.gutenberg.words('melville-moby_dick.txt')
melville = [word.lower() for word in melville if word.isalpha()]
melville1=[]
for i in range(218):
    melville1.append([melville[i*1000:(i+1)*1000],'me'])
len(melville)

218361

In [52]:
shakespeare = nltk.corpus.gutenberg.words('shakespeare-caesar.txt')+nltk.corpus.gutenberg.words('shakespeare-hamlet.txt')+nltk.corpus.gutenberg.words('shakespeare-macbeth.txt')
shakespeare = [word.lower() for word in shakespeare if word.isalpha()]
shakespeare1=[]
for i in range(69):
    shakespeare1.append([shakespeare[i*1000:(i+1)*1000],'sh'])
len(shakespeare)

69340

In [53]:
whitman = nltk.corpus.gutenberg.words('whitman-leaves.txt')
whitman = [word.lower() for word in whitman if word.isalpha()]
whitman1=[]
for i in range(126):
    whitman1.append([whitman[i*1000:(i+1)*1000],'wh'])
len(whitman)

126276

In [54]:
abbbccemsw=austen+blake+bryant+burgess+carroll+chesterson+edgeworth+melville+shakespeare+whitman
all_words = nltk.FreqDist(w.lower() for w in abbbccemsw)
word_features = list(all_words)[:2000] 

documents=austen1+blake1+bryant1+burgess1+carroll1+chesterson1+edgeworth1+melville1+shakespeare1+whitman1

In [55]:
random.shuffle(documents)
featuresets = [(document_features(d), c) for (d,c) in documents]
len(featuresets)

1259

In [56]:
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.99


Very nice results!

In [57]:
classifier.show_most_informative_features(40)

Most Informative Features
          contains(have) = False              sh : au     =    222.2 : 1.0
         contains(arise) = True               bl : ch     =    109.4 : 1.0
          contains(sing) = True               bl : ch     =    109.4 : 1.0
           contains(had) = False              wh : au     =    104.2 : 1.0
          contains(miss) = True               au : me     =     99.3 : 1.0
            contains(as) = False              bl : au     =     98.0 : 1.0
        contains(farmer) = True               bu : me     =     97.3 : 1.0
        contains(forest) = True               bu : au     =     93.0 : 1.0
       contains(forever) = True               wh : au     =     71.2 : 1.0
           contains(mrs) = True               au : me     =     69.2 : 1.0
        contains(seemed) = True               ch : wh     =     59.4 : 1.0
          contains(sigh) = True               bl : me     =     59.2 : 1.0
          contains(maid) = True               bl : ch     =     58.9 : 1.0

## Final notes

Interestingly when training the models we discovered that using significantly less texts in the training set actually resulted in greatly improved performance over using a larger portion of the texts to train the model.  For example when using about 700 texts to train the last model we got an accuracy of about 74%, but when using only 100 texts the accuracy increased to 99%! 

## YouTube Link