# Project 4
IS620 | Text Analysis  
Aaron Palumbo  
November 20, 2015  

## Objective

Using the movie review document classifier  discussed in this chapter, generate a list of  the 30 features that the classifier finds to be most informative. Can you  explain why these particular features are informative? Do you find any of them surprising?

We will be pulling from the example in "Natural Language Processing with Python" Section 6.1, Document Classification.

## Dependencies / Setup

In [39]:
import nltk
import random

In [40]:
# silly utility to launch a qtconsole if one doesn't exist
import psutil

def returnPyIDs():
    pyids = set()
    for pid in psutil.pids():
        try:
            if "python" in psutil.Process(pid).name():
                pyids.add(pid)
        except:
            pass
    return pyids

def launchConsole():
    before_pyids = returnPyIDs()
    %qtconsole
    after_pyids = returnPyIDs()
    newid = after_pyids.difference(before_pyids)
    assert len(newid) == 1
    return list(newid)[0]

try:
    qtid
except NameError:
    qtid = launchConsole()
    
if qtid not in returnPyIDs():
    qtid = launchConsole()
    
qtid

7000

## Load the Data

In [41]:
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
            for category in movie_reviews.categories()
            for fileid in movie_reviews.fileids(category)]
random.seed(1)
random.shuffle(documents)

## Extract Features

In [42]:
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = all_words.keys()[:2000]

def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
    return features

## Train and Test Classifier

In [43]:
featuresets = [(document_features(d), c) for (d, c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(30)

0.63
Most Informative Features
           contains(ugh) = True              neg : pos    =      9.7 : 1.0
          contains(sans) = True              neg : pos    =      8.4 : 1.0
    contains(mediocrity) = True              neg : pos    =      7.7 : 1.0
     contains(dismissed) = True              pos : neg    =      7.0 : 1.0
   contains(bruckheimer) = True              neg : pos    =      6.3 : 1.0
         contains(wires) = True              neg : pos    =      6.3 : 1.0
        contains(fabric) = True              pos : neg    =      6.3 : 1.0
     contains(testament) = True              pos : neg    =      6.2 : 1.0
     contains(uplifting) = True              pos : neg    =      5.8 : 1.0
        contains(doubts) = True              pos : neg    =      5.8 : 1.0
       contains(topping) = True              pos : neg    =      5.7 : 1.0
   contains(overwhelmed) = True              pos : neg    =      5.7 : 1.0
  contains(effortlessly) = True              pos : neg    =      5.6 

In [67]:
%%capture output
classifier.show_most_informative_features(30)

In [68]:
most_informative = output.stdout.splitlines()
most_informative_neg = [i for i in most_informative if 'neg : pos' in i]
most_informative_pos = [i for i in most_informative if 'pos : neg' in i]

## Analysis

First let's take a look at the negative indicators:

In [69]:
most_informative_neg

[u'           contains(ugh) = True              neg : pos    =      9.7 : 1.0',
 u'          contains(sans) = True              neg : pos    =      8.4 : 1.0',
 u'    contains(mediocrity) = True              neg : pos    =      7.7 : 1.0',
 u'   contains(bruckheimer) = True              neg : pos    =      6.3 : 1.0',
 u'         contains(wires) = True              neg : pos    =      6.3 : 1.0',
 u'           contains(hal) = True              neg : pos    =      5.0 : 1.0',
 u'         contains(pairs) = True              neg : pos    =      5.0 : 1.0',
 u'        contains(beware) = True              neg : pos    =      5.0 : 1.0',
 u'         contains(tripe) = True              neg : pos    =      4.6 : 1.0',
 u'    contains(derivative) = True              neg : pos    =      4.4 : 1.0',
 u'      contains(chopping) = True              neg : pos    =      4.3 : 1.0',
 u'         contains(gooey) = True              neg : pos    =      4.3 : 1.0',
 u'         contains(locks) = True      

Some of these make sense and would seem out of place in a positive review are:
* ugh
* mediocrity
* beware
* tripe
* derivative

The words that are less obvious but still fathomable (in my opinion), are:
* sans
* chopping

The words that don't make a lot of sense, and are probably examples of overfitting are:
* bruckheimer
* wires
* hal
* pairs
* gooey
* locks

Let's dig a little deeper into this last set.

In [137]:
def getStats(word_list):
    neg_overfit = {}
    for w in word_list:
        neg_overfit[w] = {}

    for w in neg_overfit.keys():
        ratings = [rating for (words, rating) in documents if w in words]
        neg_overfit[w]['count'] = len(ratings)
        neg_overfit[w]['numPos'] = sum([i == 'pos' for i in ratings])
        neg_overfit[w]['numNeg'] = len(ratings) - neg_overfit[w]['numPos']
    
    summary = []
    for w in neg_overfit.keys():
        obj = neg_overfit[w]
        summary.append("{:15} => Num Appearances: {:2}, Pos: {:2}, Neg: {:2}".format \
        (w, obj['count'], obj['numPos'], obj['numNeg']))
    return "\n".join(summary)

In [138]:
print getStats("bruckheimer wires hal pairs gooey locks".split())

pairs           => Num Appearances:  9, Pos:  2, Neg:  7
wires           => Num Appearances: 10, Pos:  1, Neg:  9
hal             => Num Appearances: 10, Pos:  2, Neg:  8
bruckheimer     => Num Appearances: 10, Pos:  1, Neg:  9
gooey           => Num Appearances:  8, Pos:  2, Neg:  6
locks           => Num Appearances:  7, Pos:  1, Neg:  6


These words do not appear very many times in the dataset and should probably not be used. In the beginning, when we decided to take the 2000 most frequently appearing words we probably overreached.

For a comparison, let's take a look at the words we flagged as making sense:

In [139]:
print getStats("ugh mediocrity beware tripe derivative".split())

tripe           => Num Appearances: 13, Pos:  2, Neg: 11
mediocrity      => Num Appearances: 12, Pos:  1, Neg: 11
beware          => Num Appearances: 10, Pos:  2, Neg:  8
ugh             => Num Appearances: 16, Pos:  2, Neg: 14
derivative      => Num Appearances: 22, Pos:  5, Neg: 17


Better, but not as much difference as you might expect. Maybe what we're really seeing is just a lack of data.

Now let's look at the positive indicators:

In [110]:
most_informative_pos

[u'     contains(dismissed) = True              pos : neg    =      7.0 : 1.0',
 u'        contains(fabric) = True              pos : neg    =      6.3 : 1.0',
 u'     contains(testament) = True              pos : neg    =      6.2 : 1.0',
 u'     contains(uplifting) = True              pos : neg    =      5.8 : 1.0',
 u'        contains(doubts) = True              pos : neg    =      5.8 : 1.0',
 u'       contains(topping) = True              pos : neg    =      5.7 : 1.0',
 u'   contains(overwhelmed) = True              pos : neg    =      5.7 : 1.0',
 u'  contains(effortlessly) = True              pos : neg    =      5.6 : 1.0',
 u'          contains(wits) = True              pos : neg    =      5.0 : 1.0',
 u'          contains(lang) = True              pos : neg    =      5.0 : 1.0',
 u'          contains(hugo) = True              pos : neg    =      4.6 : 1.0',
 u'     contains(overboard) = True              pos : neg    =      4.6 : 1.0',
 u'      contains(matheson) = True      

In [142]:
pos_words = [i.split("(")[1][:-1] for i in 
             [j for j in "".join(most_informative_pos).split() if 'contains' in j]]
print getStats(pos_words)

lang            => Num Appearances:  9, Pos:  8, Neg:  1
topping         => Num Appearances:  9, Pos:  8, Neg:  1
effortlessly    => Num Appearances: 22, Pos: 19, Neg:  3
fabric          => Num Appearances: 10, Pos:  9, Neg:  1
wits            => Num Appearances:  9, Pos:  8, Neg:  1
hugo            => Num Appearances: 13, Pos: 11, Neg:  2
matheson        => Num Appearances:  7, Pos:  6, Neg:  1
dismissed       => Num Appearances: 11, Pos: 10, Neg:  1
rico            => Num Appearances: 10, Pos:  8, Neg:  2
edges           => Num Appearances:  8, Pos:  6, Neg:  2
doubts          => Num Appearances: 16, Pos: 14, Neg:  2
testament       => Num Appearances: 21, Pos: 17, Neg:  4
leia            => Num Appearances:  8, Pos:  6, Neg:  2
overboard       => Num Appearances: 15, Pos: 11, Neg:  4
overwhelmed     => Num Appearances: 10, Pos:  9, Neg:  1
spins           => Num Appearances:  7, Pos:  6, Neg:  1
uplifting       => Num Appearances: 24, Pos: 21, Neg:  3


Again, some of the words in this list make sense (effortlessly, overwhelmed), but some of the do not (leia, spins). There is some indication that the words that appear more often are stronger predictors, but what we really need is more data.