## Project 4 

#### In this assignment we evaluate the results of the movie review classifier found in chapter 6. Because the accuracy was found to only be 65% I also looked at bigrams which increased the accuracy to nearly 77%


### References:
[NLTK](http://www.nltk.org/book/ch06.html) Chapter 6   
[Streamhacker](http://streamhacker.com/2010/05/24/text-classification-sentiment-analysis-stopwords-collocations/) 

In [16]:
import random
import nltk
from nltk.corpus import movie_reviews


In [2]:
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

In [4]:
words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(words)[:5000]

In [5]:
def features(document): 
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [6]:
featuresets = [(features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)


In [7]:
print(nltk.classify.accuracy(classifier, test_set)) 

0.688


#### Streamhacker has some useful information on using other nltk packages

In [14]:
import itertools
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
 
def bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=1500):
    bigram_finder = BigramCollocationFinder.from_words(words)
    bigrams = bigram_finder.nbest(score_fn, n)
    return dict([(ngram, True) for ngram in itertools.chain(words, bigrams)])
 

In [15]:
featuresets = [(bigram_word_feats(d), c) for (d,c) in documents]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [12]:
print(nltk.classify.accuracy(classifier, test_set))

0.774


In [13]:
classifier.show_most_informative_features(30)

Most Informative Features
               ludicrous = True              neg : pos    =     18.6 : 1.0
             magnificent = True              pos : neg    =     13.5 : 1.0
     (u'matt', u'damon') = True              pos : neg    =     12.1 : 1.0
               fictional = True              pos : neg    =     11.5 : 1.0
             uninvolving = True              neg : pos    =     11.2 : 1.0
         (u's', u'just') = True              neg : pos    =     10.5 : 1.0
              schumacher = True              neg : pos    =     10.5 : 1.0
                depicted = True              pos : neg    =     10.2 : 1.0
                   damon = True              pos : neg    =     10.0 : 1.0
                  seagal = True              neg : pos    =      9.8 : 1.0
                  finger = True              neg : pos    =      9.8 : 1.0
                 idiotic = True              neg : pos    =      9.6 : 1.0
                  avoids = True              pos : neg    =      9.5 : 1.0

Most of these make sense, even the comical fact that Matt Damon is a positive indicator and Steven Seagal and Schumacher are negative indicators. I tried to think of anything positive that could be said before or after "be funny" and could not think of anything (tries to be funny is probably most common). "quite frankly" is something which is generally said within a negative statement. 

Some that were rather surprising:
"work with", "was made", "finger", "fictional" at first glance would seem to have neither a negative or positive connotation, which aside from "fictional", all ended up being negative indicators. 

