## IS620 - Project 4
### Brian Chu | November 22, 2015

### Using the movie review document classifier discussed in this chapter, generate a list of the 30 features that the classifier finds to be most informative. Can you explain why these particular features are informative? Do you find any of them surprising?

*Note: most of the code below is copied or modified from NLTK Chapter 6*

In [1]:
import nltk
import sklearn as sk
from nltk.corpus import movie_reviews
import pandas as pd

In [2]:
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

# Sample row
documents[0]

([u'plot',
  u':',
  u'two',
  u'teen',
  u'couples',
  u'go',
  u'to',
  u'a',
  u'church',
  u'party',
  u',',
  u'drink',
  u'and',
  u'then',
  u'drive',
  u'.',
  u'they',
  u'get',
  u'into',
  u'an',
  u'accident',
  u'.',
  u'one',
  u'of',
  u'the',
  u'guys',
  u'dies',
  u',',
  u'but',
  u'his',
  u'girlfriend',
  u'continues',
  u'to',
  u'see',
  u'him',
  u'in',
  u'her',
  u'life',
  u',',
  u'and',
  u'has',
  u'nightmares',
  u'.',
  u'what',
  u"'",
  u's',
  u'the',
  u'deal',
  u'?',
  u'watch',
  u'the',
  u'movie',
  u'and',
  u'"',
  u'sorta',
  u'"',
  u'find',
  u'out',
  u'.',
  u'.',
  u'.',
  u'critique',
  u':',
  u'a',
  u'mind',
  u'-',
  u'fuck',
  u'movie',
  u'for',
  u'the',
  u'teen',
  u'generation',
  u'that',
  u'touches',
  u'on',
  u'a',
  u'very',
  u'cool',
  u'idea',
  u',',
  u'but',
  u'presents',
  u'it',
  u'in',
  u'a',
  u'very',
  u'bad',
  u'package',
  u'.',
  u'which',
  u'is',
  u'what',
  u'makes',
  u'this',
  u'review',
  u'an'

In [3]:
words = movie_reviews.words()
all_words = nltk.FreqDist(w.lower() for w in words) # sorted most popular {words: freq}
word_features = all_words.keys()[:2000] # only use top 2000 words; otherwise training too slow

# Sample data
word_features[:10]

[u'sucess',
 u'sonja',
 u'askew',
 u'woods',
 u'spiders',
 u'bazooms',
 u'hanging',
 u'francesca',
 u'comically',
 u'localized']

In [4]:
# Extract words (true/false) from document

def document_features(document): # [_document-classify-extractor]
    document_words = set(document) # [_document-classify-set]
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [5]:
# Split into training and test sets
featuresets = [(document_features(d), c) for (d,c) in documents] 
train_set, test_set = featuresets[100:], featuresets[:100]

# Use Naive Bayes classifier for training data
import random
random.seed(212)
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [6]:
# Show 30 most informative features from training classifier
classifier.show_most_informative_features(30)

Most Informative Features
          contains(sans) = True              neg : pos    =     10.0 : 1.0
    contains(mediocrity) = True              neg : pos    =      8.5 : 1.0
         contains(wires) = True              neg : pos    =      7.0 : 1.0
          contains(hugo) = True              pos : neg    =      6.9 : 1.0
     contains(dismissed) = True              pos : neg    =      6.3 : 1.0
   contains(bruckheimer) = True              neg : pos    =      6.3 : 1.0
        contains(fabric) = True              pos : neg    =      5.7 : 1.0
   contains(overwhelmed) = True              pos : neg    =      5.7 : 1.0
   contains(understands) = True              pos : neg    =      5.6 : 1.0
           contains(ugh) = True              neg : pos    =      5.6 : 1.0
     contains(uplifting) = True              pos : neg    =      5.5 : 1.0
        contains(doubts) = True              pos : neg    =      5.2 : 1.0
         contains(tripe) = True              neg : pos    =      5.1 : 1.0

## Not surprising

**Mediocrity | negative**: By definition, not a very positive review  
**Ugh | negative**: Self-explanatory. Surprised though that this 'word' came up so frequently  
** Uplifting | positive**: Hard to call a bad movie, 'uplifitng'  
**Accomplishes | positive**: Usually used in a positive sense, "director/movie/actor accomplishes the goal of telling the story"  
**Effortlessly | positive**: Also more of a positive descriptor
**Leia | positive**: I mean, if it's Princess Leia we're talking about, this is a no-brainer :)


## Surprising

**Sans | negative**: Only surprised that this is the most informative feature. Sure it means 'without', but is it used that often? Maybe in moview reviews.  
**Wires | negative**: Are there a lot of bad movies involving wires?  
**Hugo | positive**: I assume Victor Hugo? I haven't seen Les Miserables, but didn't know it was so univerally liked  
**Bruckheimer | negative**: The Rock, Bad Boys, Top Gun, Beverly Hills Cop?! C'mon, this isn't Cannes.  
**Fabric | positive**: "Uncovers the true fabric of our society". I guess this is a popular positive movie cliche?  
**Doubts | positive**: If anything, I would have defaulted this to a negative connotation  
**Quicker | negative**: Not sure how this is so related to negative reviews. Maybe "I wish the movie would end quicker"?  
**Wcw | negative**: I don't even know what this means or stands for  
**Minnie | positive**: Minnie Mouse? Minnie Driver? Ok...  
**Wang | positive**: Not sure if I'm more surprised it made the top 30, or that it's 4:1 positive. Boogie Nights?  

## Repeat exercise but with word pairs