Project 4

Using the Senseval corpus, we will train a classifier that predicts that tag for each instance of the word 'interest' based on the context of the position of the word, and the context of the words directly preceding and following it.

In [126]:
import nltk
from nltk.corpus import senseval
instances = senseval.instances('interest.pos')
size = int(len(instances)*0.1)
train_set, test_set = instances[size:], instances[:size]

Here we see that there are 5 sense categories for the word 'interest'.

In [127]:
hardTypes = []
for each in instances:
    if each.senses[0] not in hardTypes:
        hardTypes.append(each.senses[0])
hardTypes

['interest_6',
 'interest_5',
 'interest_4',
 'interest_1',
 'interest_3',
 'interest_2']

In [147]:
def addFeatures(each):
    features = {}
    n = each.position
    features['context at n'] = each.context[n][0]
    features['context at n-1'] = each.context[n-1][0]
    features['context at n+1'] = each.context[n+1][0]
    return features

train = [(addFeatures(each),each.senses[0]) for each in train_set]
test = [(addFeatures(each),each.senses[0]) for each in test_set]

In [148]:
classifier = nltk.NaiveBayesClassifier.train(train)
classifier.show_most_informative_features()

Most Informative Features
            context at n = 'interests'    intere : intere =     70.7 : 1.0
          context at n+1 = 'in'           intere : intere =     63.6 : 1.0
          context at n-1 = 'other'        intere : intere =     53.7 : 1.0
          context at n+1 = 'of'           intere : intere =     39.8 : 1.0
          context at n-1 = 'and'          intere : intere =     18.3 : 1.0
          context at n-1 = 'in'           intere : intere =     17.6 : 1.0
          context at n-1 = 'own'          intere : intere =     15.1 : 1.0
          context at n+1 = '.'            intere : intere =     13.8 : 1.0
          context at n+1 = 'rose'         intere : intere =     13.8 : 1.0
          context at n+1 = 'because'      intere : intere =     12.7 : 1.0


Here we see that the accuracy is almost 90%

In [149]:
print(nltk.classify.accuracy(classifier, test))

0.885593220339


Now we will use the movie review document classifier to generate a list of the 30 features that the classifier finds to be most informative. First we will load the code from chapter 6 in NLTK: 

In [152]:
from nltk.corpus import movie_reviews
import random

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000] 

def document_features(document): 
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

0.73


Some of the most informative features make sense and some are rather surprising. A few in particular - 'maxwell', 'matheson', 'prescott', 'leia', 'wang', 'lang', and 'minnie' seem to be either first or last name and are probably faulty indicators - those whose frequent mention in review(s) may have landed them on the most informative features list. Other words such as 'sans', 'mediocrity','tripe', and 'ugh' are negative, so their neg:pos ratios make sense. The association of 'gadget', 'wire', and 'Bruckheimer' are also negative, perhaps because of their association with action movies with clumsy narratives. What surprises me the most about the features are that only 8 out of the 30 are more indicative of negativity than positivity. I would have guessed that movie reviews are more critical than positive. This may still be the case, however, as reviewers may employ more diverse and specific language when criticizing a film, thereby making words less indicative the less frequent they are, than they are while praising a film, where they may relay on generic vocabulary (i.e. 'uplifting', 'effortlessy', 'testament', 'admired').

In [153]:
classifier.show_most_informative_features(30)

Most Informative Features
          contains(sans) = True              neg : pos    =      9.1 : 1.0
    contains(mediocrity) = True              neg : pos    =      7.7 : 1.0
   contains(bruckheimer) = True              neg : pos    =      6.4 : 1.0
         contains(wires) = True              neg : pos    =      6.4 : 1.0
        contains(fabric) = True              pos : neg    =      6.3 : 1.0
   contains(overwhelmed) = True              pos : neg    =      6.3 : 1.0
     contains(dismissed) = True              pos : neg    =      6.3 : 1.0
   contains(understands) = True              pos : neg    =      6.1 : 1.0
        contains(gadget) = True              neg : pos    =      5.7 : 1.0
           contains(ugh) = True              neg : pos    =      5.4 : 1.0
     contains(uplifting) = True              pos : neg    =      5.2 : 1.0
       contains(topping) = True              pos : neg    =      5.0 : 1.0
          contains(wits) = True              pos : neg    =      5.0 : 1.0