# Lab2 Sentiment analysis using NLTK

You are going to create a Sentiment classifier from movie reviews provided in NLTK

## Background reading

* http://www.nltk.org/api/nltk.sentiment.html#module-nltk.sentiment.util
* https://www.nltk.org/book/ch06.html
    * section 6.1
    * section 6.3

We first are going to load the movie_reviews data set from NLTK

In [14]:
# Loading stuff
import nltk
from nltk.classify import NaiveBayesClassifier

In [15]:
from nltk.corpus import movie_reviews
 
def word_feats(words):
    return dict([(word, True) for word in words])
 
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

print(len(negids), len(posids))


1000 1000


We now have two data sets, one with the files that are negative reviews and one with the files that are positive reviews

In [16]:
#First negative review:
negids[0]

'neg/cv000_29416.txt'

We next are going to extract texts from each sub data set and create tuples with the labels 'neg' and 'pos', where the first element is the feature representation of the words of the review.

In [17]:
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

# lets print the first tuple from the negative set
negfeats[0]

({'plot': True,
  ':': True,
  'two': True,
  'teen': True,
  'couples': True,
  'go': True,
  'to': True,
  'a': True,
  'church': True,
  'party': True,
  ',': True,
  'drink': True,
  'and': True,
  'then': True,
  'drive': True,
  '.': True,
  'they': True,
  'get': True,
  'into': True,
  'an': True,
  'accident': True,
  'one': True,
  'of': True,
  'the': True,
  'guys': True,
  'dies': True,
  'but': True,
  'his': True,
  'girlfriend': True,
  'continues': True,
  'see': True,
  'him': True,
  'in': True,
  'her': True,
  'life': True,
  'has': True,
  'nightmares': True,
  'what': True,
  "'": True,
  's': True,
  'deal': True,
  '?': True,
  'watch': True,
  'movie': True,
  '"': True,
  'sorta': True,
  'find': True,
  'out': True,
  'critique': True,
  'mind': True,
  '-': True,
  'fuck': True,
  'for': True,
  'generation': True,
  'that': True,
  'touches': True,
  'on': True,
  'very': True,
  'cool': True,
  'idea': True,
  'presents': True,
  'it': True,
  'bad': True

In [18]:
# Define a split over the data for creating a train and test set
negcutoff = int(len(negfeats)*3/4)
poscutoff = int(len(posfeats)*3/4)

print(negcutoff)
print(poscutoff)

750
750


In [19]:
trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print('train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats)))

train on 1500 instances, test on 500 instances


In [20]:
classifier = NaiveBayesClassifier.train(trainfeats)
print('accuracy:', nltk.classify.util.accuracy(classifier, testfeats))
classifier.show_most_informative_features()

accuracy: 0.728
Most Informative Features
             magnificent = True              pos : neg    =     15.0 : 1.0
             outstanding = True              pos : neg    =     13.6 : 1.0
               insulting = True              neg : pos    =     13.0 : 1.0
              vulnerable = True              pos : neg    =     12.3 : 1.0
               ludicrous = True              neg : pos    =     11.8 : 1.0
                  avoids = True              pos : neg    =     11.7 : 1.0
             uninvolving = True              neg : pos    =     11.7 : 1.0
             fascination = True              pos : neg    =     10.3 : 1.0
              astounding = True              pos : neg    =     10.3 : 1.0
                 idiotic = True              neg : pos    =      9.8 : 1.0


## Below we show how you can train and test with your own data in a simple way

In [22]:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
#from nltk.corpus import names

# simple function that turns a list of words into word_feats (word features)
def word_feats(words):
    return dict([(word, True) for word in words])

# In a lexical approach, you would predefine the positive, negative and neutral words and only use these to train a classifier
positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)' ]
negative_vocab = [ 'bad', 'terrible','useless', 'hate', ':(' ]
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not' ]

# Assume you have a collections of texts that are negative and neutral
negsentence = "I do not like green eggs and ham, and I do not like them too!"
possentence = "I like green eggs and ham, and I like them too!"
# By using the tokenization function, you can turn them into word negative and positive lists
negtokens = nltk.word_tokenize(negsentence)
postokens = nltk.word_tokenize(possentence)

# Next we use the simple word feature function to turn them into features that can be used for training the classifier 
positive_features = [(word_feats(pos), 'pos') for pos in postokens]
negative_features = [(word_feats(neg), 'neg') for neg in negtokens]
# for neural we now take the vocabulary given above
neutral_features = [(word_feats(neu), 'neu') for neu in neutral_vocab]
print(positive_features) 

[({'I': True}, 'pos'), ({'l': True, 'i': True, 'k': True, 'e': True}, 'pos'), ({'g': True, 'r': True, 'e': True, 'n': True}, 'pos'), ({'e': True, 'g': True, 's': True}, 'pos'), ({'a': True, 'n': True, 'd': True}, 'pos'), ({'h': True, 'a': True, 'm': True}, 'pos'), ({',': True}, 'pos'), ({'a': True, 'n': True, 'd': True}, 'pos'), ({'I': True}, 'pos'), ({'l': True, 'i': True, 'k': True, 'e': True}, 'pos'), ({'t': True, 'h': True, 'e': True, 'm': True}, 'pos'), ({'t': True, 'o': True}, 'pos'), ({'!': True}, 'pos')]


What would be another way to obtain neutral word features?

How would you do this for a data set where positive and negative texts are stored in two separate directories?

In [23]:
# we simply concatenate the features to create a training set
train_set = negative_features + positive_features + neutral_features
classifier = NaiveBayesClassifier.train(train_set) 

We are going to test this classifier on a single sentence.

In [24]:
neg = 0
pos = 0
testsentence = "Awesome eggs, I do not liked them"
words = nltk.word_tokenize(testsentence)
for word in words:
    classResult = classifier.classify(word_feats(word))
    if classResult == 'neg':
        neg = neg + 1
    if classResult == 'pos':
        pos = pos + 1
 
print("Sentence: '{}'\n--------------\n".format(testsentence))
print('Positive: ' + str(float(pos)/len(words)))
print('Negative: ' + str(float(neg)/len(words)))

Sentence: 'Awesome eggs, I do not liked them'
--------------

Positive: 0.375
Negative: 0.25
