# Lab2 - Sentiment analysis using NLTK

In this notebook, we show you how to create a Sentiment classifier from movie reviews provided in NLTK.

**at the end of this notebook, you will be able to**:
* inspect the training data, i.e., the movie reviews
* extracting features from training data
* training and evaluating the *NaiveBayesClassifier*
* apply the classifier
* train the *NaiveBayesClassifier* on your own data

**If you want to learn more, you might find the following links useful:**
* http://www.nltk.org/api/nltk.sentiment.html#module-nltk.sentiment.util

In [1]:
from random import choice
import nltk
from nltk import NaiveBayesClassifier

## Inspect movie reviews

We are going to use the [movie reviews](http://www.cs.cornell.edu/people/pabo/movie-review-data/) ([README](http://www.cs.cornell.edu/people/pabo/movie-review-data/poldata.README.2.0.txt)) dataset.
We are going to inspect the dataset.

In [2]:
from nltk.corpus import movie_reviews

Which sentiment categories are in the dataset?

In [3]:
categories = movie_reviews.categories()
print(categories)

['neg', 'pos']


There are just two. A review is marked either **positive** or **negative**.

How many positive and negative reviews are there in the dataset?

In [4]:
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

In [5]:
print('number of negative reviews', len(negids))
print('number of positive reviews', len(posids))

number of negative reviews 1000
number of positive reviews 1000


Ok, the dataset is balanced, meaning that each category has the same amount of training instances!

Let's also look at an example of a positive and a negative movie review.

In [6]:
random_negid = choice(negids)
negative_example = movie_reviews.words(fileids=[random_negid])
print('random negative file id', random_negid)
print('first 30 words of review', negative_example[:30])

random negative file id neg/cv335_16299.txt
first 30 words of review ['the', 'most', 'interesting', 'thing', 'about', 'virus', 'is', 'that', 'the', 'title', 'of', 'the', 'film', 'does', 'not', 'refer', 'to', 'the', 'clunky', 'robotic', 'animals', 'that', 'try', 'to', 'kill', 'our', 'heroes', '.', 'alas', ',']


In [7]:
random_posid = choice(posids)
positive_example = movie_reviews.words(fileids=[random_posid])
print('random negative file id', random_posid)
print('first 30 words of review', positive_example[:30])

random negative file id pos/cv908_16009.txt
first 30 words of review ['i', 'actually', 'am', 'a', 'fan', 'of', 'the', 'original', '1961', 'or', 'so', 'live', '-', 'action', '-', 'disney', 'flick', 'of', 'the', 'same', 'name', 'starring', 'hayley', 'mills', 'twice', 'as', 'a', 'pair', 'of', 'twins']


## Extracting features from training data
We show how to train the classifier using the simplest feature: the words.
Our feature representation will be a dictionary, for which we use the following function.
Obviously, more complex features are possible, but for now, we focus on a simple feature for the sake of clarity.

In [8]:
def word_feats(words):
    return {word: True for word in words}

For each movie review, we are going to extract its features, e.g., its words. Together the words form a Bag of Words model.

In [9]:
negfeats = []
label = 'neg'
for neg_fileid in negids:
    features = word_feats(movie_reviews.words(fileids=[neg_fileid]))
    negfeats.append((features, label))

In [10]:
posfeats = []
label = 'pos'
for pos_fileid in posids:
    features = word_feats(movie_reviews.words(fileids=[pos_fileid]))
    posfeats.append((features, label))

Let's inspect a training example.

In [None]:
example_negfeat = negfeats[0]

In [None]:
print(type(example_negfeat), len(example_negfeat))

So it's a tuple of length 2.

The first element in the tuple is a dictionary containing all the words from the movie review.
The second element is the sentiment category annotated for the movie review.

In [None]:
counter = 10
features, label = example_negfeat
print('label', label)
print('features', type(features))
print()
for index, (word, boolean) in enumerate(features.items()):
    print(word, boolean)
    if index == counter:
        break
    index += 1    

**Question**: Would you include all of the features as shown above?

To train, we need a training part and a test part. 
We will use 80% of both the positive and negative reviews for training and 20% for testing.

In [None]:
perc_training = 0.8
perc_test = 0.2

In [None]:
negcutoff = int(len(negfeats) * 0.8)
poscutoff = int(len(posfeats) * 0.8)

print(negcutoff)
print(poscutoff)

In [None]:
trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print('train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats)))

Great, we have **inspected** the data, **extracted features** from it, and split the data into training and test!

## Training and evaluating classifier
Most of the work has been done. Now you just call the following command.

In [None]:
movie_review_classifier = NaiveBayesClassifier.train(trainfeats)

NLTK has a method to evaluate the performance on the test data.

In [None]:
print('accuracy:', nltk.classify.util.accuracy(movie_review_classifier, testfeats))

Only words as features already yields a high accuracy!

We can have some more insight in the following way:

In [None]:
movie_review_classifier.show_most_informative_features()

## Applying our sentiment analyzer to unseen text
Our classifier makes a prediction for each word in a sentence. Here is a code snippet to show you how to do that.

In [None]:
testsentence = "Awesome eggs, I do not liked them"
words = nltk.word_tokenize(testsentence)
predicted_class = movie_review_classifier.classify(word_feats(words))
print('testsentence', predicted_class)

## Below we show how you can train and test with your own data in a simple way

In [None]:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier

# simple function that turns a list of words into word_feats (word features)
def word_feats(words):
    return dict([(word, True) for word in words])

# In a lexical approach, you would predefine the positive, negative and neutral words and only use these to train a classifier
positive_vocab = ['awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)']
negative_vocab = ['bad', 'terrible','useless', 'hate', ':(']
neutral_vocab = ['movie','the','sound','was','is','actors','did','know','words','not']

# Assume you have a collections of texts that are negative and neutral
negsentence = "I do not like green eggs and ham, and I do not like them too!"
possentence = "I like green eggs and ham, and I like them too!"
neusentence = "It exists and it is, there why would it be?"

# By using the tokenization function, you can turn them into word negative and positive lists
negtokens = nltk.word_tokenize(negsentence)
postokens = nltk.word_tokenize(possentence)
neutokens = nltk.word_tokenize(neusentence)

# Next we use the simple word feature function to turn them into features that can be used for training the classifier 
positive_features = []
negative_features = []
neutral_features = []

for information in [positive_vocab, postokens]:
    positive_features.append((word_feats(information), 'pos'))
    
for information in [negative_vocab, negtokens]:
    negative_features.append((word_feats(information), 'neg'))

for information in [neutral_vocab, neutokens]:   
    neutral_features.append((word_feats(information), 'neu'))

**Question**: What would be another way to obtain neutral word features?

**Question**: How would you do this for a data set where positive and negative texts are stored in two separate directories?

In [None]:
perc_training = 0.8
perc_test = 0.2

training = []
test = []

for feature_set in [negative_features, neutral_features, positive_features]:
    num_items = len(feature_set)
    cutoff = int(num_items * perc_training)
    training_part = feature_set[:cutoff]
    test_part = feature_set[cutoff:]
    print(num_items, len(training_part), len(test_part))
    
    training.extend(training_part)
    test.extend(test_part)

In [None]:
classifier = NaiveBayesClassifier.train(training)

In [None]:
print('accuracy:', nltk.classify.util.accuracy(classifier, test))

In [None]:
classifier.show_most_informative_features()

In [None]:
classifier.labels()

In [None]:
testsentence = "Awesome eggs, I do not liked them"
words = nltk.word_tokenize(testsentence)
predicted_class = classifier.classify(word_feats(words))
print('testsentence', predicted_class)