Sentiment analysis with NLTK:

http://www.nltk.org/api/nltk.sentiment.html#module-nltk.sentiment.util

https://www.nltk.org/book/ch06.html


In the next assignment you are going to work with a variety of sentiment analysis tools. To test these tools, you need to make a test set of 60 tweets. Divide these into 3 sets: positive, negative and neutral. Describe how you obtained the tweets and how they were divided.

## We first are going to use the VADER package inside NLTK to get the sentiment for some input texts
More information on VADER can be found in the original source code repository:

https://github.com/cjhutto/vaderSentiment

In [1]:
import nltk
from nltk import sentiment
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import sentiment
from nltk import word_tokenize
nltk.download('vader_lexicon')
nltk.download('punkt')



[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/piek/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to /Users/piek/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Initialize VADER so we can use it within our Python script.

In [2]:
vader = SentimentIntensityAnalyzer()

Initialize a tokenizer to split a text into sentences

In [3]:
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

We take an arbitrary text and split it into sentences

In [4]:
#define some text
sometext = "Here are my sentences. It's a nice day. It's a rainy day." 
sentences = tokenizer.tokenize(sometext)
sentences

['Here are my sentences.', "It's a nice day.", "It's a rainy day."]

The next for loop assigns a sentiment score from VADER to each sentence

In [5]:
for sentence in sentences:
    print(sentence)
    scores = vader.polarity_scores(sentence)
    for key in sorted(scores):
        print('{0}: {1}, '.format(key, scores[key]), end='')
        print()

Here are my sentences.
compound: 0.0516, 
neg: 0.0, 
neu: 0.714, 
pos: 0.286, 
It's a nice day.
compound: 0.4215, 
neg: 0.0, 
neu: 0.417, 
pos: 0.583, 
It's a rainy day.
compound: -0.0772, 
neg: 0.394, 
neu: 0.606, 
pos: 0.0, 


## Assignment
Now run the VADER package on you tweet test set and report on the result

Though sentiment analysis can be a powerful tool for quickly determining the emotions expressed through text, there are limitations to what sentiment analysis can provide. Additionally, like all text analysis, we need to be cautious in interpreting the results. For example, sentences that contain profanity have a tendency to be interpreted by NLTK as negative; this can be a problem when using texts from social media, where profanity is often used for emphasis.


PIEK: This is probably too ambituous
Download the VADER package from the original GITHUB and try to build a local installation.
Modify the lexicon and try to run it.

## Train a NaiveBayesClassifier with 

https://www.nltk.org/book/ch06.html

* section 6.1
* section 6.3

In [1]:
# Loading stuff
import nltk
from nltk.classify import NaiveBayesClassifier
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import * # needed for the mark_negation function




## Creating the datasets (subjective and objective sentences)

We will first obtain the subjectivity corpus that is included in NLTK.

In [2]:
from nltk.corpus import subjectivity

From this data set we are going to select 200 sentences for training and testing.
The package subjectivity.sents defines which sentences are subjective ('subj') and which ones are objective ('obj').

In [3]:
n_instances = 100
subj_docs = [(sent, 'subj') for sent in subjectivity.sents(categories='subj')[:n_instances]]
obj_docs = [(sent, 'obj') for sent in subjectivity.sents(categories='obj')[:n_instances]]
len(subj_docs), len(obj_docs)

(100, 100)

The data is now balanced. Why is this important for a NaiveBayesClassifier? 

Each Document is represented by a tuple (ie. in the form <sentence, label>. The sentence is tokenised, so it is represented by a list of strings. The labels is subj or obj

In [4]:
subj_docs[50]

(["there's",
  'lots',
  'of',
  'cool',
  'stuff',
  'packed',
  'into',
  "espn's",
  'ultimate',
  'x',
  '.'],
 'subj')

Subjective and objective instances were split separately, to keep a balanced uniform class distribution in both train and test sets. We create the train and test set by taking the first 80 sentences as train and the last 20 sentences as test. We then concatenate the subjective and objective sets.

In [5]:
train_subj_docs = subj_docs[:80]
test_subj_docs = subj_docs[80:100]
train_obj_docs = obj_docs[:80]
test_obj_docs = obj_docs[80:100]
training_docs = train_subj_docs+train_obj_docs
testing_docs = test_subj_docs+test_obj_docs

We now initialize a SentimentAnalyser and use a mark_negation function for negative words. mark_negationis a utility function that marks words that are negations that can switch the polarity.

In [6]:
sentim_analyzer = SentimentAnalyzer()
all_words_neg = sentim_analyzer.all_words([mark_negation(doc) for doc in training_docs])

Simple unigram word features are then used, handling negation:

In [7]:
unigram_feats = sentim_analyzer.unigram_word_feats(all_words_neg, min_freq=4)
len(unigram_feats)

83

In [8]:
unigram_feats

['.',
 'the',
 ',',
 'a',
 'and',
 'of',
 'to',
 'is',
 'in',
 'with',
 'it',
 'that',
 'his',
 'on',
 'for',
 'an',
 'who',
 'by',
 'he',
 'from',
 'her',
 '"',
 'film',
 'as',
 'this',
 'movie',
 'their',
 'but',
 'one',
 'at',
 'about',
 'the_NEG',
 'a_NEG',
 'to_NEG',
 'are',
 "there's",
 '(',
 'story',
 'when',
 'so',
 'be',
 ',_NEG',
 ')',
 'they',
 'you',
 'not',
 'have',
 'like',
 'will',
 'all',
 'into',
 'out',
 'she',
 'what',
 'life',
 'has',
 'its',
 'only',
 'more',
 'even',
 '--',
 ':',
 'can',
 ';',
 'home',
 'look',
 "it's",
 'if',
 'where',
 'most',
 'him',
 'search',
 'but_NEG',
 'love',
 'both',
 'make',
 'begins',
 'some',
 'two',
 'of_NEG',
 'made',
 'which',
 'them']

In [9]:
sentim_analyzer.add_feat_extractor(extract_unigram_feats, unigrams=unigram_feats)

Then, features are applied to obtain a feature-value representation of the datasets

In [10]:
training_set = sentim_analyzer.apply_features(training_docs)
test_set = sentim_analyzer.apply_features(testing_docs)

Check out the feature presentation of the test_set. Do you understand what it represents? Why are so many features False?

In [11]:
test_set

[({'contains(.)': True, 'contains(the)': True, 'contains(,)': False, 'contains(a)': True, 'contains(and)': False, 'contains(of)': True, 'contains(to)': False, 'contains(is)': False, 'contains(in)': False, 'contains(with)': True, 'contains(it)': False, 'contains(that)': False, 'contains(his)': False, 'contains(on)': False, 'contains(for)': True, 'contains(an)': False, 'contains(who)': False, 'contains(by)': False, 'contains(he)': False, 'contains(from)': False, 'contains(her)': False, 'contains(")': False, 'contains(film)': False, 'contains(as)': False, 'contains(this)': False, 'contains(movie)': False, 'contains(their)': False, 'contains(but)': False, 'contains(one)': False, 'contains(at)': False, 'contains(about)': False, 'contains(the_NEG)': False, 'contains(a_NEG)': False, 'contains(to_NEG)': False, 'contains(are)': False, "contains(there's)": False, 'contains(()': False, 'contains(story)': False, 'contains(when)': False, 'contains(so)': False, 'contains(be)': False, 'contains(,_NEG

At this stage, we are ready to train our classifier on the training set, and output the evaluation results:

In [12]:
trainer = NaiveBayesClassifier.train
classifier = sentim_analyzer.train(trainer, training_set)
# output: Training classifier
for key,value in sorted(sentim_analyzer.evaluate(test_set).items()):
    print('{0}: {1}'.format(key, value))
#Outputs:
#Evaluating NaiveBayesClassifier results...
#Accuracy: 0.8
#F-measure [obj]: 0.8
#F-measure [subj]: 0.8
#Precision [obj]: 0.8
#Precision [subj]: 0.8
#Recall [obj]: 0.8
#Recall [subj]: 0.8

Training classifier
Evaluating NaiveBayesClassifier results...
Accuracy: 0.8
F-measure [obj]: 0.8
F-measure [subj]: 0.8
Precision [obj]: 0.8
Precision [subj]: 0.8
Recall [obj]: 0.8
Recall [subj]: 0.8


## Create a positive and negative classifier from movie_reviews

We first are going to load the movie_reviews data set from NLTK

In [29]:
from nltk.corpus import movie_reviews
 
def word_feats(words):
    return dict([(word, True) for word in words])
 
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

print(len(negids), len(posids))


1000 1000


We now have two data sets, one with the files that are negative reviews and one with the files that are positive reviews

In [37]:
#First negative review:
negids[0]

'neg/cv000_29416.txt'

We next are going to extract texts from each sub data set and create tuples with the labels 'neg' and 'pos', where the first element is the feature representation of the words of the review.

In [36]:
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

# lets print the first tuple from the negative set
print(negfeats[0])


({'plot': True, ':': True, 'two': True, 'teen': True, 'couples': True, 'go': True, 'to': True, 'a': True, 'church': True, 'party': True, ',': True, 'drink': True, 'and': True, 'then': True, 'drive': True, '.': True, 'they': True, 'get': True, 'into': True, 'an': True, 'accident': True, 'one': True, 'of': True, 'the': True, 'guys': True, 'dies': True, 'but': True, 'his': True, 'girlfriend': True, 'continues': True, 'see': True, 'him': True, 'in': True, 'her': True, 'life': True, 'has': True, 'nightmares': True, 'what': True, "'": True, 's': True, 'deal': True, '?': True, 'watch': True, 'movie': True, '"': True, 'sorta': True, 'find': True, 'out': True, 'critique': True, 'mind': True, '-': True, 'fuck': True, 'for': True, 'generation': True, 'that': True, 'touches': True, 'on': True, 'very': True, 'cool': True, 'idea': True, 'presents': True, 'it': True, 'bad': True, 'package': True, 'which': True, 'is': True, 'makes': True, 'this': True, 'review': True, 'even': True, 'harder': True, 'wr

In [38]:
# Define a split over the data for creating a train and test set
negcutoff = int(len(negfeats)*3/4)
poscutoff = int(len(posfeats)*3/4)

print(negcutoff)
print(poscutoff)

750
750


In [39]:
trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print('train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats)))

train on 1500 instances, test on 500 instances


In [40]:
classifier = NaiveBayesClassifier.train(trainfeats)
print('accuracy:', nltk.classify.util.accuracy(classifier, testfeats))
classifier.show_most_informative_features()

accuracy: 0.728
Most Informative Features
             magnificent = True              pos : neg    =     15.0 : 1.0
             outstanding = True              pos : neg    =     13.6 : 1.0
               insulting = True              neg : pos    =     13.0 : 1.0
              vulnerable = True              pos : neg    =     12.3 : 1.0
               ludicrous = True              neg : pos    =     11.8 : 1.0
                  avoids = True              pos : neg    =     11.7 : 1.0
             uninvolving = True              neg : pos    =     11.7 : 1.0
              astounding = True              pos : neg    =     10.3 : 1.0
             fascination = True              pos : neg    =     10.3 : 1.0
                 idiotic = True              neg : pos    =      9.8 : 1.0


Try to create your own train and test set with labels SPAM and NOTSPAM or POS and NEG and see if you can train a NaiveBayesClassifier in the same way

You first need to find out how to create a data set with labels for training and testing.
Check out the subjectivity on your local disk that is included in the NLTK download.
On a mac you can find it below /Users/<your username>, e.g.:

/Users/piek/nltl_data/corpora/subjectivity

On a Windows or Linux machine it is in a slightly different path also in your user directory.

Read the README.txt file that comes with the data:

  * quote.tok.gt9.5000 contains 5000 subjective sentences (or snippets);

  * plot.tok.gt9.5000 contains 5000 objective sentences.
  
https://www.nltk.org/_modules/nltk/corpus/reader/categorized_sents.html
  
In order to create another data set you need to create the tuples consisting of a sentence and a label.
Remember that we used the subjectivity.sents function to load the tuples from the corpus:

n_instances = 100
subj_docs = [(sent, 'subj') for sent in subjectivity.sents(categories='subj')[:n_instances]]
obj_docs = [(sent, 'obj') for sent in subjectivity.sents(categories='obj')[:n_instances]]

The subjectivity packages uses a specific format and function to create the tuples.

In [19]:
#Check out the first two items to see how it is structured
subj_docs[:2]

[(['smart',
   'and',
   'alert',
   ',',
   'thirteen',
   'conversations',
   'about',
   'one',
   'thing',
   'is',
   'a',
   'small',
   'gem',
   '.'],
  'subj'),
 (['color',
   ',',
   'musical',
   'bounce',
   'and',
   'warm',
   'seas',
   'lapping',
   'on',
   'island',
   'shores',
   '.',
   'and',
   'just',
   'enough',
   'science',
   'to',
   'send',
   'you',
   'home',
   'thinking',
   '.'],
  'subj')]

Here is a very simple example that shows how you can create tuples from two simple sentences, turn them into word features and train a NaiveBayesClassifier

In [1]:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
#from nltk.corpus import names

# simple function that turns a list of words into word_feats (word features)
def word_feats(words):
    return dict([(word, True) for word in words])

# In a lexical approach, you would predefine the positive, negative and neutral words and only use these to train a classifier
positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)' ]
negative_vocab = [ 'bad', 'terrible','useless', 'hate', ':(' ]
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not' ]

# Assume you have a collections of texts that are negative and neutral
negsentence = "I do not like green eggs and ham, and I do not like them too!"
possentence = "I like green eggs and ham, and I like them too!"
# By using the tokenization function, you can turn them into word negative and positive lists
negtokens = nltk.word_tokenize(negsentence)
postokens = nltk.word_tokenize(possentence)

# Next we use the simple word feature function to turn them into features that can be used for training the classifier 
positive_features = [(word_feats(pos), 'pos') for pos in postokens]
negative_features = [(word_feats(neg), 'neg') for neg in negtokens]
# for neural we now take the vocabulary given above
neutral_features = [(word_feats(neu), 'neu') for neu in neutral_vocab]
print(positive_features) 

[({'I': True}, 'pos'), ({'l': True, 'i': True, 'k': True, 'e': True}, 'pos'), ({'g': True, 'r': True, 'e': True, 'n': True}, 'pos'), ({'e': True, 'g': True, 's': True}, 'pos'), ({'a': True, 'n': True, 'd': True}, 'pos'), ({'h': True, 'a': True, 'm': True}, 'pos'), ({',': True}, 'pos'), ({'a': True, 'n': True, 'd': True}, 'pos'), ({'I': True}, 'pos'), ({'l': True, 'i': True, 'k': True, 'e': True}, 'pos'), ({'t': True, 'h': True, 'e': True, 'm': True}, 'pos'), ({'t': True, 'o': True}, 'pos'), ({'!': True}, 'pos')]


What would be another way to obtain neutral word features?

How would you do this for a data set where positive and negative texts are stored in two separate directories?

In [2]:
# we simply concatenate the features to create a training set
train_set = negative_features + positive_features + neutral_features
classifier = NaiveBayesClassifier.train(train_set) 

We are going to test this classifier on a single sentence.

In [5]:
neg = 0
pos = 0
testsentence = "Awesome eggs, I do not liked them"
words = nltk.word_tokenize(testsentence)
for word in words:
    classResult = classifier.classify(word_feats(word))
    if classResult == 'neg':
        neg = neg + 1
    if classResult == 'pos':
        pos = pos + 1
 
print("Sentence: '{}'\n--------------\n".format(testsentence))
print('Positive: ' + str(float(pos)/len(words)))
print('Negative: ' + str(float(neg)/len(words)))

Sentence: 'Awesome eggs, I do not liked them'
--------------

Positive: 0.375
Negative: 0.25


Here are some data sets for sentiment analysis
https://www.cs.cornell.edu/people/pabo/movie-review-data/
https://github.com/nltk/nltk/wiki/Sentiment-Analysis

## Assignment
Take one of these data sets and train a NaiveBayesClassifier from these data sets. You need to define a loop to read the texts from each file and get the word features from each. Make sure you concatenate the features and do not loose them along the way.

At the end provide statistics on the data set: how many files per category, how many features.

Divide a split over training and test data. Train the classifier using the train set and evaluate the classifier on the test set.

Obtain 50 tweets and create a gold data set from these tweets. How would you create the gold data automatically?
Test the classifier on your tweets. Provide a contigency table for it.

Why would it be better to use an independent test set instead of a split over the data set into test and train?

In [7]:
## Not being used
## How can you use the panda package to read files and create a data set
#import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
#from sklearn.model_selection import train_test_split # function for splitting data to train and test sets