# Lab2 Subjectivity analysis using NLTK

You are going to create a Subjectivity classifier from data provided in NLTK

## Background reading

* http://www.nltk.org/api/nltk.sentiment.html#module-nltk.sentiment.util
* https://www.nltk.org/book/ch06.html
    * section 6.1
    * section 6.3

## Creating the datasets (subjective and objective sentences)

In [1]:
# Loading stuff
import nltk
from nltk.classify import NaiveBayesClassifier
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import * # needed for the mark_negation function




We will first obtain the subjectivity corpus that is included in NLTK.

In [2]:
from nltk.corpus import subjectivity

From this data set we are going to select 200 sentences for training and testing.
The package subjectivity.sents defines which sentences are subjective ('subj') and which ones are objective ('obj').

In [3]:
n_instances = 100
subj_docs = [(sent, 'subj') for sent in subjectivity.sents(categories='subj')[:n_instances]]
obj_docs = [(sent, 'obj') for sent in subjectivity.sents(categories='obj')[:n_instances]]
len(subj_docs), len(obj_docs)

(100, 100)

The data is now balanced. Why is this important for a NaiveBayesClassifier? 

Each Document is represented by a tuple (ie. in the form <sentence, label>. The sentence is tokenised, so it is represented by a list of strings. The labels is subj or obj

In [4]:
subj_docs[50]

(["there's",
  'lots',
  'of',
  'cool',
  'stuff',
  'packed',
  'into',
  "espn's",
  'ultimate',
  'x',
  '.'],
 'subj')

Subjective and objective instances were split separately, to keep a balanced uniform class distribution in both train and test sets. We create the train and test set by taking the first 80 sentences as train and the last 20 sentences as test. We then concatenate the subjective and objective sets.

In [5]:
train_subj_docs = subj_docs[:80]
test_subj_docs = subj_docs[80:100]
train_obj_docs = obj_docs[:80]
test_obj_docs = obj_docs[80:100]
training_docs = train_subj_docs+train_obj_docs
testing_docs = test_subj_docs+test_obj_docs

We now initialize a SentimentAnalyser and use a mark_negation function for negative words. mark_negationis a utility function that marks words that are negations that can switch the polarity.

In [6]:
sentim_analyzer = SentimentAnalyzer()
all_words_neg = sentim_analyzer.all_words([mark_negation(doc) for doc in training_docs])

Simple unigram word features are then used, handling negation:

In [7]:
unigram_feats = sentim_analyzer.unigram_word_feats(all_words_neg, min_freq=4)
len(unigram_feats)

83

In [13]:
# Show the first 10
unigram_feats[0:10]

['.', 'the', ',', 'a', 'and', 'of', 'to', 'is', 'in', 'with']

In [14]:
sentim_analyzer.add_feat_extractor(extract_unigram_feats, unigrams=unigram_feats)

Then, features are applied to obtain a feature-value representation of the datasets

In [15]:
training_set = sentim_analyzer.apply_features(training_docs)
test_set = sentim_analyzer.apply_features(testing_docs)

Check out the feature presentation of the test_set. Do you understand what it represents? Why are so many features False?

In [20]:
# Show the first
test_set[0]

({'contains(.)': True,
  'contains(the)': True,
  'contains(,)': False,
  'contains(a)': True,
  'contains(and)': False,
  'contains(of)': True,
  'contains(to)': False,
  'contains(is)': False,
  'contains(in)': False,
  'contains(with)': True,
  'contains(it)': False,
  'contains(that)': False,
  'contains(his)': False,
  'contains(on)': False,
  'contains(for)': True,
  'contains(an)': False,
  'contains(who)': False,
  'contains(by)': False,
  'contains(he)': False,
  'contains(from)': False,
  'contains(her)': False,
  'contains(")': False,
  'contains(film)': False,
  'contains(as)': False,
  'contains(this)': False,
  'contains(movie)': False,
  'contains(their)': False,
  'contains(but)': False,
  'contains(one)': False,
  'contains(at)': False,
  'contains(about)': False,
  'contains(the_NEG)': False,
  'contains(a_NEG)': False,
  'contains(to_NEG)': False,
  'contains(are)': False,
  "contains(there's)": False,
  'contains(()': False,
  'contains(story)': False,
  'contains(w

At this stage, we are ready to train our classifier on the training set, and output the evaluation results:

In [21]:
trainer = NaiveBayesClassifier.train
classifier = sentim_analyzer.train(trainer, training_set)
# output: Training classifier
for key,value in sorted(sentim_analyzer.evaluate(test_set).items()):
    print('{0}: {1}'.format(key, value))
#Outputs:
#Evaluating NaiveBayesClassifier results...
#Accuracy: 0.8
#F-measure [obj]: 0.8
#F-measure [subj]: 0.8
#Precision [obj]: 0.8
#Precision [subj]: 0.8
#Recall [obj]: 0.8
#Recall [subj]: 0.8

Training classifier
Evaluating NaiveBayesClassifier results...
Accuracy: 0.8
F-measure [obj]: 0.8
F-measure [subj]: 0.8
Precision [obj]: 0.8
Precision [subj]: 0.8
Recall [obj]: 0.8
Recall [subj]: 0.8
