# Workshop Week 6

# Demonstration: Sentiment analysis

In this part of the workshop we will walk through a system that uses the Naive Bayes classifiers of NLTK and scikit-learn to predict the review scores of NLTK's corpus of movie reviews. This corpus is used in [NLTK's chapter 6](http://www.nltk.org/book/ch06.html#document-classification). The corpus contains a selection of movie reviews, and a label as to whether the review is positive or negative. Classification of reviews as "positive" or "negative" is a task that is related to [sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis).

### Read the movie review corpus

The following code reads the movie reviews corpus and shows some statistics on the labels given to the data.

In [1]:
from nltk.corpus import movie_reviews
movie_reviews.categories()

['neg', 'pos']

In [2]:
print("Number of negative reviews:", len(movie_reviews.fileids('neg')))
print("Number of positive reviews:", len(movie_reviews.fileids('pos')))

Number of negative reviews: 1000
Number of positive reviews: 1000


The following code partitions the movie review corpus into a training and a test set.

In [3]:
import random
documents = [(list(movie_reviews.words(fileid)), category)
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]
random.seed(1234)
random.shuffle(documents)
threshold1 = int(len(documents)*.6)
threshold2 = int(len(documents)*.8)
train = documents[:threshold1]
devtest = documents[threshold1:threshold2]
test = documents[threshold2:]

### Document features

The following code defines a feature set of the 2000 most frequent non-stop words of the training set. The value of each feature is 1 if the word is present in the document, and 0 otherwise. This is what is generally called [one-hot encoding](https://en.wikipedia.org/wiki/One-hot). To build the list we will ignore word casing.

In [6]:
import nltk
import collections
from nltk.corpus import stopwords
stop = stopwords.words('english')
c = collections.Counter([w.lower() for (words,category) in train 
                                   for w in words if w.lower() not in stop])
top2000words = [w for (w,count) in c.most_common(2000)]

In [7]:
def document_features(words):
    "Return the document features for an NLTK classifier"
    words_lower = [w.lower() for w in words]
    result = dict()
    for w in top2000words:
        result['has(%s)' % w] = (w in words_lower)
    return result

### NLTK Naive Bayes

The following code trains an NLTK Naive Bayes classifier using the training set, and reports the evaluation results on the training set and the dev-test set.

In [8]:
train_features = [(document_features(x),y) for (x,y) in train]
devtest_features = [(document_features(x),y) for (x,y) in devtest]
classifier = nltk.NaiveBayesClassifier.train(train_features)

In [9]:
nltk.classify.accuracy(classifier,devtest_features)

0.78

In [10]:
nltk.classify.accuracy(classifier,train_features)

0.8883333333333333

We can see the difference in accuracy between the test set and the train set.

### Matrix features for sklearn

The following code defines a second feature extractor that uses one-hot encoding on the same list of 2000 words, and which is suitable for sklearn.

In [11]:
def vector_features(words):
    "Return a vector of features for sklearn"
    words_lower = [w.lower() for w in words]
    result = []
    for w in top2000words:
        if w in words_lower:
            result.append(1)
        else:
            result.append(0)
    return result

### sklearn Naive Bayes

This code generates the features for sklearn and trains and evaluates a multinomial Naive Bayes classifier.

In [12]:
train_vectors = [vector_features(x) for (x,y) in train]
devtest_vectors = [vector_features(x) for (x,y) in devtest]

In [14]:
from sklearn.naive_bayes import MultinomialNB
sklearn_classifier = MultinomialNB()
sklearn_classifier.fit(train_vectors, [y for (x,y) in train])

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [15]:
from sklearn.metrics import accuracy_score

In [16]:
predictions = sklearn_classifier.predict(devtest_vectors)
true = [y for (x,y) in test]
accuracy_score(true,predictions)

0.53500000000000003

In [17]:
predictions = sklearn_classifier.predict(train_vectors)
true = [y for (x,y) in train]
accuracy_score(true,predictions)

0.92249999999999999

Note that the difference between train and devtest is much greater with sklearn. This might be because of differences in the default settings of the implementations.

### Question

sklean clearly overfits. What do you think you could do to reduce overfitting?

### tfidf with Naive Bayes

The following code computes tf.idf of the training set and uses sklearn's multinomial Naive Bayes classifier.

Note that sklearn's `TfidfVectorizer` takes a list of strings as the input, but in our previous experiments we had used the tokenised information, that is, the list of tokens provided by sklearn. We can use `TfidfTransformer` to process a list of tokens (see the [sklearn documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html)), 
but since we haven't covered it in the lectures, let's use the raw strings as our data.

In [18]:
texts = [(movie_reviews.raw(fileid), category)
         for category in movie_reviews.categories()
         for fileid in movie_reviews.fileids(category)]
random.seed(1234)
random.shuffle(texts)
threshold1 = int(len(texts)*.6)
threshold2 = int(len(texts)*.8)
text_train = texts[:threshold1]
text_devtest = texts[threshold1:threshold2]
text_test = texts[threshold2:]

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(input='contents',stop_words='english',max_features=2000)
text_tfidf_train = tfidf.fit_transform([x for (x,y) in text_train])
text_tfidf_devtest = tfidf.transform([x for (x,y) in text_devtest])

Note that we used `tfidf.fit_transform` when using the training set, and `tfidf.transform` when using the test set. This is because we use the train set to learn the tfidf parameters 
(the 2000 most frequent words and their IDF), and then we apply that information when we want to compute the tfidf of the test set.

In [20]:
sklearn_tfidfclassifier = MultinomialNB()
sklearn_tfidfclassifier.fit(text_tfidf_train,[y for (x,y) in text_train])

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [21]:
accuracy_score([y for (x,y) in text_devtest], sklearn_tfidfclassifier.predict(text_tfidf_devtest))

0.80500000000000005

In [22]:
accuracy_score([y for (x,y) in text_train], sklearn_tfidfclassifier.predict(text_tfidf_train))

0.91416666666666668

### Question

We observe much better results in the devtest set, and less difference between train and devtest sets. What does this mean in terms of overfitting, bias and variance?

# Your Turn

## Naive Bayes on Question Segmentation

NLTK has a corpus of questions with their label under a particular classification scheme (e.g. `DESC` refers to a question expecting a descriptive answer, such as one starting with "How"; `HUM` refers to a question expecting an answer referring to a human). Here's some example of use of the corpus:

In [23]:
from nltk.corpus import qc
train = qc.tuples("train.txt")
test = qc.tuples("test.txt")

In [24]:
train[:3]

[('DESC:manner', 'How did serfdom develop in and then leave Russia ?'),
 ('ENTY:cremat', 'What films featured the character Popeye Doyle ?'),
 ('DESC:manner', "How can I find a list of celebrities ' real names ?")]

In [25]:
test[:3]

[('NUM:dist', 'How far is it from Denver to Aspen ?'),
 ('LOC:city', 'What county is Modesto , California in ?'),
 ('HUM:desc', 'Who was Galileo ?')]

### Exercise: All question types

Write Python code that lists all the possible question types of the training set (remember: _never look at the test set_).

### Exercise: All general types

The question types have two parts. The first part describes a general type, and the second part defines a subtype. For example, the question type `DESC:manner` belongs to the general `DESC` type and within that type to the `manner` subtype. Let's focus on the general types only. Write Python code that lists all the possible general types (there are 6 of them).

### Exercise: Feature extractor

Write a feature extractor function that uses individual words as features ("one-hot encoding"). To obtain the list of words, use the 100 most frequent words in the training set (since you aren’t supposed to use the test set to extract features). Note that we do not use a list of stop words now since the questions are very short, and some words such as 'how' are useful for question classification but are listed as stop words.

### Exercise: NLTK Naive Bayes classifier

Train an NLTK Naïve Bayes classifier with the features of the training set, and test it on the testing set. What accuracy do you obtain?

### Exercise: sklearn Naive Bayes classifier

Convert the feature set to a document matrix suitable for sklearn, and train again using sklearn's Multinomial Naive Bayes classifier. Are the results different?

### Exercise: Majority baseline

What is the accuracy if we use a majority baseline?

# Naive Bayes by Hand

The lecture notes give a walk-through for the computation of Naive Bayes of a document given the following small corpus of training documents. The table shows the frequency of specific keywords in each document of the corpus, together with the document class.

| Doc. | Class | "computer" | "machine" | "spreadsheet" | "money" | "budget" |
| -----| ----- | ---------- | --------- | ------------- | ------- | -------- |
| d1   | c1    | 10         | 3         | 2             | 0       | 1        |
| d2   | c1 |	4 |	6 |	8 |	2 |	0|
| d3 |	c2 |	2 |	1 |	0 |	5 |	4|
| d4 |	c2 |	1 |	2 |	1 |	8 |	9|

The lecture notes show how to classify the document with word frequencies (1,2,3,4,5).

### Exercise: Apply Naive Bayes to classify the document with word vector (2,4,1,0,3) using MLE

### Exercise: Apply Naive Bayes to classify the document with word vector (2,4,1,0,3) using "add-1" smoothing