# Topic 2: Basic Document Classification

## Preliminaries 

This topic (Topic 2) and Topic 3 concern the task of sentiment analysis. You will be using a corpus of **book reviews** within an **Amazon review corpus**.

You be exploring various techniques that can be used to classify Amazon book reviews as either positive or negative. 

You will be developing your own Word List classifiers and then comparing them to the NLTK Naïve Bayes classifier.

Something for you to do
- The first thing you need to do is run the following cell. This will give you access to the Sussex NLTK package.

In [None]:
# Edit this cell to uncomment one line and remove the one that follows

import sys
#sys.path.append(r'T:\Departments\Informatics\LanguageEngineering') 
sys.path.append(r'/Users/davidw/Documents/teach/NLE/resources')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import collections

## Creating training and testing sets

During the next two lab sessions you will be training and testing various document classifiers. It is important that the data used in the testing phase is not used during the training phase, since this can lead to overestimating performance. This section describes how to use the <code style="background-color: #F5F5F5;">split_data</code> function in order to get separate training and testing sets.

In [None]:
from random import sample
from sussex_nltk.corpus_readers import AmazonReviewCorpusReader

 
def split_data(data, ratio=0.7):
    data = list(data)
 
    n = len(data)  #Found out number of samples present
    train_indices = sample(xrange(n), int(n * ratio))          #Randomly select training indices
    test_indices = list(set(xrange(n)) - set(train_indices))   #Randomly select testing indices
 
    training_data = [data[i] for i in train_indices]           #Use training indices to select data
    testing_data = [data[i] for i in test_indices]             #Use testing indices to select data
 
    return (training_data, testing_data)                       #Return split data
 
#Create an Amazon corpus reader pointing at only book reviews
book_reader = AmazonReviewCorpusReader().category("book")

#In order to get even random splits, where each data set is a list of Amazon Review objects.
pos_training_data, pos_testing_data = split_data(book_reader.positive().documents()) #See the note below this code snippet 
neg_training_data, neg_testing_data = split_data(book_reader.negative().documents())

#You can also combine the training data
training_data = pos_training_data + neg_training_data
testing_data = pos_testing_data + neg_testing_data

### Note

Using the documents function on the Amazon corpus reader returns a generator over reviews in the corpus (each document in the Amazon corpus is a product review). Each review is an instance of a Python class called `AmazonReview`, which we have defined. An `AmazonReview` object contains all the data about a review.


**Function**: `split_data`

- Arguments
 - An iterable over data (e.g. list, generator)  
 - Ratio of training to testing data. The default (0.7) returns 70% training and 30% testing
- Returns
 - A split of the original data, into two chunks (stored in a tuple)

### Something for you to do

- Use the code snippet above to split the book review corpus in various ways, and by measuring the size of the resulting splits, check that the size of both splits match the specified ratios.


## Creating word lists

The next section will explain how to use a sentiment classifier that bases its decisions on word lists. The classifier requires a list of words indicating positive sentiment, and a second list of words indicating negative sentiment. Given positive and negative word lists, a document's overall sentiment is determined based on counts of occurrences of words that occur in the two lists. In this section we are concerned with the creation of the word lists. We will be considering both hand-crafted lists and automatically generated lists.

### Something for you to do

- Create a reasonably long hand-crafted list of words that you think indicate positive sentiment.
- Create a reasonably long hand-crafted list of words that indicate negative sentiment.

Next, you should try to derive word lists from the data. One way to do this, is to use the most frequent words in positive reviews as your positive list, and the most frequent words in negative reviews as your negative list. This can be done with the [NLTK <code style="background-color: #F5F5F5;">FreqDist</code>](http://www.nltk.org/api/nltk.html#module-nltk.probability) object. The following code should get you started.

In [None]:
from nltk.probability import FreqDist
from sussex_nltk.corpus_readers import AmazonReviewCorpusReader

#Helper function. Given a list of reviews, return a list of all the words in those reviews
def get_all_words(amazon_reviews):
    return reduce(lambda words,review: words + review.words(), amazon_reviews, [])

#A frequency distribution over all words in positive book reviews
pos_book_freqdist = FreqDist(get_all_words(pos_training_data))

### Something for you to do

- Extend the above code to construct positive and negative word lists consisting of the top k most frequent positive words and the top k most frequent negative words.
- Implement an alternative approach that creates positive and negative word lists consisting of all positive words occurring more than k times, and negative words occurring more than k times.
- Using the training data, create word lists using both of the above approaches. Do not use the test data for this!

## Creating a word list based classifier

Now you have a number of word lists for use with a classifier. The following code can be used as the basis for creating a word list based classifier.

In [None]:
from nltk.classify.api import ClassifierI
import random

class SimpleClassifier(ClassifierI): 

    def __init__(self, pos, neg): 
        self._pos = pos 
        self._neg = neg 

    def classify(self, words): 
        score = 0
        
        # add code here that assigns an appropriate value to score
        
        return "N" if score < 0 else "P" 

    def batch_classify(self, docs): 
        return [self.classify(doc.words() if hasattr(doc, 'words') else doc) for doc in docs] 

    def labels(self): 
        return ("P", "N")

#Example usage:

book_classifier = SimpleClassifier(positive_book_words_list, negative_book_words_list)

### Something for you to do

- Complete the `classify` method in the above code as specified below.
- Test your classifier on several very simple hand-crafted examples to verify that you have implemented `classify` correctly.

The classifier is initialised with a list of positive words, and a list of negative words. The words of a document are passed to the `classify` method (which is partially completed in the above code fragment). The <code style="background-color: #F5F5F5;">classify</code> method should be defined so that each occurrence of a negative word decrements <code style="background-color: #F5F5F5;">score</code>, and each occurrence of a positive word increments <code style="background-color: #F5F5F5;">score</code>. If the final value of <code style="background-color: #F5F5F5;">score</code> is 0, then the classification decision should be made randomly; for <code style="background-color: #F5F5F5;">score</code> less than 0, an "<code style="background-color: #F5F5F5;">N</code>" for "negative" is returned, otherwise "<code style="background-color: #F5F5F5;">P</code>" for positive is returned.

## Evaluating word list based classifier

Below is code that uses an evaluation function in order to determine how well your classifier performs. The function returns the <b>accuracy</b> of a classifier. The accuracy metric is defined as the proportion of documents that were correctly classified.

In [None]:
from sussex_nltk.stats import evaluate_wordlist_classifier
from sussex_nltk.corpus_readers import AmazonReviewCorpusReader

#Create a new classifier with your words lists
book_classifier = SimpleClassifier(positive_book_words_list, negative_book_words_list)

#Evaluate classifier
#The function requires three arguments:
# 1. Word list based classifer
# 2. A list (or generator) of positive AmazonReview objects
# 3. A list (or generator) of negative AmazonReview objects
accuracy = evaluate_wordlist_classifier(book_classifier, pos_testing_data, neg_testing_data)  
print accuracy

### Something for you to do  

You have two experiments to perform:
- Evaluate the performance of a classifier using hand-crafted lists.
- Evaluate the performance of a classifier using lists derived from the training data.

## Setting up training/testing data for Naïve Bayes (NB) classifiers

The NLTK Naïve Bayes classifier requires the data to be in a particular format. So the data should be formatted using the <code style="background-color: #F5F5F5;">format_data</code> function below.

### Note

For now, ignore the third argument to `format_data`, this will be used in the next lab session.

In [None]:
from sussex_nltk.corpus_readers import AmazonReviewCorpusReader

def format_data(reviews, label, feature_extraction_fn=None):
    if feature_extraction_fn is None: #If a feature extraction function is not provided, use simply the words of the review as features
        data = [(dict([(feature, True) for feature in review.words()]), label) for review in reviews]
    else:
        data = [(dict([(feature, True) for feature in feature_extraction_fn(review)]), label) for review in reviews]
    return data

#After you've split the data up as shown earlier, you can use the split data like this:
#Format the positive and negative separately
formatted_pos_training = format_data(pos_training_data, "pos") 
formatted_neg_training = format_data(neg_training_data, "neg") 
#Combine them
formatted_training_data = formatted_pos_training + formatted_neg_training

#Same again but for the testing data
formatted_pos_testing = format_data(pos_testing_data, "pos") 
formatted_neg_testing = format_data(neg_testing_data, "neg") 
#Combine them
formatted_testing_data = formatted_pos_testing + formatted_neg_testing

### Something for you to do

- Look carefully at the above code for `format_data` and make sure that you understand why the implementation satisfies the specification below.


**Function**: `format_data`
- Arguments
 - An iterable (e.g. list, generator) over AmazonReview objects.  
 - A label to assign the reviews in the corpus reader ("pos" or "neg" for positive and negative respectively).  
 - (optional) A function for extracting features from review (no need to use this until the next lab session).
- Returns
 - A list of dictionaries of features extracted from reviews, each mapped to the sentiment of the review. Formatted ready for NB classifier.

## Creating and evaluating a Naïve Bayes classifier

This section shows how to train and test a NB classifier.

In [None]:
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy

#Train on a list of reviews
nb_classifier = NaiveBayesClassifier.train(formatted_training_data)

#Test on another list of reviews
print "Accuracy:", accuracy(nb_classifier, formatted_testing_data)

#Print the features that the NB classifier found to be most important in making classifications
nb_classifier.show_most_informative_features()

### Something for you to do
- Investigate the performance of the various wordlist approaches and the Naïve Bayes (NB) classifier.
- Compare the most informative features in Naïve Bayes training with your word lists (use the `show_most_informative_features` method of the NB classifier.