# Topic 3: Further Document Classification

## Preliminaries 
Run the `%load` command, then run the cell.

In [None]:
%load ../setup

Also, run the following cell to have access to the functions we defined in the previous topic.

### Overview
This topic builds on the activities of the previous topic on sentiment analysis. You will be focussing on the Amazon review corpus with a view to investigating the following issues.

- What is the impact of varying training data size? To what extent does increasing the quantity of training data improve classifier performance?
- What is the impact of changing domain (i.e. book, dvd, electronics, kitchen). In particular, what happens if you train a classifier with reviews in one domain (or product category) and test the classifier on reviews from a different domain? Does performance degrade, and if so, by how much? Are some pairs of product categories more similar than others?
- What is the impact on classifier accuracy of various feature extraction methods?

By this stage, you should be very comfortable with Python's [list comprehensions](http://docs.python.org/tutorial/datastructures.html#list-comprehensions) and [slice](http://bergbom.blogspot.co.uk/2011/04/python-slice-notation.html) notation.

You will need to run the next cell so that you can use the `split_data` and `format_data` functions  covered in the previous topic.

## Investigating the impact of the quantity of training data
We will begin by exploring the impact on classification accuracy of using different quantities of training data.

We have assembled code from the notebook for Topic 2 in a module called classification_utils.  In the next cell, we load this module.

We now measure the performance of both the word list classifier (the version that uses the 100 most frequent words in each category) and Naïve Bayes classifiers on all of the dvd reviews in the extended dvd review corpus, with 70% of the corpus being used for training and the remainder for testing.

### Exercise
Run this cell several times and observe the output.

In [None]:
from classification_utils import *

reader = AmazonReviewCorpusReader().category("dvd")
#stopwords = stopwords.words('english')
word_list_size = 100
pos_train,neg_train,pos_test,neg_test = get_train_test_data(reader)
WL_accuracy = run_WL(pos_train,neg_train,pos_test,neg_test,word_list_size)
NB_accuracy = run_NB(pos_train,neg_train,pos_test,neg_test)
print("The accuracy of Word List classifer is {0:.2f}".format(WL_accuracy))
print("The accuracy of the Naive Bayes classifier is {0:.2f}.".format(NB_accuracy))
df = pd.DataFrame([("Word List",WL_accuracy),("NB",NB_accuracy)])
display(df)
ax = df.plot.bar(title="Experimental Results",legend=False,x=0)
ax.set_ylabel("Classifier Accuracy")
ax.set_xlabel("Classifier")
ax.set_ylim(0.5,1.0)

As you can see, the classifiers have different accuracies on different runs. 

### Exercise
Copy the cell above and move the copy to be positioned below this cell. Then adapt the code so that the accuracy reported for each classifier is the average across multiple runs.

In [None]:
# uncomment the next line and then run the cell to load a solution
#%load solutions/average_performance

### Exercise

The next step involves measuring the performance of both the word list and Naïve Bayes classifiers on a range of subsets of the dvd reviews in the extended dvd review corpus.

- The full data set has 1000 positive and 1000 negative reviews. 
- You should continue to use 30% of the data for testing, so this means that we have up to 700 positive and 700 negative reviews to sample from.
- Consider (at least) the following sample sizes: 1, 10, 50, 100, 200, 400, 600 and 700.
- Note that the sample size is not the total number of reviews, but the number of positive reviews (which is also equal to the number of negative reviews).

### Exercise
Copy the code cell that you created for the last exercise, and place the copy below this cell. Then adapt the code to determine the accurracy of each classifier on each subsets.

Use the `sample` function from the random module, which means you should include the line:  
`from random import sample`
- Make sure that you are selecting samples that have an equal number of positive and negative reviews.

Use a Pandas dataframe to display the results in a table.
- The table should have three columns:
 - the first for the sample sizes, 
 - the second for the Word List classifier accuracies, and 
 - the third for the Naïve Bayes classifier accuracies.
- There are examples of this in the model solutions to exercises in Topic 1 that you can adapt.
- You can use `pd.set_option('precision',2)` to limit the reals to have 2 digits after the decimal point.
- Create a dataframe like this:
```
pd.DataFrame(list(zip(<column 1 list>, <column 2 list>, ...)),
                  columns=<a list of the column headings)
```

In [None]:
# uncomment the next line and then run the cell to load a solution
#%load solutions/different_sample_sizes

### Exercise

Make a copy of the cell you created for the previous exercise and move it to be positioned below this cell. Using the new cell, repeat the above for each of the product categories.
- The available categories are `'dvd'`, `'book'`, `'kitchen'` and `'electronics'`. 

In [None]:
# uncomment the next line and then run the cell to load a solution
#%load solutions/different_categories

## Cross-domain sentiment analysis
We now consider the extend to which the performance of a Naïve Bayes classifier is degraded due to differences between the data it is trained on and the data that it is tested on. 

For example, suppose we train a classifier on book reviews and then test that classifier on a collection of dvd reviews. Does it perform as well as it would when trained on dvd reviews?

We will refer to the domain or product category that the classifier is trained on as the **source domain** and the domain or product category that the classifier is tested on as the **target domain**. You will be experimenting with different combinations of source and target domains.

There are 4 product categories so there are 16 different ways in which these can be combined to create training and testing datasets.

### Exercise
In the empty cell below, write code that determines the accuracy of the Naïve Bayes classifier for each of these 16 combinations.
- use a pandas dataframe to report the results in a table with three columns:
 - the first column is the source category,
 - the second column is the target category, and
 - the third column is the accuracy for the corresponing source and target categories.
- So each row of the table gives the accuracy for one of the combinations.
 - There should, therefore, be 16 rows.

Ideally, the accuracies that you report should be averaged over multiple runs (as we saw above). However, since there are 16 combinations to consider, in order to avoid overly long running times, you do not need to run the classifier more than once for each combination.

Now that we are just running the Naïve Bayes classifier, it makes sense to format the data before passing it to our `run_NB` function. 

We having included in `classification_util.py` a variant of `get_train_test_data` called `get_formatted_train_test_data` that is defined as follows:

```
def get_formatted_train_test_data(category, feature_extractor=None, split=0.7):
    '''
    Helper function. Splits data evenly across positive and negative, and then formats it
    ready for naive bayes. You can also optionally pass in your custom feature extractor 
    (see next section), and a custom split ratio.
    '''
    arcr = AmazonReviewCorpusReader()
    pos_train, pos_test = split_data(arcr.positive().category(category).documents(), split)
    neg_train, neg_test = split_data(arcr.negative().category(category).documents(), split)
    train = format_data(pos_train, "pos", feature_extractor) + format_data(neg_train, "neg", feature_extractor)
    test  = format_data(pos_test, "pos", feature_extractor) + format_data(neg_test, "neg", feature_extractor)
    return test, train
```

This means that we also have a variant of `run_NB` called `run_NB_preformatted` which is defined as follows:

```
def run_NB_preformatted(train,test):
    c_priors = class_priors(train)
    c_probs = cond_probs(train)
    known_vocab = known_vocabulary(train)
    return NB_evaluate(test,c_priors,c_probs,known_vocab)
```

In [None]:
# uncomment the next line and then run the cell to load a solution
#%load solutions/crossing_domains

### Exercise
Make a copy of the cell that contains your solution to the previous exercise and position the copy below this cell.

Adapt the code so that you explore the use of training sets built from multiple categories. For example, you might consider the following:

```
source = dvd_train + book_train + kitchen_train
target = electronics_test

source = dvd_train + book_train + kitchen_train + electronics_train
target = electronics_test
```

One thing to bear in mind when considering the impact of using multiple product categories is the extent to which improvements are due to an increase in the quanty of training data.

In [None]:
# uncomment the next line and then run the cell to load a solution
#%load solutions/multi_category_source

## Feature extraction
So far, the Naïve Bayes classifiers you've been training, have been using all of the tokens in a review as features. You will now be exploring whether it is possible to improve classification accuracy by extracting different features from the reviews.

### Exercise
First, establish the accuracy of the Naïve Bayes classifier on the each of the product categories.

To do this, simply run the following cell. This code creates a dictionary `baseline`  that stores the accuracy for each product category. You can use this dictionary later when considering the impact of various feature extractors.

from classification_utils import *

prod_cats = ["book","dvd","kitchen","electronics"]
baseline = {}
for prod_cat in prod_cats:
    repetitions = 2 # accuracy figures are averaged over this many repetitions
    NB_accuracy_tot = 0
    for i in range(repetitions): # for each sample_size we will find average accuracy over several repetitions
        test, train   = get_formatted_train_test_data(prod_cat)
        NB_accuracy_tot += run_NB_preformatted(train,test)
    baseline[prod_cat] = NB_accuracy_tot/repetitions
    
pd.set_option('precision',2)
df = pd.DataFrame.from_dict(baseline,orient='index')
display(df)
ax = df.plot.bar(title="Experimental Results",legend=False)
ax.set_ylabel("Classifier Accuracy")
ax.set_xlabel("Product Category")
ax.set_ylim(0.5,1.0)

### Exercise
In the empty cell below, define the feature extractor function `FE_all`, which takes as input an `AmazonReview` object, and outputs a list of strings, where each string is a feature to be used by the Naïve Bayes classifier. 

Initially, define it to return all of the tokens in the review. We will adapt it in the exercises below.

In [None]:
# uncomment the next line and then run the cell to load a solution
#%load solutions/FE_all

### Exercise 
Now that you have defined a feature extractor function, you can pass it to the helper function `get_training_testing` that was used in the last exercise. 

Copy the code cell above that you used to produce the `baseline` dictionary, and adapt it to make use of your feature_extractor function, `FE_all`. 

Run the new code cell, and save the results in a new dictionary called `FE_all_results`. This feature extractor is have no impact on the features that the classifier is using so should not have a significant impact on the accuracy of the classifier.  Check that this is true.

In [None]:
# uncomment the next line and then run the cell to load a solution
#%load solutions/test_FE_all

### Exercise
In a new cell, define a feature extraction function that converts all tokens to lowercase. 
- Call your new feature extractor `FE_lower`.
- See the last section of the Topic 1 notebook for guidance on how to convert tokens to lowercase.

In [None]:
# uncomment the next line and then run the cell to load a solution
#%load solutions/FE_lower

### Exercise
In a new cell, define a feature extraction function that converts all numbers to "NUM". 
- Call your new feature extractor `FE_NUM`.
- See the last section of the Topic 1 notebook for guidance on how to convert tokens to lowercase.

In [None]:
# uncomment the next line and then run the cell to load a solution
#%load solutions/FE_NUM

### Exercise
In a new cell, define a feature extraction function that filters out non-alphabetic words and stopwords.
- Call your new feature extractor `FE_puncstop`.
- See the last section of the Topic 1 notebook for guidance on how to convert tokens to lowercase.
- Note that the following two lines must be placed in a cell where this feature extraction function is being used:  
`from nltk.corpus import stopwords`  
`stopwords = stopwords.words('english')`


In [None]:
# uncomment the next line and then run the cell to load a solution
#%load solutions/FE_puncstop

### Exercise
In a new cell, define a feature extraction function that stems all of the tokens.
- Call your new feature extractor `FE_stem`.
- The code snippet below shows you how to set up a stemmer.

In [None]:
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer() #Create a new stemmer
stemmed = stemmer.stem("complications") #Example usage, stemming a single word

#You will need to stem all of the words in a review, 
#this will require iterating over them with a loop or list comprehension

In [None]:
# uncomment the next line and then run the cell to load a solution
#%load solutions/FE_stem

### Exercise
Now that you have defined several feature extraction functions, it is time to look at their impact on performance.

Make a copy of the cell that you used to determine the accuracy of your first feature extractor, and position the copied cell below this one.

Extend the code in the cell to do the following:
- Create a dictionary for each of your feature extraction functions.
- Display the results in a table using a Pandas dataframe. This table should have one row for each product category and one column for the product category names and additional columns for each of the five feature extraction function.

In [None]:
# uncomment the next line and then run the cell to load a solution
#%load solutions/test_FS