Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [1]:
NAME = "Teo Kai Wen"
COLLABORATORS = "Prof Goodman, Ser Han"

---

# Assignment 9

**due:** Wednesday, 13 November

This assignment gives you practice with supervised classification.

## Introduction

We will build models to classify the genre of paragraphs of text. The data will come from the Brown corpus, since its categories are divided by genre.

First let's inspect the size of each genre:

In [2]:
from nltk.corpus import brown

print('Category         # Paras  # Sents  # Words')
print('---------------  -------  -------  -------')
for genre in brown.categories():
    print('{:<15}  {:>7}  {:>7}  {:>7}'.format(
        genre,
        len(brown.paras(categories=genre)),
        len(brown.sents(categories=genre)),
        len(brown.words(categories=genre))))

Category         # Paras  # Sents  # Words
---------------  -------  -------  -------
adventure           1387     4637    69342
belles_lettres      1405     7209   173096
editorial           1003     2997    61604
fiction             1043     4249    68488
government           851     3032    70117
hobbies             1119     4193    82345
humor                254     1053    21695
learned             1418     7734   181888
lore                1203     4881   110299
mystery             1164     3886    57169
news                2234     4623   100554
religion             369     1716    39399
reviews              629     1751    40704
romance             1253     4431    70022
science_fiction      335      948    14470


For this assignment let's just use the smaller genres: `humor`, `religion`, `reviews`, and `science_fiction`:

In [3]:
genres = ['humor', 'religion', 'reviews', 'science_fiction']

Finally, note that `brown.paras()` returns a list of paragraphs, and each paragraph is a list of sentences, and each sentence is a list of words. Schematically, that is:

```
brown.paras()       ->  [para, ...]
[para, ...]         ->  [[sent, ...], ...]
[[sent, ...], ...]  ->  [[[word, ...], ...], ...]
```

Since our classifiers will need to work with lists of words, the following function should help:

In [4]:
def flatten(list_of_lists):
    return [item for lst in list_of_lists for item in lst]

# now to test it out
sents = [['w1', 'w2'], ['w3', 'w4']]  # list of sentences
paras = [sents, sents]                # list of paragraphs

print(flatten(sents))
print(flatten(paras))
print(flatten(flatten(paras)))

['w1', 'w2', 'w3', 'w4']
[['w1', 'w2'], ['w3', 'w4'], ['w1', 'w2'], ['w3', 'w4']]
['w1', 'w2', 'w3', 'w4', 'w1', 'w2', 'w3', 'w4']


## Q1: Create the labeled data (2 points)

In order to build a classifier, we need to create some labeled test and training data. Do the following:

* Use `brown.paras()` as above to get all paragraphs for a genre
  * Make a tuple that pairs each paragraph with its genre: `(para, genre)`
  * Use `k` to calculate how many pairs is `1/k` of the total for the genre
  * Put 1/10 of the pairs for each genre in `test_paras`
  * Put the remaining 9/10 of the pairs in `train_paras`
* Repeat for each genre

**Note:** you cannot just do `brown.paras(categories=genres)` because then you won't know which category each paragraph belongs to.

In [7]:
train_paras = []
test_paras = []
k = 10

for genre in genres:
    temp_list = [(para, genre) for para in brown.paras(categories=genre)]
    index = int((1/k)*len(temp_list))
    train = temp_list[index:]
    test = temp_list[:index]
    train_paras += train
    test_paras += test 
    
print('train: {:>5}'.format(len(train_paras)))
print('test:  {:>5}'.format(len(test_paras)))

train:  1431
test:    156


In [8]:
# Check that train and test are the right size for each genre.
# Don't check exactly, but +- c to account for rounding error.
c = 3  # a small constant
for genre in genres:
    total = len(brown.paras(categories=genre))
    expected_test = int(total * (1 / k))
    expected_train = int(total * ((k-1) / k))
    observed_test = len([pair for pair in test_paras if pair[1] == genre])
    observed_train = len([pair for pair in train_paras if pair[1] == genre])
    print(genre + ':')
    print('       expected  observed')
    print('TRAIN  {:>8}  {:>8}'.format(expected_train, observed_train))
    print('TEST   {:>8}  {:>8}'.format(expected_test, observed_test))
    print()
    assert (observed_train - c) < expected_train < (observed_train + c)
    assert (observed_test - c) < expected_test < (observed_test + c)

print('All tests passed!')

humor:
       expected  observed
TRAIN       228       229
TEST         25        25

religion:
       expected  observed
TRAIN       332       333
TEST         36        36

reviews:
       expected  observed
TRAIN       566       567
TEST         62        62

science_fiction:
       expected  observed
TRAIN       301       302
TEST         33        33

All tests passed!


## Q2: Create a feature extractor function (2 points)

We have pairs of `(para, genre)`, but we need `(feature_dict, genre)` for training a classifier. Create a function `features()` that takes a paragraph and returns a dictionary of features. For now, just create feature types for each word (the `flatten()` function will help), such as `contains(film)` set to the value `True`. For instance, if the paragraph was just `[['Great', 'film', '!']]`, then the feature dict would be `{'contains(Great)': True, 'contains(film)': True, 'contains(!)': True}`. Do not normalize the strings (e.g., downcasing them, etc.). Whether a word appears once or multiple times, the feature value is `True`. You do not need to make missing features with the value `False` as the classifier will figure this out automatically.

**Note:** the `features()` function takes a paragraph, not a `(para, genre)` pair.

In [9]:
def features(para):
    """Return the features for each word in each sentence in *para*."""
    d = {}
    for word in flatten(para):
        key = 'contains({})'.format(word)
        d[key] = True
    return d

print(features(train_paras[0][0]))

{'contains(The)': True, 'contains(incident)': True, 'contains(,)': True, 'contains(aside)': True, 'contains(from)': True, 'contains(reflecting)': True, 'contains(on)': True, "contains(Welch's)": True, 'contains(political)': True, 'contains(career)': True, 'contains(had)': True, 'contains(all)': True, 'contains(but)': True, 'contains(wrecked)': True, 'contains(his)': True, 'contains(home)': True, 'contains(life)': True, 'contains(.)': True, 'contains(He)': True, 'contains(never)': True, 'contains(rested)': True, 'contains(until)': True, 'contains(he)': True, 'contains(discovered)': True, 'contains(who)': True, 'contains(the)': True, 'contains(culprit)': True, 'contains(was)': True, 'contains(and)': True, 'contains(when)': True, 'contains(did)': True, 'contains(vowed)': True, 'contains(vengeance)': True, 'contains(Viola)': True, 'contains(Lake)': True, 'contains(if)': True, 'contains(ever)': True, 'contains(chance)': True, 'contains(came)': True, 'contains(way)': True, 'contains(And)': T

In [10]:
assert features([['Great', 'film', '!']]) == {'contains(Great)': True, 'contains(film)': True, 'contains(!)': True}
assert features([['yeet', 'yeet', 'yeet']]) == {'contains(yeet)': True}
print('All tests passed!')

All tests passed!


## Q3: Train a Naive Bayes model (2 points)

Do the following to train a Naive Bayes classifier:

* Apply `features()` to the paragraphs in the `(para, genre)` pairs of `train_paras` and create `(feature_dict, genre)` pairs; store in a new list `train`
* Do the same for `test_paras`; store in a new list `test`
* Train an `nltk.NaiveBayesClassifier()` with `train`
* Evaluate on `test`

In [11]:
import nltk

train = [(features(para), genre) for para, genre in train_paras]
test = [(features(para), genre) for para, genre in test_paras]

nb = nltk.NaiveBayesClassifier.train(train)
nltk.classify.accuracy(nb, test)

0.4807692307692308

You should see an accuracy around 48%, which is better than random chance of four categories (25%) or if we always picked the majority genre (`reviews`: 567/1431 = 40%).

## Q4: Create an alternative feature extractor function (2 points)

Create a new feature extractor function, `features2()`, and do as before with the following changes:

* downcase each word
* remove stopwords
* lemmatize each remaining word with [`nlkt.stem.WordNetLemmatizer`](http://www.nltk.org/api/nltk.stem.html#nltk.stem.wordnet.WordNetLemmatizer)

You can use the stopwords like this:

```python
>>> from nltk.corpus import stopwords
>>> stop = set(stopwords.words('english'))
>>> 'the' in stop
True
```

And the WordNetLemmatizer like this:

```python
>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> wnl.lemmatize('laughs')
'laugh'
```

In [14]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()
stop = set(stopwords.words('english'))

def features2(para):
    """Return the features for each normalized word in each sentence in *para*."""
    d2 = {}
    for word in flatten(para):
        word = word.lower()
        if word not in stop:
            word = wnl.lemmatize(word)
            key = 'contains({})'.format(word)
            d2[key] = True
    return d2
    
print(features2(train_paras[0][0]))

{'contains(incident)': True, 'contains(,)': True, 'contains(aside)': True, 'contains(reflecting)': True, "contains(welch's)": True, 'contains(political)': True, 'contains(career)': True, 'contains(wrecked)': True, 'contains(home)': True, 'contains(life)': True, 'contains(.)': True, 'contains(never)': True, 'contains(rested)': True, 'contains(discovered)': True, 'contains(culprit)': True, 'contains(vowed)': True, 'contains(vengeance)': True, 'contains(viola)': True, 'contains(lake)': True, 'contains(ever)': True, 'contains(chance)': True, 'contains(came)': True, 'contains(way)': True, 'contains(!)': True, 'contains(innocent)': True, 'contains(action)': True, 'contains(human)': True, 'contains(tragedy)': True, 'contains(sometimes)': True, 'contains(set)': True, 'contains(motion)': True}


In [15]:
assert features2([['Great', 'film', '!']]) == {'contains(great)': True, 'contains(film)': True, 'contains(!)': True}
assert features2([['Lots', 'of', 'laughs']]) == {'contains(lot)': True, 'contains(laugh)': True}
print('All tests passed!')

All tests passed!


## Q5: Train a Naive Bayes model with normalized features (2 points)

Now do as you did for Q3 but using `features2()` instead of `features()`.

In [16]:
import nltk

train2 = [(features2(para), genre) for para, genre in train_paras]
test2 = [(features2(para), genre) for para, genre in test_paras]

nb2 = nltk.NaiveBayesClassifier.train(train2)
nltk.classify.accuracy(nb2, test2)

0.5512820512820513

You should see an accuracy of around 55%; a 7% improvement.