## Part Of Speech Tagging

<img src=https://i.stack.imgur.com/6pdIT.png width=320>

Unlike our previous experience with language modelling, this time around we learn the mapping between two different kinds of elements.

This setting is common for a range of useful problems:
* Speech Recognition - processing human voice into text
* Part Of Speech Tagging - for morphology-aware search and as an auxuliary task for most NLP problems
* Named Entity Recognition - for chat bots and web crawlers
* Protein structure prediction - for bioinformatics

In this programming assignment we will work with part-of-speech tagging. As the name suggests, it's about converting a sequence of words into a sequence of part-of-speech tags. We'll use a reduced tag set for simplicity:

### POS-tags
- ADJ - adjective (new, good, high, ...)
- ADP - adposition	(on, of, at, ...)
- ADV - adverb	(really, already, still, ...)
- CONJ	- conjunction	(and, or, but, ...)
- DET - determiner, article	(the, a, some, ...)
- NOUN	- noun	(year, home, costs, ...)
- NUM - numeral	(twenty-four, fourth, 1991, ...)
- PRT -	particle (at, on, out, ...)
- PRON - pronoun (he, their, her, ...)
- VERB - verb (is, say, told, ...)
- .	- punctuation marks	(. , ;)
- X	- other	(ersatz, esprit, dunno, ...)

__Disclaimer:__ This assignment is ungraded.

In [None]:
import nltk
import sys
import numpy as np

nltk.download('brown')
nltk.download('universal_tagset')

data = nltk.corpus.brown.tagged_sents(tagset='universal')
all_tags = ['#EOS#','#UNK#','ADV', 'NOUN', 'ADP', 'PRON', 'DET', '.', 'PRT', 'VERB', 'X', 'NUM', 'CONJ', 'ADJ']

data = np.array([[(word.lower(),tag) for word, tag in sentence] for sentence in data])

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.
  # This is added back by InteractiveShellApp.init_path()


In [None]:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(data, test_size=0.25, random_state=42)

In [None]:
from IPython.display import HTML, display
def draw(sentence):
    words,tags = zip(*sentence)
    display(HTML('<table><tr>{tags}</tr>{words}<tr></table>'.format(
                words = '<td>{}</td>'.format('</td><td>'.join(words)),
                tags = '<td>{}</td>'.format('</td><td>'.join(tags)))))
    
    
draw(data[11])
draw(data[10])
draw(data[7])

0,1,2,3,4,5,6,7,8,9,10,11,12,13
NOUN,ADP,NOUN,NOUN,NOUN,NOUN,VERB,ADV,VERB,ADP,DET,ADJ,NOUN,.
,,,,,,,,,,,,,


0,1,2,3,4,5,6,7,8,9,10,11,12,13
PRON,VERB,ADP,DET,NOUN,.,VERB,NOUN,PRT,VERB,.,DET,NOUN,.
,,,,,,,,,,,,,


0,1
NOUN,VERB
,


### Building vocabularies

Just like before, we have to build a mapping from tokens to integer ids. This time around, our model operates on a word level, processing one word per RNN step. This means we'll have to deal with far larger vocabulary.

Luckily for us, we only receive those words as input i.e. we don't have to predict them. This means we can have a large vocabulary for free by using word embeddings.

In [None]:
from collections import Counter
word_counts = Counter()
for sentence in data:
    words,tags = zip(*sentence)
    word_counts.update(words)

all_words = ['#EOS#','#UNK#'] + list(list(zip(*word_counts.most_common(10000)))[0])

#let's measure what fraction of data words are in the dictionary
print("Coverage = %.5f"%(float(sum(word_counts[w] for w in all_words)) / sum(word_counts.values())))

Coverage = 0.92876


In [None]:
from collections import defaultdict
word_to_id = defaultdict(lambda: 1, {word: i for i, word in enumerate(all_words)})
tag_to_id = {tag:i for i, tag in enumerate(all_tags)}

convert words and tags into fixed-size matrix

In [None]:
def to_matrix(lines, token_to_id, max_len=None, pad=0, dtype='int32', time_major=False):
    """Converts a list of names into rnn-digestable matrix with paddings added after the end"""
    
    max_len = max_len or max(map(len,lines))
    matrix = np.empty([len(lines),max_len],dtype)
    matrix.fill(pad)

    for i in range(len(lines)):
        line_ix = list(map(token_to_id.__getitem__,lines[i]))[:max_len]
        matrix[i,:len(line_ix)] = line_ix

    return matrix.T if time_major else matrix

In [None]:
batch_words, batch_tags = zip(*[zip(*sentence) for sentence in data[-3:]])

print("Word ids:")
print(to_matrix(batch_words,word_to_id))
print("Tag ids:")
print(to_matrix(batch_tags,tag_to_id))

Word ids:
[[   2 3057    5    2 2238 1334 4238 2454    3    6   19   26 1070   69
     8 2088    6    3    1    3  266   65  342    2    1    3    2  315
     1    9   87  216 3322   69 1558    4    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0]
 [  45   12    8  511 8419    6   60 3246   39    2    1    1    3    2
   845    1    3    1    3   10 9910    2    1 3470    9   43    1    1
     3    6    2 1046  385   73 4562    3    9    2    1    1 3250    3
    12   10    2  861 5240   12    8 8936  121    1    4]
 [  33   64   26   12  445    7 7346    9    8 3337    3    1 2811    3
     2  463  572    2    1    1 1649   12    1    4    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0]]
Tag ids:
[[ 6  3  4  6  3  3  9  9  7 12  4  5  9  4  6  3 12  7  9  7  9  8  4  6
   3  7  6 13  3  4  6  3  9  4  3  7  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0

### Build and train a simple model

In this lab we'll focus on a high-level PyTorch interface to recurrent neural networks, which we tried at the end of the previous lab.

In [None]:
import torch
from torch import nn

In [None]:
class SimpleModel(nn.Module):
    
    def __init__(self):
        super().__init__()
        
        self.rnn = nn.Sequential(
            nn.Embedding(len(all_words), 64),
            nn.RNN(64, 64, batch_first=True)
        )
        self.classifier = nn.Sequential(
            nn.Linear(64, len(all_tags)),
        )
    
    def forward(self, input):
        output, _ = self.rnn(input)
        return self.classifier(output)

We will use data generator for batch traiing:

In [None]:
BATCH_SIZE = 128

def generate_batches(sentences, batch_size=BATCH_SIZE, max_len=None, pad=0):
    assert isinstance(sentences, np.ndarray), "Make sure sentences is a numpy array"
    
    while True:
        indices = np.random.permutation(np.arange(len(sentences)))
        for start in range(0, len(indices) - 1, batch_size):
            batch_indices = indices[start:start + batch_size]
            batch_words, batch_tags = [],[]
            for sent in sentences[batch_indices]:
                words,tags = zip(*sent)
                batch_words.append(words)
                batch_tags.append(tags)

            batch_words = to_matrix(batch_words, word_to_id, max_len, pad)
            batch_tags = to_matrix(batch_tags, tag_to_id, max_len, pad)

            yield batch_words, batch_tags
        

In [None]:
# import stuff
from torch import optim

from tqdm import tqdm
from itertools import islice

#auxiliary stuff
class AverageMeter:
    
    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

### Training
Here we do not compute loss for padded symbols by using argument ignore_index of CrossEntropyLoss.

In [None]:
NUM_EPOCH = 10
DEVICE = torch.device('cuda')

model = SimpleModel().to(DEVICE)
optimizer = optim.Adam(model.parameters(), 1e-3)

# ignore padding index
criterion = nn.CrossEntropyLoss(ignore_index=0)

In [None]:
for _ in range(NUM_EPOCH):

    loss_meter = AverageMeter()
    for batch in islice(generate_batches(train_data), 0, len(train_data) // BATCH_SIZE):
        word_id, tag_id = batch
        word_id = torch.from_numpy(word_id).long().to(DEVICE)
        tag_id = torch.from_numpy(tag_id).long().to(DEVICE)
        
        logits = model(word_id).transpose(-1, -2)
        
        loss = criterion(logits, tag_id)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        loss_meter.update(loss.item())
    
    print(loss_meter.avg)

0.9433784815802503
0.3713996912116435
0.2485774821309901
0.19659109039982753
0.17134290141845818
0.15717434082458268
0.148181233410515
0.1418686525590384
0.1371195256487647
0.13305384988215432


In [None]:
def compute_test_accuracy(model):
    test_words, test_tags = zip(*[zip(*sentence) for sentence in test_data])
    test_words, test_tags = to_matrix(test_words, word_to_id), to_matrix(test_tags, tag_to_id)

    test_words = torch.from_numpy(test_words).long().to(DEVICE)
    test_tags = torch.from_numpy(test_tags).long().to(DEVICE)

    predicted_tag_probabilities = model(test_words).argmax(dim=-1)

    numerator = torch.sum(torch.logical_and((predicted_tag_probabilities == test_tags), (test_tags != 0)))
    denominator = torch.sum(test_words != 0)
    accuracy = (numerator / denominator).item()
    return accuracy

In [None]:
accuracy = compute_test_accuracy(model)

In [None]:
print("Final accuracy: %.5f" % accuracy)

assert accuracy > 0.94

Final accuracy: 0.94673


### Task I: getting all bidirectional

Since we're analyzing a full sequence, it's legal for us to look into future data.

A simple way to achieve that is to go both directions at once, making a __bidirectional RNN__.

Try to set argument `bidirectional` to True in `nn.RNN`. You will need to adjust dimensions of rnn layer too!

Your first task is to use such a layer for our POS-tagger.

In [None]:
# Define a model that utilizes bidirectional nn.RNN

### Your code here ###

In [None]:
acc = compute_test_accuracy(model)
print("\nFinal accuracy: %.5f"%acc)

assert acc>0.96, "Bidirectional RNNs are better than this!"
print("Well done!")

### Task II: now go and improve it

You guesses it. We're now gonna ask you to come up with a better network.

Here's a few tips:

* __Go beyond nn.RNN__: there's `nn.LSTM` and `nn.GRU`
  * You can also use 1D Convolutions (`nn.Conv1d`). They are often as good as recurrent layers but with less overfitting.
* __Stack more layers__: if there is a common motif to this course it's about stacking layers
  * Try to add recurrent and 1dconv layers on top of one another
  * Just remember that bigger networks may need more epochs to train
* __Gradient clipping__: If your training isn't as stable as you'd like, try to use `nn.utils.clip_grad_norm_`.
  * Which is to say, it's a good idea to watch over your loss curve at each minibatch. 
* __Regularization__: you can apply dropouts as usuall but also in an RNN-specific way
  * `nn.Dropout` works inbetween RNN layers
  * Recurrent layers also have `dropout` parameter
* __More words!__: You can obtain greater performance by expanding your model's input dictionary from 5000 to up to every single word!
  * Just make sure your model doesn't overfit due to so many parameters.
  * Combined with regularizers or pre-trained word-vectors this could be really good because right now our model is blind to >5% of words.
* __The most important advice__: don't cram in everything at once!
  * If you stuff in a lot of modiffications, some of them almost inevitably gonna be detrimental and you'll never know which of them are.
  * Try to instead go in small iterations and record experiment results to guide further search.
  
There's some advanced stuff waiting at the end of the notebook.
  
Good hunting!

In [None]:
# <Your code here!>

In [None]:
acc = compute_test_accuracy(model)
print("\nFinal accuracy: %.5f"%acc)

if acc >= 0.99:
    print("Awesome! Sky was the limit and yet you scored even higher!")
elif acc >= 0.98:
    print("Excellent! Whatever dark magic you used, it certainly did it's trick.")
elif acc >= 0.97:
    print("Well done! If this was a graded assignment, you would have gotten a 100% score.")
elif acc > 0.96:
    print("Just a few more iterations!")
else:
    print("There seems to be something broken in the model. Unless you know what you're doing, try taking bidirectional RNN and adding one enhancement at a time to see where's the problem.")

```

```

```

```

```

```


#### Some advanced stuff
Here there are a few more tips on how to improve training that are a bit trickier to impliment. We strongly suggest that you try them _after_ you've got a good initial model.
* __Use pre-trained embeddings__: you can use pre-trained weights from [there](http://ahogrammer.com/2017/01/20/the-list-of-pretrained-word-embeddings/) to kickstart your Embedding layer.
  * Use nn.Embedding.from_pretrained to init the embedding layer with a pretrained matrix.
  * When using pre-trained embeddings, pay attention to the fact that model's dictionary is different from your own.
  * You may want to freeze the parameters of embedding layer for several first epoches of fine-tuning or to not fine-tune them at all. In the first case you can choose zero learning rate for this parameter group, in the second case just use the freeze argument of nn.Embedding.from_pretrained.
* __More efficient batching__: right now it spends a lot of time iterating over "0"s
  * This happens because batch is always padded to the length of a longest sentence
  * You can speed things up by pre-generating batches of similar lengths and feeding it with randomly chosen pre-generated batch.
  * This technically breaks the i.i.d. assumption, but it works unless you come up with some insane rnn architectures.
* __Structured loss functions__: since we're tagging the whole sequence at once, we might as well train our network to do so.
  * There's more than one way to do so, but we'd recommend starting with [Conditional Random Fields](http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/)
  * You can read an [official PyTorch tutorial on Bi-LSTM Conditional Random Field](https://pytorch.org/tutorials/beginner/nlp/advanced_tutorial.html#bi-lstm-conditional-random-field-discussion).
