<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#AllenNLP-WalkThrough---Part-of-Speech-Tagging" data-toc-modified-id="AllenNLP-WalkThrough---Part-of-Speech-Tagging-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>AllenNLP WalkThrough - Part of Speech Tagging</a></span><ul class="toc-item"><li><span><a href="#DatasetReader" data-toc-modified-id="DatasetReader-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span><a href="https://allenai.github.io/allennlp-docs/api/allennlp.data.dataset_readers.dataset_reader.html" target="_blank">DatasetReader</a></a></span><ul class="toc-item"><li><span><a href="#text_to_instance" data-toc-modified-id="text_to_instance-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>text_to_instance</a></span></li><li><span><a href="#_read" data-toc-modified-id="_read-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>_read</a></span></li><li><span><a href="#init" data-toc-modified-id="init-1.1.3"><span class="toc-item-num">1.1.3&nbsp;&nbsp;</span>init</a></span></li></ul></li><li><span><a href="#Data" data-toc-modified-id="Data-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Data</a></span></li><li><span><a href="#Vocabulary" data-toc-modified-id="Vocabulary-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span><a href="https://allenai.github.io/allennlp-docs/api/allennlp.data.vocabulary.html" target="_blank">Vocabulary</a></a></span></li><li><span><a href="#Iterators" data-toc-modified-id="Iterators-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span><a href="https://allenai.github.io/allennlp-docs/api/allennlp.data.iterators.html" target="_blank">Iterators</a></a></span></li><li><span><a href="#Model" data-toc-modified-id="Model-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span><a href="https://allenai.github.io/allennlp-docs/api/allennlp.models.model.html#allennlp.models.model.Model" target="_blank">Model</a></a></span><ul class="toc-item"><li><span><a href="#Embedder" data-toc-modified-id="Embedder-1.5.1"><span class="toc-item-num">1.5.1&nbsp;&nbsp;</span>Embedder</a></span></li><li><span><a href="#Encoder" data-toc-modified-id="Encoder-1.5.2"><span class="toc-item-num">1.5.2&nbsp;&nbsp;</span>Encoder</a></span></li><li><span><a href="#init" data-toc-modified-id="init-1.5.3"><span class="toc-item-num">1.5.3&nbsp;&nbsp;</span>init</a></span></li><li><span><a href="#forward" data-toc-modified-id="forward-1.5.4"><span class="toc-item-num">1.5.4&nbsp;&nbsp;</span>forward</a></span></li><li><span><a href="#get_metrics" data-toc-modified-id="get_metrics-1.5.5"><span class="toc-item-num">1.5.5&nbsp;&nbsp;</span>get_metrics</a></span></li></ul></li><li><span><a href="#Trainer" data-toc-modified-id="Trainer-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Trainer</a></span></li><li><span><a href="#Predictor" data-toc-modified-id="Predictor-1.7"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>Predictor</a></span></li><li><span><a href="#Serialization-&amp;-DeSerialization" data-toc-modified-id="Serialization-&amp;-DeSerialization-1.8"><span class="toc-item-num">1.8&nbsp;&nbsp;</span>Serialization &amp; DeSerialization</a></span></li></ul></li><li><span><a href="#Reference" data-toc-modified-id="Reference-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Reference</a></span></li></ul></div>

In [1]:
!pip install watermark allennlp



In [2]:
# 1. magic for inline plot
# 2. magic to print version
# 3. magic so that the notebook will reload external python modules
# 4. magic to enable retina (high resolution) plots
# https://gist.github.com/minrk/3301035
%matplotlib inline
%load_ext watermark
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format='retina'

import os
import time
import torch
import numpy as np
import torch.nn as nn
import torch.optim as optim
from overrides import overrides
from typing import Iterator, List, Dict, Tuple

%watermark -a 'Ethen' -d -t -v -p numpy,torch,allennlp,nltk

Ethen 2019-10-16 17:58:33 

CPython 3.6.8
IPython 5.5.0

numpy 1.16.5
torch 1.2.0
allennlp 0.9.0
nltk 3.2.5


# AllenNLP WalkThrough - Part of Speech Tagging

The problem of **Part of Speech Tagging (POS)** looks like the following: given a sentence (e.g. "The dog ate the apple") we want to predict the part-of-speech tags for each word ["DET", "NN", "V", "DET", "NN"]. This this documentation, we'll walkthrough the process of using the AllenNLP framework to solve for this particular problem.

AllenNLP at its core is a framework for constructing pipelines to train NLP models. We can leverage different components of the framework or implement our custom components at different steps to tackle various NLP problems.

A typical AllenNLP pipeline is composed of the following components:

- DatasetReader: Extracts necessary information from the data and turns them into a list of `Instance` objects.
- Model: The model to be trained. This is where we define the various architecture of the neural network.
- Iterator: Batches the data.
- Trainer: Handles training/optimization and metric recording.
- Predictor: Generates predictions from raw strings.

Each of these components is loosely coupled, meaning it is easy to swap different components in without having to change other parts of your code. To take full advantage of all the features available, we'll need to spend some time and understand what each component is responsible for and what protocols it must respect for these parts to work well together.

In [3]:
# some constant parameters we can tweak along the way
seed = 1
batch_size = 32
embedding_dim = 100
hidden_dim = 128
lr = 0.1
num_epochs = 300

torch.manual_seed(seed)

<torch._C.Generator at 0x7f47cfb7a4d0>

## [DatasetReader](https://allenai.github.io/allennlp-docs/api/allennlp.data.dataset_readers.dataset_reader.html)

In AllenNLP each training example is represented as an `Instance` consisting of `Fields` of various types. A `DatasetReader` defines the logic to generate those instances (typically) from data stored on disk.

Typically to create a `DatasetReader` we'd implement two methods:

- `text_to_instance`. The naming for this method is slightly misleading, as it handles not only our text, but also labels, metadata, and anything else that we model will need later on. The essence of this method is to take the data for a single example and pack it into an `Instance` object. Instance objects are very similar to dictionaries, and all you need to know about them in practice is that they are instantiated with a dictionary mapping field names to `Field`.

- `_read` takes the path to an input file and returns an Iterator of `Instances`. (It will probably delegate most of its work to `text_to_instance`.)

We'll introduce more as we go along.

In [0]:
from allennlp.data import Instance
from allennlp.data.fields import TextField, SequenceLabelField
from allennlp.data.dataset_readers import DatasetReader
from allennlp.data.token_indexers import TokenIndexer, SingleIdTokenIndexer
from allennlp.data.tokenizers import Token

In [0]:
class PosDatasetReader(DatasetReader):
    """
    DatasetReader for PoS tagging data, one sentence per line, like

    The\u0001DET dog\u0001NN ate\u0001V the\u0001DET apple\u0001NN
    
    i.e. we have the corresponding part of speech tagging for each token in the sentence
    where each token and tag is delimited by a \u0001 symbol and each pair is then delimited
    by a white-space.
    """

    def __init__(self,
                 token_indexers: Dict[str, TokenIndexer] = None,
                 lazy: bool = False,
                 token_tag_delimiter: str = '\u0001') -> None:
        super().__init__(lazy=lazy)
        self.token_indexers = token_indexers or {'tokens': SingleIdTokenIndexer()}
        self.token_tag_delimiter = token_tag_delimiter

    @overrides
    def text_to_instance(self, tokens: List[Token], tags: List[str] = None) -> Instance:
        sentence_field = TextField(tokens, self.token_indexers)
        fields = {'sentence': sentence_field}

        if tags:
            label_field = SequenceLabelField(labels=tags, sequence_field=sentence_field)
            fields['labels'] = label_field

        return Instance(fields)

    @overrides
    def _read(self, file_path: str) -> Iterator[Instance]:
        """takes a filename and produces a stream of Instance."""
        with open(file_path) as f:
            for line in f:
                pairs = line.strip().split()
                sentence, tags = zip(*(pair.split(self.token_tag_delimiter) for pair in pairs))
                yield self.text_to_instance([Token(word) for word in sentence], tags) 

### text_to_instance

A couple of things to notice. The first is that the tokens variable is a `List[Token]` (and not a `List[str]`). If we use some of the built in tokenizer, e.g. `WordTokenizer`, that's already the output we'll get. If we have pre-tokenized data, we just need to wrap each string token in a call to `Token`.

Another thing to notice is that the tags/labels are optional. This is so that after we train a model we can use it to make predictions on untagged data (which clearly won't have any tags).

We've mentioned in the previous section that `Instance` is a dictionary of `Field`. Now is a good time to touch upon `Field`. `Field` objects in AllenNLP handles the conversion of our data into tensors, these tensors are then fed into the model.

Here we're using the `TextField` for our tokens. The `TextField` does what all good NLP libraries do: it converts a sequence of tokens into integers. Be careful here though, since this is all the `TextField` does. It doesn’t clean the text, tokenize the text, etc.. we'll need to do that yourself. `TextField` takes an additional argument on `init`: the token indexer. Though the `TextField` handles converting tokens to integers, we need to tell it how to do this. Why? Because we might want to use a character-level model instead of a word-level model. AllenNLP gives us the flexibility of specifying these attributes.

As for the tags/labels. We put them in a `SequenceLabelField`, which is for labels corresponding to each element of a sequence. If we had a label that applied to the entire sentence, for example "sentiment", we would instead use a `LabelField`.

Finally, we just return an Instance containing the dict field_name -> Field.

Usually a `DatasetReader` will need to have a `dict` of `TokenIndexer`s that specify how we want to convert text tokens into indices. For instance, we will usually have a `SingleIdTokenIndexer` which maps each word to a unique ID, and you might also (or instead) have a TokenCharactersIndexer, which maps each word to a sequence of indices corresponding to its characters.

### _read

The main purpose of this method is to produce a stream of `Instance`.

We split each line on spaces to get pairs word###TAG, split each pair to get tuples (word, tag), use zip to break those into a list of words and a list of tags, wrap each word in Token (as described in the previous section), and then call `text_to_instance`.

The reason for splitting the logic into two functions is that `text_to_instance` is useful on its own, for instance, if you build an interactive demo for your model and want to produce Instances from user-supplied sentences.

### init

The `__init__` method takes in the token indexer that we'll use for the `TextField` and a `lazy` parameter.

- `lazy`. If we're working with datasets that don't fit into memory. AllenNLP can lazily load the data (only read the data into memory when you actually need it). This does impose some additional complexity and runtime overhead.

We can also add other parameters to make our reader more customizable, e.g. provide various options for the tokenization part.

## Data

The [Penn Treebank](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.9.8216&rep=rep1&type=pdf) contains a corpus of annotated POS tags. A sample is available in the NLTK python library which contains a lot of corpora that can be used to train and test some NLP models.

In [6]:
import nltk
from nltk.corpus import treebank

nltk.download('treebank')
nltk.download('universal_tagset')

sentences = treebank.tagged_sents(tagset='universal')
print('number of sentences: ', len(sentences))

# a sentence consists of multiple tuples, where the tuples
# are a pair of token and their corresponding POS tag
sentences[0]

[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.
number of sentences:  3914


[('Pierre', 'NOUN'),
 ('Vinken', 'NOUN'),
 (',', '.'),
 ('61', 'NUM'),
 ('years', 'NOUN'),
 ('old', 'ADJ'),
 (',', '.'),
 ('will', 'VERB'),
 ('join', 'VERB'),
 ('the', 'DET'),
 ('board', 'NOUN'),
 ('as', 'ADP'),
 ('a', 'DET'),
 ('nonexecutive', 'ADJ'),
 ('director', 'NOUN'),
 ('Nov.', 'NOUN'),
 ('29', 'NUM'),
 ('.', '.')]

We would write out the file, so that each line of our dataset looks something like:

`The\u0001DET dog\u0001NN ate\u0001V the\u0001DET apple\u0001NN`

Where there is a `u0001` delimiter between each token and tag and each pair is then delimited by a white space.

In [0]:
def write_nltk_treebank_tagged_sents(sentences: List[Tuple[str, str]], output_filename: str,
                                     token_tag_delimiter: str = '\u0001') -> None:
    directory, _ = os.path.split(output_filename)
    if not os.path.exists(directory):
        os.makedirs(directory, exist_ok=True)

    with open(output_filename, 'w') as f:
        for sentence in sentences:
            line = ' '.join(token + token_tag_delimiter + tag for token, tag in sentence)
            f.write(line + '\n')

In [0]:
# split the data into training and validation
validation_size = 0.2
sentences = treebank.tagged_sents(tagset='universal')

train_index = 1 - int(validation_size * len(sentences)) 
training_sentences = sentences[:train_index]
validation_sentences = sentences[train_index:]

data_dir = 'data'
train_dataset_path = os.path.join(data_dir, 'treebank_tagged_sents_train.txt')
validation_dataset_path = os.path.join(data_dir, 'treebank_tagged_sents_validation.txt')
write_nltk_treebank_tagged_sents(training_sentences, train_dataset_path)
write_nltk_treebank_tagged_sents(validation_sentences, validation_dataset_path)

In [9]:
reader = PosDatasetReader()
train_dataset = reader.read(train_dataset_path)
validation_dataset = reader.read(validation_dataset_path)

# for example, we can access the first element of the Stream of Instance
# and access the TextField, which we keyed as 'sentence', and access the
# tokens attribute to look at the tokenized word
train_dataset[0].fields['sentence'].tokens

3133it [00:00, 10107.68it/s]
781it [00:00, 3568.37it/s]


[Pierre,
 Vinken,
 ,,
 61,
 years,
 old,
 ,,
 will,
 join,
 the,
 board,
 as,
 a,
 nonexecutive,
 director,
 Nov.,
 29,
 .]

## [Vocabulary](https://allenai.github.io/allennlp-docs/api/allennlp.data.vocabulary.html)

The other thing that goes hand in hand with our `DatasetReader` and `Iterator` that we'll discuss in then next section is `Vocabulary`. To build the vocabulary, you need to pass through all the text. We can only convert fields into tensors after you know what the vocabulary is.

In [10]:
from allennlp.data.vocabulary import Vocabulary

vocab = Vocabulary.from_instances(train_dataset + validation_dataset)
vocab

100%|██████████| 3914/3914 [00:00<00:00, 34066.56it/s]


Vocabulary with namespaces:  tokens, Size: 12410 || labels, Size: 12 || Non Padded Namespaces: {'*labels', '*tags'}

In [11]:
# look at the distinct labels we have in our dataset
[vocab.get_token_from_index(i, 'labels') for i in range(vocab.get_vocab_size('labels'))]

['NOUN',
 'VERB',
 '.',
 'ADP',
 'DET',
 'X',
 'ADJ',
 'NUM',
 'PRT',
 'ADV',
 'PRON',
 'CONJ']

## [Iterators](https://allenai.github.io/allennlp-docs/api/allennlp.data.iterators.html)

Neural networks are usually trained on mini batches of tensors, not lists of data. Therefore, datasets need to be batched and converted to tensors. This seems trivial at first glance, but there is a lot of subtlety here. To list just a few things we have to consider:

- Sequences of different lengths need to be padded
- To minimize padding, sequences of similar lengths can be put in the same batch
- Tensors need to be sent to the GPU if using the GPU
- Data needs to be shuffled at the end of each epoch during training, but we don't want to shuffle in the midst of an epoch in order to cover all examples evenly

Thankfully, AllenNLP has several convenient iterators that will take care of all of these problems behind the scenes, Therefore, you will rarely have to implement your own Iterators from scratch.

In [0]:
from allennlp.data.iterators import BucketIterator

iterator = BucketIterator(batch_size=batch_size, sorting_keys=[('sentence', 'num_tokens')])
iterator.index_with(vocab)

The `BucketIterator` batches sequences of similar lengths together to minimize padding. `sorting_keys` keyword argument tells the iterator which field to reference when determining the text length of each instance. Here `sentence` is the key to our `TextField` and `num_tokens` is a padding key keyword that tells it to sort according to the number of tokens for that field.

Iterators are responsible for converting our text to numerical ids. We pass the vocabulary we built earlier so that the Iterator knows how to map our text to integers.

In [13]:
# have a peak at how the output from the iterator would look like
# when passing in the train_dataset from earlier
batch = next(iter(iterator(train_dataset)))
batch

{'labels': tensor([[ 4,  0,  0,  ...,  0,  0,  0],
         [ 6,  4,  0,  ...,  0,  0,  0],
         [ 7,  0,  3,  ...,  0,  0,  0],
         ...,
         [ 2,  4,  1,  ...,  0,  2,  2],
         [10,  1,  1,  ...,  0,  0,  0],
         [ 5,  1,  3,  ...,  0,  0,  0]]),
 'sentence': {'tokens': tensor([[  93, 5378, 1020,  ...,    0,    0,    0],
          [2631,   59,  171,  ...,    0,    0,    0],
          [3157, 3820,   28,  ...,    0,    0,    0],
          ...,
          [  20,  325,  133,  ..., 8208,    4,   21],
          [  99,   54, 5529,  ...,    0,    0,    0],
          [  10, 4534,   27,  ...,    0,    0,    0]])}}

Note the `tokens` key is the key name that we've specified for our `token_indexers`.

## [Model](https://allenai.github.io/allennlp-docs/api/allennlp.models.model.html#allennlp.models.model.Model)

In [0]:
from allennlp.models import Model
from allennlp.modules.text_field_embedders import TextFieldEmbedder, BasicTextFieldEmbedder
from allennlp.modules.token_embedders import Embedding
from allennlp.modules.seq2seq_encoders import Seq2SeqEncoder, PytorchSeq2SeqWrapper
from allennlp.nn.util import get_text_field_mask, sequence_cross_entropy_with_logits
from allennlp.training.metrics import CategoricalAccuracy

Now that we've prepared the data, the next part is the `Model`. AllenNLP models is mostly a subclass of `torch.nn.Module`. How the forward pass looks like is mainly up to us. The only key difference is that AllenNLP models are required to compute the loss function within the forward method during training and return a dictionary for every forward pass that includes that loss value. This output is what the downstream `Trainer` expects. (we'll touch upon in the next section).

AllenNLP models are generally composed from the following components:

- A token embedder
- An encoder
- (For seq-to-seq models) A decoder

In this example, we will create a model that consists of an embedding layer, a sequence encoder, and a feedforward network that predicts the tags.

### Embedder

The embedder maps a sequence of token ids (or character ids) into a sequence of tensors.

For embedding the tokens we'll use the `BasicTextFieldEmbedder` which takes a mapping from index names to embeddings. If we go back to where we defined our `DatasetReader`, the key to our token_indexers was called `tokens`, so our mapping just needs an embedding corresponding to that index. 

We'll also use the `Vocabulary` to find how many embeddings we need and our `embedding_dim` parameter to specify the output dimension.

In [0]:
token_embedding = Embedding(num_embeddings=vocab.get_vocab_size('tokens'),
                            embedding_dim=embedding_dim)
word_embeddings = BasicTextFieldEmbedder({'tokens': token_embedding})

### Encoder

We next need to specify the sequence encoder. AllenNLP provides a `PytorchSeq2SeqWrapper` that has some extra functionality to the built-in PyTorch module. In AllenNLP, we do everything batch first, so we specify that as well.

In [0]:
lstm = PytorchSeq2SeqWrapper(torch.nn.LSTM(embedding_dim, hidden_dim, batch_first=True))

We can define and instantiate the model.

In [0]:
class LstmTagger(Model):

    def __init__(self,
                 word_embeddings: TextFieldEmbedder,
                 encoder: Seq2SeqEncoder,
                 vocab: Vocabulary) -> None:

        super().__init__(vocab)
        self.word_embeddings = word_embeddings
        self.encoder = encoder
        self.hidden2class = nn.Linear(encoder.get_output_dim(),
                                      vocab.get_vocab_size('labels'))
        self.accuracy = CategoricalAccuracy()

    @overrides
    def forward(self,
                sentence: Dict[str, torch.Tensor],
                labels: torch.Tensor = None) -> Dict[str, torch.Tensor]:

        mask = get_text_field_mask(sentence)
        embeddings = self.word_embeddings(sentence)
        encoder_out = self.encoder(embeddings, mask)
        class_logits = self.hidden2class(encoder_out)
        output = {'class_logits': class_logits}

        if labels is not None:
            self.accuracy(class_logits, labels, mask)
            output['loss'] = sequence_cross_entropy_with_logits(class_logits, labels, mask)

        return output

    @overrides
    def get_metrics(self, reset: bool = False) -> Dict[str, float]:
        return {"accuracy": self.accuracy.get_metric(reset)}

In [18]:
model = LstmTagger(word_embeddings, lstm, vocab)

# we can now unpack the batch and feed it to the model
model(**batch)

{'class_logits': tensor([[[-0.0594,  0.0619, -0.0828,  ...,  0.0825, -0.0795,  0.0140],
          [-0.0688,  0.0606, -0.0804,  ...,  0.0798, -0.0787,  0.0130],
          [-0.0745,  0.0592, -0.0799,  ...,  0.0812, -0.0779,  0.0128],
          ...,
          [-0.0422,  0.0618, -0.0874,  ...,  0.0809, -0.0773,  0.0100],
          [-0.0422,  0.0618, -0.0874,  ...,  0.0809, -0.0773,  0.0100],
          [-0.0422,  0.0618, -0.0874,  ...,  0.0809, -0.0773,  0.0100]],
 
         [[-0.0611,  0.0610, -0.0846,  ...,  0.0808, -0.0799,  0.0152],
          [-0.0692,  0.0625, -0.0824,  ...,  0.0820, -0.0788,  0.0139],
          [-0.0747,  0.0622, -0.0799,  ...,  0.0816, -0.0773,  0.0125],
          ...,
          [-0.0422,  0.0618, -0.0874,  ...,  0.0809, -0.0773,  0.0100],
          [-0.0422,  0.0618, -0.0874,  ...,  0.0809, -0.0773,  0.0100],
          [-0.0422,  0.0618, -0.0874,  ...,  0.0809, -0.0773,  0.0100]],
 
         [[-0.0607,  0.0608, -0.0812,  ...,  0.0833, -0.0796,  0.0149],
          [-

### init

Notice that the `word_embeddings` and `encoder` parameter type is the base class of the embedder and encoder that we're using. This is so that we can mix and match different embedder and encoder to see which one works best for our problem. 

### forward

Each Instance in our dataset will get (batched with other instances and) fed into forward. The forward method expects dicts of tensors as input, and it expects their names to be the names of the fields in our Instance. In this case we have a sentence field and (possibly) a labels field, so we'll construct our forward accordingly.

Then the core logic inside the forward method is pretty much identical to a regular PyTorch module, where we feed through each layer and make an update on the model based on the loss computed if the labels were provided.

The other interesting part is the `get_text_field_mask` method. AllenNLP is designed to operate on batched inputs, but different input sequences have different lengths. Behind the scenes AllenNLP is padding the shorter inputs so that the batch has uniform shape, which means our computations should use a mask to exclude the padding. Here we use the utility function `get_text_field_mask`, which returns a tensor of 0s and 1s corresponding to the padded and unpadded locations.

The output of the forward pass should consist of the dictionary with the `loss` key if we wish to use this Model allow-side `Trainer` (we'll discuss in the next section).

Note that for this model, we are using the various modules/building blocks that comes with AllenNLP, this doesn't mean we can't implement our own if we desire.

### get_metrics

This method gives us the flexibility to log additional metrics that we care about. In this case, we've computed an accuracy metric that gets updated for each forward pass during the training phase.

## Trainer

One of the pain points in flexible framework such as PyTorch has been doing the training after defining the model and dataset. Often times, we will need to write a lot of boilerplate code just do get a training loop. AllenNLP includes a Trainer class that handles most of the gory details of training models. After passing in all the necessary parameters, we can call `.train` to train it.

Of course, this type of stuff can be a double-edged sword. Some people enjoy having the ability of customize the logic of the training loop. Some frameworks such as `fastai`, `keras` gives user the flexibility of adding different callbacks to do certain customization to the training loop.

The next code chunk contains boilerplate code to move our model to a GPU if we have access to one and define a optimizer to train the model. After that, we have all the bare minimum parameter we need to use the `Trainer` class.

In [0]:
if torch.cuda.is_available():
    cuda_device = 0
    model = model.cuda(cuda_device)
else:
    cuda_device = -1

optimizer = optim.SGD(model.parameters(), lr=lr)

In [20]:
from allennlp.training.trainer import Trainer

trainer = Trainer(model=model,
                  optimizer=optimizer,
                  iterator=iterator,
                  train_dataset=train_dataset,
                  validation_dataset=validation_dataset,
                  patience=10,
                  num_epochs=num_epochs,
                  cuda_device=cuda_device)
trainer.train()

accuracy: 0.2724, loss: 2.2897 ||: 100%|██████████| 98/98 [00:01<00:00, 64.43it/s]
accuracy: 0.2942, loss: 2.1997 ||: 100%|██████████| 25/25 [00:00<00:00, 123.04it/s]
accuracy: 0.2849, loss: 2.1952 ||: 100%|██████████| 98/98 [00:01<00:00, 69.66it/s]
accuracy: 0.2942, loss: 2.1874 ||: 100%|██████████| 25/25 [00:00<00:00, 153.95it/s]
accuracy: 0.2849, loss: 2.1880 ||: 100%|██████████| 98/98 [00:01<00:00, 73.00it/s]
accuracy: 0.2942, loss: 2.1845 ||: 100%|██████████| 25/25 [00:00<00:00, 146.71it/s]
accuracy: 0.2849, loss: 2.1859 ||: 100%|██████████| 98/98 [00:01<00:00, 73.46it/s]
accuracy: 0.2942, loss: 2.1830 ||: 100%|██████████| 25/25 [00:00<00:00, 142.30it/s]
accuracy: 0.2849, loss: 2.1843 ||: 100%|██████████| 98/98 [00:01<00:00, 73.52it/s]
accuracy: 0.2942, loss: 2.1801 ||: 100%|██████████| 25/25 [00:00<00:00, 149.85it/s]
accuracy: 0.2849, loss: 2.1824 ||: 100%|██████████| 98/98 [00:01<00:00, 73.58it/s]
accuracy: 0.2942, loss: 2.1805 ||: 100%|██████████| 25/25 [00:00<00:00, 154.95it/s

{'best_epoch': 134,
 'best_validation_accuracy': 0.9477555922534154,
 'best_validation_loss': 0.16635991930961608,
 'epoch': 143,
 'peak_cpu_memory_MB': 2129.736,
 'peak_gpu_0_memory_MB': 389,
 'training_accuracy': 0.9837160596334255,
 'training_cpu_memory_MB': 2129.736,
 'training_duration': '0:03:46.738170',
 'training_epochs': 143,
 'training_gpu_0_memory_MB': 389,
 'training_loss': 0.05842211561239496,
 'training_start_epoch': 0,
 'validation_accuracy': 0.9492568683380874,
 'validation_loss': 0.16979450821876527}

The `patience` parameter tells the model to stop training early if it ever spends 10 epochs without the validation loss improving.

## Predictor

To look at the predictions that our model is making AllenNLP contains a Predictor abstraction that takes inputs, converts them to Instances, feeds them through our model, and returns JSON-serializable results.

Often we would need to implement our own Predictor, but AllenNLP already has a `SentenceTaggerPredictor` that works perfectly here, so we can use it:

In [21]:
from allennlp.predictors import SentenceTaggerPredictor

predictor = SentenceTaggerPredictor(model, dataset_reader=reader)

# we can provide completely new text to the predict method and access
# the 'class_logits' key to get the predicted class/part of speech tag for each token
class_logits = predictor.predict('The dog ate the apple')['class_logits']
class_logits

[[10.796961784362793,
  -5.007230281829834,
  -15.54598617553711,
  9.824734687805176,
  16.484712600708008,
  -12.386752128601074,
  9.272146224975586,
  -4.658377647399902,
  -4.870985984802246,
  4.820064544677734,
  -6.4194793701171875,
  -2.0905094146728516],
 [7.959966659545898,
  3.6185107231140137,
  -7.2956743240356445,
  -0.41348427534103394,
  -0.6415822505950928,
  -1.552140474319458,
  5.120409965515137,
  0.7968708872795105,
  -3.256251335144043,
  1.9163012504577637,
  -2.056368112564087,
  -4.145318984985352],
 [7.676870346069336,
  6.545048713684082,
  -7.966844081878662,
  -0.5770882368087769,
  -4.420485019683838,
  -0.2390516698360443,
  3.7002615928649902,
  2.178879499435425,
  -1.561387538909912,
  2.284492254257202,
  -2.8406355381011963,
  -5.06860876083374],
 [17.334558486938477,
  -7.458230972290039,
  -28.466014862060547,
  15.429902076721191,
  27.2178897857666,
  -22.110519409179688,
  18.125486373901367,
  -6.4928765296936035,
  -6.579277515411377,
  9.07

In [22]:
class_ids = np.argmax(class_logits, axis=-1)
print([vocab.get_token_from_index(i, 'labels') for i in class_ids])

['DET', 'NOUN', 'NOUN', 'DET', 'NOUN']


## Serialization & DeSerialization

We would need to save both the model and the vocabulary.

In [0]:
# Here's how to save the model.
model_checkpoint_dir = 'models'
if not os.path.exists(model_checkpoint_dir):
    os.makedirs(model_checkpoint_dir, exist_ok=True)

model_filename = os.path.join(model_checkpoint_dir, 'model.pt')
with open(model_filename, 'wb') as f:
    torch.save(model.state_dict(), f)

vocab_filename = os.path.join(model_checkpoint_dir, 'vocabulary.allennlp')
vocab.save_to_files(vocab_filename)

In [0]:
# And here's how to reload the model.
vocab2 = Vocabulary.from_files(vocab_filename)
model2 = LstmTagger(word_embeddings, lstm, vocab2)

with open(model_filename, 'rb') as f:
    model2.load_state_dict(torch.load(f))

if cuda_device > -1:
    model2.cuda(cuda_device)

In [0]:
# generate the predictions again
predictor2 = SentenceTaggerPredictor(model2, dataset_reader=reader)
class_logits2 = predictor2.predict("The dog ate the apple")['class_logits']
np.testing.assert_array_almost_equal(class_logits2, class_logits)

There are a lot of different improvements that we can work on, but hopefully this gives a taste of what a NLP pipeline looks like in AllenNLP.

# Reference

- [Github: AllenNLP Tutorial - Getting Started with AllenNLP](https://allennlp.org/tutorials)
- [Blog: An In-Depth Tutorial to AllenNLP (From Basics to ELMo and BERT)](https://mlexplained.com/2019/01/30/an-in-depth-tutorial-to-allennlp-from-basics-to-elmo-and-bert/)
- [Blog: Part-of-Speech tagging tutorial with the Keras Deep Learning library](https://becominghuman.ai/part-of-speech-tagging-tutorial-with-the-keras-deep-learning-library-d7f93fa05537)
- [PyTorch Documentation: Saving And Loading Models](https://pytorch.org/tutorials/beginner/saving_loading_models.html)