# Natural Language Processing

## AllenNLP

AllenNLP is an open source library for building deep learning models for natural language processing, developed by the Allen Institute for Artificial Intelligence. It is built on top of PyTorch and is designed to support researchers, engineers, students, etc., who wish to build high quality deep NLP models with ease. It provides high-level abstractions and APIs for common components and models in modern NLP. It also provides an extensible framework that makes it easy to run and manage NLP experiments.

In a nutshell, AllenNLP is

- a library with well-thought-out abstractions encapsulating the common data and model operations that are done in NLP research
- a commandline tool for training PyTorch models
- a collection of pre-trained models that you can use to make predictions
- a collection of readable reference implementations of common / recent NLP models
- an experiment framework for doing replicable science
- a way to demo your research
- open source and community driven

In part 1, geared towards someone who is brand new to the library, we give you a quick walk-through of main AllenNLP concepts and features. We'll build a complete, working NLP model (a text classifier) along the way.

## Text Classification


### Fields

The first step for building an NLP model is to define its input and output. In AllenNLP, each training example is represented by an `Instance` object. An `Instance` consists of one or more `Fields`, where each `Field` represents one piece of data used by your model, either as an input or an output. `Fields` will get converted to tensors and fed to your model.

For text classification, the input and the output are very simple. The model takes a `TextField` that represents the input text and predicts its label, which is represented by a `LabelField`.

Note that AllenNLP use the **type hint** features in Python 3, by specifying a colon.  This internally helps AllenNLP do many magic, such as automatically construct the embedder and encoder from a configuration file using these type annotations.


In [1]:
#a bit about type hint

def some_func(text: str):
    print(text)
    
some_func("hello world")
some_func(3)  #won't error, because this is type hinting.  Mostly used by editors to check errors before running.

hello world
3


In [2]:
from allennlp.data.fields import LabelField,  TextField

# Inputs
text: TextField

# Outputs
label: LabelField

### 1. Reading data

The first step for building an NLP application is to read the dataset and represent it with some internal data structure.

AllenNLP uses `DatasetReaders` to read the data, whose job it is to transform raw data files into `Instances` that match the input / output spec. 

AllenNLP assume the dataset has a simple data file format: `[text] [TAB] [label]`, for example:

- I like this movie a lot! [TAB] positive

- This was a monstrous waste of time [TAB] negative

- AllenNLP is amazing [TAB] positive

- Why does this have to be so complicated? [TAB] negative

- This sentence expresses no sentiment [TAB] neutral

You can implement your own `DatasetReader` by inheriting from the `DatasetReader` class. At minimum, you need to override the `_read()` method, which reads the input dataset and yields `Instances`.

Note that we are making the label parameter of `text_to_instance()` optional. During training and evaluation, all the instances were labeled, i.e., they included the `LabelFields` that contain gold labels. However, when you are making predictions for unseen inputs, the instances are unlabeled. By making the label parameter optional the dataset reader can support both cases.

In [3]:
from typing import Dict, Iterable, List

from allennlp.data import DatasetReader, Instance
from allennlp.data.fields import LabelField, TextField
from allennlp.data.token_indexers import TokenIndexer, SingleIdTokenIndexer
from allennlp.data.tokenizers import Token, Tokenizer, SpacyTokenizer


@DatasetReader.register("classification-tsv")
class ClassificationTsvReader(DatasetReader):
    def __init__(self, max_tokens: int = None, **kwargs):
        super().__init__(**kwargs)
        self.tokenizer = SpacyTokenizer()
        self.token_indexers = {"tokens": SingleIdTokenIndexer()}
        self.max_tokens = max_tokens

    def text_to_instance(self, text: str, label: str = None) -> Instance:
        tokens = self.tokenizer.tokenize(text)
        if self.max_tokens:
            tokens = tokens[: self.max_tokens]
        text_field = TextField(tokens, self.token_indexers)
        fields = {"text": text_field}
        if label:
            fields["label"] = LabelField(label)
        return Instance(fields)

    def _read(self, file_path: str) -> Iterable[Instance]:
        with open(file_path, "r") as lines:
            for line in lines:
                text, sentiment = line.strip().split("\t")
                yield self.text_to_instance(text, sentiment)

This is a minimal DatasetReader that will return a list of classification Instances when you call `reader.read(file)`. This reader will take each line in the input file, split the text into words using a tokenizer (the SpacyTokenizer shown here relies on spaCy), and represent those words as tensors using a word id in a vocabulary we construct for you.

Pay special attention to the text and label keys that are used in the fields dictionary passed to the `Instance` - these keys will be used as parameter names when passing tensors into your `Model` later.

Ideally, the output label would be **optional** when we create the`Instances`, so that we can use the same code to make **predictions on unlabeled data (say, in a demo)**.

There are lots of places where this could be made better for a more flexible and fully-featured reader but let's keep it simple for now.

### 2 Designing the model

The next thing we need is a `Model` that will take a batch of `Instances`, predict the outputs from the inputs, and compute a loss.

Also, remember that we used these names (`text` and `label`) for the fields in the `DatasetReader.` AllenNLP passes those fields by name to the model code, so we need to use the same names in our model.

Conceptually, a generic model for classifying text does the following:

- Get some features corresponding to each word in your input
- Combine those word-level features into a document-level feature vector
- Classify that document-level feature vector into one of your labels.

In AllenNLP, we make each of these conceptual steps into a generic abstraction that you can use in your code, so that you can have a very flexible model that can use different concrete components for each step.

#### First step: Token IDs

<img src = "../../../figures/allentokenid.svg">

The first step is changing the strings in the input text into token ids. This is handled by the `SingleIdTokenIndexer` that we used previously, during part of our data processing pipeline that you don't have to write code for.

#### Second step: Embedding

<img src = "../../../figures/allenembedding.svg">

Apply an Embedding function that converts each token ID that we got as input into a vector.

#### Third step: Seq2Vec encoder

<img src = "../../figures/allenseq2vec.svg">

Next we apply some function that takes the sequence of vectors for each input token and squashes it into a single vector. Before the days of pretrained language models like BERT, this was typically an LSTM or convolutional encoder. With BERT we might just take the embedding of the [CLS] token.

#### Fourth step: Predict

<img src = "../../../figures/allendist.svg">

Finally, we take that single feature vector (for each `Instance` in the batch), and classify it as a label, which will give us a categorical probability distribution over our label space.

### 3. Implementing the model

Now that we know what our model is going to do, we need to implement it. First, we'll say a few words about how `Models` work in AllenNLP:

- An AllenNLP Model is just a **PyTorch Module**
- It implements a `forward()` method, and requires the output to be a **dictionary**
- Its output contains a loss key during training, which is used to optimize the model

Our training loop takes a batch of `Instances`, passes it through `Model.forward()`, grabs the `loss` key from the resulting dictionary, and uses backprop to compute gradients and update the model's parameters. You don't have to implement the training loop—all this will be taken care of by AllenNLP (though you can if you want to).

#### 3.1 Constructor

In the `Model` constructor, we need to instantiate all of the parameters that we will want to train. In AllenNLP, we recommend taking most of these parameters as constructor arguments, so that we can configure the behavior of our model without changing the model code itself, and so that we can think at a higher level about what our model is doing. Let's look at different components:

`Vocabulary` manages mappings between vocabulary items (such as words and labels) and their integer IDs. In our prebuilt training loop, the vocabulary gets created by AllenNLP after reading your training data, then passed to the `Model` when it gets constructed. We'll find all tokens and labels that you use and assign them all integer IDs in separate namespaces.

To get an initial word embedding, we'll use AllenNLP's `TextFieldEmbedder`. This abstraction takes the tensors created by a `TextField` and embeds each one. This is our most complex abstraction, because there are a lot of ways to do this particular operation in NLP, and we want to be able to switch between these without changing our code. We won't go into the details here; All you need to know for now is that you apply this to the text parameter you get in `forward()`, and you get out a tensor that has a single embedding vector for each input token, with shape `(batch_size, num_tokens, embedding_dim)`.

To squash our sequence of token vectors into a single vector, we use AllenNLP's `Seq2VecEncoder` abstraction. As the name implies, this encapsulates an operation that takes a sequence of vectors and returns a single vector. Because all of our modules operate on batched input, this will take a tensor shaped like `(batch_size, num_tokens, embedding_dim)` and return a tensor shaped like `(batch_size, encoding_dim)`.

In AllenNLP, you implement the logic to compute the metrics in your `Model` class. AllenNLP includes an abstraction called `Metric` that gives some useful functionality for tracking metrics during training. Here, we'll be using an accuracy `Metric`, `CategoricalAccuracy`, which computes the fraction of instances for which our model predicted the label correctly.

#### 3.2 forward

In `forward`, we use the parameters that we created in our constructor to transform the inputs into outputs. After we've predicted the outputs, we compute some loss function based on how close we got to the true outputs, and then return that loss (along with whatever else we want) so that we can use it to train the parameters.

The first thing to notice is the inputs to this function. The way the AllenNLP training loop works is that we will take the field names that you used in your `DatasetReader` and give you a batch of instances with those same field names in `forward`. So, because we used `text` and `label` as our field names, we need to name our arguments to `forward` the same way.

Second, notice the types of these arguments. Each type of `Field` knows how to convert itself into a `torch.Tensor`, then create a batched torch.Tensor from all of the `Fields` with the same name from a batch of `Instances`. The types you see for text and label are the tensors produced by `TextField` and `LabelField`. The important part to know is that our `TextFieldEmbedder`, which we created in the constructor, expects this type of object as input and will return an embedded tensor as output.

The first actual modeling operation that we do is embed the text, getting a vector for each input token. Notice here that we're not specifying anything about how that operation is done, just that a `TextFieldEmbedder` that we got in our constructor is going to do it. This lets us be very flexible later, changing between various kinds of embedding methods or pretrained representations (including ELMo and BERT) without changing our model code.

After we have embedded our text, we next have to squash the sequence of vectors (one per token) into a single vector for the whole text. We do that using the `Seq2VecEncoder` that we got as a constructor argument. In order to behave properly when we're batching pieces of text together that could have different lengths, we need to mask elements in the `embedded_text` tensor that are only there due to padding. We use a utility function to get a mask from the `TextField` output, then pass that mask into the encoder.

At the end of these lines, we have a single vector for each instance in the batch.

The last step of our model is to take the vector for each instance in the batch and predict a label for it. Our classifier is a `torch.nn.Linear` layer that gives a score (commonly called a logit) for each possible label. We normalize those scores using a `softmax` operation to get a probability distribution over labels that we can return to a consumer of this model. For computing the loss, PyTorch has a built in function that computes the cross entropy between the logits that we predict and the true label distribution, and we use that as our loss function.

Then, for each forward pass, you need to update the metric by feeding the prediction and the gold labels in `self.accuracy`.  The way metrics work in AllenNLP is that, behind the scenes, each `Metric` instance holds "counts" that are necessary and sufficient to compute the metric. For accuracy, these counts are the number of total predictions as well as the number of correct predictions. These counts get updated after every call to the instance itself, i.e., the `self.accuracy(logits, label)` line. You can pull out the computed metric by calling `get_metrics()` with a flag specifying whether to reset the counts. This allows you to compute the metric over the entire training or validation dataset.  AllenNLP's default training loop will call this method at the appropriate times and provide logging information with current metric values.  Thus you don't have to explicitly called it.

And that's it! This is all you need for a simple classifier. After you've written a `DatasetReader` and `Model`, AllenNLP takes care of the rest: connecting your input files to the dataset reader, intelligently batching together your instances and feeding them to the model, and optimizing the model's parameters by using backprop on the loss.

Note that kn order to support prediction, first you need to make the label parameter optional by specifying a default value of `None`. This will let you feed unlabeled instances to the model. Second, you need to compute the loss and accuracy only when the label is supplied.



In [4]:
from allennlp.data import Vocabulary
from allennlp.models import Model
from allennlp.modules import TextFieldEmbedder, Seq2VecEncoder
from allennlp.data import TextFieldTensors
from allennlp.nn import util
import torch
import torch.nn.functional as F
from allennlp.training.metrics import CategoricalAccuracy

@Model.register('simple_classifier')
class SimpleClassifier(Model):
    
    ##constructor
    def __init__(self,
                 vocab: Vocabulary,
                 embedder: TextFieldEmbedder,
                 encoder: Seq2VecEncoder):
        super().__init__(vocab)
        self.embedder = embedder
        self.encoder = encoder
        num_labels = vocab.get_vocab_size("labels") 
        self.classifier = torch.nn.Linear(encoder.get_output_dim(), num_labels)
        self.accuracy = CategoricalAccuracy()

    ##forward    
    def forward(self,
                text: TextFieldTensors,
                label: torch.Tensor = None) -> Dict[str, torch.Tensor]:
        # need to set label = None, in case we are predicting
        
        # Shape: (batch_size, num_tokens, embedding_dim)
        embedded_text = self.embedder(text)
        # Shape: (batch_size, num_tokens)
        mask = util.get_text_field_mask(text)
        # Shape: (batch_size, encoding_dim)
        encoded_text = self.encoder(embedded_text, mask)
        # Shape: (batch_size, num_labels)
        logits = self.classifier(encoded_text)
        # Shape: (batch_size, num_labels)
        probs = F.softmax(logits, dim=1)
        output = {"probs": probs}
        if label is not None:
            self.accuracy(logits, label)
            # Shape: (1,)
            output["loss"] = torch.nn.functional.cross_entropy(logits, label)
        return output
    
    def get_metrics(self, reset: bool = False) -> Dict[str, float]:
        return {"accuracy": self.accuracy.get_metric(reset)}


### 4. Putting everything together

In this section we'll put together a simple example of reading in data, feeding it to the model, and training the model, using your own python script instead of allennlp train. While we recommend using allennlp train for most use cases, it's easier to understand the introduction to the training loop. Once you get a handle on this, switching to using allennlp built in command should be easy, if you want to.


#### Testing your dataset reader

In the first example, we'll simply instantiate the dataset reader, read the movie review dataset using it, and inspect the AllenNLP Instances produced by the dataset reader.

In [5]:
dataset_reader = ClassificationTsvReader(max_tokens=64)
instances = list(dataset_reader.read("../data/imdb/train.tsv"))

for instance in instances[:3]:
    print(instance)

Instance with fields:
 	 text: TextField of length 64 with text: 
 		[it, is, movies, like, these, that, make, a, jaded, movie, viewer, thankful, for, the, invention,
		of, the, timex, indiglo, watch, ., based, on, the, late, 1960, 's, television, show, by, the, same,
		name, ,, the, mod, squad, tells, the, tale, of, three, reformed, criminals, under, the, employ, of,
		the, police, to, go, undercover, ., however, ,, things, go, wrong, as, evidence, gets, stolen, and]
 		and TokenIndexers : {'tokens': 'SingleIdTokenIndexer'} 
 	 label: LabelField with label: neg in namespace: 'labels'. 

Instance with fields:
 	 text: TextField of length 64 with text: 
 		[", quest, for, camelot, ", is, warner, bros, ., ', first, feature, -, length, ,, fully, -,
		animated, attempt, to, steal, clout, from, disney, 's, cartoon, empire, ,, but, the, mouse, has, no,
		reason, to, be, worried, ., the, only, other, recent, challenger, to, their, throne, was, last,
		fall, 's, promising, ,, if, flawed, ,, 20

In [6]:
#you can access the instance information like this
print("First two tokens: ", instance['text'][:2])
print("Label: ", instance['label'].label)

First two tokens:  [synopsis, :]
Label:  neg


#### Feeding instances to the model

The `Model` needs to have a `Vocabulary` computed from data before we can build it, but we don't really want to put the details of our model construction inside our training loop function. So to keep things sane, we'll pull out the model building into a separate function that we call inside the main training function.

When you run this, you should see the outputs returned from the model. Each returned dict includes the `loss` key as well as the `probs` key, which contains probabilities for each label.

In [7]:
from allennlp.modules.seq2vec_encoders import BagOfEmbeddingsEncoder
from allennlp.modules.text_field_embedders import BasicTextFieldEmbedder
from allennlp.modules.token_embedders import Embedding

def run_training_loop():
    dataset_reader = ClassificationTsvReader(max_tokens=64)
    print("Reading data")
    instances = list(dataset_reader.read("../data/imdb/train.tsv"))

    vocab = build_vocab(instances)
    model = build_model(vocab)

    outputs = model.forward_on_instances(instances[:4])
    print(outputs)


def build_vocab(instances: Iterable[Instance]) -> Vocabulary:
    print("Building the vocabulary")
    return Vocabulary.from_instances(instances)


def build_model(vocab: Vocabulary) -> Model:
    print("Building the model")
    vocab_size = vocab.get_vocab_size("tokens")
    embedder = BasicTextFieldEmbedder(
        {"tokens": Embedding(embedding_dim=10, num_embeddings=vocab_size)}
    )
    encoder = BagOfEmbeddingsEncoder(embedding_dim=10)
    return SimpleClassifier(vocab, embedder, encoder)

In [8]:
run_training_loop()

Reading data
Building the vocabulary


building vocab:   0%|          | 0/1600 [00:00<?, ?it/s]

Building the model


Encountered the loss key in the model's return dictionary which couldn't be split by the batch size. Key will be ignored.


[{'probs': array([0.52598524, 0.47401473], dtype=float32)}, {'probs': array([0.5340886 , 0.46591136], dtype=float32)}, {'probs': array([0.5437718 , 0.45622817], dtype=float32)}, {'probs': array([0.5525191, 0.4474809], dtype=float32)}]


#### Training the model

Finally, we'll run backpropagation and train the model. AllenNLP uses a `Trainer` for this, which is responsible for connecting necessary components (including your model, optimizer, instances, data loader, etc.) and executing the training loop.

When you run this, the `Trainer` goes over the training data five times (`num_epochs=5`). After each epoch, AllenNLP runs your model against the validation set to monitor how well (or badly) it's doing. This is useful if you want to do, e.g., early stopping, and for monitoring in general. Observe that the training loss decreases gradually—this is a sign that your model and the training pipeline are doing what they are supposed to do (that is, to minimize the loss).

In [9]:
from typing import Dict, Iterable, List, Tuple
import tempfile

from allennlp.data import DataLoader
from allennlp.training.trainer import Trainer
from allennlp.training.gradient_descent_trainer import GradientDescentTrainer
from allennlp.data.data_loaders import SimpleDataLoader

from allennlp.training.optimizers import AdamOptimizer

def build_dataset_reader() -> DatasetReader:
    return ClassificationTsvReader()

def read_data(reader: DatasetReader) -> Tuple[List[Instance], List[Instance]]:
    print("Reading data")
    training_data   = list(reader.read("../data/imdb/train.tsv"))
    validation_data = list(reader.read("../data/imdb/dev.tsv"))
    return training_data, validation_data


def run_training_loop():
    dataset_reader = build_dataset_reader()

    train_data, dev_data = read_data(dataset_reader)

    vocab = build_vocab(train_data + dev_data)
    model = build_model(vocab)

    train_loader, dev_loader = build_data_loaders(train_data, dev_data)
    train_loader.index_with(vocab)
    dev_loader.index_with(vocab)

    # You obviously won't want to create a temporary file for your training
    # results, but for execution in binder for this guide, we need to do this.
    with tempfile.TemporaryDirectory() as serialization_dir:
        trainer = build_trainer(model, serialization_dir, train_loader, dev_loader)
        print("Starting training")
        trainer.train()
        print("Finished training")

    return model, dataset_reader


def build_data_loaders(
    train_data: List[Instance],
    dev_data: List[Instance],
) -> Tuple[DataLoader, DataLoader]:
    train_loader = SimpleDataLoader(train_data, 8, shuffle=True)
    dev_loader   = SimpleDataLoader(dev_data, 8, shuffle=False)
    return train_loader, dev_loader


def build_trainer(
    model: Model,
    serialization_dir: str,
    train_loader: DataLoader,
    dev_loader: DataLoader,
) -> Trainer:
    parameters = [(n, p) for n, p in model.named_parameters() if p.requires_grad]
    optimizer = AdamOptimizer(parameters)  # type: ignore
    trainer = GradientDescentTrainer(
        model=model,
        serialization_dir=serialization_dir,
        data_loader=train_loader,
        validation_data_loader=dev_loader,
        num_epochs=5,
        optimizer=optimizer,
        patience=2 #after 2 epochs with no improvement, stop (basically early stopping)
    )
    return trainer


model, dataset_reader = run_training_loop()

Reading data
Building the vocabulary


building vocab:   0%|          | 0/1800 [00:00<?, ?it/s]

Building the model
Starting training


  0%|          | 0/200 [00:00<?, ?it/s]



  0%|          | 0/25 [00:00<?, ?it/s]

  0%|          | 0/200 [00:00<?, ?it/s]

  0%|          | 0/25 [00:00<?, ?it/s]

  0%|          | 0/200 [00:00<?, ?it/s]

  0%|          | 0/25 [00:00<?, ?it/s]

  0%|          | 0/200 [00:00<?, ?it/s]

  0%|          | 0/25 [00:00<?, ?it/s]

Finished training


### Evaluate test data

In [10]:
from allennlp.training.util import evaluate

# Now we can evaluate the model on a new dataset.
test_data = list(dataset_reader.read("../data/imdb/test.tsv"))
data_loader = SimpleDataLoader(test_data, 8)
data_loader.index_with(model.vocab)

results = evaluate(model, data_loader)
print(results)

0it [00:00, ?it/s]

{'accuracy': 0.875, 'loss': 0.32959430426359176}


### Inference

For making predictions, AllenNLP uses `Predictors`, which are a thin wrapper around your trained model. A `Predictor`'s main job is to take a JSON representation of an instance, convert it to an `Instance` using the dataset reader (the `text_to_instance`), pass it through the model, and return the prediction in a JSON serializable format.

In order to build a `Predictor` for your task, you only need to inherit from `Predictor` and implement a few methods (see predict() and _json_to_instances() below)—the rest will be taken care of by the base class.

AllenNLP provides implementations of `Predictors` for common tasks. In fact, it includes `TextClassifierPredictor`, a generic `Predictor` for text classification tasks, so you don't even need to write your own! Here, we are writing one from scratch solely for demonstration, but you should always check whether the predictor for your task is already there.

To implement, wrap the model with a `SentenceClassifierPredictor` to make predictions for new instances. Because the returned result (`output['probs']`) is just an array of probabilities for class labels, we use `vocab.get_token_from_index()` to convert a label ID back to its label string.



In [11]:
from allennlp.predictors import Predictor
from allennlp.common.util import JsonDict

class SentenceClassifierPredictor(Predictor):
    def predict(self, sentence: str) -> JsonDict:
        return self.predict_json({"sentence": sentence})

    def _json_to_instance(self, json_dict: JsonDict) -> Instance:
        sentence = json_dict["sentence"]
        return self._dataset_reader.text_to_instance(sentence)

vocab = model.vocab
predictor = SentenceClassifierPredictor(model=model, dataset_reader=dataset_reader)

output = predictor.predict("A good movie!")
print(
    [
        (vocab.get_token_from_index(label_id, "labels"), prob)
        for label_id, prob in enumerate(output["probs"])
    ]
)
output = predictor.predict("This was a monstrous waste of time.")
print(
    [
        (vocab.get_token_from_index(label_id, "labels"), prob)
        for label_id, prob in enumerate(output["probs"])
    ]
)

[('neg', 0.497977614402771), ('pos', 0.502022385597229)]
[('neg', 0.5784565806388855), ('pos', 0.4215433895587921)]


### 5. Training the model with command line

Ok, we've seen how to set up a simple training loop.As you can see, there are many boilerplate code.

AllenNLP have a built-in training script that handles all of these things for you and makes it so the only code that you have to write are your `DatasetReader` and `Model` classes. Instead of writing all of the `build_*` methods that we had above, we write a `JSON` configuration file specifying all necessary parameters. Our training script takes those parameters, creates all of the objects in the right order, and runs the training loop.

#### Configuration files

In a nutshell, configuration files in allennlp just take constructor parameters for various objects and put them into a JSON dictionary. Recall that we had a `build_model` method that looked like this:

In [12]:
# def build_model(vocab: Vocabulary) -> Model:
#     print("Building the model")
#     vocab_size = vocab.get_vocab_size("tokens")
#     embedder = BasicTextFieldEmbedder(
#         {"tokens": Embedding(embedding_dim=10, num_embeddings=vocab_size)})
#     encoder = BagOfEmbeddingsEncoder(embedding_dim=10)
#     return SimpleClassifier(vocab, embedder, encoder)

This gets converted into a JSON dictionary that looks like this:

In [13]:
# "model": {
#     "type": "simple_classifier",
#     "embedder": {
#         "token_embedders": {
#             "tokens": {
#                 "type": "embedding",
#                 "embedding_dim": 10
#             }
#         }
#     },
#     "encoder": {
#         "type": "bag_of_embeddings",
#         "embedding_dim": 10
#     }
# }

The constructor parameters to all of the objects that were created in `build_model` are translated directly to keys in this dictionary. AllenNLP relies on the type annotations in the model's constructor code in order to construct these objects correctly.

There are two special things to note: first, to select a particular subclass of a base type (e.g., `SimpleClassifier` as a subclass of Model, or `BagOfEmbeddingsEncoder` as a subclass of `Seq2VecEncoder`) we need an additional "type": "`simple_classifier`" key. The string "`simple_classifier`" comes from the call to `Model.register`

Second, the vocab argument is missing here. That's for the same reason that vocab was an argument to the `build_model` method, not constructed inside it—the vocabulary gets constructed separately, based on data, then passed in to the model. Generally, the sequential dependencies between objects that show up as arguments to your `build_*` methods are left out of the configuration file, as they are handled in a different way. Again, there's a lot more detail which we will cover later.

We do this not just for the model, but for the `dataset reader`, the `data loaders`, the `trainer`, and everything else that goes into a training loop. This gives us a single JSON file that holds all of the configuration for an experiment that was run (we actually use a superset of JSON called `Jsonnet`, which supports fancier features like variables and imports, but a plain JSON file works too).

For our simple classifier, that configuration file looks like this:

In [14]:
# {
#     "dataset_reader" : {
#         "type": "classification-tsv",
#         "token_indexers": {
#             "tokens": {
#                 "type": "single_id"
#             }
#         }
#     },
#     "train_data_path": "../data/imdb/train.tsv",
#     "validation_data_path": "../data/imdb/dev.tsv",
#     "model": {
#         "type": "simple_classifier",
#         "embedder": {
#             "token_embedders": {
#                 "tokens": {
#                     "type": "embedding",
#                     "embedding_dim": 10
#                 }
#             }
#         },
#         "encoder": {
#             "type": "bag_of_embeddings",
#             "embedding_dim": 10
#         }
#     },
#     "data_loader": {
#         "batch_size": 8,
#         "shuffle": true
#     },
#     "trainer": {
#         "optimizer": "adam",
#         "num_epochs": 5
#     }
# }

With this configuration file, we can train the model by running 

    allennlp train [config.json] -s [serialization_directory] 

from a command line. In order for your dataset reader, model, and other custom components to be recognized by the allennlp command, the calls to `.register()` have to be run, which happens when the classes are imported. So you typically have to also add the flag `--include-package [my_python_module]`, or use allennlp's plugin functionality, when you run this command. There is more detail on how this works in the chapter on configuration files.

Note:  add `"cuda_device": 0` inside the trainer config, if you want to use cuda.  

In [None]:
# "trainer": {
#         ...
#         "cuda_device": 0
#         ...
#     }

If you have multiple GPUs, create a new `"distributed":` and add the list of GPUs:

In [None]:
# "trainer": {
#         ...
#     },
# "distributed": {
#        "cuda_devices": [0, 1, 2, 3]
#     }

This will use PyTorch's DistributedDataParallel to aggregate losses and synchronize parameter updates across multiple GPUs. The speedup you get, however, might not be exactly proportional to the number of GPUs due to due to synchronization and overhead.

Last note:  When you are evaluating and making predictions with your model, you can specify the `--cuda-device` option from the command line to make your model run on GPUs.

#### Let's do it!

To train:

    allennlp train classification.jsonnet -s model --include-package my_text_classifier

We can also evaluate.  Note that the evaluate command takes the model saved in the previous serialization directory, which is `model` specified in `-s`.

    allennlp evaluate model/model.tar.gz ../data/imdb/test.tsv --include-package my_text_classifier

Last, we can try to predict

    allennlp predict model/model.tar.gz ../data/imdb/test.jsonl --include-package my_text_classifier --predictor sentence_classifier

### 6. Running a demo

First, install the allennlp-server

    pip install allennlp-server

Last you can spin up the server by running:

    allen serve \
    --archive-path model/model.tar.gz \
    --predictor sentence_classifier \
    --field-name sentence \
    --include-package my_text_classifier

Note that you need to specify the name of the field(s) to accept input. You can access `localhost:8000` in your browser to see the simple demo.

### 7. Try BERT

AllenNLP supports many pre-trained models out of the box.  You can try change the `classification.jsonnet` like this:

First, instead of using normal tokenizer, we gonna use BERT tokenzier.

In [16]:
# "dataset_reader" : {
#         "type": "classification-tsv",
#         "tokenizer": {
#             "type": "pretrained_transformer",
#             "model_name": bert_model,
#         },
#         "token_indexers": {
#             "bert": {
#                 "type": "pretrained_transformer",
#                 "model_name": bert_model,
#             }
#         },
#         "max_tokens": 512
#     }, 

Then we can use BERT embedding as well as BERT encoder

In [None]:
# "model": {
#         "type": "simple_classifier",
#         "embedder": {
#             "token_embedders": {
#                 "bert": {
#                     "type": "pretrained_transformer",
#                     "model_name": bert_model
#                 }
#             }
#         },
#         "encoder": {
#             "type": "bert_pooler",
#             "pretrained_model": bert_model
#         }
#     },

In a nutshell, you need to:

- Use a `PretrainedTransformerTokenizer` ("`pretrained_transformer`"), which tokenizes the string into wordpieces and adds special tokens like [CLS] and [SEP]
- Use a `PretrainedTransformerIndexer` ("`pretrained_transformer`"), which converts those wordpieces into ids using BERT's vocabulary
- Replace the embedder layer with a `PretrainedTransformerEmbedder` ("`pretrained_transformer`"), which uses a pretrained BERT model to embed the tokens, returning the top layer from BERT
- Replace the encoder with a `BertPooler` ("`bert_pooler`"), which adds another (pretrained) linear layer on top of the [CLS] token and returns the result

Also note that we switched the optimizer to use `AdamW` from HuggingFace's Transformers library.

The tokenizer and the embedder are thin wrappers around **HuggingFace's Transformers** library, so switching between different transformer architectures (BERT, RoBERTa, XLNet, etc.) is as simple as changing the `model_name` parameter in the config file.

I have already created this config file called `classification_bert.jsonnet`.  Try it.  You should see increased accuracy.