<a href="https://colab.research.google.com/github/ekdnam/allennlp-guide/blob/add-notebooks-quick-start/notebooks/quick_start/your_first_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part 1: Quick Start

Part 1 gives you a quick walk-through of main AllenNLP concepts and features. We’ll build a complete, working NLP model (text classifier) along the way.

# Introduction

## 1. What is text classification?

Text classification is one of the simplest NLP tasks, where the model, given some input text, predicts a label for the text. See the figure below for an illustration.

![text-classification.svg](https://raw.githubusercontent.com/allenai/allennlp-guide/master/static/part1/introduction/text-classification.svg)

There are a variety of applications of text classification, such as spam filtering, sentiment analysis, and topic detection. Some examples are shown in the table below.

|Application| Description | Input | Output |
|---|---|---| ---|
| Spam filtering | Detect and filter spam emails | Email | Spam / Not spam |
| Sentiment analysis | Detect the polarity of text | Tweet, review | Positive / Negative |
|Topic detection | Detect the topic of text | News article, blog post | Business / Tech / Sports |

## 2. Defining input and output

The first step for building an NLP model is to define its input and output. In AllenNLP, each training example is represented by an `Instance` object. An `Instance` consists of one or more `Fields`, where each `Field` represents one piece of data used by your model, either as an input or an output. `Fields` well get converted to tensors and fed to your model. The [Reading Data Chapter](https://guide.allennlp.org/reading-data) provides more details on using `Instances` and `Fields` to represent textual data.

For text classification, the input and the output are very simple. The model takes a `TextField` that represents the input text and predicts its label, which is represented by a `LabelField`:

```
# Input
text: TextField

# Output
label: LabelField
```

## 3. Reading data

![dataset-reader.png](https://raw.githubusercontent.com/allenai/allennlp-guide/master/static/part1/your-first-model/dataset-reader.svg)

The first step for building an NLP application is to read the dataset and represent it with some internal data structure.

AllenNLP uses `DatasetReaders` to read the data, whose job is to transform raw data files into `Instances` that match the input / ouput spec. Our spec for text classification is:

```
# Inputs
text: TextField

# Outputs
label: LabelField
```

We'll want one `Field` for the input and another for the output, and our model will use the inputs to predict the outputs.

We assume the dataset has a simple data file format: 
```
[text] [TAB] [label]
```

for example:

```
I like this movie a lot! [TAB] positive
This was a monstrous waste of time [TAB] negative
AllenNLP is amazing [TAB] positive
Why does this have to be so complicated? [TAB] negative
This sentence expresses no sentiment [TAB] neutral
```

# Let's begin to code

# Imports

At first, we will import the required libraries.

In [None]:
import tempfile
from typing import Dict, Iterable, List, Tuple

In [None]:
!pip install allennlp

Collecting allennlp
[?25l  Downloading https://files.pythonhosted.org/packages/e7/bd/c75fa01e3deb9322b637fe0be45164b40d43747661aca9195b5fb334947c/allennlp-2.1.0-py3-none-any.whl (585kB)
[K     |████████████████████████████████| 593kB 8.5MB/s 
[?25hCollecting boto3<2.0,>=1.14
[?25l  Downloading https://files.pythonhosted.org/packages/48/84/7403268cd52f7d420fd0e2b3bdf524a440d8b2eda6097daeb0a5c55b3e49/boto3-1.17.22-py2.py3-none-any.whl (130kB)
[K     |████████████████████████████████| 133kB 14.3MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577/sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 15.0MB/s 
Collecting jsonnet>=0.10.0; sys_platform != "win32"
[?25l  Downloading https://files.pythonhosted.org/packages/42/40/6f16e5ac994b16fa71c24310f97174ce07d3a97b433275589265c6b94d2b/jsonnet-0.17.0.tar.gz (259kB)
[K    

In [None]:
import allennlp
import torch
from allennlp.data import (
    DataLoader,
    DatasetReader,
    Instance,
    Vocabulary,
    TextFieldTensors,
)
from allennlp.data.data_loaders import SimpleDataLoader
from allennlp.data.fields import LabelField, TextField
from allennlp.data.token_indexers import TokenIndexer, SingleIdTokenIndexer
from allennlp.data.tokenizers import Token, Tokenizer, WhitespaceTokenizer
from allennlp.models import Model
from allennlp.modules import TextFieldEmbedder, Seq2VecEncoder
from allennlp.modules.seq2vec_encoders import BagOfEmbeddingsEncoder
from allennlp.modules.token_embedders import Embedding
from allennlp.modules.text_field_embedders import BasicTextFieldEmbedder
from allennlp.nn import util
from allennlp.training.trainer import GradientDescentTrainer, Trainer
from allennlp.training.optimizers import AdamOptimizer
from allennlp.training.metrics import CategoricalAccuracy

# Making a DatasetReader

You can implement your own `DatasetReader` by inheriting from the `DatasetReader` class. At minimum, you need to override the `_read()` method, which reads the input and yields `Instances`

In [None]:
class ClassificationTsvReader(DatasetReader):
    def __init__(
        self,
        tokenizer: Tokenizer = None,
        token_indexers: Dict[str, TokenIndexer] = None,
        max_tokens: int = None,
        **kwargs
    ):
        super().__init__(**kwargs)
        self.tokenizer = tokenizer or WhitespaceTokenizer()
        self.token_indexers = token_indexers or {"tokens": SingleIdTokenIndexer()}
        self.max_tokens = max_tokens

    def _read(self, file_path: str) -> Iterable[Instance]:
        with open(file_path, "r") as lines:
            for line in lines:
                text, sentiment = line.strip().split("\t")
                tokens = self.tokenizer.tokenize(text)
                if self.max_tokens:
                    tokens = tokens[: self.max_tokens]
                text_field = TextField(tokens, self.token_indexers)
                label_field = LabelField(sentiment)
                yield Instance({"text": text_field, "label": label_field})


This is a minimal `DatasetReader` that will return a list of classification `Instances` when you call `reader.read(file)`. This reader will take each line in the input file, split the `text` into words using a tokenizer (the `SpaCyTokenizer` shown here relies on [spaCy](https://spacy.io/)), and represent those words as tensors using a word id in a vocabulary we construct for you.

Pay special attention to the `text` and `label` keys that are used in the fields dictionary passed to the `Instance` - these keys will be used as parameter names when passing tensors into your `Model` later.

Ideally, the output label would be optional when we create the `Instances`, so that we can use the same code to make predictions on unlabeled data (say, in a demo), but for the rest of this chapter we’ll keep things simple and ignore that.

There are lots of places where this could be made better for a more flexible and fully-featured reader; see the section on [DatasetReaders](https://guide.allennlp.org/reading-data#2) for a deeper dive.

# Building your model


![designing-a-model.svg](https://raw.githubusercontent.com/allenai/allennlp-guide/master/static/part1/your-first-model/designing-a-model.svg)

The next thing we need is a `Model` that will take a batch of `Instances`, predict the outputs from the inputs, and compute a loss.

Remember that our `Instances` have this input/output spec:

```
# Inputs
text: TextField

# Outputs
label: LabelField
```
Also, remember that we used these names (`text` and `label`) for the fields in the `DatasetReader`. AllenNLP passes those fields by name to the model code, so we need to use the same names in our model.


## What should our model do?

![designing-a-model-1.svg](https://raw.githubusercontent.com/allenai/allennlp-guide/master/static/part1/your-first-model/designing-a-model-1.svg)

Conceptually, a generic model for classifying text does the following:

- Get some features corresponding to each word in your input
- Combine those word-level features into a document-level feature vector
- Classify that document-level feature vector into one of your labels.

In AllenNLP, we make each of these conceptual steps into a generic abstraction that you can use in your code, so that you can have a very flexible model that can use different concrete components for each step.

## Representing text with token IDs

![designing-a-model-2.svg](https://raw.githubusercontent.com/allenai/allennlp-guide/master/static/part1/your-first-model/designing-a-model-2.svg)

The first step is changing the strings in the input text into token ids. This is handled by the `SingleIdTokenIndexer` that we used previously, during part of our data processing pipeline that you don’t have to write code for.

## Embedding tokens

![designing-a-model-3.svg](https://raw.githubusercontent.com/allenai/allennlp-guide/master/static/part1/your-first-model/designing-a-model-3.svg)

The first thing our `Model` does is apply an `Embedding` function that converts each token ID that we got as input into a vector. This gives us a vector for each input token, so we have a large tensor here.

## Apply Seq2Vec encoder

![designing-a-model-4.svg](https://raw.githubusercontent.com/allenai/allennlp-guide/master/static/part1/your-first-model/designing-a-model-4.svg)

Next we apply some function that takes the sequence of vectors for each input token and squashes it into a single vector. Before the days of pretrained language models like BERT, this was typically an LSTM or convolutional encoder. With BERT we might just take the embedding of the `[CLS]` token (more on how to do that [later](https://guide.allennlp.org/next-steps)).


## Computing distribution over labels

![designing-a-model-5.svg](https://raw.githubusercontent.com/allenai/allennlp-guide/master/static/part1/your-first-model/designing-a-model-5.svg)

Finally, we take that single feature vector (for each `Instance` in the batch), and classify it as a label, which will give us a categorical probability distribution over our label space.

# Implementing the model - the constructor

![allennlp-model](https://raw.githubusercontent.com/allenai/allennlp-guide/master/static/part1/your-first-model/allennlp-model.svg)

Now that we know what our model is going to do, we need to implement it. First, we’ll say a few words about how `Models` work in AllenNLP:

- An AllenNLP `Model` is just a PyTorch `Module`
- It implements a `forward()` method, and requires the output to be a dictionary
- Its output contains a `loss` key during training, which is used to optimize the model.

Our training loop takes a batch of `Instances`, passes it through `Model.forward()`, grabs the `loss` key from the resulting dictionary, and uses backprop to compute gradients and update the model’s parameters. You don’t have to implement the training loop—all this will be taken care of by AllenNLP (though you can if you want to).

## Constructing the Model

In the `Model` constructor, we need to instantiate all of the parameters that we will want to train. In AllenNLP, [we recommend](https://guide.allennlp.org/using-config-files#1) taking most of these parameters as constructor arguments, so that we can configure the behavior of our model without changing the model code itself, and so that we can think at a higher level about what our model is doing. The constructor for our text classification model looks like this:

```python
@Model.register('simple_classifier')
class SimpleClassifier(Model):
    def __init__(self,
                 vocab: Vocabulary,
                 embedder: TextFieldEmbedder,
                 encoder: Seq2VecEncoder):
        super().__init__(vocab)
        self.embedder = embedder
        self.encoder = encoder
        num_labels = vocab.get_vocab_size("labels")
        self.classifier = torch.nn.Linear(encoder.get_output_dim(), num_labels)
```

You’ll notice that we use type annotations a lot in AllenNLP code - this is both for code readability (it’s way easier to understand what a method does if you know the types of its arguments, instead of just their names), and because we use these annotations to do some magic for you in some cases.

One of those cases is constructor parameters, where we can automatically construct the embedder and encoder from a configuration file using these type annotations. See the chapter on [configuration files](https://guide.allennlp.org/using-config-files) for more information. That chapter will also tell you about the call to `@Model.register().`

The upshot is that if you’re using the `allennlp train` command with a configuration file (which we show how to do below), you won’t ever have to call this constructor, it all gets taken care of for you.

### Passing the vocabulary

<pre class="brush: python"><code class="language-python">
@Model.register('simple_classifier')
class SimpleClassifier(Model):
    def __init__(self,
                 <strong>vocab: Vocabulary,</strong>
                 embedder: TextFieldEmbedder,
                 encoder: Seq2VecEncoder):
        super().__init__(vocab)
        self.embedder = embedder
        self.encoder = encoder
        <strong>num_labels = vocab.get_vocab_size("labels")</strong>
        self.classifier = torch.nn.Linear(encoder.get_output_dim(), num_labels)
</code></pre>

`Vocabulary` manages mappings between vocabulary items (such as words and labels) and their integer IDs. In our prebuilt training loop, the vocabulary gets created by AllenNLP after reading your training data, then passed to the `Model` when it gets constructed. We’ll find all tokens and labels that you use and assign them all integer IDs in separate namespaces. The way that this happens is fully configurable; see the [Vocabulary section of this guide](https://guide.allennlp.org/reading-data#3) for more information.

What we did in the `DatasetReader` will put the labels in the default “labels” namespace, and we grab the number of labels from the vocabulary on line 10.


### Embedding words

<pre class="brush: python"><code class="language-python">
@Model.register('simple_classifier')
class SimpleClassifier(Model):
    def __init__(self,
                 vocab: Vocabulary,
                 <strong>embedder: TextFieldEmbedder,</strong>
                 encoder: Seq2VecEncoder):
        super().__init__(vocab)
        <strong>self.embedder = embedder</strong>
        self.encoder = encoder
        >num_labels = vocab.get_vocab_size("labels")
        self.classifier = torch.nn.Linear(encoder.get_output_dim(), num_labels)
</code></pre>

To get an initial word embedding, we’ll use AllenNLP’s `TextFieldEmbedder`. This abstraction takes the tensors created by a `TextField` and embeds each one. This is our most complex abstraction, because there are a lot of ways to do this particular operation in NLP, and we want to be able to switch between these without changing our code. We won’t go into the details here; we have a whole [chapter of this guide](https://guide.allennlp.org/representing-text-as-features) dedicated to diving deep into how this abstraction works and how to use it. All you need to know for now is that you apply this to the `text` parameter you get in `forward()`, and you get out a tensor that has a single embedding vector for each input token, with shape `(batch_size, num_tokens, embedding_dim)`.

###  Applying a Seq2VecEncoder

<pre class="brush: python"><code class="language-python">
@Model.register('simple_classifier')
class SimpleClassifier(Model):
    def __init__(self,
                 vocab: Vocabulary,
                 embedder: TextFieldEmbedder,
                 <strong>encoder: Seq2VecEncoder</strong>):
        super().__init__(vocab)
        self.embedder = embedder
        <strong>self.encoder = encoder</strong>
        >num_labels = vocab.get_vocab_size("labels")
        self.classifier = torch.nn.Linear(encoder.get_output_dim(), num_labels)
</code></pre>

To squash our sequence of token vectors into a single vector, we use AllenNLP’s `Seq2VecEncoder` abstraction. As the name implies, this encapsulates an operation that takes a sequence of vectors and returns a single vector. Because all of our modules operate on batched input, this will take a tensor shaped like `(batch_size, num_tokens, embedding_dim)` and return a tensor shaped like `(batch_size, encoding_dim)`.

In [None]:
class SimpleClassifier(Model):
    def __init__(
        self, 
        vocab: Vocabulary, 
        embedder: TextFieldEmbedder, 
        encoder: Seq2VecEncoder
    ):
        super().__init__(vocab)
        self.embedder = embedder
        self.encoder = encoder
        num_labels = vocab.get_vocab_size("labels")
        self.classifier = torch.nn.Linear(encoder.get_output_dim(), num_labels)

# Implementing the model — the forward method

Next, we need to implement the `forward()` method of your model, which takes the input, produces the prediction, and computes the loss. Remember, our constructor and input/output spec look like:

```python
@Model.register('simple_classifier')
class SimpleClassifier(Model):
    def __init__(self,
                 vocab: Vocabulary,
                 embedder: TextFieldEmbedder,
                 encoder: Seq2VecEncoder):
        super().__init__(vocab)
        self.embedder = embedder
        self.encoder = encoder
        num_labels = vocab.get_vocab_size("labels")
        self.classifier = torch.nn.Linear(encoder.get_output_dim(), num_labels)
```

```
# Inputs:
text: TextField

# Outputs:
label: LabelField
```

Here we’ll show how to use these parameters inside of `Model.forward()`, which will get arguments that match our input/output spec (because that’s how we coded the [DatasetReader](https://colab.research.google.com/drive/1Fxl4PEW-U-x7MjIrLfPyqw2Sgs1Z2Fcw?authuser=1#scrollTo=5o2gOAXZnW_O&line=1&uniqifier=1)).

## Model.forward()

In `forward`, we use the parameters that we created in our constructor to transform the inputs into outputs. After we’ve predicted the outputs, we compute some loss function based on how close we got to the true outputs, and then return that loss (along with whatever else we want) so that we can use it to train the parameters.

```python
class SimpleClassifier(Model):
    def forward(self,
                text: TextFieldTensors,
                label: torch.Tensor) -> Dict[str, torch.Tensor]:
        # Shape: (batch_size, num_tokens, embedding_dim)
        embedded_text = self.embedder(text)
        # Shape: (batch_size, num_tokens)
        mask = util.get_text_field_mask(text)
        # Shape: (batch_size, encoding_dim)
        encoded_text = self.encoder(embedded_text, mask)
        # Shape: (batch_size, num_labels)
        logits = self.classifier(encoded_text)
        # Shape: (batch_size, num_labels)
        probs = torch.nn.functional.softmax(logits)
        # Shape: (1,)
        loss = torch.nn.functional.cross_entropy(logits, label)
        return {'loss': loss, 'probs': probs}
```


### Inputs to forward()

<pre class="brush: python"><code class="language-python">
class SimpleClassifier(Model):
    def forward(self,
                <strong>text: TextFieldTensors,
                label: torch.Tensor) -> Dict[str, torch.Tensor]:</strong>
        # Shape: (batch_size, num_tokens, embedding_dim)
        embedded_text = self.embedder(text)
        # Shape: (batch_size, num_tokens)
        mask = util.get_text_field_mask(text)
        # Shape: (batch_size, encoding_dim)
        encoded_text = self.encoder(embedded_text, mask)
        # Shape: (batch_size, num_labels)
        logits = self.classifier(encoded_text)
        # Shape: (batch_size, num_labels)
        probs = torch.nn.functional.softmax(logits)
        # Shape: (1,)
        loss = torch.nn.functional.cross_entropy(logits, label)
        return {'loss': loss, 'probs': probs}
</code></pre>

The first thing to notice is the inputs to this function. The way the AllenNLP training loop works is that we will take the field names that you used in your `DatasetReader` and give you a batch of instances _with those same field names_ in `forward`. So, because we used `text` and `label` as our field names, we need to name our arguments to forward the same way.

Second, notice the types of these arguments. Each type of `Field` knows how to convert itself into a `torch.Tensor`, then create a batched `torch.Tensor` from all of the `Fields` with the same name from a batch of `Instances`. The types you see for `text` and `label` are the tensors produced by `TextField` and `LabelField` (again, see our [chapter on using TextFields](https://guide.allennlp.org/representing-text-as-features) for more information about `TextFieldTensors`). The important part to know is that our `TextFieldEmbedder`, which we created in the constructor, expects this type of object as input and will return an embedded tensor as output.

### Embedding the text

<pre class="brush: python"><code class="language-python">
class SimpleClassifier(Model):
    def forward(self,
                text: TextFieldTensors,
                label: torch.Tensor) -> Dict[str, torch.Tensor]:
        <strong># Shape: (batch_size, num_tokens, embedding_dim)
        embedded_text = self.embedder(text)</strong>
        # Shape: (batch_size, num_tokens)
        mask = util.get_text_field_mask(text)
        # Shape: (batch_size, encoding_dim)
        encoded_text = self.encoder(embedded_text, mask)
        # Shape: (batch_size, num_labels)
        logits = self.classifier(encoded_text)
        # Shape: (batch_size, num_labels)
        probs = torch.nn.functional.softmax(logits)
        # Shape: (1,)
        loss = torch.nn.functional.cross_entropy(logits, label)
        return {'loss': loss, 'probs': probs}
</code></pre>

The first actual modeling operation that we do is embed the text, getting a vector for each input token. Notice here that we’re not specifying anything about how that operation is done, just that a `TextFieldEmbedder` that we got in our constructor is going to do it. This lets us be very flexible later, changing between various kinds of embedding methods or pretrained representations (including ELMo and BERT) without changing our model code.

### Applying a Seq2VecEncoder

<pre class="brush: python"><code class="language-python">
class SimpleClassifier(Model):
    def forward(self,
                text: TextFieldTensors,
                label: torch.Tensor) -> Dict[str, torch.Tensor]:
        # Shape: (batch_size, num_tokens, embedding_dim)
        embedded_text = self.embedder(text)
        <strong># Shape: (batch_size, num_tokens)
        mask = util.get_text_field_mask(text)
        # Shape: (batch_size, encoding_dim)
        encoded_text = self.encoder(embedded_text, mask)</strong>
        # Shape: (batch_size, num_labels)
        logits = self.classifier(encoded_text)
        # Shape: (batch_size, num_labels)
        probs = torch.nn.functional.softmax(logits)
        # Shape: (1,)
        loss = torch.nn.functional.cross_entropy(logits, label)
        return {'loss': loss, 'probs': probs}
</code></pre>

After we have embedded our text, we next have to squash the sequence of vectors (one per token) into a single vector for the whole text. We do that using the `Seq2VecEncoder` that we got as a constructor argument. In order to behave properly when we’re batching pieces of text together that could have different lengths, we need to mask elements in the `embedded_text` tensor that are only there due to padding. We use a utility function to get a mask from the `TextField` output, then pass that mask into the encoder.

At the end of these lines, we have a single vector for each instance in the batch.

### Making predictions

<pre class="brush: python"><code class="language-python">
class SimpleClassifier(Model):
    def forward(self,
                text: TextFieldTensors,
                label: torch.Tensor) -> Dict[str, torch.Tensor]:
        # Shape: (batch_size, num_tokens, embedding_dim)
        embedded_text = self.embedder(text)
        # Shape: (batch_size, num_tokens)
        mask = util.get_text_field_mask(text)
        # Shape: (batch_size, encoding_dim)
        encoded_text = self.encoder(embedded_text, mask)
        <strong># Shape: (batch_size, num_labels)
        logits = self.classifier(encoded_text)
        # Shape: (batch_size, num_labels)
        probs = torch.nn.functional.softmax(logits)
        # Shape: (1,)
        loss = torch.nn.functional.cross_entropy(logits, label)</strong>
        return {'loss': loss, 'probs': probs}
</code></pre>

The last step of our model is to take the vector for each instance in the batch and predict a label for it. Our `classifier` is a `torch.nn.Linear` layer that gives a score (commonly called a `logit`) for each possible label. We normalize those scores using a `softmax` operation to get a probability distribution over labels that we can return to a consumer of this model. For computing the loss, PyTorch has a built in function that computes the cross entropy between the logits that we predict and the true label distribution, and we use that as our loss function.


# class SimpleClassifier(Model)

In [None]:
class SimpleClassifier(Model):
    def __init__(
        self, 
        vocab: Vocabulary, 
        embedder: TextFieldEmbedder, 
        encoder: Seq2VecEncoder
    ):
        super().__init__(vocab)
        self.embedder = embedder
        self.encoder = encoder
        num_labels = vocab.get_vocab_size("labels")
        self.classifier = torch.nn.Linear(encoder.get_output_dim(), num_labels)

    def forward(
        self, text: TextFieldTensors, label: torch.Tensor
    ) -> Dict[str, torch.Tensor]:
        print("In model.forward(); printing here just because binder is so slow")
        # Shape: (batch_size, num_tokens, embedding_dim)
        embedded_text = self.embedder(text)
        # Shape: (batch_size, num_tokens)
        mask = util.get_text_field_mask(text)
        # Shape: (batch_size, encoding_dim)
        encoded_text = self.encoder(embedded_text, mask)
        # Shape: (batch_size, num_labels)
        logits = self.classifier(encoded_text)
        # Shape: (batch_size, num_labels)
        probs = torch.nn.functional.softmax(logits, dim=-1)
        # Shape: (1,)
        loss = torch.nn.functional.cross_entropy(logits, label)
        return {"loss": loss, "probs": probs}
        

# Conclusion

And that’s it! This is all you need for a simple classifier. After you’ve written a `DatasetReader` and `Model`, AllenNLP takes care of the rest: connecting your input files to the dataset reader, intelligently batching together your instances and feeding them to the model, and optimizing the model’s parameters by using backprop on the loss. We go over this part in the next chapter.