This is a whirlwind tour of most of the fundamental concepts underlying AllenNLP. There is nothing for you to do here, if you like you could just Ctrl-Enter all the way to the bottom, but feel free to poke at the results as you go if you're curious about them.

Many of these concepts you won't have to worry about that much, but it's good to sort of understand what's going on under the hood.

# Tokenization

The default tokenizer in AllenNLP is the spacy tokenizer. You can specify others if you need them. (For instance, if you're using BERT, you want to use the same tokenizer that the BERT model expects.)

In [None]:
from allennlp.data.tokenizers import WordTokenizer

In [None]:
text = "I don't hate notebooks, I just don't like notebooks!"

In [None]:
tokenizer = WordTokenizer()  

In [None]:
tokens = tokenizer.tokenize(text)
tokens

# Token Indexers

A `TokenIndexer` turns tokens into indices or lists of indices. We won't be able to see how they operate until slightly later.

In [None]:
from allennlp.data.token_indexers import SingleIdTokenIndexer, TokenCharactersIndexer

In [None]:
token_indexer = SingleIdTokenIndexer()  # maps tokens to word_ids

# Fields

Your training examples will be represented as `Instances`, each consisting of typed `Field`s.

In [None]:
from allennlp.data.fields import TextField, LabelField

A `TextField` is for storing text, and also needs one or more `TokenIndexer`s that will be used to convert the text into indices.

In [None]:
text_field = TextField(tokens, {"tokens": token_indexer})

In [None]:
text_field._indexed_tokens  # not yet

A `LabelField` is for storing a discrete label.

In [None]:
label_field = LabelField("technology")

# Instances

Each `Instance` is just a collection of named `Field`s.

In [None]:
from allennlp.data.instance import Instance

In [None]:
instance = Instance({"text": text_field, "category": label_field})

# Vocabulary

Based on our instances we construct a `Vocabulary` which contains the various mappings token <-> index, label <-> index, and so on.

In [None]:
from allennlp.data.vocabulary import Vocabulary

In [None]:
vocab = Vocabulary.from_instances([instance])

Here you can see that our vocabulary has two mappings, a `tokens` mapping (for the tokens) and a `labels` mapping (for the labels).

In [None]:
vocab._token_to_index

In [None]:
text_field._indexed_tokens

In [None]:
label_field._label_id

Although we have constructed the mappings, we haven't yet used them to index the fields in our instance. We have to do that manually (although when you use the allennlp trainer all of this will be taken care of.)

In [None]:
instance.index_fields(vocab)

In [None]:
text_field._indexed_tokens

In [None]:
label_field._label_id

Once the `Instance` has been indexed, it then knows how to convert itself to a tensor dict.

In [None]:
instance.as_tensor_dict()

And it knows how long other instances would need to be padded to if we do batching. (More on this below!)

In [None]:
instance.get_padding_lengths()

# Batching and Padding

When you're doing NLP, you have sequences with different lengths, which means that padding and masking are very important. They're tricky to get right! Luckily, AllenNLP handles most of the details for you.

In [None]:
text1 = "I just don't like notebooks."
tokens1 = tokenizer.tokenize(text)
text_field1 = TextField(tokens1, {"tokens": token_indexer})
label_field1 = LabelField("Joel")
instance1 = Instance({"text": text_field1, "speaker": label_field1})
text2 = "I do like notebooks."
tokens2 = tokenizer.tokenize(text2)
text_field2 = TextField(tokens2, {"tokens": token_indexer})
label_field2 = LabelField("Tim")
instance2 = Instance({"text": text_field2, "speaker": label_field2})


In [None]:
from allennlp.data.dataset import Batch

In [None]:
vocab = Vocabulary.from_instances([instance1, instance2])

In [None]:
batch = Batch([instance1, instance2])

In [None]:
batch.index_instances(vocab)

Notice that

1. the batching is already taken care of for you, and
2. the shorter text field is appropriately padded with 0's (the `@@PADDING@@` id)

In [None]:
batch.as_tensor_dict()

# Using Multiple Indexers

In some circumstances you might want to use multiple token indexers. For instance, you might want to index a token using its token_id, but also as a sequence of character_ids. This is as simple as adding extra token indexers to our text fields.

In [None]:
from allennlp.data.token_indexers import TokenCharactersIndexer

In [None]:
token_characters_indexer = TokenCharactersIndexer(min_padding_length=3)

In [None]:
text_field = TextField(tokens, {"tokens": token_indexer, "token_characters": token_characters_indexer})
label_field = LabelField("technology")

In [None]:
instance = Instance({"text": text_field, "label": label_field})

In [None]:
vocab = Vocabulary.from_instances([instance])

You can see that we now have an additional vocabulary namespace for the character ids:

In [None]:
vocab._token_to_index

In [None]:
instance.index_fields(vocab)

And now when we call `instance.as_tensor_dict` we'll get an additional (padded) tensor with the character_ids.

In [None]:
instance.as_tensor_dict()

# TokenEmbedders

Once we have our text represented as ids, we use a `TokenEmbedder` to create tensor embeddings.

In [None]:
text1 = "I just don't like notebooks."
tokens1 = tokenizer.tokenize(text)
text_field1 = TextField(tokens1, {"tokens": token_indexer})
label_field1 = LabelField("Joel")
instance1 = Instance({"text": text_field1, "speaker": label_field1})
text2 = "I do like notebooks."
tokens2 = tokenizer.tokenize(text2)
text_field2 = TextField(tokens2, {"tokens": token_indexer})
label_field2 = LabelField("Tim")
instance2 = Instance({"text": text_field2, "speaker": label_field2})
vocab = Vocabulary.from_instances([instance1, instance2])
batch = Batch([instance1, instance2])
batch.index_instances(vocab)

In [None]:
tensor_dict = batch.as_tensor_dict()
tensor_dict

In [None]:
from allennlp.modules.token_embedders import Embedding

Here we define an embedding layer that has a number of embeddings equal to the corresponding vocabulary size, and that consists of 5-dimensional vectors. In this case the embeddings will just be randomly initialized.

In [None]:
embedding = Embedding(num_embeddings=vocab.get_vocab_size("tokens"), embedding_dim=5)

Accordingly, we can apply those embeddings to the indexed tokens.

In [None]:
embedding(tensor_dict['text']['tokens'])

# TextFieldEmbedders

A text field may have multiple indexed representations of its tokens, in which case it needs multiple corresponding `TokenEmbedder`s. Because of this we typically wrap the token embedders in a `TextFieldEmbedder`, which runs the appropriate token embedder for each representation and then concatenates the results.

In [None]:
from allennlp.modules.text_field_embedders import BasicTextFieldEmbedder

In [None]:
text_field_embedder = BasicTextFieldEmbedder({"tokens": embedding})

Notice now we apply it to the full tensor dict for the text field.

In [None]:
text_field_embedder(tensor_dict['text'])

# Seq2VecEncoders

At this point we've ended up with a sequence of tensors. Frequently we'll want to collapse that sequence into a single contextualized tensor representation, which we do with a `Seq2VecEncoder`. (If we wanted to produce a full sequence of contextualized representations we'd instead use a `Seq2SeqEncoder`.

In particular, here we'll use a `BagOfEmbeddingsEncoder`, which just sums up the vectors.

In [None]:
from allennlp.modules.seq2vec_encoders import BagOfEmbeddingsEncoder

In [None]:
encoder = BagOfEmbeddingsEncoder(embedding_dim=text_field_embedder.get_output_dim())

We can apply this to the output of our text field embedder to collapse each sequence down to a single element.

In [None]:
encoder(text_field_embedder(tensor_dict['text']))

# Using PyTorch directly

AllenNLP modules are just PyTorch modules, and we can mix and match them with native PyTorch features. Here we create a `torch.nn.Linear` module and apply it to the output of the `Seq2VecEncoder`.

In [None]:
import torch

In [None]:
linear = torch.nn.Linear(in_features=text_field_embedder.get_output_dim(), out_features=3)

In [None]:
linear(encoder(text_field_embedder(tensor_dict['text'])))

# We typically encapsulate most of these steps into an allennlp Model

In [None]:
from allennlp.models import Model
from allennlp.modules.text_field_embedders import TextFieldEmbedder
from allennlp.modules.seq2vec_encoders import Seq2VecEncoder
from typing import Dict

Here is a model that accepts the output of `batch.as_tensor_dict`, and applies the text field embedder, the seq2vec encoder, and linear layer. 

In [None]:
class MyModel(Model):
    def __init__(self, 
                 vocab: Vocabulary, 
                 embedder: TextFieldEmbedder, 
                 encoder: Seq2VecEncoder, 
                 output_dim: int) -> None:
        super().__init__(vocab)
        self.embedder = embedder
        self.encoder = encoder
        self.linear = torch.nn.Linear(in_features=embedder.get_output_dim(), out_features=output_dim)
        
    def forward(self, text: Dict[str, torch.Tensor], speaker: torch.Tensor) -> Dict[str, torch.Tensor]:
        """
        Notice how the argument names correspond the field names in our instance.
        """
        embedded = self.embedder(text)
        encoded = self.encoder(embedded)
        output = self.linear(encoded)
        
        return {"output": output}
        

In [None]:
model = MyModel(vocab, text_field_embedder, encoder, 3)

In [None]:
model(**tensor_dict)