
### Vocabularies in AllenNLP

Before we start, this tutorial assumes you've already gone through the tutorial on `Datasets`, `Instances` and `Fields`. If you haven't, you might want to check out that one first as we make use of some of these constructs to explain the `Vocabulary` functionality.

A `Vocabulary` maps strings to integers, allowing for strings to be mapped to an
 out-of-vocabulary token.

Vocabularies can be fit to a particular dataset, which we use to decide which tokens are
 in-vocabulary, or alternatively, they can be loaded directly from a static vocabulary file.


First, let's import the vocabulary class from `allennlp` and create a vocabulary.


In [2]:
# This cell just makes sure the library paths are correct. 
# You need to run this cell before you run the rest of this
# tutorial, but you can ignore the contents!
import os
import sys
module_path = os.path.abspath(os.path.join('../..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [3]:
from allennlp.data import Vocabulary


Let's create an empty `Vocabulary` so we can look at the arguments it takes.


In [4]:
vocab = Vocabulary(counter=None, min_count=1, max_vocab_size=100000, non_padded_namespaces=None)


The vocabulary takes 4 arguments: 

- A counter, which is a `Dict[str, Dict[str, int]]`: This is a nested dictionary because the allennlp Vocabulary class supports the idea of "namespaces". A namespace is a vocabulary which is associated with a part of your data. For instance, in a sequence tagging model, you would typically have two namespaces: A namespace of words for your textual input and a namespace of tags(e.g. "NP", "VP", etc) for your labels. This counter is therefore a mapping from string namespace names to their respective mapping dictionaries of `Dict[tokens => counts]`.


- A minimum count: Tokens with smaller counts than this won't be included in your `Vocabulary`.


- A maximum vocab size: The lowest frequency words will be dropped to make your vocabulary this size.


- Non padded namespaces: For some namespaces, such as words, we provide additional tokens commonly used in NLP applications - specifically, "@@PADDING@@" and "@@UNKNOWN@@". Why did we use these weird tokens we hear you ask? Well, if anything goes wrong in your model, it's going to be pretty obvious, because these tokens are pretty hard to miss. However, for other namespaces, such as tags, you _don't_ want these extra tokens, because in your model, you are going to be creating a distribution over the size of this namespace, so if we have added extra tags, your model could predict these. Naturally, we don't want this to happen, so we provide some reasonable defaults: any vocabulary namespace ending with `tag` or `label` won't have these extra tokens by default.



It's easy to interact with the vocabulary we just created. Let's add some words!


In [5]:
vocab.add_token_to_namespace("Paul", namespace="tokens")
vocab.add_token_to_namespace("Allen", namespace="tokens")

vocab.add_token_to_namespace("PERSON", namespace="tags")
vocab.add_token_to_namespace("PLACE", namespace="tags")
print(vocab.get_index_to_token_vocabulary("tokens"))
print(vocab.get_index_to_token_vocabulary("tags"))

{0: '@@PADDING@@', 1: '@@UNKOWN@@', 2: 'Paul', 3: 'Allen'}
{0: 'PERSON', 1: 'PLACE'}


Notice that when we print the namespace for `tags` we don't have any padding tokens or unknown tokens.

Above, we demonstrated the basic functionality of the namespaces in the Vocabulary. However, we'd ideally like to 
generate a full `Vocabulary` without having to individually add all the different words. Below, we'll generate a `Dataset` consisting of a single `Instance` and use it to automatically generate a `Vocabulary`. 


In [6]:
from allennlp.data.fields import TextField, TagField
from allennlp.data import Dataset, Instance
from allennlp.data.token_indexers import SingleIdTokenIndexer
sentence = TextField(["Paul", "Allen", "is", "a", "great", "guy", "."], token_indexers=[SingleIdTokenIndexer()])
tags = TagField(["PERSON", "PERSON", "O", "O", "O", "O", "O"], sentence, tag_namespace="tags")
toy_dataset = Dataset([Instance({"sentence": sentence, "tags": tags})])


Now we've generated this baby dataset with one training instance, we can generate a `Vocabulary` using a classmethod on `Vocabulary`.

In [9]:
vocab = Vocabulary.from_dataset(toy_dataset)
print(vocab.get_index_to_token_vocabulary("tokens"))
print(vocab.get_index_to_token_vocabulary("tags"))

100%|██████████| 1/1 [00:00<00:00, 5184.55it/s]

{0: '@@PADDING@@', 1: '@@UNKOWN@@', 2: 'a', 3: 'great', 4: 'Allen', 5: '.', 6: 'Paul', 7: 'guy', 8: 'is'}
{0: 'O', 1: 'PERSON'}



