
## Vocabularies in AllenNLP
A Vocabulary maps strings to integers, allowing for strings to be mapped to an
 out-of-vocabulary token.

Vocabularies can be fit to a particular dataset, which we use to decide which tokens are
 in-vocabulary, or alternatively, they can be loaded directly from a static vocabulary file.


First, let's import the vocabulary class from `allennlp` and create a vocabulary.


In [8]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

'.'

In [None]:
from allennlp.data import Vocabulary
from allennlp.data import Dataset

Let's create an empty `Vocabulary` so we can look at the arguments it takes.

In [None]:
vocab = Vocabulary(counter=None, min_count=1, max_vocab_size=100000, non_padded_namespaces=None)

The vocabulary takes 4 arguments: 

- A counter, which is a `Dict[str, Dict[str, int]]`: This is a nested dictionary because the allennlp Vocabulary class supports the idea of "namespaces". A namespace is a vocabulary which is associated with a part of your data. For instance, in a sequence tagging model, you would typically have two namespaces: A namespace of words for your textual input and a namespace of tags(e.g. "NP", "VP", etc) for your labels. This counter is therefore a mapping from string namespaces to dictionaries of `Dict[tokens -> counts]`.


- A minimum count: Tokens with smaller counts than this won't be included in your `Vocabulary`.


- A maximum vocab size: The lowest frequency words will be dropped to make your vocabulary this size.


- Non padded namespaces: For some namespaces, such as words, we provide additional tokens commonly used in NLP applications - specifically, "@@@PADDING@@@" and "@@@UNKNOWN@@@". Why did we use these weird tokens we hear you ask? Well, if anything goes wrong in your model, it's going to be pretty obvious, because these tokens are pretty hard to miss. However, for other namespaces, such as tags, you _don't_ want these extra tokens, because in your model, you are going to be creating a distribution over the size of this namespace, so if we have added extra tags, your model could predict these. Naturally, we don't want this to happen, so we provide some reasonable defaults: any vocabulary namespace ending with `tag` or `label` won't have these extra tokens by default.


It's easy to interact with the vocabulary we just created. Let's add some words!

In [None]:
vocab.add_token_to_namespace("Paul", namespace="tokens")
vocab.add_token_to_namespace("Allen", namespace="tokens")

vocab.add_token_to_namespace("PERSON", namespace="tags")
vocab.add_token_to_namespace("PLACE", namespace="tags")
print(vocab.get_index_to_token_vocabulary("tokens"))
print(vocab.get_index_to_token_vocabulary("tags"))