This notebook introduces how AllenNlp handle one of the key aspects of applying deep learning techniques to textual data: learning distributed representations of words and sentences.

Recently, there has been an explosion of different techniques to represent words and sentences in NLP, including pre-trained word vectors, character level CNN encodings and sub-word token representation (e.g byte encodings). Going even further, using learned representations of higher level lingustic features, such as POS tags, named entities and dependency paths has proved successful.

In order to deal with this breadth of methods for representing words as vectors, AllenNLP introduces 3 key abstractions:

- `TokenIndexers`, which generate indexed tensors representing sentences in different ways. See the `data_pipeline` for more info. 

- `TokenEmbedders`, which transform indexed tensors into embedded representations. At its most basic, this is just a standard `Embedding` layer you'd find in any neural network library. However, they can be more complex - for instance, AllenNLP has a `token_characters_encoder` which applies a CNN to character level representations.
- `TextFieldEmbedders`, which are a wrapper around a set of `TokenEmbedders`. At it's most basic, this applies the `TokenEmbedders` which it is passed and concatenates their output.

Using this hierarchy allows you to easily compose different representations of a sentence together in modular ways. For instance, in the Bidaf model, we use this to concatenate a character level CNN encoding of the words in the sentence to the pretrained word embeddings. You can also specify this completely from a JSON file, making experimenation with different representations extremely easy.
 

In [None]:
# This cell just makes sure the library paths are correct. 
# You need to run this cell before you run the rest of this
# tutorial, but you can ignore the contents!
import os
import sys
module_path = os.path.abspath(os.path.join('../..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [None]:
from allennlp.data.fields import TextField
from allennlp.data import Instance
from allennlp.data.token_indexers import SingleIdTokenIndexer, TokenCharactersIndexer

sentence = TextField(["All", "the", "cool", "kids", "use", "character", "embeddings", "."],
                     token_indexers={"tokens": SingleIdTokenIndexer(),
                                     "characters": TokenCharactersIndexer()})
sentence_instance = Instance({"sentence": sentence})


Now we need to create a small vocabulary from our sentence - note that because we have used both a
`SingleIdTokenIndexer` and a `TokenCharactersIndexer`, when we call `Vocabulary.from_dataset`, the created `Vocabulary` will have two namespaces, which correspond to the keys in this token indexer dictionary in our `TextField`.

In [None]:
from allennlp.data import Vocabulary, Dataset

vocab = Vocabulary.from_dataset(Dataset([sentence_instance]))

print("This is the token vocabulary we created: \n")
print(vocab.get_index_to_token_vocabulary("tokens"))

print("This is the character vocabulary we created: \n")
print(vocab.get_index_to_token_vocabulary("characters"))