
Allennlp uses 


At a high level, we use `DatasetReaders` to read a particular dataset into a `Dataset` of self-contained individual `Instances`, 
which are made up of a dictionary of named `Fields`. There are many types of `Fields` which are useful for different types of data, such as `TextField`, for sentences, or `LabelField` for representing a categorical class label. Users who are familiar with the `torchtext` library from `Pytorch` will find a similar abstraction here. 



In [None]:
# This cell just makes sure the library paths are correct. 
# You need to run this cell before you run the rest of this
# tutorial, but you can ignore the contents!
import os
import sys
module_path = os.path.abspath(os.path.join('../..'))
if module_path not in sys.path:
    sys.path.append(module_path)

Let's create two of the most common `Fields`, imagining we are preparing some data for a sentiment analysis model. 

In [None]:
from allennlp.data.fields import TextField, LabelField
from allennlp.data.token_indexers import SingleIdTokenIndexer

review = TextField(["This", "movie", "was", "awful", "!"], token_indexers={"tokens": SingleIdTokenIndexer})
review_sentiment = LabelField("negative", label_namespace="tags")

# Access the original strings and labels using the methods on the Fields.
print(review.tokens)
print(review_sentiment.label)

Once we've made our `Fields`, we need to pair them together to form an `Instance`. 

In [None]:
from allennlp.data import Instance

instance1 = Instance({"review": review, "label": review_sentiment})

... and once we've made our `Instance`, we can group several of these into a `Dataset`.

In [None]:
from allennlp.data import Dataset
# Create another 
review2 = TextField(["This", "movie", "was", "quite", "slow", "but", "good" "."], token_indexers={"tokens": SingleIdTokenIndexer})
review_sentiment2 = LabelField("positive", label_namespace="tags")
instance2 = Instance({"review": review2, "label": review_sentiment2})

review_dataset = Dataset([instance1, instance2])

In order to get our tiny sentiment analysis ready for use in a model, we need to be able to do a few things: 
- Create a vocabulary from the Dataset (using `Vocabulary.from_dataset`)
- Index the words and labels in the`Fields` to use the integer indices specified by the `Vocabulary`
- Pad the instances to the same length
- Convert them into arrays.
The `Dataset`, `Instance` and `Fields` have some similar parts of their API. 

In [None]:
from allennlp.data import Vocabulary 

# This will automatically create a vocab for our dataset with
vocab = Vocabulary.from_dataset(review_dataset)
