# Natural Language Processing

## AllenNLP

AllenNLP is an open source library for building deep learning models for natural language processing, developed by the Allen Institute for Artificial Intelligence. It is built on top of PyTorch and is designed to support researchers, engineers, students, etc., who wish to build high quality deep NLP models with ease. It provides high-level abstractions and APIs for common components and models in modern NLP. It also provides an extensible framework that makes it easy to run and manage NLP experiments.

In a nutshell, AllenNLP is

- a library with well-thought-out abstractions encapsulating the common data and model operations that are done in NLP research
- a commandline tool for training PyTorch models
- a collection of pre-trained models that you can use to make predictions
- a collection of readable reference implementations of common / recent NLP models
- an experiment framework for doing replicable science
- a way to demo your research
- open source and community driven

In part 1, geared towards someone who is brand new to the library, we give you a quick walk-through of main AllenNLP concepts and features. We'll build a complete, working NLP model (a text classifier) along the way.

## Text Classification


### Fields

The first step for building an NLP model is to define its input and output. In AllenNLP, each training example is represented by an `Instance` object. An `Instance` consists of one or more `Fields`, where each `Field` represents one piece of data used by your model, either as an input or an output. `Fields` will get converted to tensors and fed to your model.

For text classification, the input and the output are very simple. The model takes a `TextField` that represents the input text and predicts its label, which is represented by a `LabelField`.

Note that AllenNLP use the **type hint** features in Python 3, by specifying a colon.


In [1]:
#a bit about type hint

def some_func(text: str):
    print(text)
    
some_func("hello world")
some_func(3)  #won't error, because this is type hinting.  Mostly used by editors to check errors before running.

hello world
3


In [2]:
from allennlp.data.fields import LabelField,  TextField

# Inputs
text: TextField

# Outputs
label: LabelField

### Reading data

The first step for building an NLP application is to read the dataset and represent it with some internal data structure.

AllenNLP uses `DatasetReaders` to read the data, whose job it is to transform raw data files into `Instances` that match the input / output spec. 

AllenNLP assume the dataset has a simple data file format: `[text] [TAB] [label]`, for example:

- I like this movie a lot! [TAB] positive

- This was a monstrous waste of time [TAB] negative

- AllenNLP is amazing [TAB] positive

- Why does this have to be so complicated? [TAB] negative

- This sentence expresses no sentiment [TAB] neutral

You can implement your own `DatasetReader` by inheriting from the `DatasetReader` class. At minimum, you need to override the `_read()` method, which reads the input dataset and yields `Instances`.

In [None]:
from typing import Dict, Iterable, List

from allennlp.data import DatasetReader, Instance
from allennlp.data.fields import LabelField, TextField
from allennlp.data.token_indexers import TokenIndexer, SingleIdTokenIndexer
from allennlp.data.tokenizers import Token, Tokenizer, SpacyTokenizer


@DatasetReader.register("classification-tsv")
class ClassificationTsvReader(DatasetReader):
    def __init__(self, max_tokens: int = None, **kwargs):
        super().__init__(**kwargs)
        self.tokenizer = SpacyTokenizer()
        self.token_indexers = {"tokens": SingleIdTokenIndexer()}
        self.max_tokens = max_tokens

    def text_to_instance(self, text: str, label: str = None) -> Instance:
        tokens = self.tokenizer.tokenize(text)
        if self.max_tokens:
            tokens = tokens[: self.max_tokens]
        text_field = TextField(tokens, self.token_indexers)
        fields = {"text": text_field}
        if label:
            fields["label"] = LabelField(label)
        return Instance(fields)

    def _read(self, file_path: str) -> Iterable[Instance]:
        with open(file_path, "r") as lines:
            for line in lines:
                text, sentiment = line.strip().split("\t")
                yield self.text_to_instance(text, sentiment)

This is a minimal DatasetReader that will return a list of classification Instances when you call `reader.read(file)`. This reader will take each line in the input file, split the text into words using a tokenizer (the SpacyTokenizer shown here relies on spaCy), and represent those words as tensors using a word id in a vocabulary we construct for you.

Pay special attention to the text and label keys that are used in the fields dictionary passed to the `Instance` - these keys will be used as parameter names when passing tensors into your `Model` later.

Ideally, the output label would be **optional** when we create the`Instances`, so that we can use the same code to make **predictions on unlabeled data (say, in a demo)**.

There are lots of places where this could be made better for a more flexible and fully-featured reader but let's keep it simple for now.