# Natural Language Processing

## NER using AllenNLP

The problem statement of NER is as follows:

Given 

    Bill Gates and Paul Allen, founders of Microsoft, started selling software in 1975 in New Mexico.

We want to get

    [Bill Gates PER] and [Paul Allen PER], founders of [Microsoft ORG], started selling software in [1975 DATE] in [New Mexico LOC].

To do this, there are two steps:

- First, as a segmentation task, where we attempt to find and classify segments that match entities, and assign some NULL or O label to the in-between stuff. Thus, our label space would be {PER, ORG, DATE, LOC, O}.
- Second, as a token-level tagging task. This one requires a bit more thought — it’s not clear from the start how we associate entities with each other. But if you introduce a slightly modified label space, you can reconstruct the entities.

To do this, each entity type (e.g. PER, LOC) gets split into two labels: B-PER, denoting “this is a new person entity” and I-PER, denoting, “I’m continuing the previous person entity”. On the above sentence, every token would be tagged like so:

    [Bill B-PER] [Gates I-PER] and [Paul B-PER] [Allen I-PER], founders of [Microsoft B-ORG], started selling software in [1975 B-DATE] in [New B-LOC] [Mexico I-LOC].

For brevity’s sake, I left out all the [and O] tags, but you can imagine that all the rest of the words in the sentence are assigned that null tag.


### 1. Loading the CoNLL 2003 dataset

Let’s take a look at an example from the CoNLL’03 dataset and see if they conform to the specification we laid down above:

    Essex NNP I-NP I-ORG
    , , O O
    however RB I-ADVP O
    , , O O
    look VB I-VP O
    certain JJ I-ADJP O
    to TO I-VP O
    regain VB I-VP O
    their PRP$ I-NP O
    .
    .

The shared task of CoNLL-2003 concerns language-independent named entity recognition. The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. Thus we only care about the first and last item.

The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase.

To download the dataset, run in your terminal:

    curl -o train.txt https://raw.githubusercontent.com/synalp/NER/master/corpus/CoNLL-2003/eng.train
    curl -o validation.txt https://raw.githubusercontent.com/synalp/NER/master/corpus/CoNLL-2003/eng.testa
    curl -o test.txt https://raw.githubusercontent.com/synalp/NER/master/corpus/CoNLL-2003/eng.testb

### 2. Dataset Reader

The first thing we’re going to do is build a dataset reader, which can consume the CoNLL’03 dataset.  To get started, let’s create the directory structure for this project. Currently, you should have directories that look something like this:

    mkdir conn_ner/readers
    touch conn_ner/__init__.py
    touch conn_ner/readers/__init__.py
    touch conn_ner/readers/conll_reader.py

Put the following code in a file in the `conll_reader.py`:

In [None]:
import itertools
from typing import Dict, List, Iterator
from allennlp.data.tokenizers import Token
from allennlp.data import DatasetReader, Instance
from allennlp.data.token_indexers import TokenIndexer, SingleIdTokenIndexer
from allennlp.data.fields import Field, LabelField, TextField, SequenceLabelField

@DatasetReader.register("conll_03_reader")
class CoNLL03DatasetReader(DatasetReader):    
    def __init__(self,
                 token_indexers: Dict[str, TokenIndexer] = None,
                 **kwargs,
                 ) -> None:
        super().__init__(**kwargs)
        self._token_indexers = token_indexers or {'tokens': SingleIdTokenIndexer()}
    
    def _read(self, file_path: str) -> Iterator[Instance]:
        is_divider = lambda line: line.strip() == ''

        with open(file_path, 'r') as conll_file:
            for divider, lines in itertools.groupby(conll_file, is_divider):  #read each sentence groupby empty line
                if not divider:
                    fields = [l.strip().split() for l in lines] #e.g., [['EU', 'NNP', 'I-NP', 'I-ORG'], ['rejects', 'VBZ', 'I-VP', 'O'],...
                    # switch it so that each field is a list of tokens/labels
                    fields = [l for l in zip(*fields)]  #[('EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'), ('NNP
                    # only keep the tokens and NER labels
                    tokens, _, _, ner_tags = fields  #('EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.') ('I-ORG            

                    yield self.text_to_instance(tokens, ner_tags)
                    
    def text_to_instance(self,
                         words: List[str],
                         ner_tags: List[str]) -> Instance:
        fields: Dict[str, Field] = {}
        # wrap each token in the file with a token object
        tokens = TextField([Token(w) for w in words], self._token_indexers)

        # Instances in AllenNLP are created using Python dictionaries,
        # which map the token key to the Field type
        fields["tokens"] = tokens
        fields["label"] = SequenceLabelField(ner_tags, tokens)

        return Instance(fields)

`@DatasetReader.register(...)` is one core feature of AllenNLP, `Registrables`.  When you call `.register()`, it allows us to confgure our experiments with JSON even though we write all the code in Python.

When you write a new `Model`, a new `DatasetReader`, a new `Metric`, or pretty much anything else, you’ll want to register it so it’s visible to your configuration file. 

Every class that inherits from DatasetReader **should override these 3 functions**:

    __init__(self, ...) -> None
    _read(self, file_path: str) -> List[Instance]
    text_to_instance(self, ...) -> Instance

Any argument in `__init__()` will be visible to the JSON configuration later on, so if you have parameters in the dataset reader you want to change in between experiments, you’ll put them there. The `token_indexers` will help AllenNLP map tokens to integers to keep track of them in the future.

The next thing we need to define is the `_read()` function. The `_read()` function only takes in a `file_path: str` argument in pretty much every case. The purpose of this function is to take a single file which contains the dataset and convert it to a list of `Instances`.

Last is to write `text_to_instance`.  Most code is very readable.  Note that AllenNLP has `SequenceLabelField` which supports sequential labels, common to some NLP tasks such as POS tagging, coreference resolution, and NER.

#### Testing the Dataset Reader

Let's test our dataset reader

In [7]:
reader = CoNLL03DatasetReader()
text = ['ACL', '2024', 'is', 'in', 'Thailand']
label = ['I-ORG' , '0', '0', '0', 'I-LOC']
instance = reader.text_to_instance(text, label)
instance

<allennlp.data.instance.Instance at 0x1503a3e50>

In [10]:
#get the label
instance.fields['label'][:]

['I-ORG', '0', '0', '0', 'I-LOC']

In [12]:
#get the token
instance.fields['tokens'][:]

[ACL, 2024, is, in, Thailand]

### 3. Model

LSTM, RNN, or GRU is a reasonable baseline. At every timestep, the LSTM takes in a token and outputs a prediction.
Conveniently, building a sequence tagging LSTM in AllenNLP is reasonably straightforward. It’s also easy to swap out LSTM’s, GRU’s, RNN’s, BiLSTM’s, etc. without ever touching the model code. We’ll see how that’s done in this next section.

To prep, do this:

    mkdir conn_ner/my_models
    touch conn_ner/my_models/__init__.py
    touch conn_ner/my_models/lstm.py

#### Making the model

To complete the model, we’ll have to fill in 3 functions of our own:

- `__init__(self, ...) -> None` - the initialization function, which takes all the configurable submodules as arguments
- `forward(self, ...) -> Dict[str, torch.Tensor]` - the forward function, which defines a single forward-pass for our model. Note that the output is a Python dict. We’ll get to why this is later in the section.
- `get_metrics(self, reset: bool = False) -> Dict[str, float]` - this method works with the way AllenNLP defines training metrics, and returns a dictionary where each key is the name of a metric you want to track. For instance, you could track recall, precision, f1 as different metrics.

Our `__init__` function will always take a `Vocabulary` object, because we need to call `super()` with it. However, the rest of the arguments are defined by (a) the model architecture, and (b) what we want to expose to the configuration file.

In this case, we’re going to want an `LSTM` or `GRU` to actually do the sequence tagging. AllenNLP has a module `Seq2SeqEncoder` which is a generalization over recurrent encoders like this, so we’ll want to take in a `Seq2SeqEncoder`. We’ll also want to use word embeddings, like GloVe or word2vec. To do this, we can make use of AllenNLP’s `TextFieldEmbedder`.  And to classify the LSTM outputs, we’ll need some kind of linear transformation. We have two options here, using a configurable `FeedForward` module or just defining an `nn.Linear` module ourselves. Because we don’t expect that this will ever change, let’s leave it as an `nn.Linear` for now.  Finally, we’ll want some metric to track our training progress, besides loss. Canonically, the best metric for this is macro-averaged F1 — it’s what CoNLL reports. In this case, we can use AllenNLP’s `SpanBasedF1Measure`, which will report the class-based and overall precision, recall, and F1.

In `lstm.py`, put this:

In [None]:
from allennlp.models import Model
from allennlp.nn.util import get_text_field_mask, sequence_cross_entropy_with_logits
from typing import Dict, Optional

import torch
import torch.nn as nn

from allennlp.data.vocabulary import Vocabulary
from allennlp.modules.text_field_embedders import TextFieldEmbedder
from allennlp.modules.seq2seq_encoders.seq2seq_encoder import Seq2SeqEncoder
from allennlp.training.metrics import SpanBasedF1Measure

@Model.register('ner_lstm')
class NerLSTM(Model):

    def __init__(self,
                 vocab: Vocabulary,
                 embedder: TextFieldEmbedder,
                 encoder: Seq2SeqEncoder) -> None:
        super().__init__(vocab)

        self._embedder = embedder
        self._encoder = encoder
        self._classifier = nn.Linear(in_features=encoder.get_output_dim(),
                                     out_features=vocab.get_vocab_size('labels'))

        self._f1 = SpanBasedF1Measure(vocab, 'labels')
    
    
    def forward(self,
                tokens: Dict[str, torch.Tensor],
                label: Optional[torch.Tensor] = None) -> Dict[str, torch.Tensor]:
        mask = get_text_field_mask(tokens)
        
        #the tokens input isn’t a tensor of token indexes, it’s a dict. 
        # That dict contains all the namespaces defined by the  token_indexers.
        
        #`get_text_field_mask` - this function takes the tokens dict and returns a binary mask over the tokens. 
        # The mask is passed into the encoder, the metrics, and the sequence loss function so we can ignore missing text.

        embedded = self._embedder(tokens) #embed the input tokens using our pretrained word embeddings
        encoded = self._encoder(embedded, mask) #encode them using our LSTM or GRU encoder
        classified = self._classifier(encoded) #classify each timestep to the target label space

        self._f1(classified, label, mask) #compute some classification loss over the sequence of tokens

        output: Dict[str, torch.Tensor] = {}

        if label is not None:
            output["loss"] = sequence_cross_entropy_with_logits(classified, label, mask)
            
        #`sequence_cross_entropy_with_logits` - this is the cross-entropy loss applied to sequence classification/tagging tasks. 

        return output
    
    #note that this function is automatically called when we train
    def get_metrics(self, reset: bool = True) -> Dict[str, float]:
        return self._f1.get_metric(reset)

### 4. Configuring experiments

Let's configure our experiment which provides a lot of flexibility.   

    touch train_lstm.jsonnet

To be able to train a model, we’re going to need to fill in the 3 empty configuration keys:
- `data_loader`
- `model`
- `trainer`

1. The `data_loader` describes how to batch the data and iterate over it. AllenNLP has a number of built-in iterators that we’ll be using, though for more advanced projects you might be inclined to write your own.

There are two main types of iterators that you will likely be using: a **basic iterator** and a **bucket iterator**.  The basic iterator batches it into a fixed batch size, and then (by default) shuffles those batches every epoch. 

    data_loader: {
        batch_size: 10,
        shuffle: true
    },
 
AllenNLP also offers `batch_samplers`, which allow you to specify how to construct batches. For instance, you can use a **bucket sampling** strategy, which is slightly more advanced. It’s used to minimize the memory foot-print of all the batches. When batching a variable length sequence, AllenNLP will pad all the sequences to the length of the longest sequence in the batch. If you randomly sample from the data, you could end up with some long sequences and some short sequences in the same batch, leading to a lot of extra memory used for padding. Using a bucket iterator sorts all the examples by length and then batches them.


2. The `model` configures each of those sub-modules that we defined in the last section. I.e. how big should our LSTM be, what embeddings do we want to use, etc.

AllenNLP uses the concept of dependency injection which is a fancy term for specifying holders in the constructor, but specifying the concrete models in the configuration file.  For example, we can easily specify `lstm`, etc.   This allows quick and easy experimentation.

3. The `trainer` configures the AllenNLP `Trainer` object, which handles all the training and optimizing for us. We pass it an optimizer, a number of epochs, maybe some early-stopping parameters, etc.

In the trainer, `patience` and `validation_metric` are used for early stopping. `cuda_device` specifies whether or not you want to run on the GPU (I have left this set to -1 for CPU-only training).

In `train_lstm.jsonnet`, put this:


In [22]:
# {
#   dataset_reader: {
#     type: 'conll_03_reader',
#     token_indexers: {
#       tokens: {
#         type: 'single_id',
#         namespace: 'tokens',
#         lowercase_tokens: true
#       }
#     },
#   },
#   data_loader: {
#     batch_sampler: {
#       type: 'bucket',
#       batch_size: 10
#     }
#   },
#   train_data_path: 'data/train.txt',
#   validation_data_path: 'data/validation.txt',
#   model: {
#     type: 'ner_lstm',
#     embedder: {
#       token_embedders: {
#         tokens: {
#         type: 'embedding',
#           pretrained_file: "https://allennlp.s3.amazonaws.com/datasets/glove/glove.6B.50d.txt.gz",
#           embedding_dim: 50,
#           trainable: false
#         }
#       }
#     },
#     encoder: {
#       type: 'lstm',
#       input_size: 50,
#       hidden_size: 25,
#       bidirectional: true
#     }
#   },
#   trainer: {
#     num_epochs: 40,
#     patience: 10,
#     cuda_device: -1,
#     grad_clipping: 5.0,
#     validation_metric: '-loss',
#     optimizer: {
#       type: 'adam',
#       lr: 0.003
#     }
#   }
# }

Now, run this:
    
    allennlp train -f --include-package conn_ner -s models train_lstm.jsonnet
    
`-f` forces to override serialization dir if exist.
