# Natural Language Processing

## NER using AllenNLP

The problem statement of NER is as follows:

Given 

    Bill Gates and Paul Allen, founders of Microsoft, started selling software in 1975 in New Mexico.

We want to get

    [Bill Gates PER] and [Paul Allen PER], founders of [Microsoft ORG], started selling software in [1975 DATE] in [New Mexico LOC].

To do this, there are two steps:

- First, as a segmentation task, where we attempt to find and classify segments that match entities, and assign some NULL or O label to the in-between stuff. Thus, our label space would be {PER, ORG, DATE, LOC, O}.
- Second, as a token-level tagging task. This one requires a bit more thought — it’s not clear from the start how we associate entities with each other. But if you introduce a slightly modified label space, you can reconstruct the entities.

To do this, each entity type (e.g. PER, LOC) gets split into two labels: B-PER, denoting “this is a new person entity” and I-PER, denoting, “I’m continuing the previous person entity”. On the above sentence, every token would be tagged like so:

    [Bill B-PER] [Gates I-PER] and [Paul B-PER] [Allen I-PER], founders of [Microsoft B-ORG], started selling software in [1975 B-DATE] in [New B-LOC] [Mexico I-LOC].

For brevity’s sake, I left out all the [and O] tags, but you can imagine that all the rest of the words in the sentence are assigned that null tag.


### 1. Loading the CoNLL 2003 dataset

Let’s take a look at an example from the CoNLL’03 dataset and see if they conform to the specification we laid down above:

    Essex NNP I-NP I-ORG
    , , O O
    however RB I-ADVP O
    , , O O
    look VB I-VP O
    certain JJ I-ADJP O
    to TO I-VP O
    regain VB I-VP O
    their PRP$ I-NP O
    .
    .

The shared task of CoNLL-2003 concerns language-independent named entity recognition. The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. Thus we only care about the first and last item.

The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase.

To download the dataset, run in your terminal:

    curl -o train.txt https://raw.githubusercontent.com/synalp/NER/master/corpus/CoNLL-2003/eng.train
    curl -o validation.txt https://raw.githubusercontent.com/synalp/NER/master/corpus/CoNLL-2003/eng.testa
    curl -o test.txt https://raw.githubusercontent.com/synalp/NER/master/corpus/CoNLL-2003/eng.testb

### 2. Dataset Reader

The first thing we’re going to do is build a dataset reader, which can consume the CoNLL’03 dataset.  To get started, let’s create the directory structure for this project. Currently, you should have directories that look something like this:

    mkdir conn_ner/readers
    touch conn_ner/__init__.py
    touch conn_ner/readers/__init__.py
    touch conn_ner/readers/conll_reader.py

Put the following code in a file in the `conll_reader.py`:

In [1]:
from allennlp.data import DatasetReader

@DatasetReader.register("conll_03_reader")
class CoNLL03DatasetReader(DatasetReader):
    pass

`@DatasetReader.register(...)` is one core feature of AllenNLP, `Registrables`.  When you call `.register()`, it allows us to confgure our experiments with JSON even though we write all the code in Python.

When you write a new `Model`, a new `DatasetReader`, a new `Metric`, or pretty much anything else, you’ll want to register it so it’s visible to your configuration file. 

Every class that inherits from DatasetReader **should override these 3 functions**:

    __init__(self, ...) -> None
    _read(self, file_path: str) -> List[Instance]
    text_to_instance(self, ...) -> Instance

Any argument in `__init__()` will be visible to the JSON configuration later on, so if you have parameters in the dataset reader you want to change in between experiments, you’ll put them there. For our CoNLL’03 reader, our `__init__()` function will take in 2 parameters: `token_indexers`, and `lazy`.  The `token_indexers` will help AllenNLP map tokens to integers to keep track of them in the future.  If `lazy=True`, the AllenNLP won’t store the dataset in memory, but will load it from disk in batch-size chunks. This is desirable if your dataset is too large to fit in memory, but for our purposes we’ll stick with it being false.

The next thing we need to define is the `_read()` function. The `_read()` function only takes in a `file_path: str` argument in pretty much every case. The purpose of this function is to take a single file which contains the dataset and convert it to a list of `Instances`.

Last is to write `text_to_instance`.  Most code is very readable.  Note that AllenNLP has `SequenceLabelField` which supports sequential labels, common to some NLP tasks such as POS tagging, coreference resolution, and NER.

In [None]:
import itertools
from typing import Dict, List, Iterator
from allennlp.data.tokenizers import Token
from allennlp.data import DatasetReader, Instance
from allennlp.data.token_indexers import TokenIndexer, SingleIdTokenIndexer
from allennlp.data.fields import Field, LabelField, TextField, SequenceLabelField

@DatasetReader.register("conll_03_reader")
class CoNLL03DatasetReader(DatasetReader):    
    def __init__(self,
                token_indexers: Dict[str, TokenIndexer] = None,
                lazy: bool = False) -> None:
            super().__init__(lazy)
            self._token_indexers = token_indexers or {'tokens': SingleIdTokenIndexer()}
    
    def _read(self, file_path: str) -> Iterator[Instance]:
        is_divider = lambda line: line.strip() == ''

        with open(file_path, 'r') as conll_file:
            for divider, lines in itertools.groupby(conll_file, is_divider):  #read each sentence groupby empty line
                if not divider:
                    fields = [l.strip().split() for l in lines] #e.g., [['EU', 'NNP', 'I-NP', 'I-ORG'], ['rejects', 'VBZ', 'I-VP', 'O'],...
                    # switch it so that each field is a list of tokens/labels
                    fields = [l for l in zip(*fields)]  #[('EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'), ('NNP
                    # only keep the tokens and NER labels
                    tokens, _, _, ner_tags = fields  #('EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.') ('I-ORG            

                    yield self.text_to_instance(tokens, ner_tags)
                    
    def text_to_instance(self,
                         words: List[str],
                         ner_tags: List[str]) -> Instance:
        fields: Dict[str, Field] = {}
        # wrap each token in the file with a token object
        tokens = TextField([Token(w) for w in words], self._token_indexers)

        # Instances in AllenNLP are created using Python dictionaries,
        # which map the token key to the Field type
        fields["tokens"] = tokens
        fields["label"] = SequenceLabelField(ner_tags, tokens)

        return Instance(fields)

#### Testing the Dataset Reader

Now that we’ve written a dataset reader, we want to test that it can successfully load the CoNLL dataset, and perhaps see some corpus statistics.  Do:

    touch test_reader.jsonnet

Put the following code:

In [None]:
# {
#   dataset_reader: {
#     type: 'conll_03_reader',
#     lazy: false
#   },

#   train_data_path: '../data/train.txt',
#   validation_data_path: '../data/validation.txt',
#   model: {},
#   data_loader: {},
#   trainer: {}
# }


Run in the terminal:
    
    allennlp train --dry-run --include-package conn_ner -s /tmp/tagging/tests/0 test_reader.jsonnet
