# Natural Language Processing

## NER using AllenNLP

The problem statement of NER is as follows:

Given 

    Bill Gates and Paul Allen, founders of Microsoft, started selling software in 1975 in New Mexico.

We want to get

    [Bill Gates PER] and [Paul Allen PER], founders of [Microsoft ORG], started selling software in [1975 DATE] in [New Mexico LOC].

To do this, there are two steps:

- First, as a segmentation task, where we attempt to find and classify segments that match entities, and assign some NULL or O label to the in-between stuff. Thus, our label space would be {PER, ORG, DATE, LOC, O}.
- Second, as a token-level tagging task. This one requires a bit more thought — it’s not clear from the start how we associate entities with each other. But if you introduce a slightly modified label space, you can reconstruct the entities.

To do this, each entity type (e.g. PER, LOC) gets split into two labels: B-PER, denoting “this is a new person entity” and I-PER, denoting, “I’m continuing the previous person entity”. On the above sentence, every token would be tagged like so:

    [Bill B-PER] [Gates I-PER] and [Paul B-PER] [Allen I-PER], founders of [Microsoft B-ORG], started selling software in [1975 B-DATE] in [New B-LOC] [Mexico I-LOC].

For brevity’s sake, I left out all the [and O] tags, but you can imagine that all the rest of the words in the sentence are assigned that null tag.


### 1. Building the Dataset Reader

Let’s take a look at an example from the CoNLL’03 dataset and see if they conform to the specification we laid down above:

    Essex NNP I-NP I-ORG
    , , O O
    however RB I-ADVP O
    , , O O
    look VB I-VP O
    certain JJ I-ADJP O
    to TO I-VP O
    regain VB I-VP O
    their PRP$ I-NP O
    .
    .

The shared task of CoNLL-2003 concerns language-independent named entity recognition. The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. Thus we only care about the first and last item.

The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase.

To download the dataset, run in your terminal:

    curl -o train.txt https://raw.githubusercontent.com/synalp/NER/master/corpus/CoNLL-2003/eng.train
    curl -o validation.txt https://raw.githubusercontent.com/synalp/NER/master/corpus/CoNLL-2003/eng.testa
    curl -o test.txt https://raw.githubusercontent.com/synalp/NER/master/corpus/CoNLL-2003/eng.testb