# CoNLL-2003(english) Named Entity Recognition (NER)

The  **`CoNLL 2003`** shared task consists of data from the Reuters 1996 news corpus with annotations for 4 types of `Named Entities` (persons, locations, organizations, and miscellaneous entities). The data is in a [IOB2](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)) format.  Each token enitity has a `'B-'` or `'I-'` tag indicating if it is the start of the entity or if the token is inside the annotation. 

* **`Person`**: `'B-PER'` and  `'I-PER'`


* **`Organization`**: `'B-ORG'` and `'I-ORG'`


* **`Location`**: `'B-LOC'`  and `'I-LOC'`


* **`Miscellaneous`**: `'B-MISC'` and `'I-MISC'`


* **`Other(non-named entity)`**: `'O'`

See [website](https://www.clips.uantwerpen.be/conll2003/ner/) and [paper](https://www.clips.uantwerpen.be/conll2003/pdf/14247tjo.pdf) for more info.

The data is already tokenized and tagged with NER labels:

In [None]:
# token         POS   chunk  NER
#-------------------------------
# Despite       IN    B-PP   O
# winning       VBG   B-VP   O
# the           DT    B-NP   O
# Asian         JJ    I-NP   B-MISC
# Games         NNPS  I-NP   I-MISC
# title         NN    I-NP   O
# two           CD    B-NP   O
# years         NNS   I-NP   O
# ago           RB    B-ADVP O
# ,             ,     O      O
# Uzbekistan    NNP   B-NP   B-LOC
# are           VBP   B-VP   O
# in            IN    B-PP   O
# the           DT    B-NP   O
# finals        NNS   I-NP   O
# as            IN    B-SBAR O
# outsiders     NNS   B-NP   O
# .             .     O      O

The first column is the token, the second column is the Part of Speech(POS) tag, the third is syntactic chunk tag, and the fourth is the NER tag.

So for the named entity recognition (NER) task the data consists of features:`X`and labels:`y`


* **`X`** :  a list of list of tokens 


* **`y`** :  a list of list of NER tags


## get data


In [1]:
%%bash
DATADIR="ner_english"
if test ! -d "$DATADIR";then
    echo "Creating $DATADIR dir"
    mkdir "$DATADIR"
    cd "$DATADIR"
    wget https://raw.githubusercontent.com/mxhofer/Named-Entity-Recognition-BidirectionalLSTM-CNN-CoNLL/master/data/train.txt
    wget https://raw.githubusercontent.com/mxhofer/Named-Entity-Recognition-BidirectionalLSTM-CNN-CoNLL/master/data/test.txt
    wget https://raw.githubusercontent.com/mxhofer/Named-Entity-Recognition-BidirectionalLSTM-CNN-CoNLL/master/data/dev.txt
fi

In [1]:
"""
Train data: 14987 sentences, 204567 tokens
Dev data: 3466 sentences, 51578 tokens
Test data: 3684 sentences, 46666 tokens
"""
import os
import sys

import numpy as np
import pandas as pd
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report

sys.path.append("../") 
from bert_sklearn import BertTokenClassifier, load_model

def flatten(l):
    return [item for sublist in l for item in sublist]

def read_CoNLL2003_format(filename, idx=3):
    """Read file in CoNLL-2003 shared task format""" 
    
    lines =  open(filename).read().strip()
    
    # find sentence-like boundaries
    lines = lines.split("\n\n")  
    
    # throw out -DOCSTART- lines 
    #lines = [line for line in lines if not line.startswith("-DOCSTART-")]
    
     # split on newlines
    lines = [line.split("\n") for line in lines]
    
    # get tokens
    tokens = [[l.split()[0] for l in line] for line in lines]
    
    # get labels/tags
    labels = [[l.split()[idx] for l in line] for line in lines]
    
    data= {'tokens': tokens, 'labels': labels}
    df=pd.DataFrame(data=data)
    
    return df

DATADIR = "./ner_english/"

def get_conll2003_data(trainfile=DATADIR + "train.txt",
                  devfile=DATADIR + "dev.txt",
                  testfile=DATADIR + "test.txt"):

    train = read_CoNLL2003_format(trainfile)
    print("Train data: %d sentences, %d tokens"%(len(train),len(flatten(train.tokens))))

    dev = read_CoNLL2003_format(devfile)
    print("Dev data: %d sentences, %d tokens"%(len(dev),len(flatten(dev.tokens))))

    test = read_CoNLL2003_format(testfile)
    print("Test data: %d sentences, %d tokens"%(len(test),len(flatten(test.tokens))))
    
    return train, dev, test

train, dev, test = get_conll2003_data()

X_train, y_train = train.tokens, train.labels
X_dev, y_dev = dev.tokens, dev.labels
X_test, y_test = test.tokens, test.labels


label_list = np.unique(flatten(y_train))
label_list = list(label_list)
print("\nNER tags:",label_list)

Train data: 14987 sentences, 204567 tokens
Dev data: 3466 sentences, 51578 tokens
Test data: 3684 sentences, 46666 tokens

NER tags: ['B-LOC', 'B-MISC', 'B-ORG', 'B-PER', 'I-LOC', 'I-MISC', 'I-ORG', 'I-PER', 'O']


In [2]:
train.head()

Unnamed: 0,tokens,labels
0,[-DOCSTART-],[O]
1,"[EU, rejects, German, call, to, boycott, Briti...","[B-ORG, O, B-MISC, O, O, O, B-MISC, O, O]"
2,"[Peter, Blackburn]","[B-PER, I-PER]"
3,"[BRUSSELS, 1996-08-22]","[B-LOC, O]"
4,"[The, European, Commission, said, on, Thursday...","[O, B-ORG, I-ORG, O, O, O, O, O, O, B-MISC, O,..."


Let's look at an observation on the tokens, labels pair :

In [3]:
i = 152
tokens = X_test[i]
labels = y_test[i]

data = {"token": tokens,"label": labels}
df=pd.DataFrame(data=data)
print(df)

         token   label
0        Dutch  B-MISC
1      forward       O
2       Reggie   B-PER
3      Blinker   I-PER
4          had       O
5          his       O
6   indefinite       O
7   suspension       O
8       lifted       O
9           by       O
10        FIFA   B-ORG
11          on       O
12      Friday       O
13         and       O
14         was       O
15         set       O
16          to       O
17        make       O
18         his       O
19   Sheffield   B-ORG
20   Wednesday   I-ORG
21    comeback       O
22     against       O
23   Liverpool   B-ORG
24          on       O
25    Saturday       O
26           .       O


## define model

Define our model using the **`BertTokenClassifier`** class

* We will include an **`ignore_label`** option to exclude the `'O'`,non named entities label, to calculate  `f1`. The non named entities are a huge majority of the labels, and typically `f1` is reported with this class excluded.


* We will also use the `'bert-base-cased'` model as casing provides an important signal for NER.

In [4]:
# define model
model = BertTokenClassifier(bert_model='bert-base-cased',
                            epochs=3,
                            learning_rate=5e-5,
                            train_batch_size=16,
                            eval_batch_size=16,
                            validation_fraction=0.05,                            
                            label_list=label_list,
                            ignore_label=['O'])

Building sklearn token classifier...


One issue that we need to be mindful of is the max token length in the token lists. 
There are 2 complications:
    
    
* The **`max_seq_length`** parameter in the model will dictate how long a token sequence we can handle. All input token sequences longer than this will be truncated. The limit on this is 512, but we would like smaller sequences since they are much faster and consume less memory on the GPU. 
    
    
* Each token will be tokenized again by the BERT wordpiece tokenizer. This will result in longer token sequences than the input token lists. 
    
    
Let's check our bert token lengths by running the data through the BERT wordpiece tokenizer:

In [5]:
%%time
print("Bert wordpiece tokenizer max token length in train: %d tokens"% model.get_max_token_len(X_train))
print("Bert wordpiece tokenizer max token length in dev: %d tokens"% model.get_max_token_len(X_dev))
print("Bert wordpiece tokenizer max token length in test: %d tokens"% model.get_max_token_len(X_test))

100%|██████████| 213450/213450 [00:00<00:00, 840482.31B/s]


Bert wordpiece tokenizer max token length in train: 171 tokens
Bert wordpiece tokenizer max token length in dev: 149 tokens
Bert wordpiece tokenizer max token length in test: 146 tokens
CPU times: user 4.24 s, sys: 12 ms, total: 4.25 s
Wall time: 4.92 s


So as long as we set the **`max_seq_length`** to greater than 173 = 171 + 2( for the `'[CLS]'` and `'[SEP]'` delimiter tokens that Bert uses), none of the data will be truncated.

If we set the  **`max_seq_length`**  to less than that, we can still fineune the model but we will lose the training signal from truncated tokens in the training data. Also at prediction time, we will predict the majority label,`'O'` for any tokens that have been truncated.

## finetune model on train and predict on test

In [4]:
%%time
model.max_seq_length = 173
model.gradient_accumulation_steps = 2
print(model)

# finetune model on train data
model.fit(X_train, y_train)

# score model on dev data
f1_dev = model.score(X_dev, y_dev)
print("Dev f1: %0.02f"%(f1_dev))

# score model on test data
f1_test = model.score(X_test, y_test)
print("Test f1: %0.02f"%(f1_test))

# get predictions on test data
y_preds = model.predict(X_test)

# calculate the probability of each class
y_probs = model.predict_proba(X_test)

# print report on classifier stats
print(classification_report(flatten(y_test), flatten(y_preds)))

BertTokenClassifier(bert_model='bert-base-cased', epochs=3,
          eval_batch_size=16, fp16=False, gradient_accumulation_steps=2,
          ignore_label=['O'],
          label_list=['B-LOC', 'B-MISC', 'B-ORG', 'B-PER', 'I-LOC', 'I-MISC', 'I-ORG', 'I-PER', 'O'],
          learning_rate=5e-05, local_rank=-1, logfile='bert_sklearn.log',
          loss_scale=0, max_seq_length=173, num_mlp_hiddens=500,
          num_mlp_layers=0, random_state=42, restore_file=None,
          train_batch_size=16, use_cuda=True, validation_fraction=0.05,
          warmup_proportion=0.1)
Loading bert-base-cased model...
Defaulting to linear classifier/regressor
train data size: 14238, validation data size: 749


Training: 100%|██████████| 1780/1780 [15:03<00:00,  1.92it/s, loss=0.0113]
                                                           

Epoch 1, Train loss: 0.0113, Val loss: 0.0034, Val accy: 98.89%, f1: 94.74


Training: 100%|██████████| 1780/1780 [15:51<00:00,  2.04it/s, loss=0.00174]
                                                           

Epoch 2, Train loss: 0.0017, Val loss: 0.0029, Val accy: 99.04%, f1: 95.38


Training: 100%|██████████| 1780/1780 [15:36<00:00,  2.00it/s, loss=0.000668]
                                                           

Epoch 3, Train loss: 0.0007, Val loss: 0.0028, Val accy: 99.28%, f1: 96.68


Predicting:   0%|          | 0/231 [00:00<?, ?it/s]          

Dev f1: 96.04


Predicting:   0%|          | 0/231 [00:00<?, ?it/s]          

Test f1: 91.97


                                                             

              precision    recall  f1-score   support

       B-LOC       0.93      0.93      0.93      1668
      B-MISC       0.83      0.84      0.84       702
       B-ORG       0.90      0.91      0.91      1661
       B-PER       0.96      0.97      0.96      1617
       I-LOC       0.82      0.93      0.87       257
      I-MISC       0.64      0.78      0.70       216
       I-ORG       0.87      0.91      0.89       835
       I-PER       0.99      0.99      0.99      1156
           O       1.00      0.99      0.99     38554

   micro avg       0.98      0.98      0.98     46666
   macro avg       0.88      0.92      0.90     46666
weighted avg       0.98      0.98      0.98     46666

CPU times: user 32min 24s, sys: 21min 2s, total: 53min 27s
Wall time: 53min 24s


If we want span level stats, we can run the original [perl script](https://www.clips.uantwerpen.be/conll2003/ner/bin/conlleval) to evaluate the results of processing the `CoNLL-2000/2003 shared task`:

In [5]:
# write out predictions to file for conlleval.pl
iter_zip = zip(flatten(X_test),flatten(y_test),flatten(y_preds))
preds = [" ".join([token, y, y_pred]) for token, y, y_pred in iter_zip]
with open("preds.txt",'w') as f:
    for x in preds:
        f.write(str(x)+'\n') 

# run conlleval perl script 
!perl ./conlleval.pl < preds.txt
!rm preds.txt

processed 46666 tokens with 5648 phrases; found: 5740 phrases; correct: 5173.
accuracy:  98.15%; precision:  90.12%; recall:  91.59%; FB1:  90.85
              LOC: precision:  92.24%; recall:  92.69%; FB1:  92.46  1676
             MISC: precision:  78.07%; recall:  81.62%; FB1:  79.81  734
              ORG: precision:  87.64%; recall:  90.07%; FB1:  88.84  1707
              PER: precision:  96.00%; recall:  96.35%; FB1:  96.17  1623


Let's also take a look at the example from the test set we looked at before and compare the predicted tags with the actuals:

In [8]:
i = 152
tokens = X_test[i]
labels = y_test[i]
preds  = y_preds[i]
prob   = y_probs[i]

data = {"token": tokens,"label": labels,"predict": preds}
df=pd.DataFrame(data=data)
print(df)

         token   label predict
0        Dutch  B-MISC  B-MISC
1      forward       O       O
2       Reggie   B-PER   B-PER
3      Blinker   I-PER   I-PER
4          had       O       O
5          his       O       O
6   indefinite       O       O
7   suspension       O       O
8       lifted       O       O
9           by       O       O
10        FIFA   B-ORG   B-ORG
11          on       O       O
12      Friday       O       O
13         and       O       O
14         was       O       O
15         set       O       O
16          to       O       O
17        make       O       O
18         his       O       O
19   Sheffield   B-ORG   B-ORG
20   Wednesday   I-ORG   I-ORG
21    comeback       O       O
22     against       O       O
23   Liverpool   B-ORG   B-ORG
24          on       O       O
25    Saturday       O       O
26           .       O       O


Let's calculate tthe probability of each label:

In [9]:
# pprint out probs for this observation
tokens_prob = model.tokens_proba(tokens, prob)

         token  B-LOC  B-MISC  B-ORG  B-PER  I-LOC  I-MISC  I-ORG  I-PER    O
0        Dutch   0.00    1.00   0.00   0.00   0.00    0.00   0.00   0.00 0.00
1      forward   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00
2       Reggie   0.00    0.00   0.00   1.00   0.00    0.00   0.00   0.00 0.00
3      Blinker   0.00    0.00   0.00   0.00   0.00    0.00   0.00   1.00 0.00
4          had   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00
5          his   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00
6   indefinite   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00
7   suspension   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00
8       lifted   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00
9           by   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00
10        FIFA   0.00    0.00   1.00   0.00   0.00    0.00   0.00   0.00 0.00
11          on   0.00    0.00   0.00   0.00   0.00    0.00   0.0

Finally, predict the tags and tag probabilities on some new text:

In [10]:
text = "Jefferson wants to go to France."       

tag_predicts  = model.tag_text(text)       
prob_predicts = model.tag_text_proba(text)    

Predicting:   0%|          | 0/1 [00:00<?, ?it/s]        

       token predicted tags
0  Jefferson          B-PER
1      wants              O
2         to              O
3         go              O
4         to              O
5     France          B-LOC
6          .              O


                                                         

       token  B-LOC  B-MISC  B-ORG  B-PER  I-LOC  I-MISC  I-ORG  I-PER    O
0  Jefferson   0.00    0.00   0.00   1.00   0.00    0.00   0.00   0.00 0.00
1      wants   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00
2         to   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00
3         go   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00
4         to   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00
5     France   1.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 0.00
6          .   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00




## save/load

If we want to save the model to disk:
   

In [None]:
#save model
savepath = "/data/ner_english.bin"
model.save(savepath)

# restore model
model = load_model(savepath)