# Chinese Named Entity Recognition (NER)

data is from https://github.com/ProHiryu/bert-chinese-ner

The  data is in a similar format to the **`CoNLL 2003`** shared task with 4 types of `Named Entities` (persons, locations, organizations, and miscellaneous entities). The data is in a [IOB2](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)) format.  Each token enitity has a `'B-'` or `'I-'` tags indicating if it is the start of the entity or if the token is inside the annotation. 


### example 

In [None]:
token  tag
美     B-LOC
国     I-LOC
的     O
华     B-PER
莱     B-PER
士     B-PER
，     O
我     O
和     O
他     O
谈     O
笑     O
风     O
生     O
。     O

The first column is the `token`, the second column is the `NER` tag. 

So for the named entity recognition (NER) task our data consists of features, `X`, and labels, `y`:


* **`X`** :  a list of list of tokens 


* **`y`** :  a list of list of NER tags

## get data


In [2]:
%%bash
DATADIR="ner_chinese"
if test ! -d "$DATADIR";then
    echo "Creating $DATADIR dir"
    mkdir "$DATADIR"
    cd "$DATADIR"
    wget https://raw.githubusercontent.com/ProHiryu/bert-chinese-ner/master/data/train.txt
    wget https://raw.githubusercontent.com/ProHiryu/bert-chinese-ner/master/data/dev.txt
    wget https://raw.githubusercontent.com/ProHiryu/bert-chinese-ner/master/data/test.txt
fi

In [3]:
"""
Train data: 50658 sentences, 2169879 tokens
Dev data: 4631 sentences, 172601 tokens
Test data: 69 sentences, 2294 tokens
"""
import os
import sys

import numpy as np
import pandas as pd
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report

sys.path.append("../") 
from bert_sklearn import BertTokenClassifier, load_model

def flatten(l):
    return [item for sublist in l for item in sublist]

def read_CoNLL2003_format(filename, idx=3):
    """Read file in CoNLL-2003 shared task format"""
    # read file
    lines =  open(filename).read().strip()
    
    # find sentence-like boundaries
    lines = lines.split("\n\n")  
    
     # split on newlines
    lines = [line.split("\n") for line in lines]
    
    # get tokens
    tokens = [[l.split()[0] for l in line] for line in lines]
    
    # get labels/tags
    labels = [[l.split()[idx] for l in line] for line in lines]
    
    #convert to df
    data= {'tokens': tokens, 'labels': labels}
    df=pd.DataFrame(data=data)
    
    return df

DATADIR = "./ner_chinese/"

def get_data(trainfile=DATADIR + "train.txt",
             devfile=DATADIR + "dev.txt",
             testfile=DATADIR + "test.txt"):

    train = read_CoNLL2003_format(trainfile, 1)
    print("Train data: %d sentences, %d tokens"%(len(train),len(flatten(train.tokens))))

    dev = read_CoNLL2003_format(devfile, 1)
    print("Dev data: %d sentences, %d tokens"%(len(dev),len(flatten(dev.tokens))))

    test = read_CoNLL2003_format(testfile, 1)
    print("Test data: %d sentences, %d tokens"%(len(test),len(flatten(test.tokens))))
    
    return train, dev, test


train, dev, test = get_data()

X_train, y_train = train.tokens, train.labels
X_dev, y_dev = dev.tokens, dev.labels
X_test, y_test = test.tokens, test.labels

label_list = np.unique(flatten(y_train))
label_list = list(label_list)
print("\nNER tags/labels:\n", label_list)

Train data: 50658 sentences, 2169879 tokens
Dev data: 4631 sentences, 172601 tokens
Test data: 69 sentences, 2294 tokens

NER tags/labels:
 ['B-LOC', 'B-ORG', 'B-PER', 'I-LOC', 'I-ORG', 'I-PER', 'O']


In [4]:
train.head()

Unnamed: 0,tokens,labels
0,"[当, 希, 望, 工, 程, 救, 助, 的, 百, 万, 儿, 童, 成, 长, 起, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
1,"[藏, 书, 本, 来, 就, 是, 所, 有, 传, 统, 收, 藏, 门, 类, 中, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
2,"[因, 有, 关, 日, 寇, 在, 京, 掠, 夺, 文, 物, 详, 情, ，, 藏, ...","[O, O, O, B-LOC, O, O, B-LOC, O, O, O, O, O, O..."
3,"[我, 们, 藏, 有, 一, 册, 1, 9, 4, 5, 年, 6, 月, 油, 印, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
4,"[以, 家, 乡, 的, 历, 史, 文, 献, 、, 特, 定, 历, 史, 时, 期, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."


And let's look at an observation on the tokens,labels pair and make sure it makes sense:

In [4]:
i = 22
tokens = X_test[i]
labels = y_test[i]

print(" ".join(tokens))
print(" ".join(labels))

在 一 个 统 一 的 中 华 人 民 共 和 国 ， 可 以 实 行 社 会 主 义 和 资 本 主 义 两 种 制 度 ， 这 是 为 了 民 族 、 国 家 的 根 本 利 益 。
O O O O O O B-LOC I-LOC I-LOC I-LOC I-LOC I-LOC I-LOC O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O


Define our model using the **`BertTokenClassifier`** class

* We will include an **`ignore_label`** option to exclude the `'O'`,non named entities label, to calculate  `f1`. The non named entities are a huge majority of the labels, and typically `f1` is reported with this class excluded.



* We will also use the `'bert-base-chinese'` model.

In [5]:
# define model
model = BertTokenClassifier(bert_model='bert-base-chinese',
                            epochs=3,
                            learning_rate=2e-5,
                            train_batch_size=16,
                            eval_batch_size=16,
                            ignore_label=['O'])

Building sklearn token classifier...


One issue that we need to be mindful of is the max token length in the token lists. 
There are 2 complications:
    
* We have a **`max_seq_length`** parameter  with BERT that will dictate how long a token sequence we can handle. All input tokens will be truncaed based on this. The limit on this is 512, but we would like smaller sequences since they are much faster and consume less memory on the GPU. 
    
    
* Each token will be tokenized again by the BERT wordpiece tokenizer. This will result in longer token sequences than the input token lists.
    
Let's check our bert token lengths by running the data through the BERT wordpiece tokenizer:

In [6]:
%%time
print("Bert wordpiece tokenizer max token length in train: %d tokens"% model.get_max_token_len(X_train))
print("Bert wordpiece tokenizer max token length in dev: %d tokens"% model.get_max_token_len(X_dev))
print("Bert wordpiece tokenizer max token length in test: %d tokens"% model.get_max_token_len(X_test))

Bert wordpiece tokenizer max token length in train: 100 tokens
Bert wordpiece tokenizer max token length in dev: 100 tokens
Bert wordpiece tokenizer max token length in test: 83 tokens
CPU times: user 17 s, sys: 12 ms, total: 17 s
Wall time: 17.4 s


So based on this we will set the max_seq_length to 102 = 100 + 2( for the `'[CLS]'` and `'[SEP]'` tokens that Bert uses).

## finetune model on train and predict on test

In [7]:
%%time
# set max_seq_length
model.max_seq_length = 102
print(model)

# finetune model on train data
model.fit(X_train, y_train)

# score model on dev data
f1_dev = model.score(X_dev, y_dev)
print("Dev f1: %0.02f"%(f1_dev))

# score model on test data
f1_test = model.score(X_test, y_test)
print("Test f1: %0.02f"%(f1_test))

# get predictions on test data
y_preds = model.predict(X_test)

# calculate the probability of each class
y_probs = model.predict_proba(X_test)

# print report on classifier stats
print(classification_report(flatten(y_test), flatten(y_preds)))

BertTokenClassifier(bert_model='bert-base-chinese', epochs=3,
          eval_batch_size=16, fp16=False, gradient_accumulation_steps=1,
          ignore_label=['O'], label_list=None, learning_rate=2e-05,
          local_rank=-1, logfile='bert_sklearn.log', loss_scale=0,
          max_seq_length=102, num_mlp_hiddens=500, num_mlp_layers=0,
          random_state=42, restore_file=None, train_batch_size=16,
          use_cuda=True, validation_fraction=0.1, warmup_proportion=0.1)
Loading bert-base-chinese model...


100%|██████████| 382072689/382072689 [00:28<00:00, 13584391.81B/s]


Defaulting to linear classifier/regressor
train data size: 45593, validation data size: 5065


Training: 100%|██████████| 2850/2850 [30:49<00:00,  1.64it/s, loss=0.0297]
                                                             

Epoch 1, Train loss: 0.0297, Val loss: 0.0089, Val accy: 99.28%, f1: 96.06


Training: 100%|██████████| 2850/2850 [33:36<00:00,  1.54it/s, loss=0.00517]
                                                             

Epoch 2, Train loss: 0.0052, Val loss: 0.0071, Val accy: 99.49%, f1: 97.07


Training: 100%|██████████| 2850/2850 [32:43<00:00,  1.27it/s, loss=0.00224]
                                                             

Epoch 3, Train loss: 0.0022, Val loss: 0.0074, Val accy: 99.53%, f1: 97.32


Predicting:   0%|          | 0/5 [00:00<?, ?it/s]            

Dev f1: 96.60


Predicting:   0%|          | 0/5 [00:00<?, ?it/s]        

Test f1: 96.26


                                                         

              precision    recall  f1-score   support

       B-LOC       1.00      0.98      0.99        45
       B-ORG       0.80      1.00      0.89         8
       B-PER       0.92      0.92      0.92        25
       I-LOC       0.99      1.00      0.99        72
       I-ORG       0.88      1.00      0.94        29
       I-PER       0.94      0.94      0.94        32
           O       1.00      1.00      1.00      2083

   micro avg       0.99      0.99      0.99      2294
   macro avg       0.93      0.98      0.95      2294
weighted avg       0.99      0.99      0.99      2294

CPU times: user 1h 1min 13s, sys: 41min 12s, total: 1h 42min 25s
Wall time: 1h 42min 35s




If we want span level stats we can run the original [perl script](https://www.clips.uantwerpen.be/conll2003/ner/bin/conlleval) to evaluate the results of processing the `CoNLL-2000/2003 shared task`:

In [5]:
# write out predictions to file for conlleval.pl
iter_zip = zip(flatten(X_test),flatten(y_test),flatten(y_preds))
preds = [" ".join([token, y, y_pred]) for token, y, y_pred in iter_zip]
with open("preds.txt",'w') as f:
    for x in preds:
        f.write(str(x)+'\n') 

# run conlleval perl script 
!perl ./conlleval.pl < preds.txt
!rm preds.txt

processed 2294 tokens with 78 phrases; found: 79 phrases; correct: 71.
accuracy:  99.43%; precision:  89.87%; recall:  91.03%; FB1:  90.45
              LOC: precision:  97.73%; recall:  95.56%; FB1:  96.63  44
              ORG: precision:  80.00%; recall: 100.00%; FB1:  88.89  10
              PER: precision:  80.00%; recall:  80.00%; FB1:  80.00  25


Let's also take a look at the example from the test set we looked at before and compare the predicted tags with the actuals:

In [11]:
i = 22
tokens = X_test[i]
labels = y_test[i]
preds = y_preds[i]
probs   = y_probs[i]

data = {"token": tokens,"label": labels,"predict": preds}
df=pd.DataFrame(data=data)
print(df)

   token  label predict
0      在      O       O
1      一      O       O
2      个      O       O
3      统      O       O
4      一      O       O
5      的      O       O
6      中  B-LOC   B-LOC
7      华  I-LOC   I-LOC
8      人  I-LOC   I-LOC
9      民  I-LOC   I-LOC
10     共  I-LOC   I-LOC
11     和  I-LOC   I-LOC
12     国  I-LOC   I-LOC
13     ，      O       O
14     可      O       O
15     以      O       O
16     实      O       O
17     行      O       O
18     社      O       O
19     会      O       O
20     主      O       O
21     义      O       O
22     和      O       O
23     资      O       O
24     本      O       O
25     主      O       O
26     义      O       O
27     两      O       O
28     种      O       O
29     制      O       O
30     度      O       O
31     ，      O       O
32     这      O       O
33     是      O       O
34     为      O       O
35     了      O       O
36     民      O       O
37     族      O       O
38     、      O       O
39     国      O       O
40     家      O 

In [12]:
# pprint out probs for this obdervation
tokens_prob = model.tokens_proba(tokens, probs)

   token  B-LOC  B-ORG  B-PER  I-LOC  I-ORG  I-PER    O
0      在   0.00   0.00   0.00   0.00   0.00   0.00 1.00
1      一   0.00   0.00   0.00   0.00   0.00   0.00 1.00
2      个   0.00   0.00   0.00   0.00   0.00   0.00 1.00
3      统   0.00   0.00   0.00   0.00   0.00   0.00 1.00
4      一   0.00   0.00   0.00   0.00   0.00   0.00 1.00
5      的   0.00   0.00   0.00   0.00   0.00   0.00 1.00
6      中   1.00   0.00   0.00   0.00   0.00   0.00 0.00
7      华   0.00   0.00   0.00   1.00   0.00   0.00 0.00
8      人   0.00   0.00   0.00   1.00   0.00   0.00 0.00
9      民   0.00   0.00   0.00   1.00   0.00   0.00 0.00
10     共   0.00   0.00   0.00   1.00   0.00   0.00 0.00
11     和   0.00   0.00   0.00   1.00   0.00   0.00 0.00
12     国   0.00   0.00   0.00   1.00   0.00   0.00 0.00
13     ，   0.00   0.00   0.00   0.00   0.00   0.00 1.00
14     可   0.00   0.00   0.00   0.00   0.00   0.00 1.00
15     以   0.00   0.00   0.00   0.00   0.00   0.00 1.00
16     实   0.00   0.00   0.00   0.00   0.00   0.

Finally, let's predict the tags and tag probabilities on some new text:

In [13]:
text = "乔治华盛顿想访问法国。"     

tag_predicts  = model.tag_text(text)       
prob_predicts = model.tag_text_proba(text)    

Predicting:   0%|          | 0/1 [00:00<?, ?it/s]        

   token predicted tags
0      乔          B-PER
1      治          I-PER
2      华          B-PER
3      盛          I-PER
4      顿          I-PER
5      想              O
6      访              O
7      问              O
8      法          B-LOC
9      国          I-LOC
10     。              O


                                                         

   token  B-LOC  B-ORG  B-PER  I-LOC  I-ORG  I-PER    O
0      乔   0.00   0.00   1.00   0.00   0.00   0.00 0.00
1      治   0.00   0.00   0.00   0.00   0.00   1.00 0.00
2      华   0.18   0.01   0.61   0.00   0.01   0.19 0.00
3      盛   0.00   0.00   0.00   0.01   0.00   0.99 0.00
4      顿   0.00   0.00   0.00   0.02   0.00   0.97 0.00
5      想   0.00   0.00   0.00   0.00   0.00   0.00 1.00
6      访   0.00   0.00   0.00   0.00   0.00   0.00 1.00
7      问   0.00   0.00   0.00   0.00   0.00   0.00 1.00
8      法   1.00   0.00   0.00   0.00   0.00   0.00 0.00
9      国   0.00   0.00   0.00   1.00   0.00   0.00 0.00
10     。   0.00   0.00   0.00   0.00   0.00   0.00 1.00


