We implemented a BiLSTM-CRF model for NER, which identifies three types of entities: PERSON, LOCATION and ORGANIZATION.
- Python: 3.6
- TensorFlow: 1.8
Dataset comes from MSRA corpus.
The directory ./data
contains:
- the preprocessed data files,
train_data
andtest_data
- a vocabulary file
word2id.pkl
- a char embedding file
vector.npy
#This file is a pre-trained char embeddings. Randomly initializes the embedding when this file does not exist.
The directory ./data/case
contains:
- a original file'input.txt',# When you need to use our model for NER, you need to put these sentences into 'input.txt'
- a 'result.json' # save all identified entities
The directory ./model
is used to save the trained model file
We use BIO tag scheme, with each character its label for one line. Sentences are splited with a null line.
我 O
们 O
是 O
受 O
到 O
郑 B-PER
振 I-PER
铎 I-PER
先 O
生 O
、 O
阿 B-PER
英 I-PER
先 O
生 O
著 O
作 O
的 O
启 O
示 O
We use Dr. Xiao Han’s bert-as-serviceto obtain the BERT embeddings for NER, which is used as follows:
- 1.Environmental requirements: python version >=3.5, tensorflow version >=1.10
- 2.pip install bert-serving-server;
- 3.pip install bert-serving-client
- 4.Download the trained BERT Chinese modelchinese_L-12_H-768_A-12.zip
- 5.start up bert-as-service : bert-serving-start -model_dir /home/chinese_L-12_H-768_A-12 -num_worker=2
- 6.Get the word vector using getBertVector.py
we use BiLSTM, can efficiently use both past and future input information and extract features automatically.
we use CRF and a Softmax, label the tag for each character in one sentence.
python main.py --model=0
the model can be 0 ,1 or 2.
"--model=0" means train "--model=1" means test and "--model=2" means application.
If you use "--model=2", you can use the trained model to identify the entities in some original sentences.
First, you need to put the original sentence into the ./data/case/input.txt file;
Then run the command python main.py --model=2
and the recognized result will be saved to ./data/case/ result.json
The best test performance of the model is:
accuracy: 98.64%; precision: 90.87%; recall: 87.13%; F1: 88.96
LOC: precision: 93.41%; recall: 90.16%; F1: 91.76 2777
ORG: precision: 85.64%; recall: 84.67%; F1: 85.15 1316
PER: precision: 90.78%; recall: 84.38%; F1: 87.46 1844
[1] https://github.com/Determined22/zh-NER-TF