Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
1 contributor

Users who have contributed to this file

339 lines (244 sloc) 14.5 KB

Named Entity Recognition (NER)

Train and use the model

There are two main types of models available: standard RNN based and BERT based. To see details about BERT based models see :doc:`here </features/models/bert>`. Any pre-trained model can be used for inference from both Command Line Interface (CLI) and Python. Before using the model make sure that all required packages are installed using the command:

python -m deeppavlov install ner_ontonotes_bert

To use a pre-trained model from CLI use the following command:

python deeppavlov/deep.py interact ner_ontonotes_bert [-d]

where ner_conll2003_bert is the name of the config and -d is an optional download key. The key -d is used to download the pre-trained model along with embeddings and all other files needed to run the model. Other possible commands are train, evaluate, and download,

Here is the list of all available configs:

Model Dataset Language Embeddings Size Model Size F1 score
:config:`ner_rus_bert <ner/ner_rus_bert.json>` Collection3 [1] Ru 700 MB 1.4 GB 98.1
:config:`ner_rus <ner/ner_rus.json>` 1.0 GB 5.6 MB 95.1
:config:`ner_ontonotes_bert_mult <ner/ner_ontonotes_bert_mult.json>` Ontonotes Multi 700 MB 1.4 GB 88.8
:config:`ner_ontonotes_bert <ner/ner_ontonotes_bert.json>` En 400 MB 800 MB 88.6
:config:`ner_ontonotes <ner/ner_ontonotes.json>` 331 MB 7.8 MB 86.4
:config:`ner_conll2003_bert <ner/ner_conll2003_bert.json>` CoNLL-2003 400 MB 850 MB 91.7
:config:`ner_conll2003 <ner/ner_conll2003.json>` 331 MB 3.1 MB 89.9
:config:`ner_dstc2 <ner/ner_dstc2.json>` DSTC2 --- 626 KB 97.1

Models can be used from Python using the following code:

from deeppavlov import configs, build_model

ner_model = build_model(configs.ner.ner_ontonotes_bert, download=True)

ner_model(['Bob Ross lived in Florida'])
>>> [[['Bob', 'Ross', 'lived', 'in', 'Florida']], [['B-PERSON', 'I-PERSON', 'O', 'O', 'B-GPE']]]

The model also can be trained from the Python:

from deeppavlov import configs, train_model

ner_model = train_model(configs.ner.ner_ontonotes_bert)

The data for training should be placed in the folder provided in the config:

from deeppavlov import configs, train_model
from deeppavlov.core.commands.utils import parse_config

config_dict = parse_config(configs.ner.ner_ontonotes_bert)

print(config_dict['dataset_reader']['data_path'])
>>> '~/.deeppavlov/downloads/ontonotes'

There must be three txt files: train.txt, valid.txt, and test.txt. Furthermore the data_path can be changed from code. The format of the data is described in the Training data section.

Multilingual BERT Zero-Shot Transfer

Multilingual BERT models allow to perform zero-shot transfer from one language to another. The model :config:`ner_ontonotes_bert_mult <ner/ner_ontonotes_bert_mult.json>` was trained on OntoNotes corpus which has 19 types in the markup schema. The model performance was evaluated on Russian corpus Collection 3 [1]. Results of the transfer are presented in the table below.

TOTAL 79.39
PER 95.74
LOC 82.62
ORG 55.68

The following Python code can be used to infer the model:

from deeppavlov import configs, build_model

ner_model = build_model(configs.ner.ner_ontonotes_bert_mult, download=True)

ner_model(['Curling World Championship will be held in Antananarivo'])
>>> (['Curling', 'World', 'Championship', 'will', 'be', 'held', 'in', 'Antananarivo']],
[['B-EVENT', 'I-EVENT', 'I-EVENT', 'O', 'O', 'O', 'O', 'B-GPE'])

ner_model(['Mistrzostwa Świata w Curlingu odbędą się w Antananarivo'])
>>> (['Mistrzostwa', 'Świata', 'w', 'Curlingu', 'odbędą', 'się', 'w', 'Antananarivo']],
[['B-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'O', 'O', 'O', 'B-GPE'])

ner_model(['Чемпионат мира по кёрлингу пройдёт в Антананариву'])
>>> (['Чемпионат', 'мира', 'по', 'кёрлингу', 'пройдёт', 'в', 'Антананариву'],
['B-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'O', 'O', 'B-GPE'])

The list of available tags and their descriptions are presented below.

PERSON People including fictional
NORP Nationalities or religious or political groups
FACILITY Buildings, airports, highways, bridges, etc.
ORGANIZATION Companies, agencies, institutions, etc.
GPE Countries, cities, states
LOCATION Non-GPE locations, mountain ranges, bodies of water
PRODUCT Vehicles, weapons, foods, etc. (Not services)
EVENT Named hurricanes, battles, wars, sports events, etc.
WORK OF ART Titles of books, songs, etc.
LAW Named documents made into laws
LANGUAGE Any named language
DATE Absolute or relative dates or periods
TIME Times smaller than a day
PERCENT Percentage (including “%”)
MONEY Monetary values, including unit
QUANTITY Measurements, as of weight or distance
ORDINAL “first”, “second”
CARDINAL Numerals that do not fall under another type

NER task

Named Entity Recognition (NER) is one of the most common tasks in natural language processing. In most of the cases, NER task can be formulated as:

Given a sequence of tokens (words, and maybe punctuation symbols) provide a tag from a predefined set of tags for each token in the sequence.

For NER task there are some common types of entities used as tags:

  • persons
  • locations
  • organizations
  • expressions of time
  • quantities
  • monetary values

Furthermore, to distinguish adjacent entities with the same tag many applications use BIO tagging scheme. Here "B" denotes beginning of an entity, "I" stands for "inside" and is used for all words comprising the entity except the first one, and "O" means the absence of entity. Example with dropped punctuation:

Bernhard        B-PER
Riemann         I-PER
Carl            B-PER
Friedrich       I-PER
Gauss           I-PER
and             O
Leonhard        B-PER
Euler           I-PER

In the example above PER means person tag, and "B-" and "I-" are prefixes identifying beginnings and continuations of the entities. Without such prefixes, it is impossible to separate Bernhard Riemann from Carl Friedrich Gauss.

Training data

To train the neural network, you need to have a dataset in the following format:

EU B-ORG
rejects O
the O
call O
of O
Germany B-LOC
to O
boycott O
lamb O
from O
Great B-LOC
Britain I-LOC
. O

China B-LOC
says O
time O
right O
for O
Taiwan B-LOC
talks O
. O

...

The source text is tokenized and tagged. For each token, there is a tag with BIO markup. Tags are separated from tokens with whitespaces. Sentences are separated with empty lines.

Dataset is a text file or a set of text files. The dataset must be split into three parts: train, test, and validation. The train set is used for training the network, namely adjusting the weights with gradient descent. The validation set is used for monitoring learning progress and early stopping. The test set is used for final evaluation of model quality. Typical partition of a dataset into train, validation, and test are 80%, 10%, 10%, respectively.

Few-shot Language-Model based

It is possible to get a cold-start baseline from just a few samples of labeled data in a couple of seconds. The solution is based on a Language Model trained on open domain corpus. On top of the LM a SVM classification layer is placed. It is possible to start from as few as 10 sentences containing entities of interest.

The data for training this model should be collected in the following way. Given a collection of N sentences without markup, sequentially markup sentences until the total number of sentences with entity of interest become equal K. During the training both sentences with and without markup are used.

Mean chunk-wise F1 scores for Russian language on 10 sentences with entities :

PER 84.85
LOC 68.41
ORG 32.63

(the total number of training sentences is bigger and defined by the distribution of sentences with / without entities).

The model can be trained using CLI:

python -m deeppavlov train ner_few_shot_ru

you have to provide the train.txt, valid.txt, and test.txt files in the format described in the Training data section. The files must be in the ner_few_shot_data folder as described in the dataset_reader part of the config :config:`ner/ner_few_shot_ru_train.json <ner/ner_few_shot_ru.json>` .

To train and use the model from python code the following snippet can be used:

from deeppavlov import configs, train_model

ner_model = train_model(configs.ner.ner_few_shot_ru, download=True)

ner_model(['Example sentence'])

Warning! This model can take a lot of time and memory if the number of sentences is greater than 1000!

If a lot of data is available the few-shot setting can be simulated with special dataset_iterator. For this purpose the config :config:`ner/ner_few_shot_ru_train.json <ner/ner_few_shot_ru_simulate.json>` . The following code can be used for this simulation:

from deeppavlov import configs, train_model

ner_model = train_model(configs.ner.ner_few_shot_ru_simulate, download=True)

In this config the Collection dataset is used. However, if there are files train.txt, valid.txt, and test.txt in the ner_few_shot_data folder they will be used instead.

To use existing few-shot model use the following python interface can be used:

from deeppavlov import configs, build_model

ner_model = build_model(configs.ner.ner_few_shot_ru)

ner_model([['Example', 'sentence']])
ner_model(['Example sentence'])

Literature

[1](1, 2) Mozharova V., Loukachevitch N., Two-stage approach in Russian named entity recognition // International FRUCT Conference on Intelligence, Social Media and Web, ISMW FRUCT 2016. Saint-Petersburg; Russian Federation, DOI 10.1109/FRUCT.2016.7584769
You can’t perform that action at this time.