Skip to content

Commit

Permalink
feat: add Ontonotes NER with Senna
Browse files Browse the repository at this point in the history
* release 0.0.3 (#150)

* feat: tests can be run from project root (#86)

* refactor: instead of juggling global random states use instances of Random for datasets

* test(): add test for interacting with custom queries

After refactoring, it is possible to easily add list of query-response
pairs for every model (config), which will be used to compare pretrained
model output with expected output. Initial lists added for error_model
and ner. Also URL for downloading pretrained ner_conll2003_model added
IP-1344 #done

* Update docs from master (#96)

* fixed grammar and style

* Update README.md

* fix grammar & style

* fix grammar & style

* fix grammar&style in Intent classification README

* doc: add supported platform notes

* docs: correct paths to scripts and configs to be relative to repository root (#94)

* docs: correct paths to scripts and configs to be relative to repository root

fixes #93

* docs: set paths in basic examples to be relative to the project root

* docs: run deep.py as a python module in examples

* doc: add notes for python 3.5

* test(): change downloading to temp dir (#97)

* feat: assert python version is 3.6 or higher

* Rename dataset to dataset_iterator and other renames (#103)

* refactor: rename 'dataset' to 'dataset_iterator'

* refactor: rename dataset readers and iterators

* refactor: classification iterator and reader

* fix: dialog_iterator

* test: fix downloading procedure (#108)

* Feature/tf layers to core (#67)

* feat: layers moved to core

* feat: attention added

* fix: highway/skip connections for different dimensionality of units are fixed

* feat: NER now supports core layers

* fix: minor docstrings fixes

* feat: CuDNN GRU and LSTM added

* feat: Bidirectional CuDNN GRU and LSTM added

* feat: stacked bi-rnn refactored

* fix: fixed arguments order in rnn

* fix: remove duplicate mult_att

* chore: merge with dev

* fix: backward forward bug in cudnnrnn

* refactor: use single fasttext module, clean dependencies

* fix: add error when n_classes is zero

* feat: add fastText model usage instead of fasttext

* fix: emb_module default fastText

* chore: embedding fixed in configs

* chore: change new models names

* feat: change intent embeddings in gobot configs

* chore: fastText to fasttext, new model, change intents in gobot configs

* chore: new url on new fasttext embeddings

* fix: delete dowload all true

* fix: add url of old embedding file

* fix: delete comma

* fix: delete old embedding file from urls

* fix: delete pyfasttext from requirements, fasttext_embedder

* fix: change pyfasttext embeddings from gobot

* fix: delete from requirements

* fix: delete gensim from fasttext_embedder

* fix: simplify requirements

* fix: fix dim in gobot_all config

* refactor: remove redundant parameter 'emb_module'

* feat: use wiki.en.bin embeddings in gobot_all

* feat: check saved model params and fix lowercase for interact

* fix: lowercase text while interact

* feat: check saved model params

* fix: rm extra configs

* feat: add support for classification data in csv/json formats (#115)

* feat: add support for csv/json classification datasets

* feat: add tests for snips and samples

* fix: gobot_all config fix

* feat: add REST API for all models

* Moved telegram_utils -> utils; Refactored telegram_ui.py

* Moved telegram_utils -> utils: modified deeppavlov/deep.py

* Fixed getting model name with get_main_component() in telegram_ui.py

* chaner.py: minor fix in get_main_component()

* Added riseapi launch mode

* README.md: added riseapi mode reference

* Updated README.MD and fixed requirements.txt

* minor fixes in README.md

* Fixes in utils/server

* refactor: change endpoint names

* feat: add SteamSpacyTokenizer

* refactor: remove duplicating from script naming

* refactor: outline detokenize() meth in utils, because it should be used by all tokenizers and doesn't depend on tokenize()

* feat: add streaming spaCy tokenizer

* refactor: DELETE original spaCy tokenizer, rename stream_spacy to spacy

* refactor: rename tokenoizer scripts back

* fix: wrong grammar

* feat: include spacy_tokenizer import

* feat: replace old SpacyTokenizer with new StreamSpacyTokenizer

* feat: ability to manage lowercasing from class constructor, typing improvements

* fix: update go-bot configs, so they would work with StreamSpacyTokenizer the same as with the old tokenizer

* feat: add optional logging to the spacy tokenizer

* docs: update docstrings

* refactor: replace custom logger with deppavlov's, pep8 style update

* refactor: uotline ngramize() cause it is independent from tokenizer classes

* refactor: return original JSON formatting

* fix: add **kwargs to __init__()

* chore: update .gitignore

* refactor: more stable and consistent code

* feat: add TravisCI integration

* build(): add TravisCI integration

* build(): add TravisCI integration

* feat: add ranking model

* feat: add ranking model to deeppavlov

* feat: add download of dataset and embedding_model

* feat: adapt to new deeppavlov interfaces

* refactor: use pathlib where available in the ranking model

* feat: add saving and loading responses saving with np.save

* feat: add saving and loading response embeddings saving with np.save,
use response embeddings to calculate predictions in  __call__ function

* feat: add interact regime

* feat: add interact_pred_num parameter

* refactor: change parameter default value, change check if the file with
embeddings model exists

* fix: fix non-string keys in EmbeddingDict class

* feat: add parameters dict for autotests

* feat: add tests support

* feat: add context embeddings vocabulary (it is used in interact regime
to predict the most similar contexts)

* chore: change shuffle parameter default value to True in batch_generator

* refactor: change config to chainer representation

* fix: bug fix in urls.py file

* refactor: remove emb_vocab_file saving, move build_tok2int_vocab and
make_ints funcs to InsuranceDict class, add set_embeddings and
reset_embeddings funcs in RankingModel

* feat: add initial documentation

* refactor: remove idx2int vocabulary, add vocabularies saving

* change config parameters default values, remove examples in tests

* feat: add table in documentation

* fix: fix bug in urls.py

* refactor: remove paths from config

* feat: add documentation

* feat: add True in tests

* feat: add documentation

* refactor: move init/load in the load function.

* refactor: change parameters in config

* feat: add logging

* feat: add more logging

* feat: add documentation, change parameters values in config

* fix: add genesis for ranking model

* fix: requirements installation order that caused setup.py error

* refactor: train script

* feat: add documentation

* feat: models parameters check for ner

* feat: parameters check added to ner

* feat: parameters check added to slotfill

* chore: minor clean-up

* fix: fix conll-2003 model file names and archive names

* refactor: remove blank line

* feat: allow to stop training after n batches (#127)

* fix: many minor fixes

* fix: fix mark_done data_path

* refactor: rename ranking_dataset to ranking_iterator.py and move it to the dataset_iterators folder

* fix: fix embedding matrix construction, change epochs num
default parameter value

* refactor: rename registered name and name of the class

* refactor: rename files and classes

* refactor: change dataset downlaod

* feat: add insurance embeddings and datasets in urls.py

* refactor: change batch data representation (#131)

* feat: install tensorflow-gpu

* feat: add SQUAD model

* feat: add SQuAD dataset reader

* feat: add dataset, preprocessing, config

* feat: add VocabEmbedder for chars and tokens

* feat&fix: add model realization

* feat: add training support, answer postprocessing

* fix: predicted answer extraction from context

* fix: dropout mask

* feat: true_answer is a list of answers now

* merge with dev

* docs: add some docstrings

* refactor: renaming variables

* docs: add README.md

* feat: add support of multiple inputs and outputs in interact mode

* docs: upd README.md

* fix: bugs after merge with dev

* fix: turn on training vocabs

* fix: remove keep_prob multiplier for dropout mask

* fix: add short contexts support

* docs: upd README.md

* feat: chainer returns batch of tuples instead of tuple of batches

* docs: upd squad README.md

* docs: upd squad README.md

* feat: add link to pretrained SQuAD model

* fix: SQuAD model url

* feat: add embeddings downloading and upd config

* feat: add variable scope for optimizer

* refactor: do not override __init__ method for squad_iterator

* fix: ensure that directory exists before saving SquadVocabEmbedder

* style: upd names in config and docs

* chore: remove main.py used for debugging

* docs: upd README.md

* fix: change batch_size to fix possible OOM

* test: add possibility to interact with several input query

* chore: add max_batches to squad config

* docs: upd README.md

* fix(ranking_network): wrap y as np.array

* fix: fix training stop for pytest

* style: add license header

* fix: refactor training stop for pytest

* test: specify pytest_max_batches

* feat: use all pytest keys and not only max_batches (#134)

* fix: remove result stringification

* feat: add GPU_only and Slow marks for tests

* feat: add SQuAD dataset reader

* feat: add dataset, preprocessing, config

* feat: add VocabEmbedder for chars and tokens

* feat&fix: add model realization

* feat: add training support, answer postprocessing

* fix: predicted answer extraction from context

* fix: dropout mask

* feat: true_answer is a list of answers now

* merge with dev

* docs: add some docstrings

* refactor: renaming variables

* docs: add README.md

* feat: add support of multiple inputs and outputs in interact mode

* docs: upd README.md

* fix: bugs after merge with dev

* fix: turn on training vocabs

* fix: remove keep_prob multiplier for dropout mask

* fix: add short contexts support

* docs: upd README.md

* feat: chainer returns batch of tuples instead of tuple of batches

* docs: upd squad README.md

* docs: upd squad README.md

* feat: add link to pretrained SQuAD model

* fix: SQuAD model url

* feat: add embeddings downloading and upd config

* feat: add variable scope for optimizer

* refactor: do not override __init__ method for squad_iterator

* fix: ensure that directory exists before saving SquadVocabEmbedder

* style: upd names in config and docs

* chore: remove main.py used for debugging

* docs: upd README.md

* fix: change batch_size to fix possible OOM

* test: add possibility to interact with several input query

* chore: add max_batches to squad config

* docs: upd README.md

* fix(ranking_network): wrap y as np.array

* fix: fix training stop for pytest

* style: add license header

* fix: refactor training stop for pytest

* test: specify pytest_max_batches

* test: add couple of marks for selecting tests

* test: make Travis running only fast tests without GPU

* fix: ranking config works in interactbot

* fix: add downloading nltk punkt for tokenization (#140)

* feat: bot start message for intents does not say anything about dstc2 (#142)

* feat: interactbot command works with pipes that require multiple inputs (#137)

* build: change TravisCI script (#143)

* feat: add Glove embedder (#138)

* feat: glove embedder added

* feat: embeddings added to NER network

* feat: dataset and embeddings are added to urls.py for downloading

* fix: char embeddings added to pretrained embeddings

* feat: embedder return list of embeddings instead zero padded np array

* feat: capitalization added

* feat: config modified according to new features

* feat: double dense added to input parameters

* feat:config parameters updated

* chore: fix urls for conll NER, ontonotes model url added

* feat: pytest_max_batches added for faster tran check

* feat: ontonotes tests added

* feat: test conll max batches added

* Update README.md

* feat: add seq2seq go bot

* fix: lowercase text while interact

* feat: check saved model params

* fix: rm extra configs

* feat: add kvret dataset_reader

* feat: add kvret_dataset_iterator

* fix: add configerror

* fix: dirty fix for dialog data to be lowercased

* feat: check np.int and int in Vocabulary

* feat: seq2seqbot works for train and infer

* feat: add bleu-metric

* feat: add simple seq2seq_go_bot config

* fix: fix inference and load()

* feat: add variable scope for optimizer

* feat: add support of multiple inputs and outputs in interact mode

* fix: fix padding

* feat: tokenizer argument in Vocabulary

* feat: chainer returns batch of tuples instead of tuple of batches

* fix: spacy_tokenizer returns [['']] for batch with empty string and add alpha_only argument

* feat: add per_item_bleu

* feat: train seq2seq_go_bot on utterance batches

* feat: tokenize y_true

* feat: fit kb_entries knowledge base

* feat: add split tokenizer

* feat: standartize tokenizers output

* feat: normalize kb entities

* feat: db_columns, db_items in each sample

* fix: go_bot configs (for new vocab) and loading of network

* style: minor restyling

* feat: add config for infer

* feat: add config for infer

* feat: add seq2seq_go_bot pretrained model

* feat: update telegram start and help messages

* style: minor styling

* docs: add simple readme

* doc: remove red ... blocks

* doc: change Dataset to DatasetIterator

* doc: update list of configs

* doc: update package structure

* doc: add notes about dataset element in config

* feat: add squad model description to README.md

* doc: add config specification for seq2seq_go_bot

* fix: lowercase text while interact

* feat: check saved model params

* fix: rm extra configs

* feat: add kvret dataset_reader

* feat: add kvret_dataset_iterator

* fix: add configerror

* fix: dirty fix for dialog data to be lowercased

* feat: check np.int and int in Vocabulary

* feat: seq2seqbot works for train and infer

* feat: add bleu-metric

* feat: add simple seq2seq_go_bot config

* fix: fix inference and load()

* feat: add variable scope for optimizer

* feat: add support of multiple inputs and outputs in interact mode

* fix: fix padding

* feat: tokenizer argument in Vocabulary

* feat: chainer returns batch of tuples instead of tuple of batches

* fix: spacy_tokenizer returns [['']] for batch with empty string and add alpha_only argument

* feat: add per_item_bleu

* feat: train seq2seq_go_bot on utterance batches

* feat: tokenize y_true

* feat: fit kb_entries knowledge base

* feat: add split tokenizer

* feat: standartize tokenizers output

* feat: normalize kb entities

* feat: db_columns, db_items in each sample

* fix: go_bot configs (for new vocab) and loading of network

* style: minor restyling

* feat: add config for infer

* feat: add config for infer

* feat: add seq2seq_go_bot pretrained model

* feat: update telegram start and help messages

* style: minor styling

* docs: add simple readme

* docs: add seq2seq_go_bot in main readme

* docs: small fix

* docs: add config specification for seq2seq_go_bot

* chore: remove install.py (#151)

* feat: add support for batches in go-bot

* feat: batching v1

* feat: bow_encoder is optional

* fix: probs calculation for use_action_mask=true

* refactor: do not feed inital_state during train

* feat: feed sequence lengths in dynamic_rnn

* refactor: rename go_bot.py -> bot.py

* Update README.md

* feat: Ontonotes NER added

* chore: train part removed from config

* fix: readme dataset_iterator fixed, json removed from striong

* feat: raw version of test added

* fix: test modes

* fix: folder name in ontonotes config and download path now consistent

* fix: skip tests

* feat: check GPU added to ner OntoNotes
  • Loading branch information
mu-arkhipov authored and seliverstov committed Apr 3, 2018
1 parent 53b3d36 commit 833cc70
Show file tree
Hide file tree
Showing 10 changed files with 479 additions and 98 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ DeepPavlov is an open-source conversational AI library built on TensorFlow and K
* development of production ready chat-bots and complex conversational systems
* NLP and dialog systems research

Our goal is to enable AI-application developers researchers with:
Our goal is to enable AI-application developers and researchers with:
* set of pre-trained NLP models, pre-defined dialog system components (ML/DL/Rule-based) and pipeline templates
* a framework for implementing and testing their own dialog models
* tools for application integration with adjacent infrastructure (messengers, helpdesk software etc.)
Expand Down
2 changes: 2 additions & 0 deletions deeppavlov/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@
import deeppavlov.models.embedders.glove_embedder
import deeppavlov.models.encoders.bow
import deeppavlov.models.ner.slotfill
import deeppavlov.models.ner.ner
import deeppavlov.models.ner.ner_ontonotes
import deeppavlov.models.spellers.error_model.error_model
import deeppavlov.models.trackers.hcn_at
import deeppavlov.models.trackers.hcn_et
Expand Down
52 changes: 52 additions & 0 deletions deeppavlov/configs/ner/ner_ontonotes.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
{
"dataset_reader": {
"name": "conll2003_reader",
"data_path": "ontonotes/"
},
"dataset_iterator": {
"name": "basic_dataset_iterator"
},
"chainer": {
"in": ["x"],
"pipe": [
{
"id": "pos_vocab",
"name": "default_vocab",
"load_path": "ner_ontonotes_senna/pos.dict",
"save_path": "ner_ontonotes_senna/pos.dict"
},
{
"id": "tag_vocab",
"name": "default_vocab",
"load_path": "ner_ontonotes_senna/tag.dict",
"save_path": "ner_ontonotes_senna/tag.dict"
},
{
"id": "ner_vocab",
"name": "default_vocab",
"load_path": "ner_ontonotes_senna/ner.dict",
"save_path": "ner_ontonotes_senna/ner.dict"
},
{
"id": "glove_emb",
"name": "glove",
"load_path": "embeddings/glove.6B.100d.txt",
"save_path": "embeddings/glove.6B.100d.txt"
},
{
"in": ["x"],
"out": ["y_predicted"],
"name": "ner_ontonotes",
"main": true,
"save_path": "ner_ontonotes_senna/model.ckpt",
"load_path": "ner_ontonotes_senna/model.ckpt",
"ner_vocab": "#ner_vocab",
"tag_vocab": "#tag_vocab",
"pos_vocab": "#pos_vocab",
"embedder": "#glove_emb"
}
],
"out": ["y_predicted"]
}
}

5 changes: 0 additions & 5 deletions deeppavlov/configs/ner/ner_ontonotes_emb.json
Original file line number Diff line number Diff line change
Expand Up @@ -87,11 +87,6 @@

"log_every_n_epochs": 1,
"show_examples": false
},
"metadata": {
"labels": {
"telegram_utils": "NERModel"
}
}
}

2 changes: 2 additions & 0 deletions deeppavlov/core/data/urls.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@
'http://lnsigo.mipt.ru/export/deeppavlov_data/squad_model.tar.gz',
'http://lnsigo.mipt.ru/export/deeppavlov_data/seq2seq_go_bot.tar.gz',
'http://lnsigo.mipt.ru/export/deeppavlov_data/ner_ontonotes.tar.gz',
'http://lnsigo.mipt.ru/export/deeppavlov_data/ner_ontonotes_senna.tar.gz',
'http://lnsigo.mipt.ru/export/deeppavlov_data/senna.tar.gz'
}

OPT_URLS = {
Expand Down
121 changes: 90 additions & 31 deletions deeppavlov/models/ner/README_NER.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ Configuration of the model can be performed in code or in JSON configuration fil
the model you need to specify four groups of parameters:

- **`dataset_reader`**
- **`dataset`**
- **`dataset_iterator`**
- **`chainer`**
- **`train`**

Expand All @@ -89,7 +89,7 @@ In the subsequent text we show the parameter specification in config file. Howev
The dataset reader is a class which reads and parses the data. It returns a dictionary with
three fields: "train", "test", and "valid". The basic dataset reader is "ner_dataset_reader."
The dataset reader config part with "ner_dataset_reader" should look like:
```json
```
"dataset_reader": {
"name": "ner_dataset_reader",
"data_path": "/home/user/Data/conll2003/"
Expand All @@ -102,13 +102,13 @@ contain data in the format presented in *Training data* section. Each line in th
may contain additional information such as POS tags. However, the token must be the first in
line and NER tag must be the last.

### Dataset
### Dataset Iterator

For simple batching and shuffling you can use "basic_dataset". The part of the
For simple batching and shuffling you can use "basic_dataset_iterator". The part of the
configuration file for the dataset looks like:
```json
"dataset": {
"name": "basic_dataset"
```
"dataset_iterator": {
"name": "basic_dataset_iterator"
}
```

Expand All @@ -119,7 +119,7 @@ There is no additional parameters in this part.
The chainer part of the configuration file contains the specification of the neural network
model and supplementary things such as vocabularies. Chainer should be defined as follows:

```json
```
"chainer": {
"in": ["x"],
"in_y": ["y"],
Expand All @@ -137,7 +137,7 @@ predictions.
The major part of "chainer" is "pipe". The "pipe" contains network and vocabularies. Firstly
we define vocabularies needed to build the neural network:

```json
```
"pipe": [
{
"id": "word_vocab",
Expand Down Expand Up @@ -255,7 +255,7 @@ works well in most of the cases

After the "chainer" part you should specify the "train" part:

```json
```
"train": {
"epochs": 100,
"batch_size": 64,
Expand All @@ -280,14 +280,14 @@ training parameters are:


And now all parts together:
```json
```
{
"dataset_reader": {
"name": "ner_dataset_reader",
"data_path": "conll2003/"
},
"dataset": {
"name": "basic_dataset"
"dataset_iterator": {
"name": "basic_dataset_iterator"
},
"chainer": {
"in": ["x"],
Expand Down Expand Up @@ -372,43 +372,102 @@ interact_model(PIPELINE_CONFIG_PATH)
This example assumes that the working directory is deeppavlov.


## OntoNotes NER

A pre-trained model for solving OntoNotes task can be used as following:
```python
from deeppavlov.core.commands.infer import interact_model
interact_model('deeppavlov/configs/ner/ner_ontonotes.json')
```
Or from command line:

```bash
python deeppavlov/deep.py interact deeppavlov/configs/ner/ner_ontonotes.json
```

Since the model is built with cuDNN version of LSTM, the GPU along with installed cuDNN library needed to run this model.
The F1 scores of this model on test part of OntoNotes is presented in table below.

| Model | F1 score |
|----------------------------|:----------------:|
|DeepPavlov |**87.07** ± 0.21 |
|Strubell at al. (2017) [1]|86.84 ± 0.19 |
|Chiu and Nichols (2016) [2]|86.19 ± 0.25 |
|Spacy |85.85 |
|Durrett and Klein (2014) [3]|84.04 |
|Ratinov and Roth (2009) [4]|83.45 |

Scores by entity type are presented in the table below:

|Tag |F1 score|
|------ |:------:|
|TOTAL | 87.07 |
|CARDINAL |82.80|
|DATE |84.87|
|EVENT |68.39 |
|FAC |68.07|
|GPE |94.61|
|LANGUAGE |62.91|
|LAW |48.27|
|LOC |72.39|
|MONEY |87.79|
|NORP |94.27|
|ORDINAL |79.53|
|ORG |85.59|
|PERCENT |89.41|
|PERSON |91.67|
|PRODUCT |58.90|
|QUANTITY |77.93|
|TIME |62.50|
|WORK_OF_ART |53.17|


## Results

The NER network component reproduces the architecture from the paper "_Application of a Hybrid Bi-LSTM-CRF model to the task of Russian Named Entity Recognition_" https://arxiv.org/pdf/1709.09686.pdf, which is inspired by LSTM+CRF architecture from https://arxiv.org/pdf/1603.01360.pdf.

Bi-LSTM architecture of NER network was tested on three datasets:
- Gareev corpus [1] (obtainable by request to authors)
- FactRuEval 2016 [2]
- Persons-1000 [3]
- Gareev corpus [5] (obtainable by request to authors)
- FactRuEval 2016 [6]
- Persons-1000 [7]

The F1 measure for our model along with the results of other published solutions are provided in the table below:

| Models | Gareev’s dataset | Persons-1000 | FactRuEval 2016 |
|---------------------- |:----------------:|:------------:|:---------------:|
| Gareev et al. [1] | 75.05 | | |
| Malykh et al. [4] | 62.49 | | |
| Trofimov [5] | | 95.57 | |
| Rubaylo et al. [6] | | | 78.13 |
| Sysoev et al. [7] | | | 74.67 |
| Ivanitsky et al. [7] | | | **87.88** |
| Mozharova et al. [8] | | 97.21 | |
| Gareev et al. [5] | 75.05 | | |
| Malykh et al. [8] | 62.49 | | |
| Trofimov [13] | | 95.57 | |
| Rubaylo et al. [9] | | | 78.13 |
| Sysoev et al. [10] | | | 74.67 |
| Ivanitsky et al. [11]| | | **87.88** |
| Mozharova et al. [12]| | 97.21 | |
| Our (Bi-LSTM+CRF) | **87.17** | **99.26** | 82.10 ||

## Literature
[1] - Strubell at al. (2017) Strubell, Emma, et al. "Fast and accurate entity recognition with iterated dilated convolutions." Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017.

[2] - Jason PC Chiu and Eric Nichols. 2016. Named entity recognition with bidirectional lstm-cnns. Transactions of the Association for Computational Linguistics, 4:357–370.

[3] - Greg Durrett and Dan Klein. 2014. A joint model for entity analysis: Coreference, typing and linking. Transactions of the Association for Computational Linguistics, 2:477–490.

[1] - Rinat Gareev, Maksim Tkachenko, Valery Solovyev, Andrey Simanovsky, Vladimir Ivanov: Introducing Baselines for Russian Named Entity Recognition. Computational Linguistics and Intelligent Text Processing, 329 -- 342 (2013).
[4] - Lev Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning, pages 147–155. Association for Computational Linguistics.

[2] - https://github.com/dialogue-evaluation/factRuEval-2016
[5] - Rinat Gareev, Maksim Tkachenko, Valery Solovyev, Andrey Simanovsky, Vladimir Ivanov: Introducing Baselines for Russian Named Entity Recognition. Computational Linguistics and Intelligent Text Processing, 329 -- 342 (2013).

[3] - http://ai-center.botik.ru/Airec/index.php/ru/collections/28-persons-1000
[6] - https://github.com/dialogue-evaluation/factRuEval-2016

[4] - Reproducing Russian NER Baseline Quality without Additional Data. In proceedings of the 3rd International Workshop on ConceptDiscovery in Unstructured Data, Moscow, Russia, 54 – 59 (2016)
[7] - http://ai-center.botik.ru/Airec/index.php/ru/collections/28-persons-1000

[5] - Rubaylo A. V., Kosenko M. Y.: Software utilities for natural language information
[8] - Malykh, Valentin, and Alexey Ozerin. "Reproducing Russian NER Baseline Quality without Additional Data." CDUD@ CLA. 2016.

[9] - Rubaylo A. V., Kosenko M. Y.: Software utilities for natural language information
retrievial. Almanac of modern science and education, Volume 12 (114), 87 – 92.(2016)

[6] - Sysoev A. A., Andrianov I. A.: Named Entity Recognition in Russian: the Power of Wiki-Based Approach. dialog-21.ru
[10] - Sysoev A. A., Andrianov I. A.: Named Entity Recognition in Russian: the Power of Wiki-Based Approach. dialog-21.ru

[11] - Ivanitskiy Roman, Alexander Shipilo, Liubov Kovriguina: Russian Named Entities Recognition and Classification Using Distributed Word and Phrase Representations. In SIMBig, 150 – 156. (2016).

[7] - Ivanitskiy Roman, Alexander Shipilo, Liubov Kovriguina: Russian Named Entities Recognition and Classification Using Distributed Word and Phrase Representations. In SIMBig, 150156. (2016).
[12] - Mozharova V., Loukachevitch N.: Two-stage approach in Russian named entity recognition. In Intelligence, Social Media and Web (ISMW FRUCT), 2016 International FRUCT Conference, 16 (2016)

[8] - Mozharova V., Loukachevitch N.: Two-stage approach in Russian named entity recognition. In Intelligence, Social Media and Web (ISMW FRUCT), 2016 International FRUCT Conference, 16 (2016)
[13] - Trofimov, I.V.: Person name recognition in news articles based on the persons- 1000/1111-F collections. In: 16th All-Russian Scientific C onference Digital Libraries: Advanced Methods and Technologies, Digital Collections, RCDL 2014,pp. 217221 (2014).
79 changes: 79 additions & 0 deletions deeppavlov/models/ner/ner_ontonotes.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
"""
Copyright 2017 Neural Networks and Deep Learning lab, MIPT
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
"""
import tensorflow as tf
from overrides import overrides
from copy import deepcopy
import inspect
import json

from deeppavlov.core.common.registry import register
from deeppavlov.core.data.utils import tokenize_reg
from deeppavlov.models.ner.network_ontonotes import NerNetwork
from deeppavlov.core.models.tf_model import TFModel
from deeppavlov.core.common.log import get_logger

log = get_logger(__name__)


@register('ner_ontonotes')
class NER(TFModel):
def __init__(self, **kwargs):
self.opt = deepcopy(kwargs)
vocabs = self.opt.pop('vocabs')
self.opt.update(vocabs)

# Find all input parameters of the network init
network_parameter_names = list(inspect.signature(NerNetwork.__init__).parameters)
# Fill all provided parameters from opt
network_parameters = {par: self.opt[par] for par in network_parameter_names if par in self.opt}

self.sess = tf.Session()
network_parameters['sess'] = self.sess
self._network_parameters = network_parameters
self._net = NerNetwork(**network_parameters)

# Try to load the model (if there are some model files the model will be loaded from them)
super().__init__(**kwargs)
if self.load_path is not None:
self.load()

def load(self, *args, **kwargs):
super().load(*args, **kwargs)

def save(self, *args, **kwargs):
super().save(*args, **kwargs)
self.save_params()

def save_params(self):
params_to_save = {param: self.opt.get(param, None) for param in self.GRAPH_PARAMS}
for vocab in self.VOCABS:
params_to_save[vocab] = [self.opt[vocab][i] for i in range(len(self.opt[vocab]))]
path = str(self.save_path.with_suffix('.json').resolve())
log.info('[saving parameters to {}]'.format(path))
with open(path, 'w') as fp:
json.dump(params_to_save, fp, indent=4)

def train_on_batch(self, batch_x, batch_y):
raise NotImplementedError

@overrides
def __call__(self, batch, *args, **kwargs):
if isinstance(batch[0], str):
batch = [tokenize_reg(utterance) for utterance in batch]
return self._net.predict_on_batch(batch)

def shutdown(self):
self._net.shutdown()

0 comments on commit 833cc70

Please sign in to comment.