Skip to content

Commit

Permalink
release 0.0.3 (#150)
Browse files Browse the repository at this point in the history
* feat: tests can be run from project root (#86)

* refactor: instead of juggling global random states use instances of Random for datasets

* test(): add test for interacting with custom queries

After refactoring, it is possible to easily add list of query-response
pairs for every model (config), which will be used to compare pretrained
model output with expected output. Initial lists added for error_model
and ner. Also URL for downloading pretrained ner_conll2003_model added
IP-1344 #done

* Update docs from master (#96)

* fixed grammar and style

* Update README.md

* fix grammar & style

* fix grammar & style

* fix grammar&style in Intent classification README

* doc: add supported platform notes

* docs: correct paths to scripts and configs to be relative to repository root (#94)

* docs: correct paths to scripts and configs to be relative to repository root

fixes #93

* docs: set paths in basic examples to be relative to the project root

* docs: run deep.py as a python module in examples

* doc: add notes for python 3.5

* test(): change downloading to temp dir (#97)

* feat: assert python version is 3.6 or higher

* Rename dataset to dataset_iterator and other renames (#103)

* refactor: rename 'dataset' to 'dataset_iterator'

* refactor: rename dataset readers and iterators

* refactor: classification iterator and reader

* fix: dialog_iterator

* test: fix downloading procedure (#108)

* Feature/tf layers to core (#67)

* feat: layers moved to core

* feat: attention added

* fix: highway/skip connections for different dimensionality of units are fixed

* feat: NER now supports core layers

* fix: minor docstrings fixes

* feat: CuDNN GRU and LSTM added

* feat: Bidirectional CuDNN GRU and LSTM added

* feat: stacked bi-rnn refactored

* fix: fixed arguments order in rnn

* fix: remove duplicate mult_att

* chore: merge with dev

* fix: backward forward bug in cudnnrnn

* refactor: use single fasttext module, clean dependencies

* fix: add error when n_classes is zero

* feat: add fastText model usage instead of fasttext

* fix: emb_module default fastText

* chore: embedding fixed in configs

* chore: change new models names

* feat: change intent embeddings in gobot configs

* chore: fastText to fasttext, new model, change intents in gobot configs

* chore: new url on new fasttext embeddings

* fix: delete dowload all true

* fix: add url of old embedding file

* fix: delete comma

* fix: delete old embedding file from urls

* fix: delete pyfasttext from requirements, fasttext_embedder

* fix: change pyfasttext embeddings from gobot

* fix: delete from requirements

* fix: delete gensim from fasttext_embedder

* fix: simplify requirements

* fix: fix dim in gobot_all config

* refactor: remove redundant parameter 'emb_module'

* feat: use wiki.en.bin embeddings in gobot_all

* feat: check saved model params and fix lowercase for interact

* fix: lowercase text while interact

* feat: check saved model params

* fix: rm extra configs

* feat: add support for classification data in csv/json formats (#115)

* feat: add support for csv/json classification datasets

* feat: add tests for snips and samples

* fix: gobot_all config fix

* feat: add REST API for all models

* Moved telegram_utils -> utils; Refactored telegram_ui.py

* Moved telegram_utils -> utils: modified deeppavlov/deep.py

* Fixed getting model name with get_main_component() in telegram_ui.py

* chaner.py: minor fix in get_main_component()

* Added riseapi launch mode

* README.md: added riseapi mode reference

* Updated README.MD and fixed requirements.txt

* minor fixes in README.md

* Fixes in utils/server

* refactor: change endpoint names

* feat: add SteamSpacyTokenizer

* refactor: remove duplicating from script naming

* refactor: outline detokenize() meth in utils, because it should be used by all tokenizers and doesn't depend on tokenize()

* feat: add streaming spaCy tokenizer

* refactor: DELETE original spaCy tokenizer, rename stream_spacy to spacy

* refactor: rename tokenoizer scripts back

* fix: wrong grammar

* feat: include spacy_tokenizer import

* feat: replace old SpacyTokenizer with new StreamSpacyTokenizer

* feat: ability to manage lowercasing from class constructor, typing improvements

* fix: update go-bot configs, so they would work with StreamSpacyTokenizer the same as with the old tokenizer

* feat: add optional logging to the spacy tokenizer

* docs: update docstrings

* refactor: replace custom logger with deppavlov's, pep8 style update

* refactor: uotline ngramize() cause it is independent from tokenizer classes

* refactor: return original JSON formatting

* fix: add **kwargs to __init__()

* chore: update .gitignore

* refactor: more stable and consistent code

* feat: add TravisCI integration

* build(): add TravisCI integration

* build(): add TravisCI integration

* feat: add ranking model

* feat: add ranking model to deeppavlov

* feat: add download of dataset and embedding_model

* feat: adapt to new deeppavlov interfaces

* refactor: use pathlib where available in the ranking model

* feat: add saving and loading responses saving with np.save

* feat: add saving and loading response embeddings saving with np.save,
use response embeddings to calculate predictions in  __call__ function

* feat: add interact regime

* feat: add interact_pred_num parameter

* refactor: change parameter default value, change check if the file with
embeddings model exists

* fix: fix non-string keys in EmbeddingDict class

* feat: add parameters dict for autotests

* feat: add tests support

* feat: add context embeddings vocabulary (it is used in interact regime
to predict the most similar contexts)

* chore: change shuffle parameter default value to True in batch_generator

* refactor: change config to chainer representation

* fix: bug fix in urls.py file

* refactor: remove emb_vocab_file saving, move build_tok2int_vocab and
make_ints funcs to InsuranceDict class, add set_embeddings and
reset_embeddings funcs in RankingModel

* feat: add initial documentation

* refactor: remove idx2int vocabulary, add vocabularies saving

* change config parameters default values, remove examples in tests

* feat: add table in documentation

* fix: fix bug in urls.py

* refactor: remove paths from config

* feat: add documentation

* feat: add True in tests

* feat: add documentation

* refactor: move init/load in the load function.

* refactor: change parameters in config

* feat: add logging

* feat: add more logging

* feat: add documentation, change parameters values in config

* fix: add genesis for ranking model

* fix: requirements installation order that caused setup.py error

* refactor: train script

* feat: add documentation

* feat: models parameters check for ner

* feat: parameters check added to ner

* feat: parameters check added to slotfill

* chore: minor clean-up

* fix: fix conll-2003 model file names and archive names

* refactor: remove blank line

* feat: allow to stop training after n batches (#127)

* fix: many minor fixes

* fix: fix mark_done data_path

* refactor: rename ranking_dataset to ranking_iterator.py and move it to the dataset_iterators folder

* fix: fix embedding matrix construction, change epochs num
default parameter value

* refactor: rename registered name and name of the class

* refactor: rename files and classes

* refactor: change dataset downlaod

* feat: add insurance embeddings and datasets in urls.py

* refactor: change batch data representation (#131)

* feat: install tensorflow-gpu

* feat: add SQUAD model

* feat: add SQuAD dataset reader

* feat: add dataset, preprocessing, config

* feat: add VocabEmbedder for chars and tokens

* feat&fix: add model realization

* feat: add training support, answer postprocessing

* fix: predicted answer extraction from context

* fix: dropout mask

* feat: true_answer is a list of answers now

* merge with dev

* docs: add some docstrings

* refactor: renaming variables

* docs: add README.md

* feat: add support of multiple inputs and outputs in interact mode

* docs: upd README.md

* fix: bugs after merge with dev

* fix: turn on training vocabs

* fix: remove keep_prob multiplier for dropout mask

* fix: add short contexts support

* docs: upd README.md

* feat: chainer returns batch of tuples instead of tuple of batches

* docs: upd squad README.md

* docs: upd squad README.md

* feat: add link to pretrained SQuAD model

* fix: SQuAD model url

* feat: add embeddings downloading and upd config

* feat: add variable scope for optimizer

* refactor: do not override __init__ method for squad_iterator

* fix: ensure that directory exists before saving SquadVocabEmbedder

* style: upd names in config and docs

* chore: remove main.py used for debugging

* docs: upd README.md

* fix: change batch_size to fix possible OOM

* test: add possibility to interact with several input query

* chore: add max_batches to squad config

* docs: upd README.md

* fix(ranking_network): wrap y as np.array

* fix: fix training stop for pytest

* style: add license header

* fix: refactor training stop for pytest

* test: specify pytest_max_batches

* feat: use all pytest keys and not only max_batches (#134)

* fix: remove result stringification

* feat: add GPU_only and Slow marks for tests

* feat: add SQuAD dataset reader

* feat: add dataset, preprocessing, config

* feat: add VocabEmbedder for chars and tokens

* feat&fix: add model realization

* feat: add training support, answer postprocessing

* fix: predicted answer extraction from context

* fix: dropout mask

* feat: true_answer is a list of answers now

* merge with dev

* docs: add some docstrings

* refactor: renaming variables

* docs: add README.md

* feat: add support of multiple inputs and outputs in interact mode

* docs: upd README.md

* fix: bugs after merge with dev

* fix: turn on training vocabs

* fix: remove keep_prob multiplier for dropout mask

* fix: add short contexts support

* docs: upd README.md

* feat: chainer returns batch of tuples instead of tuple of batches

* docs: upd squad README.md

* docs: upd squad README.md

* feat: add link to pretrained SQuAD model

* fix: SQuAD model url

* feat: add embeddings downloading and upd config

* feat: add variable scope for optimizer

* refactor: do not override __init__ method for squad_iterator

* fix: ensure that directory exists before saving SquadVocabEmbedder

* style: upd names in config and docs

* chore: remove main.py used for debugging

* docs: upd README.md

* fix: change batch_size to fix possible OOM

* test: add possibility to interact with several input query

* chore: add max_batches to squad config

* docs: upd README.md

* fix(ranking_network): wrap y as np.array

* fix: fix training stop for pytest

* style: add license header

* fix: refactor training stop for pytest

* test: specify pytest_max_batches

* test: add couple of marks for selecting tests

* test: make Travis running only fast tests without GPU

* fix: ranking config works in interactbot

* fix: add downloading nltk punkt for tokenization (#140)

* feat: bot start message for intents does not say anything about dstc2 (#142)

* feat: interactbot command works with pipes that require multiple inputs (#137)

* build: change TravisCI script (#143)

* feat: add Glove embedder (#138)

* feat: glove embedder added

* feat: embeddings added to NER network

* feat: dataset and embeddings are added to urls.py for downloading

* fix: char embeddings added to pretrained embeddings

* feat: embedder return list of embeddings instead zero padded np array

* feat: capitalization added

* feat: config modified according to new features

* feat: double dense added to input parameters

* feat:config parameters updated

* chore: fix urls for conll NER, ontonotes model url added

* feat: pytest_max_batches added for faster tran check

* feat: ontonotes tests added

* feat: test conll max batches added

* Update README.md

* feat: add seq2seq go bot

* fix: lowercase text while interact

* feat: check saved model params

* fix: rm extra configs

* feat: add kvret dataset_reader

* feat: add kvret_dataset_iterator

* fix: add configerror

* fix: dirty fix for dialog data to be lowercased

* feat: check np.int and int in Vocabulary

* feat: seq2seqbot works for train and infer

* feat: add bleu-metric

* feat: add simple seq2seq_go_bot config

* fix: fix inference and load()

* feat: add variable scope for optimizer

* feat: add support of multiple inputs and outputs in interact mode

* fix: fix padding

* feat: tokenizer argument in Vocabulary

* feat: chainer returns batch of tuples instead of tuple of batches

* fix: spacy_tokenizer returns [['']] for batch with empty string and add alpha_only argument

* feat: add per_item_bleu

* feat: train seq2seq_go_bot on utterance batches

* feat: tokenize y_true

* feat: fit kb_entries knowledge base

* feat: add split tokenizer

* feat: standartize tokenizers output

* feat: normalize kb entities

* feat: db_columns, db_items in each sample

* fix: go_bot configs (for new vocab) and loading of network

* style: minor restyling

* feat: add config for infer

* feat: add config for infer

* feat: add seq2seq_go_bot pretrained model

* feat: update telegram start and help messages

* style: minor styling

* docs: add simple readme

* doc: remove red ... blocks

* doc: change Dataset to DatasetIterator

* doc: update list of configs

* doc: update package structure

* doc: add notes about dataset element in config

* feat: add squad model description to README.md

* doc: add config specification for seq2seq_go_bot

* fix: lowercase text while interact

* feat: check saved model params

* fix: rm extra configs

* feat: add kvret dataset_reader

* feat: add kvret_dataset_iterator

* fix: add configerror

* fix: dirty fix for dialog data to be lowercased

* feat: check np.int and int in Vocabulary

* feat: seq2seqbot works for train and infer

* feat: add bleu-metric

* feat: add simple seq2seq_go_bot config

* fix: fix inference and load()

* feat: add variable scope for optimizer

* feat: add support of multiple inputs and outputs in interact mode

* fix: fix padding

* feat: tokenizer argument in Vocabulary

* feat: chainer returns batch of tuples instead of tuple of batches

* fix: spacy_tokenizer returns [['']] for batch with empty string and add alpha_only argument

* feat: add per_item_bleu

* feat: train seq2seq_go_bot on utterance batches

* feat: tokenize y_true

* feat: fit kb_entries knowledge base

* feat: add split tokenizer

* feat: standartize tokenizers output

* feat: normalize kb entities

* feat: db_columns, db_items in each sample

* fix: go_bot configs (for new vocab) and loading of network

* style: minor restyling

* feat: add config for infer

* feat: add config for infer

* feat: add seq2seq_go_bot pretrained model

* feat: update telegram start and help messages

* style: minor styling

* docs: add simple readme

* docs: add seq2seq_go_bot in main readme

* docs: small fix

* docs: add config specification for seq2seq_go_bot

* chore: remove install.py (#151)

* feat: add support for batches in go-bot

* feat: batching v1

* feat: bow_encoder is optional

* fix: probs calculation for use_action_mask=true

* refactor: do not feed inital_state during train

* feat: feed sequence lengths in dynamic_rnn

* refactor: rename go_bot.py -> bot.py
  • Loading branch information
seliverstov committed Mar 26, 2018
1 parent 5c52988 commit d1bccd1
Show file tree
Hide file tree
Showing 100 changed files with 5,332 additions and 816 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,7 @@ download/

#project test
/test/
.pytest_cache

# project data
/data/
17 changes: 17 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
language: python

python:
- '3.6'

cache: pip

git:
depth: false

install:
- pip3 install -r requirements-dev.txt
- python3 setup.py develop
- python3 -m spacy download en

script:
- pytest -v -m "not gpu_only and not slow"
78 changes: 54 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
# <center>DeepPavlov</center>

### *We are in a really early Alpha release. You should be ready for hard adventures.*
### *If you have updated to version 0.0.2 - please re-download all pre-trained models*
### *If you have updated to version 0.0.2 or greater - please re-download all pre-trained models*

DeepPavlov is an open-source conversational AI library built on TensorFlow and Keras. It is designed for
* development of production ready chat-bots and complex conversational systems
Expand All @@ -24,8 +24,11 @@ Our goal is to enable AI-application developers researchers with:
| [Slot filling and NER components](deeppavlov/models/ner/README.md) | Based on neural Named Entity Recognition network and fuzzy Levenshtein search to extract normalized slot values from text. The NER component reproduces architecture from the paper [Application of a Hybrid Bi-LSTM-CRF model to the task of Russian Named Entity Recognition](https://arxiv.org/pdf/1709.09686.pdf) which is inspired by Bi-LSTM+CRF architecture from https://arxiv.org/pdf/1603.01360.pdf. |
| [Intent classification component](deeppavlov/models/classifiers/intents/README.md) | Based on shallow-and-wide Convolutional Neural Network architecture from [Kim Y. Convolutional neural networks for sentence classification – 2014](https://arxiv.org/pdf/1408.5882). The model allows multilabel classification of sentences. |
| [Automatic spelling correction component](deeppavlov/models/spellers/error_model/README.md) | Based on [An Improved Error Model for Noisy Channel Spelling Correction by Eric Brill and Robert C. Moore](http://www.aclweb.org/anthology/P00-1037) and uses statistics based error model, a static dictionary and an ARPA language model to correct spelling errors. |
| **Skill** | |
| [Goal-oriented bot](deeppavlov/skills/go_bot/README.md) | Based on Hybrid Code Networks (HCNs) architecture from [Jason D. Williams, Kavosh Asadi, Geoffrey Zweig, Hybrid Code Networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning – 2017](https://arxiv.org/abs/1702.03274). It allows to predict responses in goal-oriented dialog. The model is customizable: embeddings, slot filler and intent classifier can switched on and off on demand. |
| [Ranking component](deeppavlov/models/ranking/README.md) | Based on [LSTM-based deep learning models for non-factoid answer selection](https://arxiv.org/abs/1511.04108). The model performs ranking of responses or contexts from some database by their relevance for the given context. |
| [Question Answering component](deeppavlov/models/squad/README.md) | Based on [R-NET: Machine Reading Comprehension with Self-matching Networks](https://www.microsoft.com/en-us/research/publication/mrc/). The model solves the task of looking for an answer on a question in a given context ([SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) task format). |
| **Skills** | |
| [Goal-oriented bot](deeppavlov/skills/go_bot/README.md) | Based on Hybrid Code Networks (HCNs) architecture from [Jason D. Williams, Kavosh Asadi, Geoffrey Zweig, Hybrid Code Networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning – 2017](https://arxiv.org/abs/1702.03274). It allows to predict responses in goal-oriented dialog. The model is customizable: embeddings, slot filler and intent classifier can switched on and off on demand. |
| [Seq2seq goal-oriented bot](deeppavlov/skills/seq2seq_go_bot/README.md) | Dialogue agent predicts responses in a goal-oriented dialog and is able to handle multiple domains (pretrained bot allows calendar scheduling, weather information retrieval, and point-of-interest navigation). The model is end-to-end differentiable and does not need to explicitly model dialogue state or belief trackers. |
| **Embeddings** | |
| [Pre-trained embeddings for the Russian language](pretrained-vectors.md) | Word vectors for the Russian language trained on joint [Russian Wikipedia](https://ru.wikipedia.org/wiki/%D0%97%D0%B0%D0%B3%D0%BB%D0%B0%D0%B2%D0%BD%D0%B0%D1%8F_%D1%81%D1%82%D1%80%D0%B0%D0%BD%D0%B8%D1%86%D0%B0) and [Lenta.ru](https://lenta.ru/) corpora. |

Expand All @@ -43,14 +46,22 @@ View video demo of deployment of a goal-oriented bot and a slot-filling model wi
```
python -m deeppavlov.deep interact deeppavlov/configs/go_bot/gobot_dstc2.json
```
* Run slot-filling model with Telegram interface:
* Run goal-oriented bot with REST API:
```
python -m deeppavlov.deep riseapi deeppavlov/configs/go_bot/gobot_dstc2.json
```
* Run slot-filling model with Telegram interface:
```
python -m deeppavlov.deep interactbot deeppavlov/configs/ner/slotfill_dstc2.json -t <TELEGRAM_TOKEN>
```
* Run slot-filling model with console interface:
```
python -m deeppavlov.deep interact deeppavlov/configs/ner/slotfill_dstc2.json
```
* Run slot-filling model with REST API:
```
python -m deeppavlov.deep riseapi deeppavlov/configs/ner/slotfill_dstc2.json
```
## Conceptual overview

### Principles
Expand Down Expand Up @@ -91,8 +102,8 @@ DeepPavlov is built on top of machine learning frameworks [TensorFlow](https://w
* [Train config](#train-config)
* [Train parameters](#train-parameters)
* [DatasetReader](#datasetreader)
* [Dataset](#dataset)
* [Inferring](#inferring)
* [DatasetIterator](#datasetiterator)
* [Inference](#inference)
* [License](#license)
* [Support and collaboration](#support-and-collaboration)
* [The Team](#the-team)
Expand Down Expand Up @@ -138,21 +149,28 @@ Then you can interact with the models or train them with the following command:
python -m deeppavlov.deep <mode> <path_to_config>
```

* `<mode>` can be 'train', 'interact' or 'interactbot'
* `<mode>` can be 'train', 'interact', 'interactbot' or 'riseapi'
* `<path_to_config>` should be a path to an NLP pipeline json config

For 'interactbot' mode you should specify Telegram bot token in `-t` parameter or in `TELEGRAM_TOKEN` environment variable.

For 'riseapi' mode you should specify api settings (host, port, etc.) in [*utils/server_utils/server_config.json*](utils/server_utils/server_config.json) configuration file. If provided, values from *model_defaults* section override values for the same parameters from *common_defaults* section. Model names in *model_defaults* section should be similar to the class names of the models main component.

Available model configs are:

*deeppavlov/configs/go_bot/gobot_dstc2.json*
- ```deeppavlov/configs/go_bot/*.json```

- ```deeppavlov/configs/seq2seq_go_bot/*.json```

*deeppavlov/configs/intents/intents_dstc2.json*
- ```deeppavlov/configs/squad/*.json```

*deeppavlov/configs/ner/slotfill_dstc2.json*
- ```deeppavlov/configs/intents/*.json```

*deeppavlov/configs/error_model/brillmoore_wikitypos_en.json*
- ```deeppavlov/configs/ner/*.json```

- ```deeppavlov/configs/rankinf/*.json```

- ```deeppavlov/configs/error_model/*.json```

---

Expand All @@ -171,7 +189,11 @@ Available model configs are:
</tr>
<tr>
<td><b> deeppavlov.core.data </b></td>
<td> basic <b><i>Dataset</i></b>, <b><i>DatasetReader</i></b> and <b><i>Vocab</i></b> classes </td>
<td> basic <b><i>DatasetIterator</i></b>, <b><i>DatasetReader</i></b> and <b><i>Vocab</i></b> classes </td>
</tr>
<tr>
<td><b> deeppavlov.core.layers </b></td>
<td> collection of commonly used <b><i>Layers</i></b> for TF models </td>
</tr>
<tr>
<td><b> deeppavlov.core.models </b></td>
Expand All @@ -182,8 +204,12 @@ Available model configs are:
<td> concrete <b><i>DatasetReader</i></b> classes </td>
</tr>
<tr>
<td><b> deeppavlov.datasets </b></td>
<td> concrete <b><i>Dataset</i></b> classes </td>
<td><b> deeppavlov.dataset_iterators </b></td>
<td> concrete <b><i>DatasetIterators</i></b> classes </td>
</tr>
<tr>
<td><b> deeppavlov.metrics </b></td>
<td> different <b><i>Metric</i></b> functions </td>
</tr>
<tr>
<td><b> deeppavlov.models </b></td>
Expand All @@ -203,7 +229,7 @@ Available model configs are:

An NLP pipeline config is a JSON file that contains one required element `chainer`:

```json
```
{
"chainer": {
"in": ["x"],
Expand Down Expand Up @@ -288,15 +314,15 @@ An NNModel should have the `in_y` parameter which contains a list of ground trut
]
```

The config for training the pipeline should have three additional elements: `dataset_reader`, `dataset` and `train`:
The config for training the pipeline should have three additional elements: `dataset_reader`, `dataset_iterator` and `train`:

```json
```
{
"dataset_reader": {
"name": ...,
...
}
"dataset": {
"dataset_iterator": {
"name": ...,
...
},
Expand All @@ -309,6 +335,10 @@ The config for training the pipeline should have three additional elements: `dat
}
```

Simplified version of trainig pipeline contains two elemens: `dataset` and `train`. The `dataset` element currently
can be used for train from classification data in `csv` and `json` formats. You can find complete examples of how to use simplified training pipeline in [intents_sample_csv.json](deeppavlov/configs/intents/intents_sample_csv.json) and [intents_sample_json.json](deeppavlov/configs/intents/intents_sample_json.json) config files.


### Train Parameters
* `epochs` — maximum number of epochs to train NNModel, defaults to `-1` (infinite)
* `batch_size`,
Expand All @@ -334,12 +364,12 @@ from deeppavlov.core.data.dataset_reader import DatasetReader
class DSTC2DatasetReader(DatasetReader):
```

### Dataset
### DatasetIterator

`Dataset` forms the sets of data ('train', 'valid', 'test') needed for training/inference and divides it into batches.
A concrete `Dataset` class should be registered and can be inherited from
`deeppavlov.data.dataset_reader.Dataset` class. `deeppavlov.data.dataset_reader.Dataset`
is not an abstract class and can be used as a `Dataset` as well.
`DatasetIterator` forms the sets of data ('train', 'valid', 'test') needed for training/inference and divides it into batches.
A concrete `DatasetIterator` class should be registered and can be inherited from
`deeppavlov.data.dataset_iterator.BasicDatasetIterator` class. `deeppavlov.data.dataset_iterator.BasicDatasetIterator`
is not an abstract class and can be used as a `DatasetIterator` as well.

### Inference

Expand All @@ -359,7 +389,7 @@ If you have any questions, bug reports or feature requests, please feel free to

## The Team

DeepPavlov is built and maintained by [Neural Networks and Deep Learning Lab](https://mipt.ru/english/research/labs/neural-networks-and-deep-learning-lab) at [MIPT](https://mipt.ru/english/).
DeepPavlov is built and maintained by [Neural Networks and Deep Learning Lab](https://mipt.ru/english/research/labs/neural-networks-and-deep-learning-lab) at [MIPT](https://mipt.ru/english/) within [iPavlov](http://ipavlov.ai/) project (part of [National Technology Initiative](https://asi.ru/eng/nti/)) and in partnership with [Sberbank](http://www.sberbank.com/).

<p align="center">
<img src="http://ipavlov.ai/img/ipavlov_footer.png" width="50%" height="50%"/>
Expand Down
48 changes: 35 additions & 13 deletions deeppavlov/__init__.py
Original file line number Diff line number Diff line change
@@ -1,34 +1,56 @@
# check version
import sys
assert sys.hexversion >= 0x3060000, 'Does not work in python3.5 or lower'


import deeppavlov.core.models.keras_model
import deeppavlov.core.data.dataset
import deeppavlov.core.data.dataset_iterator
import deeppavlov.core.data.vocab
import deeppavlov.dataset_readers.babi_dataset_reader
import deeppavlov.dataset_readers.dstc2_dataset_reader
import deeppavlov.dataset_readers.basic_ner_dataset_reader
import deeppavlov.dataset_readers.typos
import deeppavlov.dataset_readers.classification_dataset_reader
import deeppavlov.datasets.dialog_dataset
import deeppavlov.datasets.dstc2_datasets
import deeppavlov.datasets.hcn_dataset
import deeppavlov.datasets.intent_dataset
import deeppavlov.datasets.typos_dataset
import deeppavlov.datasets.classification_dataset
import deeppavlov.dataset_readers.babi_reader
import deeppavlov.dataset_readers.dstc2_reader
import deeppavlov.dataset_readers.kvret_reader
import deeppavlov.dataset_readers.conll2003_reader
import deeppavlov.dataset_readers.typos_reader
import deeppavlov.dataset_readers.basic_classification_reader
import deeppavlov.dataset_readers.squad_dataset_reader
import deeppavlov.dataset_iterators.dialog_iterator
import deeppavlov.dataset_iterators.kvret_dialog_iterator
import deeppavlov.dataset_iterators.dstc2_ner_iterator
import deeppavlov.dataset_iterators.dstc2_intents_iterator
import deeppavlov.dataset_iterators.typos_iterator
import deeppavlov.dataset_iterators.basic_classification_iterator
import deeppavlov.dataset_iterators.squad_iterator
import deeppavlov.models.classifiers.intents.intent_model
import deeppavlov.models.commutators.random_commutator
import deeppavlov.models.embedders.fasttext_embedder
import deeppavlov.models.embedders.dict_embedder
import deeppavlov.models.embedders.glove_embedder
import deeppavlov.models.encoders.bow
import deeppavlov.models.ner.slotfill
import deeppavlov.models.spellers.error_model.error_model
import deeppavlov.models.trackers.hcn_at
import deeppavlov.models.trackers.hcn_et
import deeppavlov.models.preprocessors.str_lower
import deeppavlov.models.preprocessors.squad_preprocessor
import deeppavlov.models.ner.ner
import deeppavlov.skills.go_bot.go_bot
import deeppavlov.models.tokenizers.spacy_tokenizer
import deeppavlov.models.tokenizers.split_tokenizer
import deeppavlov.models.squad.squad
import deeppavlov.skills.go_bot.bot
import deeppavlov.skills.go_bot.network
import deeppavlov.skills.go_bot.tracker
import deeppavlov.skills.seq2seq_go_bot.bot
import deeppavlov.skills.seq2seq_go_bot.network
import deeppavlov.skills.seq2seq_go_bot.kb
import deeppavlov.vocabs.typos
import deeppavlov.dataset_readers.insurance_reader
import deeppavlov.dataset_iterators.ranking_iterator
import deeppavlov.models.ranking.ranking_model
import deeppavlov.models.ranking.metrics

import deeppavlov.metrics.accuracy
import deeppavlov.metrics.fmeasure
import deeppavlov.metrics.bleu
import deeppavlov.metrics.squad_metrics

import deeppavlov.core.common.log
4 changes: 2 additions & 2 deletions deeppavlov/configs/error_model/brillmoore_kartaslov_ru.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@
"dataset_reader": {
"name": "typos_kartaslov_reader"
},
"dataset": {
"name": "typos_dataset",
"dataset_iterator": {
"name": "typos_iterator",
"test_ratio": 0.02
},
"chainer":{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@
"dataset_reader": {
"name": "typos_kartaslov_reader"
},
"dataset": {
"name": "typos_dataset",
"dataset_iterator": {
"name": "typos_iterator",
"test_ratio": 0.02
},
"chainer":{
Expand Down
4 changes: 2 additions & 2 deletions deeppavlov/configs/error_model/brillmoore_wikitypos_en.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@
"dataset_reader": {
"name": "typos_wikipedia_reader"
},
"dataset": {
"name": "typos_dataset",
"dataset_iterator": {
"name": "typos_iterator",
"test_ratio": 0.05
},
"chainer":{
Expand Down
24 changes: 12 additions & 12 deletions deeppavlov/configs/go_bot/gobot_dstc2.json
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
{
"dataset_reader": {
"name": "dstc2_datasetreader",
"name": "dstc2_reader",
"data_path": "dstc2"
},
"dataset": {
"name": "dialog_dataset"
"dataset_iterator": {
"name": "dialog_iterator"
},
"chainer": {
"in": ["x"],
Expand All @@ -16,7 +16,7 @@
"fit_on": ["x"],
"name": "default_vocab",
"level": "token",
"tokenize": true,
"tokenizer": { "name": "split_tokenizer" },
"save_path": "vocabs/token.dict",
"load_path": "vocabs/token.dict"
},
Expand Down Expand Up @@ -64,7 +64,7 @@
"save_path": "go_bot/model",
"learning_rate": 0.002,
"dropout_rate": 0.8,
"hidden_dim": 128,
"hidden_size": 128,
"dense_size": 64,
"obs_size": 530,
"action_size": 45
Expand All @@ -89,8 +89,8 @@
},
"intent_classifier": {
"name": "intent_model",
"save_path": "intents/intent_cnn",
"load_path": "intents/intent_cnn",
"save_path": "intents/intent_cnn_v2",
"load_path": "intents/intent_cnn_v2",
"classes": "#classes_vocab.keys()",
"opt": {
"train_now": true,
Expand Down Expand Up @@ -123,8 +123,8 @@
},
"embedder": {
"name": "fasttext",
"save_path": "embeddings/dstc2_fasttext_model_100.bin",
"load_path": "embeddings/dstc2_fasttext_model_100.bin",
"save_path": "embeddings/dstc2_fastText_model.bin",
"load_path": "embeddings/dstc2_fastText_model.bin",
"emb_module": "fasttext",
"dim": 100
},
Expand All @@ -138,7 +138,8 @@
"name": "bow"
},
"tokenizer": {
"name": "spacy_tokenizer"
"name": "stream_spacy_tokenizer",
"lowercase": false
},
"tracker": {
"name": "featurized_tracker",
Expand All @@ -156,7 +157,7 @@
},
"train": {
"epochs": 200,
"batch_size": 1,
"batch_size": 2,

"metrics": ["per_item_dialog_accuracy"],
"validation_patience": 20,
Expand All @@ -167,4 +168,3 @@
"show_examples": false
}
}

0 comments on commit d1bccd1

Please sign in to comment.