-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
refactor: refactor seq2seq-gobot and speller
- Loading branch information
1 parent
17ca0bf
commit 64a32c4
Showing
6 changed files
with
343 additions
and
5 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
Hello bot! | ||
========== | ||
|
||
Simple example here. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,4 @@ | ||
Installation | ||
============ | ||
|
||
Installation instruction. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,125 @@ | ||
Sequence-To-Sequence Dialogue Bot for goal-oriented task | ||
======================================================== | ||
|
||
Dialogue agent is able to handle multi-domain data. | ||
The model is end-to-end differentiable and does not need to explicitly | ||
model dialogue state or belief trackers. | ||
|
||
We also used a new Stanford NLP Group released dataset of 3,031 | ||
dialogues `[2] <#references>`__ that are grounded through | ||
underlying knowledge bases and span three distinct tasks in the in-car | ||
personal assistant space: | ||
calendar scheduling, weather information retrieval, and | ||
point-of-interest navigation. | ||
|
||
Here is a simple example of interaction with a trained dialogue bot (can | ||
be downloaded with ``deeppavlov/download.py``: | ||
|
||
:: | ||
|
||
x_text:: Where's the nearest parking garage? | ||
>> the nearest is dish parking at dish_parking_address. is that okay? | ||
x_text:: Yes, please set directions via a route that avoids all heavy traffic if possible. | ||
>> it looks like there is a road block being reported on the route but i will still find the quickest route to dish_parking_address. | ||
x_text:: Thanks so much for your help. | ||
>> you're welcome. have a good day. end_of_dialogue | ||
|
||
x_text:: Will it be warm in Camarillo over the next 2 days? | ||
>> no, it's not gon na be warm in camarillo over the next 2 days | ||
x_text:: Thank you car! | ||
>> you're welcome! end_of_dialogue | ||
|
||
x_text:: What day is my lab appointment, and who all will be there? | ||
>> your lab appointment is on lab_appointment_date at lab_appointment_time with lab_appointment_party. | ||
x_text:: Thanks! | ||
>> you're welcome. | ||
|
||
Usage | ||
----- | ||
|
||
Requirements | ||
^^^^^^^^^^^^ | ||
|
||
To use a seq2seq\_go\_bot model you should have a pretrained | ||
goal-oriented bot model | ||
|
||
- config | ||
``deeppavlov/configs/seq2seq_go_bot/bot_kvret_infer.json`` | ||
is recommended to be used in inference mode | ||
|
||
- config | ||
``deeppavlov/configs/seq2seq_go_bot/bot_kvret.json`` | ||
is recommended to be used in train mode | ||
|
||
Config parameters: | ||
^^^^^^^^^^^^^^^^^^ | ||
|
||
- ``name`` always equals to ``"seq2seq_go_bot"`` | ||
- ``source_vocab`` — vocabulary of tokens from context (source) | ||
utterances | ||
- ``target_vocab`` — vocabulary of tokens from response (target) | ||
utterances | ||
- ``start_of_sequence_token`` — token corresponding to the start of | ||
sequence during decoding | ||
- ``end_of_sequence_token`` — token corresponding to the end of | ||
sequence during decoding | ||
- ``bow_encoder`` — one of bag-of-words encoders from | ||
``deeppavlov.models.encoders.bow`` | ||
module | ||
- ``name`` — encoder name | ||
- other arguments specific to your encoder | ||
- ``debug`` — whether to display debug output (defaults to ``false``) | ||
*(optional)* | ||
- ``network`` — reccurent network that handles encoder-decoder | ||
mechanism | ||
|
||
- ``name`` equals to ``"seq2seq_go_bot_nn"`` | ||
- ``learning_rate`` — learning rate during training | ||
- ``target_start_of_sequence_index`` — index of | ||
``start_of_sequence_token`` in decoder vocabulary | ||
- ``target_end_of_sequence_index`` — index of ``end_of_sequence_token`` | ||
in decoder vocabulary | ||
- ``source_vocab_size`` — size of encoder token vocabulary (size of | ||
``source_vocab`` is recommended) | ||
- ``target_vocab_size`` — size of decoder token vocabulary (size of | ||
``target_vocab`` is recommended) | ||
- ``hidden_size`` — LSTM hidden state size equal for encoder and | ||
decoder | ||
|
||
Usage example | ||
^^^^^^^^^^^^^ | ||
|
||
- To infer from a pretrained model with config path equal to | ||
``path/to/config.json``: | ||
|
||
.. code:: python | ||
from deeppavlov.core.commands.infer import build_model_from_config | ||
from deeppavlov.core.common.file import read_json | ||
CONFIG_PATH = 'path/to/config.json' | ||
model = build_model_from_config(read_json(CONFIG_PATH)) | ||
utterance = "" | ||
while utterance != 'exit': | ||
print(">> " + model([utterance])[0]) | ||
utterance = input(':: ') | ||
- To interact via command line use | ||
``deeppavlov/deep.py`` script: | ||
|
||
.. code:: bash | ||
cd deeppavlov | ||
python3 deep.py interact path/to/config.json | ||
References | ||
---------- | ||
|
||
[1] [A New Multi-Turn, Multi-Domain, Task-Oriented Dialogue | ||
Dataset](\ https://nlp.stanford.edu/blog/a-new-multi-turn-multi-domain-task-oriented-dialogue-dataset/) | ||
|
||
[2] [Mihail Eric, Lakshmi Krishnan, Francois Charette, and Christopher | ||
D. Manning, Key-Value Retrieval Networks for Task-Oriented Dialogue – | ||
2017](\ https://arxiv.org/abs/1705.05414.pdf) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,202 @@ | ||
Automatic spelling correction pipelines | ||
======================================= | ||
|
||
We provide two types of pipelines for spelling correction: | ||
`levenstein_corrector <#levenstein_corrector>`__ | ||
uses simple Damerau-Levenstein distance to find correction candidates | ||
and `brillmoore <#brillmoore>`__ | ||
uses statistics based error model for it. In both cases correction | ||
candidates are chosen based on context | ||
with the help of a `kenlm language model <#language-model>`__. | ||
You can find `the comparison <#comparison>`__ of these and other | ||
approaches near the end of this readme. | ||
|
||
Quick start | ||
----------- | ||
|
||
You can run the following command to try provided pipelines out: | ||
|
||
:: | ||
|
||
python -m deeppavlov interact <path_to_config> [-d] | ||
|
||
where ``<path_to_config>`` is one of the provided config | ||
files `deeppavlov/configs/spelling_correction`. | ||
With the optional ``-d`` parameter all the data required to run | ||
selected pipeline will be downloaded, including | ||
an appropriate language model. | ||
|
||
After downloading the required files you can use these configs in your | ||
python code. | ||
For example, this code will read lines from stdin and print corrected | ||
lines to stdout: | ||
|
||
.. code:: python | ||
import json | ||
import sys | ||
from deeppavlov.core.commands.infer import build_model_from_config | ||
CONFIG_PATH = 'deeppavlov/configs/spelling_correction/brillmoore_kartaslov_ru.json' | ||
with open(CONFIG_PATH) as config_file: | ||
config = json.load(config_file) | ||
model = build_model_from_config(config) | ||
for line in sys.stdin: | ||
print(model([line])[0], flush=True) | ||
levenstein_corrector | ||
--------------------- | ||
|
||
Component `levenstein/searcher_component.py` finds all the | ||
candidates in a static dictionary | ||
on set Damerau-Levenstein distance. | ||
It can separate one token into two but it will not work the other way | ||
around. | ||
|
||
Component config parameters: | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
- ``in`` — list with one element: name of this component's input in | ||
chainer's shared memory | ||
- ``out`` — list with one element: name for this component's output in | ||
chainer's shared memory | ||
- ``name`` always equals to ``"spelling_levenstein"``. Optional if | ||
``class`` attribute is present | ||
- ``class`` always equals to | ||
``deeppavlov.models.spelling_correction.levenstein.searcher_component:LevensteinSearcherComponent``. | ||
Optional if ``name`` attribute is present | ||
- ``words`` — list of all correct words (should be a reference) | ||
- ``max_distance`` — maximum allowed Damerau-Levenstein distance | ||
between source words and candidates | ||
- ``error_probability`` — assigned probability for every edit | ||
|
||
brillmoore | ||
---------- | ||
|
||
Component `brillmoore/error_model.py` is based on | ||
An Improved Error Model for Noisy Channel Spelling | ||
Correction (http://www.aclweb.org/anthology/P00-1037) | ||
by Eric Brill and Robert C. Moore and uses statistics based error | ||
model to find best candidates in a static dictionary. | ||
|
||
Component config parameters: | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
- ``in`` — list with one element: name of this component's input in | ||
chainer's shared memory | ||
- ``out`` — list with one element: name for this component's output in | ||
chainer's shared memory | ||
- ``name`` always equals to ``"spelling_error_model"``. Optional if | ||
``class`` attribute is present | ||
- ``class`` always equals to | ||
``deeppavlov.models.spelling_correction.brillmoore.error_model:ErrorModel``. | ||
Optional if ``name`` attribute is present | ||
- ``save_path`` — path where the model will be saved at after a | ||
training session | ||
- ``load_path`` — path to the pretrained model | ||
- ``window`` — window size for the error model from ``0`` to ``4``, | ||
defaults to ``1`` | ||
- ``candidates_count`` — maximum allowed count of candidates for every | ||
source token | ||
- ``dictionary`` — description of a static dictionary model, instance | ||
of (or inherited from) | ||
``deeppavlov.vocabs.static_dictionary.StaticDictionary`` | ||
|
||
- ``name`` — ``"static_dictionary"`` for a custom dictionary or one | ||
of two provided: | ||
|
||
- ``"russian_words_vocab"`` to automatically download and use a | ||
list of russian words from | ||
`https://github.com/danakt/russian-words/ <https://github.com/danakt/russian-words/>`__ | ||
- ``"wikitionary_100K_vocab"`` to automatically download a list | ||
of most common words from Project Gutenberg from | ||
`Wiktionary <https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists#Project_Gutenberg>`__ | ||
|
||
- ``dictionary_name`` — name of a directory where a dictionary will | ||
be built to and loaded from, defaults to ``"dictionary"`` for | ||
static\_dictionary | ||
- ``raw_dictionary_path`` — path to a file with a line-separated | ||
list of dictionary words, required for static\_dictionary | ||
|
||
Training configuration | ||
^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
For the training phase config file needs to also include these | ||
parameters: | ||
|
||
- ``dataset_iterator`` — it should always be set like | ||
``"dataset_iterator": {"name": "typos_iterator"}`` | ||
|
||
- ``name`` always equals to ``typos_iterator`` | ||
- ``test_ratio`` — ratio of test data to train, from ``0.`` to | ||
``1.``, defaults to ``0.`` | ||
|
||
- ``dataset_reader`` | ||
|
||
- ``name`` — ``typos_custom_reader`` for a custom dataset or one of | ||
two provided: | ||
|
||
- ``typos_kartaslov_reader`` to automatically download and | ||
process misspellings dataset for russian language from | ||
https://github.com/dkulagin/kartaslov/tree/master/dataset/orfo_and_typos | ||
- ``typos_wikipedia_reader`` to automatically download and | ||
process a list of common misspellings from english | ||
Wikipedia - https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines | ||
|
||
- ``data_path`` — required for typos\_custom\_reader as a path to | ||
a dataset file, | ||
where each line contains a misspelling and a correct spelling | ||
of a word separated by a tab symbol | ||
|
||
Component's configuration for ``spelling_error_model`` also has to | ||
have as ``fit_on`` parameter — list of two elements: | ||
names of component's input and true output in chainer's shared | ||
memory. | ||
|
||
Language model | ||
-------------- | ||
|
||
Provided pipelines use `KenLM <http://kheafield.com/code/kenlm/>`__ to | ||
process language models, so if you want to build your own, | ||
we suggest you consult its website. We do also provide our own | ||
language models for | ||
`english <http://lnsigo.mipt.ru/export/lang_models/en_wiki_no_punkt.arpa.binary.gz>`__ | ||
(5.5GB) and | ||
`russian <http://lnsigo.mipt.ru/export/lang_models/ru_wiyalen_no_punkt.arpa.binary.gz>`__ | ||
(3.1GB) languages. | ||
|
||
Comparison | ||
---------- | ||
|
||
We compared our pipelines with | ||
`Yandex.Speller <http://api.yandex.ru/speller/>`__, | ||
`JamSpell <https://github.com/bakwc/JamSpell>`__ that was trained on | ||
biggest part of our Russian texts corpus that JamSpell could handle and | ||
`PyHunSpell <https://github.com/blatinier/pyhunspell>`__ | ||
on the `test | ||
set <http://www.dialog-21.ru/media/3838/test_sample_testset.txt>`__ | ||
for the `SpellRuEval | ||
competition <http://www.dialog-21.ru/en/evaluation/2016/spelling_correction/>`__ | ||
on Automatic Spelling Correction for Russian: | ||
|
||
+--------------------------------------------------------------------------------------------------------+-------------+----------+-------------+-----------------------+ | ||
| Correction method | Precision | Recall | F-measure | Speed (sentences/s) | | ||
+========================================================================================================+=============+==========+=============+=======================+ | ||
| Yandex.Speller | 83.09 | 59.86 | 69.59 | 5. | | ||
+--------------------------------------------------------------------------------------------------------+-------------+----------+-------------+-----------------------+ | ||
| Damerau Levenstein 1 + lm (deeppavlov/configs/spelling_correction/levenstein_corrector_ru.json) | 53.26 | 53.74 | 53.50 | 29.3 | | ||
+--------------------------------------------------------------------------------------------------------+-------------+----------+-------------+-----------------------+ | ||
| Brill Moore top 4 + lm (deeppavlov/configs/spelling_correction/brillmoore_kartaslov_ru.json) | 51.92 | 53.94 | 52.91 | 0.6 | | ||
+--------------------------------------------------------------------------------------------------------+-------------+----------+-------------+-----------------------+ | ||
| Hunspell + lm | 41.03 | 48.89 | 44.61 | 2.1 | | ||
+--------------------------------------------------------------------------------------------------------+-------------+----------+-------------+-----------------------+ | ||
| JamSpell | 44.57 | 35.69 | 39.64 | 136.2 | | ||
+--------------------------------------------------------------------------------------------------------+-------------+----------+-------------+-----------------------+ | ||
| Brill Moore top 1 (deeppavlov/configs/spelling_correction/brillmoore_kartaslov_ru_nolm.json) | 41.29 | 37.26 | 39.17 | 2.4 | | ||
+--------------------------------------------------------------------------------------------------------+-------------+----------+-------------+-----------------------+ | ||
| Hunspell | 30.30 | 34.02 | 32.06 | 20.3 | | ||
+--------------------------------------------------------------------------------------------------------+-------------+----------+-------------+-----------------------+ | ||
|