Skip to content

Commit

Permalink
refactor: refactor seq2seq-gobot and speller
Browse files Browse the repository at this point in the history
  • Loading branch information
nikolay-bushkov committed Jul 20, 2018
1 parent 17ca0bf commit 64a32c4
Show file tree
Hide file tree
Showing 6 changed files with 343 additions and 5 deletions.
13 changes: 11 additions & 2 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,23 @@ Welcome to DeepPavlov's documentation!
.. toctree::
:glob:
:maxdepth: 1
:caption: User Documentation
:caption: Getting Started

Installation <userdocs/installation>
Getting Started <userdocs/getting_started>
Hello bot! <userdocs/hello_bot>


.. toctree::
:glob:
:maxdepth: 1
:caption: Components

Classification <userdocs/classifiers>
Slot filling <userdocs/slot_filling>
Goal-Oriented Dialogue Bot <userdocs/go_bot>
Sequence-To-Sequence Dialogue Bot <userdocs/seq2seq_go_bot>
Named Entity Recognition <userdocs/ner>
Spelling Correction <userdocs/spelling_correction>


.. toctree::
Expand Down
3 changes: 0 additions & 3 deletions docs/userdocs/getting_started.rst

This file was deleted.

4 changes: 4 additions & 0 deletions docs/userdocs/hello_bot.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
Hello bot!
==========

Simple example here.
1 change: 1 addition & 0 deletions docs/userdocs/installation.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
Installation
============

Installation instruction.
125 changes: 125 additions & 0 deletions docs/userdocs/seq2seq_go_bot.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
Sequence-To-Sequence Dialogue Bot for goal-oriented task
========================================================

Dialogue agent is able to handle multi-domain data.
The model is end-to-end differentiable and does not need to explicitly
model dialogue state or belief trackers.

We also used a new Stanford NLP Group released dataset of 3,031
dialogues `[2] <#references>`__ that are grounded through
underlying knowledge bases and span three distinct tasks in the in-car
personal assistant space:
calendar scheduling, weather information retrieval, and
point-of-interest navigation.

Here is a simple example of interaction with a trained dialogue bot (can
be downloaded with ``deeppavlov/download.py``:

::

x_text:: Where's the nearest parking garage?
>> the nearest is dish parking at dish_parking_address. is that okay?
x_text:: Yes, please set directions via a route that avoids all heavy traffic if possible.
>> it looks like there is a road block being reported on the route but i will still find the quickest route to dish_parking_address.
x_text:: Thanks so much for your help.
>> you're welcome. have a good day. end_of_dialogue

x_text:: Will it be warm in Camarillo over the next 2 days?
>> no, it's not gon na be warm in camarillo over the next 2 days
x_text:: Thank you car!
>> you're welcome! end_of_dialogue

x_text:: What day is my lab appointment, and who all will be there?
>> your lab appointment is on lab_appointment_date at lab_appointment_time with lab_appointment_party.
x_text:: Thanks!
>> you're welcome.

Usage
-----

Requirements
^^^^^^^^^^^^

To use a seq2seq\_go\_bot model you should have a pretrained
goal-oriented bot model

- config
``deeppavlov/configs/seq2seq_go_bot/bot_kvret_infer.json``
is recommended to be used in inference mode

- config
``deeppavlov/configs/seq2seq_go_bot/bot_kvret.json``
is recommended to be used in train mode

Config parameters:
^^^^^^^^^^^^^^^^^^

- ``name`` always equals to ``"seq2seq_go_bot"``
- ``source_vocab`` — vocabulary of tokens from context (source)
utterances
- ``target_vocab`` — vocabulary of tokens from response (target)
utterances
- ``start_of_sequence_token`` — token corresponding to the start of
sequence during decoding
- ``end_of_sequence_token`` — token corresponding to the end of
sequence during decoding
- ``bow_encoder`` — one of bag-of-words encoders from
``deeppavlov.models.encoders.bow``
module
- ``name`` — encoder name
- other arguments specific to your encoder
- ``debug`` — whether to display debug output (defaults to ``false``)
*(optional)*
- ``network`` — reccurent network that handles encoder-decoder
mechanism

- ``name`` equals to ``"seq2seq_go_bot_nn"``
- ``learning_rate`` — learning rate during training
- ``target_start_of_sequence_index`` — index of
``start_of_sequence_token`` in decoder vocabulary
- ``target_end_of_sequence_index`` — index of ``end_of_sequence_token``
in decoder vocabulary
- ``source_vocab_size`` — size of encoder token vocabulary (size of
``source_vocab`` is recommended)
- ``target_vocab_size`` — size of decoder token vocabulary (size of
``target_vocab`` is recommended)
- ``hidden_size`` — LSTM hidden state size equal for encoder and
decoder

Usage example
^^^^^^^^^^^^^

- To infer from a pretrained model with config path equal to
``path/to/config.json``:

.. code:: python
from deeppavlov.core.commands.infer import build_model_from_config
from deeppavlov.core.common.file import read_json
CONFIG_PATH = 'path/to/config.json'
model = build_model_from_config(read_json(CONFIG_PATH))
utterance = ""
while utterance != 'exit':
   print(">> " + model([utterance])[0])
   utterance = input(':: ')
- To interact via command line use
``deeppavlov/deep.py`` script:

.. code:: bash
cd deeppavlov
python3 deep.py interact path/to/config.json
References
----------

[1] [A New Multi-Turn, Multi-Domain, Task-Oriented Dialogue
Dataset](\ https://nlp.stanford.edu/blog/a-new-multi-turn-multi-domain-task-oriented-dialogue-dataset/)

[2] [Mihail Eric, Lakshmi Krishnan, Francois Charette, and Christopher
D. Manning, Key-Value Retrieval Networks for Task-Oriented Dialogue –
2017](\ https://arxiv.org/abs/1705.05414.pdf)

202 changes: 202 additions & 0 deletions docs/userdocs/spelling_correction.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
Automatic spelling correction pipelines
=======================================

We provide two types of pipelines for spelling correction:
`levenstein_corrector <#levenstein_corrector>`__
uses simple Damerau-Levenstein distance to find correction candidates
and `brillmoore <#brillmoore>`__
uses statistics based error model for it. In both cases correction
candidates are chosen based on context
with the help of a `kenlm language model <#language-model>`__.
You can find `the comparison <#comparison>`__ of these and other
approaches near the end of this readme.

Quick start
-----------

You can run the following command to try provided pipelines out:

::

python -m deeppavlov interact <path_to_config> [-d]

where ``<path_to_config>`` is one of the provided config
files `deeppavlov/configs/spelling_correction`.
With the optional ``-d`` parameter all the data required to run
selected pipeline will be downloaded, including
an appropriate language model.

After downloading the required files you can use these configs in your
python code.
For example, this code will read lines from stdin and print corrected
lines to stdout:

.. code:: python
import json
import sys
from deeppavlov.core.commands.infer import build_model_from_config
CONFIG_PATH = 'deeppavlov/configs/spelling_correction/brillmoore_kartaslov_ru.json'
with open(CONFIG_PATH) as config_file:
config = json.load(config_file)
model = build_model_from_config(config)
for line in sys.stdin:
print(model([line])[0], flush=True)
levenstein_corrector
---------------------

Component `levenstein/searcher_component.py` finds all the
candidates in a static dictionary
on set Damerau-Levenstein distance.
It can separate one token into two but it will not work the other way
around.

Component config parameters:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- ``in`` — list with one element: name of this component's input in
chainer's shared memory
- ``out`` — list with one element: name for this component's output in
chainer's shared memory
- ``name`` always equals to ``"spelling_levenstein"``. Optional if
``class`` attribute is present
- ``class`` always equals to
``deeppavlov.models.spelling_correction.levenstein.searcher_component:LevensteinSearcherComponent``.
Optional if ``name`` attribute is present
- ``words`` — list of all correct words (should be a reference)
- ``max_distance`` — maximum allowed Damerau-Levenstein distance
between source words and candidates
- ``error_probability`` — assigned probability for every edit

brillmoore
----------

Component `brillmoore/error_model.py` is based on
An Improved Error Model for Noisy Channel Spelling
Correction (http://www.aclweb.org/anthology/P00-1037)
by Eric Brill and Robert C. Moore and uses statistics based error
model to find best candidates in a static dictionary.

Component config parameters:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- ``in`` — list with one element: name of this component's input in
chainer's shared memory
- ``out`` — list with one element: name for this component's output in
chainer's shared memory
- ``name`` always equals to ``"spelling_error_model"``. Optional if
``class`` attribute is present
- ``class`` always equals to
``deeppavlov.models.spelling_correction.brillmoore.error_model:ErrorModel``.
Optional if ``name`` attribute is present
- ``save_path`` — path where the model will be saved at after a
training session
- ``load_path`` — path to the pretrained model
- ``window`` — window size for the error model from ``0`` to ``4``,
defaults to ``1``
- ``candidates_count`` — maximum allowed count of candidates for every
source token
- ``dictionary`` — description of a static dictionary model, instance
of (or inherited from)
``deeppavlov.vocabs.static_dictionary.StaticDictionary``

- ``name`` — ``"static_dictionary"`` for a custom dictionary or one
of two provided:

- ``"russian_words_vocab"`` to automatically download and use a
list of russian words from
`https://github.com/danakt/russian-words/ <https://github.com/danakt/russian-words/>`__
- ``"wikitionary_100K_vocab"`` to automatically download a list
of most common words from Project Gutenberg from
`Wiktionary <https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists#Project_Gutenberg>`__

- ``dictionary_name`` — name of a directory where a dictionary will
be built to and loaded from, defaults to ``"dictionary"`` for
static\_dictionary
- ``raw_dictionary_path`` — path to a file with a line-separated
list of dictionary words, required for static\_dictionary

Training configuration
^^^^^^^^^^^^^^^^^^^^^^

For the training phase config file needs to also include these
parameters:

- ``dataset_iterator`` — it should always be set like
``"dataset_iterator": {"name": "typos_iterator"}``

- ``name`` always equals to ``typos_iterator``
- ``test_ratio`` — ratio of test data to train, from ``0.`` to
``1.``, defaults to ``0.``

- ``dataset_reader``

- ``name`` — ``typos_custom_reader`` for a custom dataset or one of
two provided:

- ``typos_kartaslov_reader`` to automatically download and
process misspellings dataset for russian language from
https://github.com/dkulagin/kartaslov/tree/master/dataset/orfo_and_typos
- ``typos_wikipedia_reader`` to automatically download and
process a list of common misspellings from english
Wikipedia - https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines

- ``data_path`` — required for typos\_custom\_reader as a path to
a dataset file,
where each line contains a misspelling and a correct spelling
of a word separated by a tab symbol

Component's configuration for ``spelling_error_model`` also has to
have as ``fit_on`` parameter — list of two elements:
names of component's input and true output in chainer's shared
memory.

Language model
--------------

Provided pipelines use `KenLM <http://kheafield.com/code/kenlm/>`__ to
process language models, so if you want to build your own,
we suggest you consult its website. We do also provide our own
language models for
`english <http://lnsigo.mipt.ru/export/lang_models/en_wiki_no_punkt.arpa.binary.gz>`__
(5.5GB) and
`russian <http://lnsigo.mipt.ru/export/lang_models/ru_wiyalen_no_punkt.arpa.binary.gz>`__
(3.1GB) languages.

Comparison
----------

We compared our pipelines with
`Yandex.Speller <http://api.yandex.ru/speller/>`__,
`JamSpell <https://github.com/bakwc/JamSpell>`__ that was trained on
biggest part of our Russian texts corpus that JamSpell could handle and
`PyHunSpell <https://github.com/blatinier/pyhunspell>`__
on the `test
set <http://www.dialog-21.ru/media/3838/test_sample_testset.txt>`__
for the `SpellRuEval
competition <http://www.dialog-21.ru/en/evaluation/2016/spelling_correction/>`__
on Automatic Spelling Correction for Russian:

+--------------------------------------------------------------------------------------------------------+-------------+----------+-------------+-----------------------+
| Correction method | Precision | Recall | F-measure | Speed (sentences/s) |
+========================================================================================================+=============+==========+=============+=======================+
| Yandex.Speller | 83.09 | 59.86 | 69.59 | 5. |
+--------------------------------------------------------------------------------------------------------+-------------+----------+-------------+-----------------------+
| Damerau Levenstein 1 + lm (deeppavlov/configs/spelling_correction/levenstein_corrector_ru.json) | 53.26 | 53.74 | 53.50 | 29.3 |
+--------------------------------------------------------------------------------------------------------+-------------+----------+-------------+-----------------------+
| Brill Moore top 4 + lm (deeppavlov/configs/spelling_correction/brillmoore_kartaslov_ru.json) | 51.92 | 53.94 | 52.91 | 0.6 |
+--------------------------------------------------------------------------------------------------------+-------------+----------+-------------+-----------------------+
| Hunspell + lm | 41.03 | 48.89 | 44.61 | 2.1 |
+--------------------------------------------------------------------------------------------------------+-------------+----------+-------------+-----------------------+
| JamSpell | 44.57 | 35.69 | 39.64 | 136.2 |
+--------------------------------------------------------------------------------------------------------+-------------+----------+-------------+-----------------------+
| Brill Moore top 1 (deeppavlov/configs/spelling_correction/brillmoore_kartaslov_ru_nolm.json) | 41.29 | 37.26 | 39.17 | 2.4 |
+--------------------------------------------------------------------------------------------------------+-------------+----------+-------------+-----------------------+
| Hunspell | 30.30 | 34.02 | 32.06 | 20.3 |
+--------------------------------------------------------------------------------------------------------+-------------+----------+-------------+-----------------------+

0 comments on commit 64a32c4

Please sign in to comment.