refactor: refactor seq2seq-gobot and speller

deeppavlov · Jul 20, 2018 · 64a32c4 · 64a32c4
1 parent 17ca0bf
commit 64a32c4
Show file tree

Hide file tree

Showing 6 changed files with 343 additions and 5 deletions.
diff --git a/docs/index.rst b/docs/index.rst
@@ -9,14 +9,23 @@ Welcome to DeepPavlov's documentation!
 .. toctree::
    :glob:
    :maxdepth: 1
-   :caption: User Documentation
+   :caption: Getting Started
 
    Installation <userdocs/installation>
-   Getting Started <userdocs/getting_started>
+   Hello bot! <userdocs/hello_bot>
+
+
+.. toctree::
+   :glob:
+   :maxdepth: 1
+   :caption: Components
+
    Classification <userdocs/classifiers>
    Slot filling <userdocs/slot_filling>
    Goal-Oriented Dialogue Bot <userdocs/go_bot>
+   Sequence-To-Sequence Dialogue Bot <userdocs/seq2seq_go_bot>
    Named Entity Recognition <userdocs/ner>
+   Spelling Correction <userdocs/spelling_correction>
 
 
 .. toctree::

diff --git a/docs/userdocs/getting_started.rst b/docs/userdocs/getting_started.rst
diff --git a/docs/userdocs/hello_bot.rst b/docs/userdocs/hello_bot.rst
@@ -0,0 +1,4 @@
+Hello bot!
+==========
+
+Simple example here.
diff --git a/docs/userdocs/installation.rst b/docs/userdocs/installation.rst
@@ -1,3 +1,4 @@
 Installation
 ============
+
 Installation instruction.
diff --git a/docs/userdocs/seq2seq_go_bot.rst b/docs/userdocs/seq2seq_go_bot.rst
@@ -0,0 +1,125 @@
+Sequence-To-Sequence Dialogue Bot for goal-oriented task
+========================================================
+
+Dialogue agent is able to handle multi-domain data.
+The model is end-to-end differentiable and does not need to explicitly
+model dialogue state or belief trackers.
+
+We also used a new Stanford NLP Group released dataset of 3,031
+dialogues `[2] <#references>`__ that are grounded through
+underlying knowledge bases and span three distinct tasks in the in-car
+personal assistant space:
+calendar scheduling, weather information retrieval, and
+point-of-interest navigation.
+
+Here is a simple example of interaction with a trained dialogue bot (can
+be downloaded with ``deeppavlov/download.py``:
+
+::
+
+    x_text:: Where's the nearest parking garage?
+    >> the nearest is dish parking at dish_parking_address. is that okay?
+    x_text:: Yes, please set directions via a route that avoids all heavy traffic if possible.
+    >> it looks like there is a road block being reported on the route but i will still find the quickest route to dish_parking_address.
+    x_text:: Thanks so much for your help.
+    >> you're welcome. have a good day. end_of_dialogue
+
+    x_text:: Will it be warm in Camarillo over the next 2 days?
+    >> no, it's not gon na be warm in camarillo over the next 2 days
+    x_text:: Thank you car!
+    >> you're welcome! end_of_dialogue
+
+    x_text:: What day is my lab appointment, and who all will be there?
+    >> your lab appointment is on lab_appointment_date at lab_appointment_time with lab_appointment_party.
+    x_text:: Thanks!
+    >> you're welcome.
+
+Usage
+-----
+
+Requirements
+^^^^^^^^^^^^
+
+To use a seq2seq\_go\_bot model you should have a pretrained
+goal-oriented bot model
+
+-  config
+   ``deeppavlov/configs/seq2seq_go_bot/bot_kvret_infer.json``
+   is recommended to be used in inference mode
+
+-  config
+   ``deeppavlov/configs/seq2seq_go_bot/bot_kvret.json``
+   is recommended to be used in train mode
+
+Config parameters:
+^^^^^^^^^^^^^^^^^^
+
+-  ``name`` always equals to ``"seq2seq_go_bot"``
+-  ``source_vocab`` — vocabulary of tokens from context (source)
+   utterances
+-  ``target_vocab`` — vocabulary of tokens from response (target)
+   utterances
+-  ``start_of_sequence_token`` — token corresponding to the start of
+   sequence during decoding
+-  ``end_of_sequence_token`` — token corresponding to the end of
+   sequence during decoding
+-  ``bow_encoder`` — one of bag-of-words encoders from
+   ``deeppavlov.models.encoders.bow``
+   module
+-  ``name`` — encoder name
+-  other arguments specific to your encoder
+-  ``debug`` — whether to display debug output (defaults to ``false``)
+   *(optional)*
+-  ``network`` — reccurent network that handles encoder-decoder
+   mechanism
+
+    -  ``name`` equals to ``"seq2seq_go_bot_nn"``
+    -  ``learning_rate`` — learning rate during training
+    -  ``target_start_of_sequence_index`` — index of
+       ``start_of_sequence_token`` in decoder vocabulary
+    -  ``target_end_of_sequence_index`` — index of ``end_of_sequence_token``
+       in decoder vocabulary
+    -  ``source_vocab_size`` — size of encoder token vocabulary (size of
+       ``source_vocab`` is recommended)
+    -  ``target_vocab_size`` — size of decoder token vocabulary (size of
+       ``target_vocab`` is recommended)
+    -  ``hidden_size`` — LSTM hidden state size equal for encoder and
+       decoder
+
+Usage example
+^^^^^^^^^^^^^
+
+-  To infer from a pretrained model with config path equal to
+   ``path/to/config.json``:
+
+.. code:: python
+
+    from deeppavlov.core.commands.infer import build_model_from_config
+    from deeppavlov.core.common.file import read_json
+
+    CONFIG_PATH = 'path/to/config.json'
+    model = build_model_from_config(read_json(CONFIG_PATH))
+
+    utterance = ""
+    while utterance != 'exit':
+        print(">> " + model([utterance])[0])
+        utterance = input(':: ')
+
+-  To interact via command line use
+   ``deeppavlov/deep.py`` script:
+
+.. code:: bash
+
+    cd deeppavlov
+    python3 deep.py interact path/to/config.json
+
+References
+----------
+
+[1] [A New Multi-Turn, Multi-Domain, Task-Oriented Dialogue
+Dataset](\ https://nlp.stanford.edu/blog/a-new-multi-turn-multi-domain-task-oriented-dialogue-dataset/)
+
+[2] [Mihail Eric, Lakshmi Krishnan, Francois Charette, and Christopher
+D. Manning, Key-Value Retrieval Networks for Task-Oriented Dialogue –
+2017](\ https://arxiv.org/abs/1705.05414.pdf)
+
diff --git a/docs/userdocs/spelling_correction.rst b/docs/userdocs/spelling_correction.rst
@@ -0,0 +1,202 @@
+Automatic spelling correction pipelines
+=======================================
+
+We provide two types of pipelines for spelling correction:
+`levenstein_corrector <#levenstein_corrector>`__
+uses simple Damerau-Levenstein distance to find correction candidates
+and `brillmoore <#brillmoore>`__
+uses statistics based error model for it. In both cases correction
+candidates are chosen based on context
+with the help of a `kenlm language model <#language-model>`__.
+You can find `the comparison <#comparison>`__ of these and other
+approaches near the end of this readme.
+
+Quick start
+-----------
+
+You can run the following command to try provided pipelines out:
+
+::
+
+    python -m deeppavlov interact <path_to_config> [-d]
+
+where ``<path_to_config>`` is one of the provided config
+files `deeppavlov/configs/spelling_correction`.
+With the optional ``-d`` parameter all the data required to run
+selected pipeline will be downloaded, including
+an appropriate language model.
+
+After downloading the required files you can use these configs in your
+python code.
+For example, this code will read lines from stdin and print corrected
+lines to stdout:
+
+.. code:: python
+
+    import json
+    import sys
+
+    from deeppavlov.core.commands.infer import build_model_from_config
+
+    CONFIG_PATH = 'deeppavlov/configs/spelling_correction/brillmoore_kartaslov_ru.json'
+
+    with open(CONFIG_PATH) as config_file:
+        config = json.load(config_file)
+
+    model = build_model_from_config(config)
+    for line in sys.stdin:
+        print(model([line])[0], flush=True)
+
+levenstein_corrector
+---------------------
+
+Component `levenstein/searcher_component.py` finds all the
+candidates in a static dictionary
+on set Damerau-Levenstein distance.
+It can separate one token into two but it will not work the other way
+around.
+
+Component config parameters:
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+-  ``in`` — list with one element: name of this component's input in
+   chainer's shared memory
+-  ``out`` — list with one element: name for this component's output in
+   chainer's shared memory
+-  ``name`` always equals to ``"spelling_levenstein"``. Optional if
+   ``class`` attribute is present
+-  ``class`` always equals to
+   ``deeppavlov.models.spelling_correction.levenstein.searcher_component:LevensteinSearcherComponent``.
+   Optional if ``name`` attribute is present
+-  ``words`` — list of all correct words (should be a reference)
+-  ``max_distance`` — maximum allowed Damerau-Levenstein distance
+   between source words and candidates
+-  ``error_probability`` — assigned probability for every edit
+
+brillmoore
+----------
+
+Component `brillmoore/error_model.py` is based on
+An Improved Error Model for Noisy Channel Spelling
+Correction (http://www.aclweb.org/anthology/P00-1037)
+by Eric Brill and Robert C. Moore and uses statistics based error
+model to find best candidates in a static dictionary.
+
+Component config parameters:
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+-  ``in`` — list with one element: name of this component's input in
+   chainer's shared memory
+-  ``out`` — list with one element: name for this component's output in
+   chainer's shared memory
+-  ``name`` always equals to ``"spelling_error_model"``. Optional if
+   ``class`` attribute is present
+-  ``class`` always equals to
+   ``deeppavlov.models.spelling_correction.brillmoore.error_model:ErrorModel``.
+   Optional if ``name`` attribute is present
+-  ``save_path`` — path where the model will be saved at after a
+   training session
+-  ``load_path`` — path to the pretrained model
+-  ``window`` — window size for the error model from ``0`` to ``4``,
+   defaults to ``1``
+-  ``candidates_count`` — maximum allowed count of candidates for every
+   source token
+-  ``dictionary`` — description of a static dictionary model, instance
+   of (or inherited from)
+   ``deeppavlov.vocabs.static_dictionary.StaticDictionary``
+
+   -  ``name`` — ``"static_dictionary"`` for a custom dictionary or one
+      of two provided:
+
+      -  ``"russian_words_vocab"`` to automatically download and use a
+         list of russian words from
+         `https://github.com/danakt/russian-words/ <https://github.com/danakt/russian-words/>`__
+      -  ``"wikitionary_100K_vocab"`` to automatically download a list
+         of most common words from Project Gutenberg from
+         `Wiktionary <https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists#Project_Gutenberg>`__
+
+   -  ``dictionary_name`` — name of a directory where a dictionary will
+      be built to and loaded from, defaults to ``"dictionary"`` for
+      static\_dictionary
+   -  ``raw_dictionary_path`` — path to a file with a line-separated
+      list of dictionary words, required for static\_dictionary
+
+Training configuration
+^^^^^^^^^^^^^^^^^^^^^^
+
+For the training phase config file needs to also include these
+parameters:
+
+-  ``dataset_iterator`` — it should always be set like
+   ``"dataset_iterator": {"name": "typos_iterator"}``
+
+   -  ``name`` always equals to ``typos_iterator``
+   -  ``test_ratio`` — ratio of test data to train, from ``0.`` to
+      ``1.``, defaults to ``0.``
+
+-  ``dataset_reader``
+
+   -  ``name`` — ``typos_custom_reader`` for a custom dataset or one of
+      two provided:
+
+      -  ``typos_kartaslov_reader`` to automatically download and
+         process misspellings dataset for russian language from
+         https://github.com/dkulagin/kartaslov/tree/master/dataset/orfo_and_typos
+      -  ``typos_wikipedia_reader`` to automatically download and
+         process a list of common misspellings from english
+         Wikipedia - https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines
+
+   -  ``data_path`` — required for typos\_custom\_reader as a path to
+      a dataset file,
+      where each line contains a misspelling and a correct spelling
+      of a word separated by a tab symbol
+
+Component's configuration for ``spelling_error_model`` also has to
+have as ``fit_on`` parameter — list of two elements:
+names of component's input and true output in chainer's shared
+memory.
+
+Language model
+--------------
+
+Provided pipelines use `KenLM <http://kheafield.com/code/kenlm/>`__ to
+process language models, so if you want to build your own,
+we suggest you consult its website. We do also provide our own
+language models for
+`english <http://lnsigo.mipt.ru/export/lang_models/en_wiki_no_punkt.arpa.binary.gz>`__
+(5.5GB) and
+`russian <http://lnsigo.mipt.ru/export/lang_models/ru_wiyalen_no_punkt.arpa.binary.gz>`__
+(3.1GB) languages.
+
+Comparison
+----------
+
+We compared our pipelines with
+`Yandex.Speller <http://api.yandex.ru/speller/>`__,
+`JamSpell <https://github.com/bakwc/JamSpell>`__ that was trained on
+biggest part of our Russian texts corpus that JamSpell could handle and
+`PyHunSpell <https://github.com/blatinier/pyhunspell>`__
+on the `test
+set <http://www.dialog-21.ru/media/3838/test_sample_testset.txt>`__
+for the `SpellRuEval
+competition <http://www.dialog-21.ru/en/evaluation/2016/spelling_correction/>`__
+on Automatic Spelling Correction for Russian:
+
++--------------------------------------------------------------------------------------------------------+-------------+----------+-------------+-----------------------+
+| Correction method                                                                                      | Precision   | Recall   | F-measure   | Speed (sentences/s)   |
++========================================================================================================+=============+==========+=============+=======================+
+| Yandex.Speller                                                                                         | 83.09       | 59.86    | 69.59       | 5.                    |
++--------------------------------------------------------------------------------------------------------+-------------+----------+-------------+-----------------------+
+| Damerau Levenstein 1 + lm (deeppavlov/configs/spelling_correction/levenstein_corrector_ru.json)        | 53.26       | 53.74    | 53.50       | 29.3                  |
++--------------------------------------------------------------------------------------------------------+-------------+----------+-------------+-----------------------+
+| Brill Moore top 4 + lm (deeppavlov/configs/spelling_correction/brillmoore_kartaslov_ru.json)           | 51.92       | 53.94    | 52.91       | 0.6                   |
++--------------------------------------------------------------------------------------------------------+-------------+----------+-------------+-----------------------+
+| Hunspell + lm                                                                                          | 41.03       | 48.89    | 44.61       | 2.1                   |
++--------------------------------------------------------------------------------------------------------+-------------+----------+-------------+-----------------------+
+| JamSpell                                                                                               | 44.57       | 35.69    | 39.64       | 136.2                 |
++--------------------------------------------------------------------------------------------------------+-------------+----------+-------------+-----------------------+
+| Brill Moore top 1 (deeppavlov/configs/spelling_correction/brillmoore_kartaslov_ru_nolm.json)           | 41.29       | 37.26    | 39.17       | 2.4                   |
++--------------------------------------------------------------------------------------------------------+-------------+----------+-------------+-----------------------+
+| Hunspell                                                                                               | 30.30       | 34.02    | 32.06       | 20.3                  |
++--------------------------------------------------------------------------------------------------------+-------------+----------+-------------+-----------------------+
+