Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
199 lines (154 sloc) 9.47 KB

Automatic spelling correction pipelines

We provide two types of pipelines for spelling correction: levenshtein_corrector uses simple Damerau-Levenshtein distance to find correction candidates and brillmoore uses statistics based error model for it. In both cases correction candidates are chosen based on context with the help of a kenlm language model. You can find the comparison of these and other approaches near the end of this readme.

Note

About 4.4 GB on disc required for the Russian language model and about 7 GB for the English one.

Quick start

First you would need to install additional requirements:

python -m deeppavlov install <path_to_config>

where <path_to_config> is a path to one of the :config:`provided config files <spelling_correction>` or its name without an extension, for example :config:`levenshtein_corrector_ru <spelling_correction/levenshtein_corrector_ru.json>`.

You can run the following command to try provided pipelines out:

python -m deeppavlov interact <path_to_config> [-d]

where <path_to_config> is one of the :config:`provided config files <spelling_correction>`. With the optional -d parameter all the data required to run selected pipeline will be downloaded, including an appropriate language model.

After downloading the required files you can use these configs in your python code. For example, this code will read lines from stdin and print corrected lines to stdout:

import sys

from deeppavlov import build_model, configs

CONFIG_PATH = configs.spelling_correction.brillmoore_kartaslov_ru

model = build_model(CONFIG_PATH, download=True)
for line in sys.stdin:
    print(model([line])[0], flush=True)

levenshtein_corrector

:class:`This component <deeppavlov.models.spelling_correction.levenshtein.LevenshteinSearcherComponent>` finds all the candidates in a static dictionary on a set Damerau-Levenshtein distance. It can separate one token into two but it will not work the other way around.

Component config parameters:

  • in — list with one element: name of this component's input in chainer's shared memory
  • out — list with one element: name for this component's output in chainer's shared memory
  • class_name always equals to "spelling_levenshtein" or deeppavlov.models.spelling_correction.levenshtein.searcher_component:LevenshteinSearcherComponent.
  • words — list of all correct words (should be a reference)
  • max_distance — maximum allowed Damerau-Levenshtein distance between source words and candidates
  • error_probability — assigned probability for every edit

brillmoore

:class:`This component <deeppavlov.models.spelling_correction.brillmoore.ErrorModel>` is based on An Improved Error Model for Noisy Channel Spelling Correction by Eric Brill and Robert C. Moore and uses statistics based error model to find best candidates in a static dictionary.

Component config parameters:

  • in — list with one element: name of this component's input in chainer's shared memory
  • out — list with one element: name for this component's output in chainer's shared memory
  • class_name always equals to "spelling_error_model" or deeppavlov.models.spelling_correction.brillmoore.error_model:ErrorModel.
  • save_path — path where the model will be saved at after a training session
  • load_path — path to the pretrained model
  • window — window size for the error model from 0 to 4, defaults to 1
  • candidates_count — maximum allowed count of candidates for every source token
  • dictionary — description of a static dictionary model, instance of (or inherited from) deeppavlov.vocabs.static_dictionary.StaticDictionary
    • class_name — "static_dictionary" for a custom dictionary or one of two provided:
    • dictionary_name — name of a directory where a dictionary will be built to and loaded from, defaults to "dictionary" for static_dictionary
    • raw_dictionary_path — path to a file with a line-separated list of dictionary words, required for static_dictionary

Training configuration

For the training phase config file needs to also include these parameters:

  • dataset_iterator — it should always be set like "dataset_iterator": {"class_name": "typos_iterator"}
    • class_name always equals to typos_iterator
    • test_ratio — ratio of test data to train, from 0. to 1., defaults to 0.
  • dataset_reader

Component's configuration for spelling_error_model also has to have as fit_on parameter — list of two elements: names of component's input and true output in chainer's shared memory.

Language model

Provided pipelines use KenLM to process language models, so if you want to build your own, we suggest you consult its website. We do also provide our own language models for english (5.5GB) and russian (3.1GB) languages.

Comparison

We compared our pipelines with Yandex.Speller, JamSpell and PyHunSpell on the test set for the SpellRuEval competition on Automatic Spelling Correction for Russian:

Correction method Precision Recall F-measure Speed (sentences/s)
Yandex.Speller 83.09 59.86 69.59
:config:`Damerau Levenshtein 1 + lm<spelling_correction/levenshtein_corrector_ru.json>` 59.38 53.44 56.25 39.3
:config:`Brill Moore top 4 + lm<spelling_correction/brillmoore_kartaslov_ru.json>` 51.92 53.94 52.91 0.6
Hunspell + lm 41.03 48.89 44.61 2.1
JamSpell 44.57 35.69 39.64 136.2
:config:`Brill Moore top 1 <spelling_correction/brillmoore_kartaslov_ru_nolm.json>` 41.29 37.26 39.17 2.4
Hunspell 30.30 34.02 32.06 20.3
You can’t perform that action at this time.