Skip to content

catalpa-cl/ltl-spelling

Repository files navigation

DKPro Spelling

DKPro Spelling is a highly configurable spellchecking application.
It is language-invariant: To process any language in a minimal version, only a tokenizer and dictionary are required.
A named entity recognizer and (unigram) language model are likely to improve results.

Resources included in this repository are:

  • Corpora:
  • Dictionaries for English, German, Italian and Czech and a script to generate a .txt dictionary from hunspell .dic and .aff files.
  • Phonetic versions (*_phoneme_map.txt) of the above dictionaries and a script to create a phonetic dictionary from a given .txt dictionary.
  • Keyboard distance files for English, German, Italian and Czech.
  • Unigram language models for English, German and Italian (see here for an example on how to incorporate unigram frequencies to rerank correction candidates).

See an example end-to-end pipeline here and one that employs the different candidate correction methods here. For easy access we also provide jars for the default configuration (see User Mode).

Before you use phonetic spellchecking on a new corpus, please pre-generate phonetic representations of misspellings/out-of-dictionary words in it as shown here and place copies of the *_phoneme_map.txt dictionaries as well as any custom phonetic dictionaries in the respective language folders here.

Setup

Please make sure to git clone this repository rather than to download it as a .zip. This ensures that the jars and other large files will be downloaded properly.

For Web 1T reranking to work, set WEB1T system variable to point to the location of web1t (export WEB1T="PATH_TO_WEB1T"). In this folder you need subfolders /en, /de, /it, /cz for the respective languages. Within these, you need subfolders /*gms as well as files index-*gms and the aggregated_counts.cnt file. You can obtain Web 1T from the Linguistic Data Consortium.

You may have to unzip some of the dictionaries.

LeSpell Error Annotation Format

<corpus name="EXAMPLE_NAME"><text id="EXAMPLE_ID" lang="en">
  We mark <error correct="misspellings" type="typo">mispellings</error> as shown in this example.
</text></corpus>

User Mode

Supports English (en), German (de), Italian (it) and Czech (cz).
Requires WEB1T system variable to be set.

Use the tool in its default configuration: Generate correction candidates based on Damerau-Levenshtein Distance, rerank them using Web 1T trigrams.

Spellcheck a .txt file.
Outputs .tsv with ranked list of corrections.
To run an example: java -jar DKPro_Spellcheck.jar de spelling/src/main/resources/corpora/test_de.txt

java -jar DKPro_Spellcheck.jar [LANGUAGE] [PATH_TO_TXT]

Evaluate error correction on a corpus annotated with errors (in the LeSpell XML format).
Outputs recall@k and lists of words that are corrected correctly/incorrectly.
To run an example: java -jar DKPro_Spellcheck_EvaluateCorrection.jar cz spelling/src/main/resources/corpora/merlin-CZ_spelling.xml

java -jar DKPro_Spellcheck_EvaluateCorrection.jar [LANGUAGE] [PATH_TO_XML]

As Web 1T is quite large, you may (especially for English) want to set -Djava.io.tmpdir="PATH_TO_DIR" to a folder with enough space, making the full command java -Djava.io.tmpdir="PATH_TO_DIR" -jar [JAR_NAME] [LANG] [PATH_TO_FILE]

Cite

@InProceedings{le-spell-2022,
  author    = {Bexte, Marie  and  Laarmann-Quante, Ronja  and  Horbach, Andrea  and  Zesch, Torsten},
  title     = {LeSpell - A Multi-Lingual Benchmark Corpus of Spelling Errors to Develop Spellchecking Methods for Learner Language},
  booktitle = {Proceedings of the Language Resources and Evaluation Conference},
  month     = {June},
  year      = {2022},
  address   = {Marseille, France},
  publisher = {European Language Resources Association},
  pages     = {697--706},
  url       = {https://aclanthology.org/2022.lrec-1.73}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published