OPENNLP-1832: Add SymSpell-based SpellChecker component#1057
Draft
rzo1 wants to merge 1 commit into
Draft
Conversation
Add a native, language-agnostic spell-correction component backed by the
SymSpell (Symmetric Delete) algorithm as a new opennlp-spellcheck module
under opennlp-extensions.
- SpellChecker API: lookup with TOP/CLOSEST/ALL verbosity and a configurable
maximum edit distance, plus context-aware lookupCompound (handles split,
merged and misspelled words) following the SymSpell reference and backed by
a bigram dictionary
- SymSpell engine with a precomputed delete index and a pluggable edit
distance (Damerau-OSA default, Apache Commons Text Levenshtein alternative)
- Configurable Naive-Bayes corpus size N (SymSpellConfig.corpusWordCount,
derived from the dictionary by default), persisted in the model
- Serializable model (SerializableArtifact/ArtifactSerializer) with
model.properties for classpath model-resolver loading of
opennlp-models-spellcheck-{lang} artifacts; the resolver verifies model.sha256
- Frequency-dictionary loaders for the SymSpell text format (unigram and
bigram, whitespace-separated, tolerant of a UTF-8 byte-order mark)
- Pipeline integration: SpellCheckingCharSequenceNormalizer and
FilterObjectStream adapters (line- and token-level)
- Command-line tools: SpellCheckModelBuilder and CorrectText (with a -suggest
mode that lists candidates honoring -verbosity)
- ArgumentParser: support Long parameter return types
- Unit tests plus an opennlp-eval-tests benchmark reporting a correction
confusion matrix, an edit-distance sweep, a Damerau-OSA vs Levenshtein
comparison, lookup throughput, and bigram-backed compound accuracy
- Manual chapter, README/project-structure updates, and NOTICE/licensing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add a native, language-agnostic spell-correction component backed by the SymSpell (Symmetric Delete) algorithm as a new opennlp-spellcheck module under opennlp-extensions.
Thank you for contributing to Apache OpenNLP.
In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:
For all changes:
Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with OPENNLP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically main)?
Is your initial contribution a single, squashed commit?
For code changes:
For documentation related changes:
Note:
I have updated with the data needed to run some eval tests: https://nightlies.apache.org/opennlp/opennlp-data.zip
In addition, this PR does not ship pre-build dictionaries (1-gram, 2-gram). Idea would be to put some default dicts for the major language together via https://github.com/apache/opennlp-models (or a like).
Currently, this is a draft to get some feedback since I am going to do a few more review passes.