Add spellchecking #51

bminixhofer · 2021-03-02T15:35:04Z

This PR implements spellchecking using the algorithm described in Error-tolerant Finite-state Recognition with Applications to Morphological Analysis and Spelling Correction. I tried multiple other methods:

SymSpell which would probably be faster but needs a significant amount of memory at runtime and needs a precomputation step thus either increasing the binary size a lot or taking a long time to load.
SimHash based spelling correction which in my experiments lead to low-quality corrections and is encumbered by a patent.

I ended up with the error-tolerant FST retrieval algorithm because it plays well with the portability goals of nlprule by needing very little memory while still being quite fast.

There are a couple of issues with the current implementation:

Edit distance is computed on byte sequences directly so it is not unicode aware. This will only make a difference in a couple of edge cases for the currently supported languages but it would be good to have unicode support.
Edit distance only takes substitutions, additions and deletions into account (i. e. Levenshtein distance), not transposes (~~Damerau Levenshtein distance~~ optimal string alignment distance).
Keyboard proximity of the expected keyboard for a language variant is not taken into account.
Multi-word spelling correction is not implemented.
Documentation and tests need improvement.
CI resources need to be updated.
Tokens with action=ignore_spelling in the Disambiguation rules need to be ignored.
Fix false positives for hyphenated words.

I still want to address the issues marked with a box in this PR. The others are out of scope.

It took quite some exploration of different solutions to come up with something I'm satisfied with: now the spellchecking relevant code is isolated in one module (which makes #50 easier) and is not too heavy on binary size (adds ~2.1MB, 900kb xzipped for English).

sai-prasanna · 2021-03-12T10:15:05Z

There is a symspell package in rust https://github.com/reneklacan/symspell . Can we benchmark it and check whether the memory requirements are satisfied? What are the hard bounds on memory you're looking to support for this library?. And maybe we could integrate that package optionally perhaps if it's speed improvement is justified?

bminixhofer · 2021-03-12T12:37:43Z

Hi, I have to admit that I did not do a "proper" benchmark of SymSpell vs the FST approach. One of the problems with SymSpell is that the precomputes need to be either stored in the binary (which blows up the size of it) or precomputed every time when loading the Rules (which takes multiple seconds on a good PC). Runtime memory is of course another issue.

What are the hard bounds on memory you're looking to support for this library?

There are no hard bounds on memory. Up until now I decided trade-offs between speed and memory on a case by case basis, with a tendency to prefer speed. (also, notably, there are actually not that many places where this is relevant, often the best solution is both speed and memory efficient).

One issue with this library currently is that there are two very different use cases:

Using it in an application. Portability, memory efficiency, small binaries are important (i. e. the typical Rust use case).
Using it as a one-time pre / postprocessing step, as a server, etc. (i. e. the typical Python use case).

So far I've tried to find middle ground between these two where it was relevant. For Spellchecking, the FST-based approach is better for (1) while the SymSpell approach might be better for (2).

And maybe we could integrate that package optionally perhaps if it's speed improvement is justified?

I think this is the best solution. There is an open issue on modularizing the crate (#50) in a way in which integrating a second spellchecker would come quite naturally.

So, in short:

This PR and the initial implementation will keep the FST approach
I am very open to benchmarks between the FST approach and SymSpell
If it turns out that SymSpell is faster, it can be integrated as a second spellchecker once nlprule has better modularization.

bminixhofer · 2021-03-14T17:12:17Z

The API for spellchecking is fully implemented now, this is how it will look like:

Rust

use nlprule::{Rules, Tokenizer};

let tokenizer = Tokenizer::new("path/to/tokenizer.bin").unwrap();
// `Rules` now take an `Arc<Tokenizer>` as second argument, like in the Python API 
// and do not require passing the tokenizer to the other methods anymore
let mut rules = Rules::new("path/to/rules.bin", tokenizer.into()).unwrap();

// by default, spellchecking is disabled
// the spellchecker can be accessed by `.spell()` and `.spell_mut()`
// and contains `SpellOptions` which can be accessed analogously
println!("{:#?}", rules.spell().options()); // prints the default options

// Enables spellchecking. `Variant` is a string newtype denoting valid variants
// the `.variant` method on the spellchecker gets the variant for a language code 
// or returns an error if none exists
rules.spell_mut().options_mut().variant = Some(rules.spell().variant("en_GB").unwrap());

// now `rules.correct` and `rules.suggest` also finds spelling mistakes!

// there are some other options like a whitelist for words (the whitelist is actually a `HashSet<String>`)
rules
    .spell_mut()
    .options_mut()
    .whitelist
    .insert("nlprule".into());

// and that's it! there are also some new utility methods like `.variants()` on the spellchecker 
// which are not too important

Python

import pytest
from nlprule import Tokenizer, Rules

tokenizer = Tokenizer.load("en")
rules = Rules.load("en", tokenizer)

# rules.spell is the spellchecker object
# does not do anything useful besides providing the ability to set and get options at the moment
print(rules.spell)

# the used variant is `None` by default i. e. spellchecking is disabled
assert rules.spell.options.variant is None

# setting the variant to one which does not exist prints a helpful error message
with pytest.raises(ValueError):
    rules.spell.options.variant = "en_INVALID"

# setting the variant to a valid variant enables spellchecking
rules.spell.options.variant = "en_GB"

# any iterable works here, but it has to be assigned with =, the returned object is not mutable
rules.spell.options.whitelist = ["nlprule"]

assert rules.correct("nlprule has spelllchcking!") == "nlprule has spellchecking!"

I am quite happy with how this turned out. I'd appreciate some feedback regarding the API from people who were previously interested in this (@theelgirl, @drahnr) and of course from anyone else who is reading this. I might have missed something.

As usual, implementing this took more effort than anticipated (LT uses quite a bite more than just one wordlist for spellchecking) but I'm excited to add this functionality! There are some smaller things still left to do to get this over the finish line (see the checklist above). I hope to merge this PR & release next weekend.

…tokens

drahnr · 2021-03-18T10:01:46Z

build/make_build_dir.py

+                            for ending in {
+                                "S": ["s"],
+                                "N": ["n"],
+                                "E": ["e"],
+                                "F": ["in"],
+                                "A": ["e", "er", "es", "en", "em"],
+                                "Ä": ["r", "s", "n", "m"],


Use the language specific affix file instead of hardcoding it, it saves yourself and a few other people a lot of pain :)

drahnr · 2021-03-18T10:03:11Z

nlprule/src/bin/run.rs

-    let rules = Rules::new(opts.rules).unwrap();
+    let tokenizer = Arc::new(Tokenizer::new(opts.tokenizer).unwrap());
+    let mut rules = Rules::new(opts.rules, tokenizer.clone()).unwrap();
+    rules.spell_mut().options_mut().variant = Some(rules.spell().variant("en_GB").unwrap());


I find this very very un-ergonomic API.

A few particular points to avoid:

mutable options in public API

structs in public lib API

circular borrow fun (need rules.spell_mut() and rules.spell() at the same time)

combine artifacts that don't seem necessarily connected (spellcheck rules vs grammer / tokenizer rules)

A more rustic API imho would be:

// explicit type annotations for clarity let spellcheck: Spellcheck = Spellcheck::new(path_to_spellcheck_rules)?; // alt: let spellcheck: Spellcheck = Spellcheck::from_reader(fs::File::opn(path_to_spellcheck_rules)?)?; spellcheck.add_hunspell_dict let previous: Option<Spellcheck> = rules.set_spell(spellcheck); // and rules.modify_spellcheck(move |spellcheck| Ok(spellcheck.extend_whitelist(words_iter)))?

Was there a particular reason to not split the spelling rules from the tokenization rules? A third file would not hurt and avoid having to (cargo-spellcheck 🎩 ) issues with deployment size - 900kB extra is a lot when the upper limit is 10MB.

I'll think about this. In my opinion a mutable SpellOptions struct where all mutations are guaranteed to be valid and the spellchecker is updated when it is dropped is more erognomic than something like rules.modify_spellcheck(move |spellcheck| Ok(spellcheck.extend_whitelist(words_iter)))?. E.g. it seamlessly allows all HashSet operations on the whitelist, otherwhise that would have to be reimplemented as *_whitelist methods.

mutable options in public API

Do you have some more information / source on this? I wouldn't have thought that this is bad.

circular borrow fun (need rules.spell_mut() and rules.spell() at the same time)

this wouldn't compile, you need the speller to get a Variant (the Variant is owned, it doesn't need the spellchecker), then you can use the Variant to set the options. But I'm also not perfectly happy with that part.

combine artifacts that don't seem necessarily connected (spellcheck rules vs grammer / tokenizer rules)

This is a very valid point. The reason is that .correct() and .suggest() on the Rules should easily be configurable to do both, grammar and spelling correction. The best fix would probably be some meta-struct to combine any amount of suggestion providers (where Rules and Spellcheck are each one suggestion provider) which is strongly related to #50. Defining and implementing that would take significantly more work than I want to do before nlprule can do spellchecking 🙂

A third file would not hurt

A third binary and a function Rules.set_spell really does seem like a good idea. But I would argue that it decreases ergonomics, at least as long as the Spell structure is still part of the Rules.

when the upper limit is 10MB.

I'd just ask the crates.io team to increase the size so you don't have to worry about this (especially since (I think) you distribute some hunspell data as well). They do it on a case by case basis (rust-lang/crates.io#195). I understand the concern, but I'd argue that it is already quite impressive to store multiple hundred thousand valid words + lemmas + part-of-speech tags, rules, a noun chunker etc. in < 10MB. An extra 900kb for spellchecking should very rarely be an issue.

There's one alternative API moving some of the changes anticipated for #50 to this PR without going completely out of scope:

A new trait Suggest:

trait Suggest { fn suggest(&self, text: &str, tokenizer: &Tokenizer) -> Vec<Suggestion>; // or iterator fn correct(&self, text: &str, tokenizer: &Tokenizer) -> String { // default impl to just apply all the suggestions } }

then Rules and a new separate struct Spell implement Suggest. The spell struct implements from_reader, new, etc. like the Rules but with an additional argument to denote the variant: Spell::new("path/to/bin", "en_GB").

There is one new struct Correcter with the fields:

struct Correcter { suggesters: Vec<Box<dyn Suggest>> }

that can be constructed with

let correcter = Correcter::new(vec![ Rules::from_reader(...).into(), Spell::from_reader(..., "en_GB").into() ])?; // implements `FromIterator` too

and implements Suggest as well. Getting the inner suggesters would then have to be done with downcast-rs and look like:

let rules = correcter.get::<Rules>(0)?; // and mutably: let spell = correcter.get_mut::<Spell>(1)?; // I'm not 100% sure if this works with downcast-rs but I think it should

Extending the whitelist etc. would stay the same as before:

spell.options_mut().whitelist.push("nlprule".into());

But would not need the guard anymore (so no Drop impl - I agree that mutable options with some guard are problematic because they hide cost from the user, this way I really don't see an issue). Setting the variant is fallible and would thus not be part of the options:

spell.set_variant("en_US")?;

The Python API would of course adjust accordingly keeping it in sync when possible.

The nice thing about all of this is that nothing is a breaking change since the Rules keeps the same methods, there are just additional Correcter and Spell structs. There are multiple problems though:

.correct and .suggest need the Tokenizer as second argument. These will by far be the most commonly used methods so I'd like them to be as simple as possible e.g. fn correct(text: &str) -> String and fn suggest(text: &str) -> Vec<Sugggestion> like in the current PR without having to explicitly keep the tokenizer around if you don't need it.

The downcast magic happening in the Correcter is not ideal but I believe it can not be avoided.

A third binary hurts ergonomics, especially because including a binary is quite verbose at the moment.

Spellchecking suggestions should have lower priority than suggestions from the grammar rules. This is not trivial to enforce with the Correcter and might require some notion of priority on the Suggestion.

What do you think?

drahnr · 2021-03-18T10:12:27Z

nlprule/src/bin/test.rs

@@ -19,8 +21,8 @@ fn main() {
    env_logger::init();
    let opts = Opts::parse();

-    let tokenizer = Tokenizer::new(opts.tokenizer).unwrap();
-    let rules_container = Rules::new(opts.rules).unwrap();
+    let tokenizer = Arc::new(Tokenizer::new(opts.tokenizer).unwrap());


Iff the above API changes are implemented, the Arc should be able to be avoided. Imho Spellcheck should parse the file and maybe use a Tokenizer reference only when extracting items.

drahnr · 2021-03-18T10:14:51Z

nlprule/src/compile/impls.rs

+            let prefix = parts.next().unwrap();
+            let suffix = parts.next().unwrap();


I'd strongly suggest to use debug_assert_eq(parts.clone().count(), ) and .expect() instead.

drahnr · 2021-03-18T10:17:00Z

nlprule/src/compile/mod.rs

@@ -81,6 +89,8 @@ pub enum Error {
    Unimplemented(String),
    #[error("error parsing to integer: {0}")]
    ParseError(#[from] ParseIntError),
+    #[error("nlprule error: {0}")]


With #[from] adding a explicit print {0} would duplicate the error, since the backtrace will also contain the cause which is implicit by #[from].

So #[error(transparent)] or #[error("nlprule error")].

drahnr · 2021-03-18T10:18:53Z

Note that the review is a first glance and by no means complete, I'll have another go at the impl details tomorrow 🤞

drahnr · 2021-03-18T10:24:41Z

nlprule/src/rule/grammar.rs

            .rev()
            .flatten()
            .collect();
-        candidates.sort_by(|(_, a), (_, b)| a.cmp(b));
-        if candidates.is_empty() {
-            None
-        } else {
-            Some(candidates.remove(0).0)
-        }
+        candidates.sort_unstable();
+        candidates.dedup();


Looks a bit like https://docs.rs/itertools/0.10.0/itertools/fn.kmerge.html could be applied 🙃

drahnr · 2021-03-18T10:26:50Z

nlprule/src/rule/grammar.rs

 pub struct PosReplacer {
    pub(crate) matcher: PosMatcher,
 }

 impl PosReplacer {
-    fn apply(&self, text: &str, tokenizer: &Tokenizer) -> Option<String> {
+    pub fn apply(&self, text: &str, tokenizer: &Tokenizer) -> Vec<String> {


A lot of API returns an iterator which could also just be an impl Iterator at the cost of a few lifetime parameters, in the long run this can make a huge difference, since most collect calls could be avoided, which ends up both being an additive cost of N and an additional allocation (avoided).

bminixhofer · 2021-03-18T11:20:53Z

Hi, thanks for taking a look!

I'll have another go at the impl details tomorrow

I was asking about the public API (thanks for your feedback on that!), the PR isn't fully ready for review.

drahnr · 2021-03-18T12:09:56Z

Hi, thanks for taking a look!

I'll have another go at the impl details tomorrow

I was asking about the public API (thanks for your feedback on that!), the PR isn't fully ready for review.

Ah, overlooked that. Let me know once you want a more introspective review!

bminixhofer · 2021-03-27T16:59:57Z

I will hold off on merging this for some time to explore better modularization first, see #50 (comment).

ssokolow · 2022-07-21T15:05:17Z

let tokenizer = Tokenizer::new("path/to/tokenizer.bin").unwrap();
let mut rules = Rules::new("path/to/rules.bin", tokenizer.into()).unwrap();

Will there be a way to load from in-memory data rather than a path?

I find it both inconvenient and irritating that crates like ttspico don't let me bundle the dependencies into the binary with include_bytes! in situations where the primary goal is ease of deployment and maintainability. (eg. dropping a single binary into ~/bin/ or ~/.cargo/bin ...possibly built against a -musl target for extra protection against dependency breakage.)

...especially when Cargo's solutions for making sure that non-embedded dependencies wind up where the thing looks for them across all different ways of building and running artifacts aren't the greatest.

bminixhofer added 7 commits March 1, 2021 16:31

load words from spelling dictionary

582987f

add spellcheck module

b2ea80a

more clone derives, store Arc<Tokenizer> in Rules struct

7f6458a

treat spellchecker language variants properly

026ffb7

switch to fst based spelling correction

a0be3c2

add tests for spellint, add hashset to check if word is valid

f086fc5

add python bindings

0c66c78

bminixhofer mentioned this pull request Mar 6, 2021

Add spellchecking #2

Open

add support for multiword spelling correction, add contraction mapping

93881da

bminixhofer added 8 commits March 15, 2021 14:42

proper multiword spellchecker support, ignore action=ignore_spelling …

16781b5

…tokens

add whitelist support, add variant_checker struct for cleaner separation

79ac926

extend distance to optimal string alignment distance

9f15524

improve docs

57c06e2

reduce repetition, fix warnings

0a738db

fix test, qol improvement in dev scripts

f474d91

fix doctests

f1fdee7

fix python ci

b0348bb

drahnr suggested changes Mar 18, 2021

View reviewed changes

drahnr reviewed Mar 18, 2021

View reviewed changes

bminixhofer mentioned this pull request Mar 27, 2021

Modularizing the crate #50

Open

bminixhofer mentioned this pull request Apr 15, 2021

Grammar check fails #67

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add spellchecking #51

Add spellchecking #51

bminixhofer commented Mar 2, 2021 •

edited

sai-prasanna commented Mar 12, 2021

bminixhofer commented Mar 12, 2021 •

edited

bminixhofer commented Mar 14, 2021

drahnr Mar 18, 2021 •

edited

drahnr Mar 18, 2021

bminixhofer Mar 18, 2021

bminixhofer Mar 20, 2021 •

edited

drahnr Mar 18, 2021

drahnr Mar 18, 2021

drahnr Mar 18, 2021

drahnr commented Mar 18, 2021 •

edited

drahnr Mar 18, 2021 •

edited

drahnr Mar 18, 2021 •

edited

bminixhofer commented Mar 18, 2021

drahnr commented Mar 18, 2021

bminixhofer commented Mar 27, 2021

ssokolow commented Jul 21, 2022 •

edited

		let prefix = parts.next().unwrap();
		let suffix = parts.next().unwrap();

Add spellchecking #51

Are you sure you want to change the base?

Add spellchecking #51

Conversation

bminixhofer commented Mar 2, 2021 • edited

sai-prasanna commented Mar 12, 2021

bminixhofer commented Mar 12, 2021 • edited

bminixhofer commented Mar 14, 2021

Rust

Python

drahnr Mar 18, 2021 • edited

Choose a reason for hiding this comment

drahnr Mar 18, 2021

Choose a reason for hiding this comment

bminixhofer Mar 18, 2021

Choose a reason for hiding this comment

bminixhofer Mar 20, 2021 • edited

Choose a reason for hiding this comment

drahnr Mar 18, 2021

Choose a reason for hiding this comment

drahnr Mar 18, 2021

Choose a reason for hiding this comment

drahnr Mar 18, 2021

Choose a reason for hiding this comment

drahnr commented Mar 18, 2021 • edited

drahnr Mar 18, 2021 • edited

Choose a reason for hiding this comment

drahnr Mar 18, 2021 • edited

Choose a reason for hiding this comment

bminixhofer commented Mar 18, 2021

drahnr commented Mar 18, 2021

bminixhofer commented Mar 27, 2021

ssokolow commented Jul 21, 2022 • edited

bminixhofer commented Mar 2, 2021 •

edited

bminixhofer commented Mar 12, 2021 •

edited

drahnr Mar 18, 2021 •

edited

bminixhofer Mar 20, 2021 •

edited

drahnr commented Mar 18, 2021 •

edited

drahnr Mar 18, 2021 •

edited

drahnr Mar 18, 2021 •

edited

ssokolow commented Jul 21, 2022 •

edited