Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add spellchecking #51

Draft
wants to merge 16 commits into
base: main
Choose a base branch
from
Draft

Add spellchecking #51

wants to merge 16 commits into from

Conversation

bminixhofer
Copy link
Owner

@bminixhofer bminixhofer commented Mar 2, 2021

This PR implements spellchecking using the algorithm described in Error-tolerant Finite-state Recognition with Applications to Morphological Analysis and Spelling Correction. I tried multiple other methods:

  • SymSpell which would probably be faster but needs a significant amount of memory at runtime and needs a precomputation step thus either increasing the binary size a lot or taking a long time to load.
  • SimHash based spelling correction which in my experiments lead to low-quality corrections and is encumbered by a patent.

I ended up with the error-tolerant FST retrieval algorithm because it plays well with the portability goals of nlprule by needing very little memory while still being quite fast.

There are a couple of issues with the current implementation:

  • Edit distance is computed on byte sequences directly so it is not unicode aware. This will only make a difference in a couple of edge cases for the currently supported languages but it would be good to have unicode support.
  • Edit distance only takes substitutions, additions and deletions into account (i. e. Levenshtein distance), not transposes (Damerau Levenshtein distance optimal string alignment distance).
  • Keyboard proximity of the expected keyboard for a language variant is not taken into account.
  • Multi-word spelling correction is not implemented.
  • Documentation and tests need improvement.
  • CI resources need to be updated.
  • Tokens with action=ignore_spelling in the Disambiguation rules need to be ignored.
  • Fix false positives for hyphenated words.

I still want to address the issues marked with a box in this PR. The others are out of scope.

It took quite some exploration of different solutions to come up with something I'm satisfied with: now the spellchecking relevant code is isolated in one module (which makes #50 easier) and is not too heavy on binary size (adds ~2.1MB, 900kb xzipped for English).

@bminixhofer bminixhofer mentioned this pull request Mar 6, 2021
@sai-prasanna
Copy link

There is a symspell package in rust https://github.com/reneklacan/symspell . Can we benchmark it and check whether the memory requirements are satisfied? What are the hard bounds on memory you're looking to support for this library?. And maybe we could integrate that package optionally perhaps if it's speed improvement is justified?

@bminixhofer
Copy link
Owner Author

bminixhofer commented Mar 12, 2021

Hi, I have to admit that I did not do a "proper" benchmark of SymSpell vs the FST approach. One of the problems with SymSpell is that the precomputes need to be either stored in the binary (which blows up the size of it) or precomputed every time when loading the Rules (which takes multiple seconds on a good PC). Runtime memory is of course another issue.

What are the hard bounds on memory you're looking to support for this library?

There are no hard bounds on memory. Up until now I decided trade-offs between speed and memory on a case by case basis, with a tendency to prefer speed. (also, notably, there are actually not that many places where this is relevant, often the best solution is both speed and memory efficient).

One issue with this library currently is that there are two very different use cases:

  1. Using it in an application. Portability, memory efficiency, small binaries are important (i. e. the typical Rust use case).
  2. Using it as a one-time pre / postprocessing step, as a server, etc. (i. e. the typical Python use case).

So far I've tried to find middle ground between these two where it was relevant. For Spellchecking, the FST-based approach is better for (1) while the SymSpell approach might be better for (2).

And maybe we could integrate that package optionally perhaps if it's speed improvement is justified?

I think this is the best solution. There is an open issue on modularizing the crate (#50) in a way in which integrating a second spellchecker would come quite naturally.

So, in short:

  • This PR and the initial implementation will keep the FST approach
  • I am very open to benchmarks between the FST approach and SymSpell
  • If it turns out that SymSpell is faster, it can be integrated as a second spellchecker once nlprule has better modularization.

@bminixhofer
Copy link
Owner Author

The API for spellchecking is fully implemented now, this is how it will look like:

Rust

use nlprule::{Rules, Tokenizer};

let tokenizer = Tokenizer::new("path/to/tokenizer.bin").unwrap();
// `Rules` now take an `Arc<Tokenizer>` as second argument, like in the Python API 
// and do not require passing the tokenizer to the other methods anymore
let mut rules = Rules::new("path/to/rules.bin", tokenizer.into()).unwrap();

// by default, spellchecking is disabled
// the spellchecker can be accessed by `.spell()` and `.spell_mut()`
// and contains `SpellOptions` which can be accessed analogously
println!("{:#?}", rules.spell().options()); // prints the default options

// Enables spellchecking. `Variant` is a string newtype denoting valid variants
// the `.variant` method on the spellchecker gets the variant for a language code 
// or returns an error if none exists
rules.spell_mut().options_mut().variant = Some(rules.spell().variant("en_GB").unwrap());

// now `rules.correct` and `rules.suggest` also finds spelling mistakes!

// there are some other options like a whitelist for words (the whitelist is actually a `HashSet<String>`)
rules
    .spell_mut()
    .options_mut()
    .whitelist
    .insert("nlprule".into());

// and that's it! there are also some new utility methods like `.variants()` on the spellchecker 
// which are not too important

Python

import pytest
from nlprule import Tokenizer, Rules

tokenizer = Tokenizer.load("en")
rules = Rules.load("en", tokenizer)

# rules.spell is the spellchecker object
# does not do anything useful besides providing the ability to set and get options at the moment
print(rules.spell)

# the used variant is `None` by default i. e. spellchecking is disabled
assert rules.spell.options.variant is None

# setting the variant to one which does not exist prints a helpful error message
with pytest.raises(ValueError):
    rules.spell.options.variant = "en_INVALID"

# setting the variant to a valid variant enables spellchecking
rules.spell.options.variant = "en_GB"

# any iterable works here, but it has to be assigned with =, the returned object is not mutable
rules.spell.options.whitelist = ["nlprule"]

assert rules.correct("nlprule has spelllchcking!") == "nlprule has spellchecking!"

I am quite happy with how this turned out. I'd appreciate some feedback regarding the API from people who were previously interested in this (@theelgirl, @drahnr) and of course from anyone else who is reading this. I might have missed something.

As usual, implementing this took more effort than anticipated (LT uses quite a bite more than just one wordlist for spellchecking) but I'm excited to add this functionality! There are some smaller things still left to do to get this over the finish line (see the checklist above). I hope to merge this PR & release next weekend.

Comment on lines +116 to +122
for ending in {
"S": ["s"],
"N": ["n"],
"E": ["e"],
"F": ["in"],
"A": ["e", "er", "es", "en", "em"],
"Ä": ["r", "s", "n", "m"],
Copy link
Contributor

@drahnr drahnr Mar 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the language specific affix file instead of hardcoding it, it saves yourself and a few other people a lot of pain :)

let rules = Rules::new(opts.rules).unwrap();
let tokenizer = Arc::new(Tokenizer::new(opts.tokenizer).unwrap());
let mut rules = Rules::new(opts.rules, tokenizer.clone()).unwrap();
rules.spell_mut().options_mut().variant = Some(rules.spell().variant("en_GB").unwrap());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this very very un-ergonomic API.

A few particular points to avoid:

  • mutable options in public API
  • structs in public lib API
  • circular borrow fun (need rules.spell_mut() and rules.spell() at the same time)
  • combine artifacts that don't seem necessarily connected (spellcheck rules vs grammer / tokenizer rules)

A more rustic API imho would be:

// explicit type annotations for clarity
let spellcheck: Spellcheck = Spellcheck::new(path_to_spellcheck_rules)?;
// alt:
let spellcheck: Spellcheck = Spellcheck::from_reader(fs::File::opn(path_to_spellcheck_rules)?)?;
spellcheck.add_hunspell_dict

let previous: Option<Spellcheck> = rules.set_spell(spellcheck);
// and
rules.modify_spellcheck(move |spellcheck| Ok(spellcheck.extend_whitelist(words_iter)))?

Was there a particular reason to not split the spelling rules from the tokenization rules? A third file would not hurt and avoid having to (cargo-spellcheck 🎩 ) issues with deployment size - 900kB extra is a lot when the upper limit is 10MB.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll think about this. In my opinion a mutable SpellOptions struct where all mutations are guaranteed to be valid and the spellchecker is updated when it is dropped is more erognomic than something like rules.modify_spellcheck(move |spellcheck| Ok(spellcheck.extend_whitelist(words_iter)))?. E.g. it seamlessly allows all HashSet operations on the whitelist, otherwhise that would have to be reimplemented as *_whitelist methods.

mutable options in public API

Do you have some more information / source on this? I wouldn't have thought that this is bad.

circular borrow fun (need rules.spell_mut() and rules.spell() at the same time)

this wouldn't compile, you need the speller to get a Variant (the Variant is owned, it doesn't need the spellchecker), then you can use the Variant to set the options. But I'm also not perfectly happy with that part.

combine artifacts that don't seem necessarily connected (spellcheck rules vs grammer / tokenizer rules)

This is a very valid point. The reason is that .correct() and .suggest() on the Rules should easily be configurable to do both, grammar and spelling correction. The best fix would probably be some meta-struct to combine any amount of suggestion providers (where Rules and Spellcheck are each one suggestion provider) which is strongly related to #50. Defining and implementing that would take significantly more work than I want to do before nlprule can do spellchecking 🙂

A third file would not hurt

A third binary and a function Rules.set_spell really does seem like a good idea. But I would argue that it decreases ergonomics, at least as long as the Spell structure is still part of the Rules.

when the upper limit is 10MB.

I'd just ask the crates.io team to increase the size so you don't have to worry about this (especially since (I think) you distribute some hunspell data as well). They do it on a case by case basis (rust-lang/crates.io#195). I understand the concern, but I'd argue that it is already quite impressive to store multiple hundred thousand valid words + lemmas + part-of-speech tags, rules, a noun chunker etc. in < 10MB. An extra 900kb for spellchecking should very rarely be an issue.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's one alternative API moving some of the changes anticipated for #50 to this PR without going completely out of scope:

A new trait Suggest:

trait Suggest {
    fn suggest(&self, text: &str, tokenizer: &Tokenizer) -> Vec<Suggestion>; // or iterator
    fn correct(&self, text: &str, tokenizer: &Tokenizer) -> String {
        // default impl to just apply all the suggestions
    }
}

then Rules and a new separate struct Spell implement Suggest. The spell struct implements from_reader, new, etc. like the Rules but with an additional argument to denote the variant: Spell::new("path/to/bin", "en_GB").

There is one new struct Correcter with the fields:

struct Correcter {
   suggesters: Vec<Box<dyn Suggest>>
}

that can be constructed with

let correcter = Correcter::new(vec![
    Rules::from_reader(...).into(),
    Spell::from_reader(..., "en_GB").into()
])?; // implements `FromIterator` too

and implements Suggest as well. Getting the inner suggesters would then have to be done with downcast-rs and look like:

let rules = correcter.get::<Rules>(0)?;
// and mutably:
let spell = correcter.get_mut::<Spell>(1)?;
// I'm not 100% sure if this works with downcast-rs but I think it should

Extending the whitelist etc. would stay the same as before:

spell.options_mut().whitelist.push("nlprule".into());

But would not need the guard anymore (so no Drop impl - I agree that mutable options with some guard are problematic because they hide cost from the user, this way I really don't see an issue). Setting the variant is fallible and would thus not be part of the options:

spell.set_variant("en_US")?;

The Python API would of course adjust accordingly keeping it in sync when possible.

The nice thing about all of this is that nothing is a breaking change since the Rules keeps the same methods, there are just additional Correcter and Spell structs. There are multiple problems though:

  • .correct and .suggest need the Tokenizer as second argument. These will by far be the most commonly used methods so I'd like them to be as simple as possible e.g. fn correct(text: &str) -> String and fn suggest(text: &str) -> Vec<Sugggestion> like in the current PR without having to explicitly keep the tokenizer around if you don't need it.
  • The downcast magic happening in the Correcter is not ideal but I believe it can not be avoided.
  • A third binary hurts ergonomics, especially because including a binary is quite verbose at the moment.
  • Spellchecking suggestions should have lower priority than suggestions from the grammar rules. This is not trivial to enforce with the Correcter and might require some notion of priority on the Suggestion.

What do you think?

@@ -19,8 +21,8 @@ fn main() {
env_logger::init();
let opts = Opts::parse();

let tokenizer = Tokenizer::new(opts.tokenizer).unwrap();
let rules_container = Rules::new(opts.rules).unwrap();
let tokenizer = Arc::new(Tokenizer::new(opts.tokenizer).unwrap());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iff the above API changes are implemented, the Arc should be able to be avoided. Imho Spellcheck should parse the file and maybe use a Tokenizer reference only when extracting items.

Comment on lines +97 to +98
let prefix = parts.next().unwrap();
let suffix = parts.next().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd strongly suggest to use debug_assert_eq(parts.clone().count(), ) and .expect() instead.

@@ -81,6 +89,8 @@ pub enum Error {
Unimplemented(String),
#[error("error parsing to integer: {0}")]
ParseError(#[from] ParseIntError),
#[error("nlprule error: {0}")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With #[from] adding a explicit print {0} would duplicate the error, since the backtrace will also contain the cause which is implicit by #[from].

So #[error(transparent)] or #[error("nlprule error")].

@drahnr
Copy link
Contributor

drahnr commented Mar 18, 2021

Note that the review is a first glance and by no means complete, I'll have another go at the impl details tomorrow 🤞

Comment on lines 89 to +93
.rev()
.flatten()
.collect();
candidates.sort_by(|(_, a), (_, b)| a.cmp(b));
if candidates.is_empty() {
None
} else {
Some(candidates.remove(0).0)
}
candidates.sort_unstable();
candidates.dedup();
Copy link
Contributor

@drahnr drahnr Mar 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks a bit like https://docs.rs/itertools/0.10.0/itertools/fn.kmerge.html could be applied 🙃

pub struct PosReplacer {
pub(crate) matcher: PosMatcher,
}

impl PosReplacer {
fn apply(&self, text: &str, tokenizer: &Tokenizer) -> Option<String> {
pub fn apply(&self, text: &str, tokenizer: &Tokenizer) -> Vec<String> {
Copy link
Contributor

@drahnr drahnr Mar 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of API returns an iterator which could also just be an impl Iterator at the cost of a few lifetime parameters, in the long run this can make a huge difference, since most collect calls could be avoided, which ends up both being an additive cost of N and an additional allocation (avoided).

@bminixhofer
Copy link
Owner Author

Hi, thanks for taking a look!

I'll have another go at the impl details tomorrow

I was asking about the public API (thanks for your feedback on that!), the PR isn't fully ready for review.

@drahnr
Copy link
Contributor

drahnr commented Mar 18, 2021

Hi, thanks for taking a look!

I'll have another go at the impl details tomorrow

I was asking about the public API (thanks for your feedback on that!), the PR isn't fully ready for review.

Ah, overlooked that. Let me know once you want a more introspective review!

@bminixhofer
Copy link
Owner Author

I will hold off on merging this for some time to explore better modularization first, see #50 (comment).

@bminixhofer bminixhofer mentioned this pull request Apr 15, 2021
@ssokolow
Copy link

ssokolow commented Jul 21, 2022

let tokenizer = Tokenizer::new("path/to/tokenizer.bin").unwrap();
let mut rules = Rules::new("path/to/rules.bin", tokenizer.into()).unwrap();

Will there be a way to load from in-memory data rather than a path?

I find it both inconvenient and irritating that crates like ttspico don't let me bundle the dependencies into the binary with include_bytes! in situations where the primary goal is ease of deployment and maintainability. (eg. dropping a single binary into ~/bin/ or ~/.cargo/bin ...possibly built against a -musl target for extra protection against dependency breakage.)

...especially when Cargo's solutions for making sure that non-embedded dependencies wind up where the thing looks for them across all different ways of building and running artifacts aren't the greatest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants