extra quirks for large codebases #82

drahnr · 2020-07-29T15:44:41Z

Is your feature request related to a use-case? Please describe.
With large code bases there are often special cases which are not easy to cover, some of them are more generic, others are very specific.

Issues faced 2x, 'gap' and some-thing should be all ok, but they are marked as possible spelling mistakes.

Describe the solution you'd like
A combination of regex quirks for the hunspell checker.
Possibly a configurable set of hardcoded quirks quirks=["single-quoted", "quoted", "multipicity-x-suffix", "dash-free-compound-words"] since that would give a speed gain compared to fancy-regex compilation and execution.

Describe alternatives you've considered
Spelling is hard, the only alternative would be to learn all of them, but that's a long shot #41 and would be very tedious for pre-existing code-bases.

Additional context

Required for adoption in the rust-lang/rust code base 🎉 rust-lang/rust#74697 (comment)

The text was updated successfully, but these errors were encountered:

drahnr · 2020-07-29T15:47:20Z

This should be quite self contained within checker/hunspell.rs, main.rs and config.rs.

CC @laysauchoa

zhiburt · 2020-08-01T19:32:20Z

Hey @drahnr I've tried it out and as always didn't succeed 😞

I'd like to address a couple of questions here.

Should the quirks be applied recursively so essentially for "'2x'" checker would simplify it to 2.
Should sub-parts of words be simplified as well? Example 2x-something so it would check basically 2 and something.

As I see there's 2 types of rules here basically ones which produce &str -> &str and &str -> Vec<&str>.

drahnr · 2020-08-03T06:36:31Z

These are very good points!

I'd like to address a couple of questions here.

* Should the `quirks` be applied recursively so essentially for `"'2x'"` `checker` would simplify it to `2`.

Technically that sounds like a good idea but requires additional work (using a mut Deque feeding in generated suggestions and processing the Deque element by element) so for now, I'd say get the recursion free version done, and then use the existing patterns in step 2 to refactor the suggestion processing logic.

* Should sub-parts of words be simplified as well? Example `2x-something` so it would check basically `2` and `something`.

I think the 2x quirk should be expanded a bit to something like ^[0-9]+(?:[,.e][0-9]+)?(?:-.+)?$ for the particular token, so here we would just expand the notion of 2x-pattern.

As I see there's 2 types of rules here basically ones which produce &str -> &str and &str -> Vec<&str>.

I understand it like this: The fn tokenize splits up the chunks into words, which are then checked against the dictionary, we then check those tokens against the dictionary, if that yields a suggestion/detects a mistake, then we call something like fn quirks(..) -> Vec<Suggestion<_>> which can internally handle all quirks described earlier (non-recursive for now) and will return n-suggestions. Returning suggestions here has the advantage, that not much context needs to be fed into the fn, and it can do more complex things rather than just reduction.

What do you think?

drahnr · 2020-08-13T08:51:11Z

@zhiburt take a look at #90 - it implements the first step (more aligned to your proposal), repeated matching should be impl'd as step 2

drahnr · 2020-08-14T06:07:33Z

0.4.0-alpha.1 just hit the road, it includes a hunspell specific backend quirk: regex_transform: [ "re1", ... ] specifies a bunch of regex options which are attempt to be applied to individual words to remove i.e. enclosing ' - the capture groups are then checked against the dictionary.
Note that this only solves half of the issues, i.e. the dashed suggestions for concatenated words can not be resolved using.
Example: testcase in a text would be suggested to be test-case, we would like an option to avoid those kind of meaningless suggestions.

Closes #82

drahnr · 2020-08-14T08:05:04Z

Not entirely closed.

Closes #82

drahnr added enhancement 🦚 New feature or request good first issue 🔰 Good for newcomers checker / hunspell hunspell checker related topics labels Jul 29, 2020

drahnr assigned KuabeM and drahnr Jul 29, 2020

drahnr added this to the v0.4.0 milestone Jul 29, 2020

drahnr unassigned drahnr and KuabeM Jul 29, 2020

drahnr mentioned this issue Aug 10, 2020

feat/hunspell: add transformation / whitelist support #90

Merged

drahnr self-assigned this Aug 13, 2020

drahnr closed this as completed in #90 Aug 13, 2020

drahnr added a commit that referenced this issue Aug 14, 2020

feat/quirks: extract quirks from hunspell, add allow_concatenation

daba5aa

Closes #82

drahnr mentioned this issue Aug 14, 2020

add explicit concat quirks #91

Merged

drahnr reopened this Aug 14, 2020

drahnr closed this as completed in #91 Aug 14, 2020

drahnr added a commit that referenced this issue Aug 14, 2020

feat/quirks: extract quirks from hunspell, add allow_concatenation

1b7f32f

Closes #82

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extra quirks for large codebases #82

extra quirks for large codebases #82

drahnr commented Jul 29, 2020 •

edited

Loading

drahnr commented Jul 29, 2020 •

edited

Loading

zhiburt commented Aug 1, 2020 •

edited

Loading

drahnr commented Aug 3, 2020

drahnr commented Aug 13, 2020 •

edited

Loading

drahnr commented Aug 14, 2020

drahnr commented Aug 14, 2020

extra quirks for large codebases #82

extra quirks for large codebases #82

Comments

drahnr commented Jul 29, 2020 • edited Loading

drahnr commented Jul 29, 2020 • edited Loading

zhiburt commented Aug 1, 2020 • edited Loading

drahnr commented Aug 3, 2020

drahnr commented Aug 13, 2020 • edited Loading

drahnr commented Aug 14, 2020

drahnr commented Aug 14, 2020

drahnr commented Jul 29, 2020 •

edited

Loading

drahnr commented Jul 29, 2020 •

edited

Loading

zhiburt commented Aug 1, 2020 •

edited

Loading

drahnr commented Aug 13, 2020 •

edited

Loading