Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extra quirks for large codebases #82

Closed
drahnr opened this issue Jul 29, 2020 · 6 comments · Fixed by #90 or #91
Closed

extra quirks for large codebases #82

drahnr opened this issue Jul 29, 2020 · 6 comments · Fixed by #90 or #91
Assignees
Labels
checker / hunspell hunspell checker related topics enhancement 🦚 New feature or request good first issue 🔰 Good for newcomers
Milestone

Comments

@drahnr
Copy link
Owner

drahnr commented Jul 29, 2020

Is your feature request related to a use-case? Please describe.
With large code bases there are often special cases which are not easy to cover, some of them are more generic, others are very specific.

Issues faced 2x, 'gap' and some-thing should be all ok, but they are marked as possible spelling mistakes.

Describe the solution you'd like
A combination of regex quirks for the hunspell checker.
Possibly a configurable set of hardcoded quirks quirks=["single-quoted", "quoted", "multipicity-x-suffix", "dash-free-compound-words"] since that would give a speed gain compared to fancy-regex compilation and execution.

Describe alternatives you've considered
Spelling is hard, the only alternative would be to learn all of them, but that's a long shot #41 and would be very tedious for pre-existing code-bases.

Additional context

Required for adoption in the rust-lang/rust code base 🎉 rust-lang/rust#74697 (comment)

@drahnr drahnr added enhancement 🦚 New feature or request good first issue 🔰 Good for newcomers checker / hunspell hunspell checker related topics labels Jul 29, 2020
@drahnr drahnr added this to the v0.4.0 milestone Jul 29, 2020
@drahnr drahnr unassigned drahnr and KuabeM Jul 29, 2020
@drahnr
Copy link
Owner Author

drahnr commented Jul 29, 2020

This should be quite self contained within checker/hunspell.rs, main.rs and config.rs.

CC @laysauchoa

@zhiburt
Copy link
Contributor

zhiburt commented Aug 1, 2020

Hey @drahnr I've tried it out and as always didn't succeed 😞

I'd like to address a couple of questions here.

  • Should the quirks be applied recursively so essentially for "'2x'" checker would simplify it to 2.
  • Should sub-parts of words be simplified as well? Example 2x-something so it would check basically 2 and something.

As I see there's 2 types of rules here basically ones which produce &str -> &str and &str -> Vec<&str>.

@drahnr
Copy link
Owner Author

drahnr commented Aug 3, 2020

These are very good points!

I'd like to address a couple of questions here.

* Should the `quirks` be applied recursively so essentially for `"'2x'"` `checker` would simplify it to `2`.

Technically that sounds like a good idea but requires additional work (using a mut Deque feeding in generated suggestions and processing the Deque element by element) so for now, I'd say get the recursion free version done, and then use the existing patterns in step 2 to refactor the suggestion processing logic.

* Should sub-parts of words be simplified as well? Example `2x-something` so it would check basically `2` and `something`.

I think the 2x quirk should be expanded a bit to something like ^[0-9]+(?:[,.e][0-9]+)?(?:-.+)?$ for the particular token, so here we would just expand the notion of 2x-pattern.

As I see there's 2 types of rules here basically ones which produce &str -> &str and &str -> Vec<&str>.

I understand it like this: The fn tokenize splits up the chunks into words, which are then checked against the dictionary, we then check those tokens against the dictionary, if that yields a suggestion/detects a mistake, then we call something like fn quirks(..) -> Vec<Suggestion<_>> which can internally handle all quirks described earlier (non-recursive for now) and will return n-suggestions. Returning suggestions here has the advantage, that not much context needs to be fed into the fn, and it can do more complex things rather than just reduction.

What do you think?

@drahnr
Copy link
Owner Author

drahnr commented Aug 13, 2020

@zhiburt take a look at #90 - it implements the first step (more aligned to your proposal), repeated matching should be impl'd as step 2

@drahnr drahnr self-assigned this Aug 13, 2020
@drahnr
Copy link
Owner Author

drahnr commented Aug 14, 2020

0.4.0-alpha.1 just hit the road, it includes a hunspell specific backend quirk: regex_transform: [ "re1", ... ] specifies a bunch of regex options which are attempt to be applied to individual words to remove i.e. enclosing ' - the capture groups are then checked against the dictionary.
Note that this only solves half of the issues, i.e. the dashed suggestions for concatenated words can not be resolved using.
Example: testcase in a text would be suggested to be test-case, we would like an option to avoid those kind of meaningless suggestions.

@drahnr
Copy link
Owner Author

drahnr commented Aug 14, 2020

Not entirely closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
checker / hunspell hunspell checker related topics enhancement 🦚 New feature or request good first issue 🔰 Good for newcomers
Projects
None yet
3 participants