-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-9684: Hunspell: support COMPOUNDRULE #2228
Conversation
Reviewing and merging commits separately might be a good idea. |
e.g. for English ordinal numbers
otherwise dictionaries with long/number flags with all-caps words are broken
When checking an upper- or title-cased word, consider its case variants like Stemmer does
import org.apache.lucene.util.IntsRef; | ||
|
||
class CompoundRule { | ||
private final char[] data; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if keeping char[] instead of the string itself buys us anything here. Strings can use more efficient storage than on newer JVMs (compact strings) and using charAt instead of array accesses is pretty much the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Memory shouldn't be of much concern, as AFAIK there are usually very few rules. Array access is a bit shorter in code, so I'd prefer to leave it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much more complex logic to review. Looks ok to me but I didn't verify it against what hunspell actually does (I trust you did it :). The tests look reasonable so if there are no regressions I'm +1 for committing it in.
Sure, I did :) |
How do we proceed here? Should anyone else review this? |
No, I'll commit it in. I have an unrelated day job so expect some delays. |
Description
Hunspell uses COMPOUNDRULE e.g. for ordinal numbers in en-US dictionary (1st, 42nd, etc)
Solution
COMPOUNDRULE is a regexp-like pattern over word parts flags. I've reimplemented Hunspell's logic, which breaks the word into parts in different ways and checks whether any COMPOUNDRULE matches them (commit 1). To support uppercase, I had also to repeat
Stemmer
's case variations (commit 3). While doing this, I discovered a bug in all-caps treatment where HIDDEN flag didn't play well with non-single-character flag formats, which I fixed (commit 2).Tests
All
compoundrule*
tests taken from Hunspell repository, plus a randomized test for flag serialization and deserialization.Checklist
Please review the following and check all that apply:
master
branch../gradlew check
.