Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-9684: Hunspell: support COMPOUNDRULE #2228

Merged
merged 3 commits into from
Jan 22, 2021
Merged

LUCENE-9684: Hunspell: support COMPOUNDRULE #2228

merged 3 commits into from
Jan 22, 2021

Conversation

donnerpeter
Copy link
Contributor

@donnerpeter donnerpeter commented Jan 21, 2021

Description

Hunspell uses COMPOUNDRULE e.g. for ordinal numbers in en-US dictionary (1st, 42nd, etc)

Solution

COMPOUNDRULE is a regexp-like pattern over word parts flags. I've reimplemented Hunspell's logic, which breaks the word into parts in different ways and checks whether any COMPOUNDRULE matches them (commit 1). To support uppercase, I had also to repeat Stemmer's case variations (commit 3). While doing this, I discovered a bug in all-caps treatment where HIDDEN flag didn't play well with non-single-character flag formats, which I fixed (commit 2).

Tests

All compoundrule* tests taken from Hunspell repository, plus a randomized test for flag serialization and deserialization.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the master branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Ref Guide (for Solr changes only).

@donnerpeter
Copy link
Contributor Author

Reviewing and merging commits separately might be a good idea.

e.g. for English ordinal numbers
otherwise dictionaries with long/number flags with all-caps words are broken
When checking an upper- or title-cased word, consider its case variants like Stemmer does
import org.apache.lucene.util.IntsRef;

class CompoundRule {
private final char[] data;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if keeping char[] instead of the string itself buys us anything here. Strings can use more efficient storage than on newer JVMs (compact strings) and using charAt instead of array accesses is pretty much the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory shouldn't be of much concern, as AFAIK there are usually very few rules. Array access is a bit shorter in code, so I'd prefer to leave it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

Copy link
Contributor

@dweiss dweiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much more complex logic to review. Looks ok to me but I didn't verify it against what hunspell actually does (I trust you did it :). The tests look reasonable so if there are no regressions I'm +1 for committing it in.

@donnerpeter
Copy link
Contributor Author

I didn't verify it against what hunspell actually does (I trust you did it :).

Sure, I did :)

@donnerpeter
Copy link
Contributor Author

How do we proceed here? Should anyone else review this?

@dweiss
Copy link
Contributor

dweiss commented Jan 22, 2021

No, I'll commit it in. I have an unrelated day job so expect some delays.

@dweiss dweiss merged commit d796813 into apache:master Jan 22, 2021
@donnerpeter donnerpeter deleted the compoundRule branch January 22, 2021 11:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants