Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeywordRepeatFilter + OpenNLPLemmatizer Early Exit #11771

Closed
kotman12 opened this issue Sep 14, 2022 · 5 comments
Closed

KeywordRepeatFilter + OpenNLPLemmatizer Early Exit #11771

kotman12 opened this issue Sep 14, 2022 · 5 comments
Assignees
Labels
Milestone

Comments

@kotman12
Copy link
Contributor

kotman12 commented Sep 14, 2022

Description

KeywordRepeatFilter + OpenNLPLemmatizer leads to arbitrarily early exit of token stream.

Steps to reproduce: run this test and notice how no text below this line from the test file gets analyzed.

The root cause appears to be an extraneous exit condition that doesn't play nicely with KeywordRepeatFilter.

This is related to the bug #11735 and is addressed by #11734

Version and environment details

latest version of lucene running jdk-17

@kotman12 kotman12 changed the title KeywordRepeatFilter + OpenNLPLLemmatizer Early Exit KeywordRepeatFilter + OpenNLPLemmatizer Early Exit Sep 14, 2022
@dweiss dweiss added this to the 9.5.0 milestone Sep 23, 2022
@dweiss dweiss closed this as completed Sep 23, 2022
@dweiss dweiss self-assigned this Sep 23, 2022
@dweiss
Copy link
Contributor

dweiss commented Sep 23, 2022

https://ci-builds.apache.org/job/Lucene/job/Lucene-Check-9.x/3057/

Hmm... this patch applied to 9x fails the tests. Could you take a look at that, @kotman12 ?

@dweiss dweiss reopened this Sep 23, 2022
@dweiss
Copy link
Contributor

dweiss commented Sep 23, 2022

I can reproduce those failures with JDK11 but not with JDK17. I didn't look into this deeper.

@kotman12
Copy link
Contributor Author

kotman12 commented Sep 23, 2022

Very interesting .. will take a look. First thing that comes to mind is a bug with the output array creation in the actual test https://github.com/kotman12/lucene/blob/fix-sentence-iteration/lucene/analysis/opennlp/src/test/org/apache/lucene/analysis/opennlp/TestOpenNLPLemmatizerFilterFactory.java#L335

This is especially suspicious https://github.com/kotman12/lucene/blob/fix-sentence-iteration/lucene/analysis/opennlp/src/test/org/apache/lucene/analysis/opennlp/TestOpenNLPLemmatizerFilterFactory.java#L337 .. I wonder if jdk17 has some optimization with empty string that it doesn't create new ones and thus that != "works" by coincidence.

@kotman12
Copy link
Contributor Author

So this change seems to fix the test locally for me in branch 9x .. Created a PR for the upstream .. not sure how you want to handle the reversion in 9X branch

@dweiss
Copy link
Contributor

dweiss commented Sep 24, 2022

If this code went in the main branch then it's also a bug there. Comparing strings by reference is a no-no - I should have caught it earlier. I'll do the update on both branches later today.

@dweiss dweiss closed this as completed Sep 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants