KeywordRepeatFilter + OpenNLPLemmatizer Early Exit #11771

kotman12 · 2022-09-14T14:40:29Z

Description

KeywordRepeatFilter + OpenNLPLemmatizer leads to arbitrarily early exit of token stream.

Steps to reproduce: run this test and notice how no text below this line from the test file gets analyzed.

The root cause appears to be an extraneous exit condition that doesn't play nicely with KeywordRepeatFilter.

This is related to the bug #11735 and is addressed by #11734

Version and environment details

latest version of lucene running jdk-17

dweiss · 2022-09-23T18:12:51Z

https://ci-builds.apache.org/job/Lucene/job/Lucene-Check-9.x/3057/

Hmm... this patch applied to 9x fails the tests. Could you take a look at that, @kotman12 ?

dweiss · 2022-09-23T18:24:33Z

I can reproduce those failures with JDK11 but not with JDK17. I didn't look into this deeper.

kotman12 · 2022-09-23T19:38:45Z

Very interesting .. will take a look. First thing that comes to mind is a bug with the output array creation in the actual test https://github.com/kotman12/lucene/blob/fix-sentence-iteration/lucene/analysis/opennlp/src/test/org/apache/lucene/analysis/opennlp/TestOpenNLPLemmatizerFilterFactory.java#L335

This is especially suspicious https://github.com/kotman12/lucene/blob/fix-sentence-iteration/lucene/analysis/opennlp/src/test/org/apache/lucene/analysis/opennlp/TestOpenNLPLemmatizerFilterFactory.java#L337 .. I wonder if jdk17 has some optimization with empty string that it doesn't create new ones and thus that != "works" by coincidence.

kotman12 · 2022-09-23T20:27:32Z

So this change seems to fix the test locally for me in branch 9x .. Created a PR for the upstream .. not sure how you want to handle the reversion in 9X branch

dweiss · 2022-09-24T07:21:02Z

If this code went in the main branch then it's also a bug there. Comparing strings by reference is a no-no - I should have caught it earlier. I'll do the update on both branches later today.

kotman12 added the type:bug label Sep 14, 2022

kotman12 changed the title ~~KeywordRepeatFilter + OpenNLPLLemmatizer Early Exit~~ KeywordRepeatFilter + OpenNLPLemmatizer Early Exit Sep 14, 2022

kotman12 mentioned this issue Sep 21, 2022

fix sentence iteration in opennlp package #11802

Closed

dweiss added this to the 9.5.0 milestone Sep 23, 2022

dweiss closed this as completed Sep 23, 2022

dweiss self-assigned this Sep 23, 2022

dweiss reopened this Sep 23, 2022

dweiss closed this as completed Sep 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeywordRepeatFilter + OpenNLPLemmatizer Early Exit #11771

KeywordRepeatFilter + OpenNLPLemmatizer Early Exit #11771

kotman12 commented Sep 14, 2022 •

edited

Loading

dweiss commented Sep 23, 2022

dweiss commented Sep 23, 2022

kotman12 commented Sep 23, 2022 •

edited

Loading

kotman12 commented Sep 23, 2022

dweiss commented Sep 24, 2022

KeywordRepeatFilter + OpenNLPLemmatizer Early Exit #11771

KeywordRepeatFilter + OpenNLPLemmatizer Early Exit #11771

Comments

kotman12 commented Sep 14, 2022 • edited Loading

Description

Version and environment details

dweiss commented Sep 23, 2022

dweiss commented Sep 23, 2022

kotman12 commented Sep 23, 2022 • edited Loading

kotman12 commented Sep 23, 2022

dweiss commented Sep 24, 2022

kotman12 commented Sep 14, 2022 •

edited

Loading

kotman12 commented Sep 23, 2022 •

edited

Loading