Incorrect sentence boundaries with repeating tokens in OpenNLP package #11735

kotman12 · 2022-09-01T18:36:35Z

Description

Initial issue: KeywordRepeatFilter + OpenNLPLLemmatizer leads to empty token list in case of a single token stream.

Steps to re-produce: run TestOpenNLPLemmatizerFilterFactory.testNoBreakWithRepeatKeywordFilter and observe that 0 tokens are returned after processing the text “period”.

Underlying issue: opennlp package mishandles sentence boundary detection in general when KeywordRepeatFilter is added. The issue flies under the radar because the tests don’t verify which tokens are processed together as one sentence. Below is a screenshot showing that the last token of the last sentence gets dropped. This is usually not a big deal when that token is punctuation (most of the time) but can become especially problematic when the last bit of text of a stream has no punctuation.

For example consider the text "This is some sentence". If you pass this on its own into an analysis chain identical to the one configured in TestOpenNLPLemmatizerFilterFactory.testNoBreakWithRepeatKeywordFilter you will see this:

Suggested fix: Linking #11734 as the suggested fix for this. The gist is to use a one-step lookahead when processing the token stream to correctly detect sentence transition in the general case of repeating tokens. I have centralized the inner sentence token loop which had been repeated across the different sentence-aware filters. The suggested fix also removes other seemingly unnecessary conditional branching and tidies up the different OpenNLP filters so they operate more similarly to one another (at least wherever possible)

Version and environment details

Latest version of lucene running jdk-17

The text was updated successfully, but these errors were encountered:

kotman12 added the type:bug label Sep 1, 2022

kotman12 mentioned this issue Sep 1, 2022

Fix repeating token sentence boundary bug #11734

Merged

kotman12 mentioned this issue Sep 14, 2022

KeywordRepeatFilter + OpenNLPLemmatizer Early Exit #11771

Closed

kotman12 mentioned this issue Sep 21, 2022

fix sentence iteration in opennlp package #11802

Closed

dweiss added this to the 9.5.0 milestone Sep 23, 2022

dweiss self-assigned this Sep 23, 2022

dweiss closed this as completed Sep 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect sentence boundaries with repeating tokens in OpenNLP package #11735

Incorrect sentence boundaries with repeating tokens in OpenNLP package #11735

kotman12 commented Sep 1, 2022 •

edited

Loading

Incorrect sentence boundaries with repeating tokens in OpenNLP package #11735

Incorrect sentence boundaries with repeating tokens in OpenNLP package #11735

Comments

kotman12 commented Sep 1, 2022 • edited Loading

Description

Version and environment details

kotman12 commented Sep 1, 2022 •

edited

Loading