Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect sentence boundaries with repeating tokens in OpenNLP package #11735

Closed
kotman12 opened this issue Sep 1, 2022 · 0 comments
Closed
Assignees
Labels
Milestone

Comments

@kotman12
Copy link
Contributor

kotman12 commented Sep 1, 2022

Description

Initial issue: KeywordRepeatFilter + OpenNLPLLemmatizer leads to empty token list in case of a single token stream.

Steps to re-produce: run TestOpenNLPLemmatizerFilterFactory.testNoBreakWithRepeatKeywordFilter and observe that 0 tokens are returned after processing the text “period”.

Underlying issue: opennlp package mishandles sentence boundary detection in general when KeywordRepeatFilter is added. The issue flies under the radar because the tests don’t verify which tokens are processed together as one sentence. Below is a screenshot showing that the last token of the last sentence gets dropped. This is usually not a big deal when that token is punctuation (most of the time) but can become especially problematic when the last bit of text of a stream has no punctuation.

For example consider the text "This is some sentence". If you pass this on its own into an analysis chain identical to the one configured in TestOpenNLPLemmatizerFilterFactory.testNoBreakWithRepeatKeywordFilter you will see this:

image

Suggested fix: Linking #11734 as the suggested fix for this. The gist is to use a one-step lookahead when processing the token stream to correctly detect sentence transition in the general case of repeating tokens. I have centralized the inner sentence token loop which had been repeated across the different sentence-aware filters. The suggested fix also removes other seemingly unnecessary conditional branching and tidies up the different OpenNLP filters so they operate more similarly to one another (at least wherever possible)

Version and environment details

Latest version of lucene running jdk-17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants