You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Underlying issue: opennlp package mishandles sentence boundary detection in general when KeywordRepeatFilter is added. The issue flies under the radar because the tests don’t verify which tokens are processed together as one sentence. Below is a screenshot showing that the last token of the last sentence gets dropped. This is usually not a big deal when that token is punctuation (most of the time) but can become especially problematic when the last bit of text of a stream has no punctuation.
Suggested fix: Linking #11734 as the suggested fix for this. The gist is to use a one-step lookahead when processing the token stream to correctly detect sentence transition in the general case of repeating tokens. I have centralized the inner sentence token loop which had been repeated across the different sentence-aware filters. The suggested fix also removes other seemingly unnecessary conditional branching and tidies up the different OpenNLP filters so they operate more similarly to one another (at least wherever possible)
Version and environment details
Latest version of lucene running jdk-17
The text was updated successfully, but these errors were encountered:
Description
Initial issue:
KeywordRepeatFilter
+OpenNLPLLemmatizer
leads to empty token list in case of a single token stream.Steps to re-produce: run TestOpenNLPLemmatizerFilterFactory.testNoBreakWithRepeatKeywordFilter and observe that 0 tokens are returned after processing the text “period”.
Underlying issue: opennlp package mishandles sentence boundary detection in general when KeywordRepeatFilter is added. The issue flies under the radar because the tests don’t verify which tokens are processed together as one sentence. Below is a screenshot showing that the last token of the last sentence gets dropped. This is usually not a big deal when that token is punctuation (most of the time) but can become especially problematic when the last bit of text of a stream has no punctuation.
For example consider the text "This is some sentence". If you pass this on its own into an analysis chain identical to the one configured in TestOpenNLPLemmatizerFilterFactory.testNoBreakWithRepeatKeywordFilter you will see this:
Suggested fix: Linking #11734 as the suggested fix for this. The gist is to use a one-step lookahead when processing the token stream to correctly detect sentence transition in the general case of repeating tokens. I have centralized the inner sentence token loop which had been repeated across the different sentence-aware filters. The suggested fix also removes other seemingly unnecessary conditional branching and tidies up the different OpenNLP filters so they operate more similarly to one another (at least wherever possible)
Version and environment details
Latest version of lucene running jdk-17
The text was updated successfully, but these errors were encountered: