New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-9088: JapaneseNumberFilter uses inaccurate PartOfSpeechAttribute #1073
base: master
Are you sure you want to change the base?
Conversation
Currently the JapaneseNumberFilter reads past a single or multiple numeric tokens and emits the new composed token with the attributes of the following token. This will often lead to e.g. wrong part-of-speech attributes on the numeric token, which in turn can lead to wrong filtering by subsequent filters. This change keeps track of the state of the last numeric token while iterating over a number group and restores the last seen state before emiting the composed numeric token, so we use the attributes of the last one.
@@ -218,6 +228,11 @@ public final boolean incrementToken() throws IOException { | |||
// capture the state of this token and emit it on our next incrementToken() | |||
state = captureState(); | |||
} | |||
// we restore state to when we read the last numeral token to get its attributes (e.g. part-of-speech) | |||
if (lastNumeralTokenState != null) { | |||
restoreState(lastNumeralTokenState); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: simply setting the PartOfSpeechAttribute to "noun-numeric" on the emited token wasn't as straight forward as I expected, since the implementation wraps a whole org.apache.lucene.analysis.ja.Token
. This is why I explored tracking and restoring the last "good" tokens state here.
if (isNumeral(term)) { | ||
// lastNumeralTokenState = captureState(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this block do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this shouldn't be left out, it's a left over from when I was doing some local debugging.
I'll push an update shortly. The classes javadocs say that the filter "will inherit the values of the last token used to compose the normalized number", and in order to do so, that state is saved here on every new numeral token encountered.
Currently the JapaneseNumberFilter reads past a single or multiple numeric
tokens and emits the new composed token with the attributes of the following
token. This will often lead to e.g. wrong part-of-speech attributes on the
numeric token, which in turn can lead to wrong filtering by subsequent filters.
This change keeps track of the state of the last numeric token while iterating
over a number group and restores the last seen state before emiting the composed
numeric token, so we use the attributes of the last one.