Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-9088: JapaneseNumberFilter uses inaccurate PartOfSpeechAttribute #1073

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

cbuescher
Copy link
Contributor

Currently the JapaneseNumberFilter reads past a single or multiple numeric
tokens and emits the new composed token with the attributes of the following
token. This will often lead to e.g. wrong part-of-speech attributes on the
numeric token, which in turn can lead to wrong filtering by subsequent filters.

This change keeps track of the state of the last numeric token while iterating
over a number group and restores the last seen state before emiting the composed
numeric token, so we use the attributes of the last one.

Currently the JapaneseNumberFilter reads past a single or multiple numeric
tokens and emits the new composed token with the attributes of the following
token. This will often lead to e.g. wrong part-of-speech attributes on the
numeric token, which in turn can lead to wrong filtering by subsequent filters.

This change keeps track of the state of the last numeric token while iterating
over a number group and restores the last seen state before emiting the composed
numeric token, so we use the attributes of the last one.
@@ -218,6 +228,11 @@ public final boolean incrementToken() throws IOException {
// capture the state of this token and emit it on our next incrementToken()
state = captureState();
}
// we restore state to when we read the last numeral token to get its attributes (e.g. part-of-speech)
if (lastNumeralTokenState != null) {
restoreState(lastNumeralTokenState);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: simply setting the PartOfSpeechAttribute to "noun-numeric" on the emited token wasn't as straight forward as I expected, since the implementation wraps a whole org.apache.lucene.analysis.ja.Token. This is why I explored tracking and restoring the last "good" tokens state here.

Comment on lines 219 to 221
if (isNumeral(term)) {
// lastNumeralTokenState = captureState();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this block do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this shouldn't be left out, it's a left over from when I was doing some local debugging.
I'll push an update shortly. The classes javadocs say that the filter "will inherit the values of the last token used to compose the normalized number", and in order to do so, that state is saved here on every new numeral token encountered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants