LUCENE-9088: JapaneseNumberFilter uses inaccurate PartOfSpeechAttribute #1073

cbuescher · 2019-12-11T14:44:22Z

Currently the JapaneseNumberFilter reads past a single or multiple numeric
tokens and emits the new composed token with the attributes of the following
token. This will often lead to e.g. wrong part-of-speech attributes on the
numeric token, which in turn can lead to wrong filtering by subsequent filters.

This change keeps track of the state of the last numeric token while iterating
over a number group and restores the last seen state before emiting the composed
numeric token, so we use the attributes of the last one.

Currently the JapaneseNumberFilter reads past a single or multiple numeric tokens and emits the new composed token with the attributes of the following token. This will often lead to e.g. wrong part-of-speech attributes on the numeric token, which in turn can lead to wrong filtering by subsequent filters. This change keeps track of the state of the last numeric token while iterating over a number group and restores the last seen state before emiting the composed numeric token, so we use the attributes of the last one.

cbuescher · 2019-12-11T14:46:42Z

lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseNumberFilter.java

@@ -218,6 +228,11 @@ public final boolean incrementToken() throws IOException {
        // capture the state of this token and emit it on our next incrementToken()
        state = captureState();
      }
+      // we restore state to when we read the last numeral token to get its attributes (e.g. part-of-speech)
+      if (lastNumeralTokenState != null) {
+        restoreState(lastNumeralTokenState);


Note: simply setting the PartOfSpeechAttribute to "noun-numeric" on the emited token wasn't as straight forward as I expected, since the implementation wraps a whole org.apache.lucene.analysis.ja.Token. This is why I explored tracking and restoring the last "good" tokens state here.

madrob · 2019-12-18T16:40:48Z

lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseNumberFilter.java

+        if (isNumeral(term)) {
+          // lastNumeralTokenState = captureState();
+        }


What does this block do?

Sorry, this shouldn't be left out, it's a left over from when I was doing some local debugging.
I'll push an update shortly. The classes javadocs say that the filter "will inherit the values of the last token used to compose the normalized number", and in order to do so, that state is saved here on every new numeral token encountered.

cbuescher commented Dec 11, 2019

View reviewed changes

madrob reviewed Dec 18, 2019

View reviewed changes

Remove accidental comment

81c84be

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-9088: JapaneseNumberFilter uses inaccurate PartOfSpeechAttribute #1073

LUCENE-9088: JapaneseNumberFilter uses inaccurate PartOfSpeechAttribute #1073

cbuescher commented Dec 11, 2019

cbuescher Dec 11, 2019

madrob Dec 18, 2019

cbuescher Dec 18, 2019

LUCENE-9088: JapaneseNumberFilter uses inaccurate PartOfSpeechAttribute #1073

Are you sure you want to change the base?

LUCENE-9088: JapaneseNumberFilter uses inaccurate PartOfSpeechAttribute #1073

Conversation

cbuescher commented Dec 11, 2019

cbuescher Dec 11, 2019

Choose a reason for hiding this comment

madrob Dec 18, 2019

Choose a reason for hiding this comment

cbuescher Dec 18, 2019

Choose a reason for hiding this comment