Skip to content

LUCENE-10102: Add JapaneseCompletionFilter for Input Method-aware auto-completion#297

Merged
mocobeta merged 22 commits into
apache:mainfrom
mocobeta:japanese-completion-filter
Sep 17, 2021
Merged

LUCENE-10102: Add JapaneseCompletionFilter for Input Method-aware auto-completion#297
mocobeta merged 22 commits into
apache:mainfrom
mocobeta:japanese-completion-filter

Conversation

@mocobeta
Copy link
Copy Markdown
Contributor

@mocobeta mocobeta commented Sep 14, 2021

This adds a new token filter - JapaneseCompletionFilter that is supposed to work with AnalyzingSuggester for performing Japanese auto-completion.

For more detailed descriptions, please see https://issues.apache.org/jira/browse/LUCENE-10102.

It's a draft PR because a correct offset calculation has to be implemented. Other than that, I hope this is ready to be reviewed.

mocobeta and others added 4 commits September 13, 2021 14:34
Tests would fail because not all state would be cleared in reset(). So,
reset() is implemented and it resets all state, and the tests seem happy
now?
@rmuir
Copy link
Copy Markdown
Member

rmuir commented Sep 14, 2021

@mocobeta I think i got the tests happy (at least for me), I pushed a commit.

@mocobeta
Copy link
Copy Markdown
Contributor Author

@rmuir Ah, thank you for fixing it!

@mocobeta mocobeta changed the title LUCENE-10102: Add JapaneseCompletionFilter for Input LUCENE-10102: Add JapaneseCompletionFilter for Input Method-aware auto-completion Sep 15, 2021
@mocobeta
Copy link
Copy Markdown
Contributor Author

mocobeta commented Sep 15, 2021

d6c014c seems to be buggy (when mode=QUERY). I'll try to debug the endOffset bug.

 ./gradlew -p lucene/analysis/kuromoji/ test -Ptests.seed=A0438AE4577EF19 \\
--tests "org.apache.lucene.analysis.ja.TestJapaneseCompletionFilter.testRandomStrings"

org.apache.lucene.analysis.ja.TestJapaneseCompletionFilter > test suite's output saved to /mnt/hdd/repo/lucene/lucene/analysis/kuromoji/build/test-results/test/outputs/OUTPUT-org.apache.lucene.analysis.ja.TestJapaneseCompletionFilter.txt, copied below:
  2> TEST FAIL: useCharFilter=true text='\u93e1 ill lsg wwbznzp \"'
   >     java.lang.AssertionError: inconsistent endOffset 2 pos=1 posLen=1 token=llsg expected:<5> but was:<9>
   >         at __randomizedtesting.SeedInfo.seed([A0438AE4577EF19:828D3810E673B82C]:0)

This was caused by a bad posLength modification... I fixed it.

@mocobeta mocobeta marked this pull request as ready for review September 15, 2021 11:30
@mocobeta
Copy link
Copy Markdown
Contributor Author

I think offsets are now corrected, and randomized tests are happy.
Javadocs updated so that it includes the usage and limitations of the new filter.

protected TokenStream normalize(String fieldName, TokenStream in) {
TokenStream result = new LowerCaseFilter(in);
return result;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the LowerCaseFilter do here? Should we just return in, given that this filter is not really used in createComponents?

Copy link
Copy Markdown
Contributor Author

@mocobeta mocobeta Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this method from JapaneseAnalyzer without checking the usage... removed it.
Meanwhile, LowerCaseFilter is usually a desirable (but not strictly required I think) component so I included it within createComponents. mocobeta@7e2153f

@@ -0,0 +1,335 @@
ア,a
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a line of logic to KatakanaRomanizer to discard lines beginning with #? It would allow to put some comments in this file, which might help keep it tidy.

maybe something like:

# mapping of kana to list of acceptable romanizations
# longest-match is used to find entries in this list
# covers romanization systems: X, Y, Z

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was also in my mind but I held it off; i added a minimum description: mocobeta@c1c6e86

import org.apache.lucene.util.CharsRef;

/** Utility functions for {@link org.apache.lucene.analysis.ja.JapaneseCompletionFilter} */
public class StringUtils {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some methods in this file take String, others take CharsRef. But for the most part, the methods here only need charAt and length. Should we change signatures to use CharSequence, which is implemented by both String and CharsRef?

P.S. I think it could also reduce some copying/conversion if we look into this elsewhere in the logic of this filter too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I changed the signatures; now we have CharSequenceUtils instead of StringUtils.

P.S. I think it could also reduce some copying/conversion if we look into this elsewhere in the logic of this filter too.

I looked at the romanizer and token filter, and found it was possible to reduce some array copies by replacing String/CharsRef concatenations with CharsRefBuilder.append(). I think this has improved a bit.

@rmuir
Copy link
Copy Markdown
Member

rmuir commented Sep 15, 2021

@mocobeta This is a great contribution! I added a couple suggestions.

ッシュ,ssyu,sshu
ッショ,ssyo,ssho
ッザ,zza
ッジ,zzi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add jji? I usually use jajji for ジャッジ. What do you think?

Copy link
Copy Markdown
Contributor Author

@mocobeta mocobeta Sep 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johtani yes, of course. Thanks for reviewing this. mocobeta@fc90e45

@mocobeta mocobeta merged commit 4e86df9 into apache:main Sep 17, 2021
@mocobeta mocobeta deleted the japanese-completion-filter branch September 17, 2021 13:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants