LUCENE-10102: Add JapaneseCompletionFilter for Input Method-aware auto-completion#297
Conversation
Tests would fail because not all state would be cleared in reset(). So, reset() is implemented and it resets all state, and the tests seem happy now?
|
@mocobeta I think i got the tests happy (at least for me), I pushed a commit. |
|
@rmuir Ah, thank you for fixing it! |
|
d6c014c seems to be buggy (when mode=QUERY). I'll try to debug the endOffset bug. This was caused by a bad posLength modification... I fixed it. |
|
I think offsets are now corrected, and randomized tests are happy. |
| protected TokenStream normalize(String fieldName, TokenStream in) { | ||
| TokenStream result = new LowerCaseFilter(in); | ||
| return result; | ||
| } |
There was a problem hiding this comment.
what's the LowerCaseFilter do here? Should we just return in, given that this filter is not really used in createComponents?
There was a problem hiding this comment.
I copied this method from JapaneseAnalyzer without checking the usage... removed it.
Meanwhile, LowerCaseFilter is usually a desirable (but not strictly required I think) component so I included it within createComponents. mocobeta@7e2153f
| @@ -0,0 +1,335 @@ | |||
| ア,a | |||
There was a problem hiding this comment.
Should we add a line of logic to KatakanaRomanizer to discard lines beginning with #? It would allow to put some comments in this file, which might help keep it tidy.
maybe something like:
# mapping of kana to list of acceptable romanizations
# longest-match is used to find entries in this list
# covers romanization systems: X, Y, Z
There was a problem hiding this comment.
Yes, it was also in my mind but I held it off; i added a minimum description: mocobeta@c1c6e86
| import org.apache.lucene.util.CharsRef; | ||
|
|
||
| /** Utility functions for {@link org.apache.lucene.analysis.ja.JapaneseCompletionFilter} */ | ||
| public class StringUtils { |
There was a problem hiding this comment.
Some methods in this file take String, others take CharsRef. But for the most part, the methods here only need charAt and length. Should we change signatures to use CharSequence, which is implemented by both String and CharsRef?
P.S. I think it could also reduce some copying/conversion if we look into this elsewhere in the logic of this filter too.
There was a problem hiding this comment.
Thanks, I changed the signatures; now we have CharSequenceUtils instead of StringUtils.
P.S. I think it could also reduce some copying/conversion if we look into this elsewhere in the logic of this filter too.
I looked at the romanizer and token filter, and found it was possible to reduce some array copies by replacing String/CharsRef concatenations with CharsRefBuilder.append(). I think this has improved a bit.
|
@mocobeta This is a great contribution! I added a couple suggestions. |
| ッシュ,ssyu,sshu | ||
| ッショ,ssyo,ssho | ||
| ッザ,zza | ||
| ッジ,zzi |
There was a problem hiding this comment.
Should we add jji? I usually use jajji for ジャッジ. What do you think?
There was a problem hiding this comment.
@johtani yes, of course. Thanks for reviewing this. mocobeta@fc90e45
This adds a new token filter - JapaneseCompletionFilter that is supposed to work with AnalyzingSuggester for performing Japanese auto-completion.
For more detailed descriptions, please see https://issues.apache.org/jira/browse/LUCENE-10102.
It's a draft PR because a correct offset calculation has to be implemented. Other than that, I hope this is ready to be reviewed.