LUCENE-10102: Add JapaneseCompletionFilter for Input Method-aware auto-completion by mocobeta · Pull Request #297 · apache/lucene

mocobeta · 2021-09-14T07:44:26Z

This adds a new token filter - JapaneseCompletionFilter that is supposed to work with AnalyzingSuggester for performing Japanese auto-completion.

For more detailed descriptions, please see https://issues.apache.org/jira/browse/LUCENE-10102.

It's a draft PR because a correct offset calculation has to be implemented. Other than that, I hope this is ready to be reviewed.

Tests would fail because not all state would be cleared in reset(). So, reset() is implemented and it resets all state, and the tests seem happy now?

rmuir · 2021-09-14T18:06:31Z

@mocobeta I think i got the tests happy (at least for me), I pushed a commit.

mocobeta · 2021-09-15T01:45:50Z

@rmuir Ah, thank you for fixing it!

mocobeta · 2021-09-15T05:11:01Z

d6c014c seems to be buggy (when mode=QUERY). I'll try to debug the endOffset bug.

 ./gradlew -p lucene/analysis/kuromoji/ test -Ptests.seed=A0438AE4577EF19 \\
--tests "org.apache.lucene.analysis.ja.TestJapaneseCompletionFilter.testRandomStrings"

org.apache.lucene.analysis.ja.TestJapaneseCompletionFilter > test suite's output saved to /mnt/hdd/repo/lucene/lucene/analysis/kuromoji/build/test-results/test/outputs/OUTPUT-org.apache.lucene.analysis.ja.TestJapaneseCompletionFilter.txt, copied below:
  2> TEST FAIL: useCharFilter=true text='\u93e1 ill lsg wwbznzp \"'
   >     java.lang.AssertionError: inconsistent endOffset 2 pos=1 posLen=1 token=llsg expected:<5> but was:<9>
   >         at __randomizedtesting.SeedInfo.seed([A0438AE4577EF19:828D3810E673B82C]:0)

This was caused by a bad posLength modification... I fixed it.

This reverts commit 5cb09be.

mocobeta · 2021-09-15T11:44:45Z

I think offsets are now corrected, and randomized tests are happy.
Javadocs updated so that it includes the usage and limitations of the new filter.

rmuir · 2021-09-15T11:41:28Z

+  protected TokenStream normalize(String fieldName, TokenStream in) {
+    TokenStream result = new LowerCaseFilter(in);
+    return result;
+  }


what's the LowerCaseFilter do here? Should we just return in, given that this filter is not really used in createComponents?

I copied this method from JapaneseAnalyzer without checking the usage... removed it.
Meanwhile, LowerCaseFilter is usually a desirable (but not strictly required I think) component so I included it within createComponents. mocobeta@7e2153f

rmuir · 2021-09-15T11:48:02Z

@@ -0,0 +1,335 @@
+ア,a


Should we add a line of logic to KatakanaRomanizer to discard lines beginning with #? It would allow to put some comments in this file, which might help keep it tidy.

maybe something like:

# mapping of kana to list of acceptable romanizations # longest-match is used to find entries in this list # covers romanization systems: X, Y, Z

Yes, it was also in my mind but I held it off; i added a minimum description: mocobeta@c1c6e86

rmuir · 2021-09-15T11:54:09Z

+import org.apache.lucene.util.CharsRef;
+
+/** Utility functions for {@link org.apache.lucene.analysis.ja.JapaneseCompletionFilter} */
+public class StringUtils {


Some methods in this file take String, others take CharsRef. But for the most part, the methods here only need charAt and length. Should we change signatures to use CharSequence, which is implemented by both String and CharsRef?

P.S. I think it could also reduce some copying/conversion if we look into this elsewhere in the logic of this filter too.

https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/CharSequence.html

Thanks, I changed the signatures; now we have CharSequenceUtils instead of StringUtils.

P.S. I think it could also reduce some copying/conversion if we look into this elsewhere in the logic of this filter too.

I looked at the romanizer and token filter, and found it was possible to reduce some array copies by replacing String/CharsRef concatenations with CharsRefBuilder.append(). I think this has improved a bit.

rmuir · 2021-09-15T12:02:16Z

@mocobeta This is a great contribution! I added a couple suggestions.

…pying

johtani · 2021-09-17T05:45:21Z

+ッシュ,ssyu,sshu
+ッショ,ssyo,ssho
+ッザ,zza
+ッジ,zzi


Should we add jji? I usually use jajji for ジャッジ. What do you think?

@johtani yes, of course. Thanks for reviewing this. mocobeta@fc90e45

mocobeta and others added 4 commits September 13, 2021 14:34

Add kantakana romenizer

dff3942

Merge branch 'main' into japanese-completion-filter

24b4972

add test cases; fix bugs.

18ec9a8

fix random tests to pass.

e36b3ab

Tests would fail because not all state would be cleared in reset(). So, reset() is implemented and it resets all state, and the tests seem happy now?

spotless

9a1f86e

mocobeta changed the title ~~LUCENE-10102: Add JapaneseCompletionFilter for Input~~ LUCENE-10102: Add JapaneseCompletionFilter for Input Method-aware auto-completion Sep 15, 2021

mocobeta added 3 commits September 15, 2021 11:07

add tests

a3a92d1

correct start/end offset

d6c014c

correct position length

5cb09be

mocobeta added 4 commits September 15, 2021 19:04

Revert "correct position length"

d46062f

This reverts commit 5cb09be.

fix documentation.

163d933

remove unused import

321212b

remove unused import

f34dbc4

mocobeta marked this pull request as ready for review September 15, 2021 11:30

rmuir requested changes Sep 15, 2021

View reviewed changes

mocobeta added 7 commits September 16, 2021 14:20

wrap the token stream by LowerCaseFilter

7e2153f

add comments to romanization mapping file.

c1c6e86

Merge branch 'main' into japanese-completion-filter

1396b88

fix grammar err

e57df75

change StringUtils to CharSequenceUtils; reduce char sequence data co…

ea337a2

…pying

change component ordering

f6844bb

okay to use CharsRefBuilder.get() instead of toCharsRef()

49db396

johtani reviewed Sep 17, 2021

View reviewed changes

mocobeta added 2 commits September 17, 2021 18:23

add entry to romaji map

fc90e45

minor fix on javadoc

a44b2c2

rmuir approved these changes Sep 17, 2021

View reviewed changes

add changes entry

7c045b6

mocobeta merged commit 4e86df9 into apache:main Sep 17, 2021

mocobeta deleted the japanese-completion-filter branch September 17, 2021 13:37

Conversation

mocobeta commented Sep 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rmuir commented Sep 14, 2021

Uh oh!

mocobeta commented Sep 15, 2021

Uh oh!

mocobeta commented Sep 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mocobeta commented Sep 15, 2021

Uh oh!

rmuir Sep 15, 2021

Choose a reason for hiding this comment

Uh oh!

mocobeta Sep 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rmuir Sep 15, 2021

Choose a reason for hiding this comment

Uh oh!

mocobeta Sep 16, 2021

Choose a reason for hiding this comment

Uh oh!

rmuir Sep 15, 2021

Choose a reason for hiding this comment

Uh oh!

mocobeta Sep 17, 2021

Choose a reason for hiding this comment

Uh oh!

rmuir commented Sep 15, 2021

Uh oh!

johtani Sep 17, 2021

Choose a reason for hiding this comment

Uh oh!

mocobeta Sep 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mocobeta commented Sep 14, 2021 •

edited

Loading

mocobeta commented Sep 15, 2021 •

edited

Loading

mocobeta Sep 16, 2021 •

edited

Loading

mocobeta Sep 17, 2021 •

edited

Loading