Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji #12885

kuramitsu · 2023-12-07T12:32:38Z

Description

I found a bug in using JapaneseReadingFormFilter that some hiragana are not converted to romaji.
(For example, "ぐ" does not become "gu". I noticed this because "マスキング" did not get any hits when searching for "ますきんぐ".)
I believe this is due to the existence of hiragana whose readings are not explicitly defined in the kuromoji dictionary.

Draft

For hiragana, which is an OOV term, how about treating its conversion to katakana as a reading?

…iragana to romaji

msfroh · 2023-12-07T17:53:53Z

This is really interesting. It looks like the filter logic is already trying to convert to katakana before converting to romaji.

Specifically in https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseReadingFormFilter.java#L49-L63, my take is:

The reading variable gets populated with (what I understand should be) the katakana reading (which gets returned if useRomaji is false).
Assuming that reading is populated and useRomaji is true, then we convert the katakana to romaji.

So, I'm wondering if maybe there's a bug in JaMorphData.getReading() implementation? It looks like there's already supposed to be a hiragana -> katakana shift here: https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/dict/TokenInfoMorphData.java#L115

(This is my first time reading this code, backed by my very limited, Duolingo-based knowledge of Japanese, so I might be wrong.)

kuramitsu · 2023-12-08T04:21:06Z

@msfroh
Thank you for your confirmation.
As you said, I think it would be better to check and fix the bug in the JaMorphData.getReading() part.
I will try to check and fix it.

kuramitsu · 2023-12-08T06:09:50Z

I found that one of the implementations of getReading, UnknownMorphData.getReading(), always returns null.
Since the problem seems to be that termAttr is used as it is for OOV Term, I will try to insert a process to convert hiragana into katakana.

…treat it as reading.

kuramitsu · 2023-12-08T08:03:02Z

The modification within the getRomanization function has been dropped.
Instead, in the incrementToken function, I added a process to treat the hiragana OOV term converted to kataka as the reading when it appears.
This processing allows for correct handling even when the useRomaji option is false.

zhaih · 2023-12-08T20:42:24Z

lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseReadingFormFilter.java

  @Override
  public boolean incrementToken() throws IOException {
    if (input.incrementToken()) {
      String reading = readingAttr.getReading();
+      if (reading == null & isAllHiragana(termAttr)) {


Maybe use && instead?

Thank you. I fixed it.
2967e38

zhaih · 2023-12-08T20:44:38Z

lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseReadingFormFilter.java

  @Override
  public boolean incrementToken() throws IOException {
    if (input.incrementToken()) {
      String reading = readingAttr.getReading();
+      if (reading == null & isAllHiragana(termAttr)) {


What if we have hiragana and katagana mixed text? I feel like we still need to convert the hiragana part to katagana?

Fixed to support partial hiragana.
2967e38
However, it is difficult to describe a test for OOV terms that contain a mixture of hiragana and katakana, because the system seems to basically split different character types for OOV terms.

dungba88 · 2023-12-12T03:37:02Z

lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseReadingFormFilter.java

@@ -43,10 +43,38 @@ public JapaneseReadingFormFilter(TokenStream input) {
    this(input, false);
  }

+  private boolean isHiragana(char ch) {
+    return ch >= 0x3041 && ch <= 0x3096;


I think these can be extracted to constants for readability, e.g HIRAGANA_START, HIRAGANA_END

Thank you. I fixed it.
9773668

dungba88 · 2023-12-12T03:40:12Z

lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseReadingFormFilter.java

+    int len = s.length();
+    for (int i = 0; i < len; i++) {


I think we can just use i < s.length(). It's also computed only once, and don't need an additional temporary variable.

Thank you. I fixed it.
9ad2cbe

dungba88 · 2023-12-12T03:57:52Z

.../analysis/kuromoji/src/test/org/apache/lucene/analysis/ja/TestJapaneseReadingFormFilter.java

@@ -88,6 +88,11 @@ protected TokenStreamComponents createComponents(String fieldName) {
    a.close();
  }

+  public void testKatakanaReadingsHiragana() throws IOException {
+    assertAnalyzesTo(
+        katakanaAnalyzer, "が ぎ ぐ げ ご ぁ ゔ", new String[] {"ガ", "ギ", "グ", "ゲ", "ゴ", "ァ", "ヴ"});


I'm wondering what would happen if these characters are in decomposed form? It seems like the か part will be converted to カ but the dakuten is not. Maybe we could have test to see how it looks like?

Checking the getRomanization function in ToStringUtil.java, it does not seem to support split muddle even in katakana. I am sorry, but please let this matter be a separate issue.
https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/dict/ToStringUtil.java#L246

zhaih

LGTM, I'm not professional on JP analyses but this seems right to me.

github-actions · 2024-01-08T12:22:28Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

zhaih · 2024-01-08T18:12:59Z

Thank you bot I obviously forgot to merge this one. @kuramitsu could you please add an CHANGES.txt entry under Lucene 9.10?

kuramitsu · 2024-01-11T10:44:07Z

@zhaih

could you please add an CHANGES.txt entry under Lucene 9.10?

Thank you. I added it.

…iragana to romaji (#12885)

zhaih · 2024-01-11T22:24:48Z

Merged and backported, thanks for the contribution @kuramitsu !

…iragana to romaji (apache#12885)

kuramitsu added 2 commits December 7, 2023 21:06

Fix for the bug where JapaneseReadingFormFilter cannot convert some h…

f5fa85d

…iragana to romaji

format code

a232f28

kuramitsu added 2 commits December 8, 2023 16:18

When a term is OOV and in hiragana, convert the term to katakana and …

509b257

…treat it as reading.

format code

4d9e54f

kuramitsu changed the title ~~[Draft] Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji~~ Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji Dec 8, 2023

zhaih reviewed Dec 8, 2023

View reviewed changes

kuramitsu added 2 commits December 9, 2023 09:38

Support for oov terms with mixed hiragana

2967e38

format code (Support for oov terms with mixed hiragana)

fa07672

dungba88 reviewed Dec 12, 2023

View reviewed changes

kuramitsu added 2 commits December 13, 2023 21:28

Fixed redundant code.

9ad2cbe

Extract hiragana Unicode range to constants for improved readability.

9773668

dungba88 approved these changes Dec 13, 2023

View reviewed changes

zhaih approved these changes Dec 14, 2023

View reviewed changes

github-actions bot added the Stale label Jan 8, 2024

zhaih added this to the 9.10.0 milestone Jan 8, 2024

github-actions bot removed the Stale label Jan 9, 2024

kuramitsu added 2 commits January 11, 2024 18:34

Merge branch 'main' into proposal_JapaneseReadingForm_Unknown_Hiragana

e15a348

add an CHANGES.txt entry

ba2ee5f

zhaih merged commit 8bee418 into apache:main Jan 11, 2024
4 checks passed

zhaih pushed a commit that referenced this pull request Jan 11, 2024

Fix for the bug where JapaneseReadingFormFilter cannot convert some h…

7ad2507

…iragana to romaji (#12885)

slow-J pushed a commit to slow-J/lucene that referenced this pull request Jan 16, 2024

Fix for the bug where JapaneseReadingFormFilter cannot convert some h…

54789f8

…iragana to romaji (apache#12885)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji #12885

Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji #12885

kuramitsu commented Dec 7, 2023 •

edited

msfroh commented Dec 7, 2023 •

edited

kuramitsu commented Dec 8, 2023

kuramitsu commented Dec 8, 2023

kuramitsu commented Dec 8, 2023 •

edited

zhaih Dec 8, 2023

kuramitsu Dec 9, 2023 •

edited

zhaih Dec 8, 2023

kuramitsu Dec 9, 2023 •

edited

dungba88 Dec 12, 2023

kuramitsu Dec 13, 2023

dungba88 Dec 12, 2023

kuramitsu Dec 13, 2023

dungba88 Dec 12, 2023

kuramitsu Dec 13, 2023

zhaih left a comment

github-actions bot commented Jan 8, 2024

zhaih commented Jan 8, 2024

kuramitsu commented Jan 11, 2024

zhaih commented Jan 11, 2024

Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji #12885

Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji #12885

Conversation

kuramitsu commented Dec 7, 2023 • edited

Description

Draft

msfroh commented Dec 7, 2023 • edited

kuramitsu commented Dec 8, 2023

kuramitsu commented Dec 8, 2023

kuramitsu commented Dec 8, 2023 • edited

Choose a reason for hiding this comment

kuramitsu Dec 9, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kuramitsu Dec 9, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhaih left a comment

Choose a reason for hiding this comment

github-actions bot commented Jan 8, 2024

zhaih commented Jan 8, 2024

kuramitsu commented Jan 11, 2024

zhaih commented Jan 11, 2024

kuramitsu commented Dec 7, 2023 •

edited

msfroh commented Dec 7, 2023 •

edited

kuramitsu commented Dec 8, 2023 •

edited

kuramitsu Dec 9, 2023 •

edited

kuramitsu Dec 9, 2023 •

edited