Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji #12885

Merged

Conversation

kuramitsu
Copy link
Contributor

@kuramitsu kuramitsu commented Dec 7, 2023

Description

I found a bug in using JapaneseReadingFormFilter that some hiragana are not converted to romaji.
(For example, "ぐ" does not become "gu". I noticed this because "マスキング" did not get any hits when searching for "ますきんぐ".)
I believe this is due to the existence of hiragana whose readings are not explicitly defined in the kuromoji dictionary.

Draft

For hiragana, which is an OOV term, how about treating its conversion to katakana as a reading?

@msfroh
Copy link
Contributor

msfroh commented Dec 7, 2023

This is really interesting. It looks like the filter logic is already trying to convert to katakana before converting to romaji.

Specifically in https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseReadingFormFilter.java#L49-L63, my take is:

  1. The reading variable gets populated with (what I understand should be) the katakana reading (which gets returned if useRomaji is false).
  2. Assuming that reading is populated and useRomaji is true, then we convert the katakana to romaji.

So, I'm wondering if maybe there's a bug in JaMorphData.getReading() implementation? It looks like there's already supposed to be a hiragana -> katakana shift here: https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/dict/TokenInfoMorphData.java#L115

(This is my first time reading this code, backed by my very limited, Duolingo-based knowledge of Japanese, so I might be wrong.)

@kuramitsu
Copy link
Contributor Author

@msfroh
Thank you for your confirmation.
As you said, I think it would be better to check and fix the bug in the JaMorphData.getReading() part.
I will try to check and fix it.

@kuramitsu
Copy link
Contributor Author

I found that one of the implementations of getReading, UnknownMorphData.getReading(), always returns null.
Since the problem seems to be that termAttr is used as it is for OOV Term, I will try to insert a process to convert hiragana into katakana.

@kuramitsu
Copy link
Contributor Author

kuramitsu commented Dec 8, 2023

The modification within the getRomanization function has been dropped.
Instead, in the incrementToken function, I added a process to treat the hiragana OOV term converted to kataka as the reading when it appears.
This processing allows for correct handling even when the useRomaji option is false.

@kuramitsu kuramitsu changed the title [Draft] Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji Dec 8, 2023
@Override
public boolean incrementToken() throws IOException {
if (input.incrementToken()) {
String reading = readingAttr.getReading();
if (reading == null & isAllHiragana(termAttr)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use && instead?

Copy link
Contributor Author

@kuramitsu kuramitsu Dec 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I fixed it.
2967e38

@Override
public boolean incrementToken() throws IOException {
if (input.incrementToken()) {
String reading = readingAttr.getReading();
if (reading == null & isAllHiragana(termAttr)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we have hiragana and katagana mixed text? I feel like we still need to convert the hiragana part to katagana?

Copy link
Contributor Author

@kuramitsu kuramitsu Dec 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed to support partial hiragana.
2967e38
However, it is difficult to describe a test for OOV terms that contain a mixture of hiragana and katakana, because the system seems to basically split different character types for OOV terms.

@@ -43,10 +43,38 @@ public JapaneseReadingFormFilter(TokenStream input) {
this(input, false);
}

private boolean isHiragana(char ch) {
return ch >= 0x3041 && ch <= 0x3096;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these can be extracted to constants for readability, e.g HIRAGANA_START, HIRAGANA_END

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I fixed it.
9773668

Comment on lines 51 to 52
int len = s.length();
for (int i = 0; i < len; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just use i < s.length(). It's also computed only once, and don't need an additional temporary variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I fixed it.
9ad2cbe

@@ -88,6 +88,11 @@ protected TokenStreamComponents createComponents(String fieldName) {
a.close();
}

public void testKatakanaReadingsHiragana() throws IOException {
assertAnalyzesTo(
katakanaAnalyzer, "が ぎ ぐ げ ご ぁ ゔ", new String[] {"ガ", "ギ", "グ", "ゲ", "ゴ", "ァ", "ヴ"});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering what would happen if these characters are in decomposed form? It seems like the か part will be converted to カ but the dakuten is not. Maybe we could have test to see how it looks like?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking the getRomanization function in ToStringUtil.java, it does not seem to support split muddle even in katakana. I am sorry, but please let this matter be a separate issue.
https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/dict/ToStringUtil.java#L246

Copy link
Contributor

@zhaih zhaih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I'm not professional on JP analyses but this seems right to me.

Copy link

github-actions bot commented Jan 8, 2024

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jan 8, 2024
@zhaih zhaih added this to the 9.10.0 milestone Jan 8, 2024
@zhaih
Copy link
Contributor

zhaih commented Jan 8, 2024

Thank you bot I obviously forgot to merge this one. @kuramitsu could you please add an CHANGES.txt entry under Lucene 9.10?

@github-actions github-actions bot removed the Stale label Jan 9, 2024
@kuramitsu
Copy link
Contributor Author

@zhaih

could you please add an CHANGES.txt entry under Lucene 9.10?

Thank you. I added it.

@zhaih zhaih merged commit 8bee418 into apache:main Jan 11, 2024
4 checks passed
zhaih pushed a commit that referenced this pull request Jan 11, 2024
@zhaih
Copy link
Contributor

zhaih commented Jan 11, 2024

Merged and backported, thanks for the contribution @kuramitsu !

slow-J pushed a commit to slow-J/lucene that referenced this pull request Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants