New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji #12885
Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji #12885
Conversation
This is really interesting. It looks like the filter logic is already trying to convert to katakana before converting to romaji. Specifically in https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseReadingFormFilter.java#L49-L63, my take is:
So, I'm wondering if maybe there's a bug in (This is my first time reading this code, backed by my very limited, Duolingo-based knowledge of Japanese, so I might be wrong.) |
@msfroh |
I found that one of the implementations of getReading, UnknownMorphData.getReading(), always returns null. |
…treat it as reading.
The modification within the getRomanization function has been dropped. |
@Override | ||
public boolean incrementToken() throws IOException { | ||
if (input.incrementToken()) { | ||
String reading = readingAttr.getReading(); | ||
if (reading == null & isAllHiragana(termAttr)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use &&
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I fixed it.
2967e38
@Override | ||
public boolean incrementToken() throws IOException { | ||
if (input.incrementToken()) { | ||
String reading = readingAttr.getReading(); | ||
if (reading == null & isAllHiragana(termAttr)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we have hiragana and katagana mixed text? I feel like we still need to convert the hiragana part to katagana?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed to support partial hiragana.
2967e38
However, it is difficult to describe a test for OOV terms that contain a mixture of hiragana and katakana, because the system seems to basically split different character types for OOV terms.
@@ -43,10 +43,38 @@ public JapaneseReadingFormFilter(TokenStream input) { | |||
this(input, false); | |||
} | |||
|
|||
private boolean isHiragana(char ch) { | |||
return ch >= 0x3041 && ch <= 0x3096; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these can be extracted to constants for readability, e.g HIRAGANA_START, HIRAGANA_END
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I fixed it.
9773668
int len = s.length(); | ||
for (int i = 0; i < len; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just use i < s.length()
. It's also computed only once, and don't need an additional temporary variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I fixed it.
9ad2cbe
@@ -88,6 +88,11 @@ protected TokenStreamComponents createComponents(String fieldName) { | |||
a.close(); | |||
} | |||
|
|||
public void testKatakanaReadingsHiragana() throws IOException { | |||
assertAnalyzesTo( | |||
katakanaAnalyzer, "が ぎ ぐ げ ご ぁ ゔ", new String[] {"ガ", "ギ", "グ", "ゲ", "ゴ", "ァ", "ヴ"}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering what would happen if these characters are in decomposed form? It seems like the か part will be converted to カ but the dakuten is not. Maybe we could have test to see how it looks like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking the getRomanization function in ToStringUtil.java, it does not seem to support split muddle even in katakana. I am sorry, but please let this matter be a separate issue.
https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/dict/ToStringUtil.java#L246
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I'm not professional on JP analyses but this seems right to me.
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
Thank you bot I obviously forgot to merge this one. @kuramitsu could you please add an CHANGES.txt entry under Lucene 9.10? |
Thank you. I added it. |
Merged and backported, thanks for the contribution @kuramitsu ! |
…iragana to romaji (apache#12885)
Description
I found a bug in using JapaneseReadingFormFilter that some hiragana are not converted to romaji.
(For example, "ぐ" does not become "gu". I noticed this because "マスキング" did not get any hits when searching for "ますきんぐ".)
I believe this is due to the existence of hiragana whose readings are not explicitly defined in the kuromoji dictionary.
Draft
For hiragana, which is an OOV term, how about treating its conversion to katakana as a reading?