Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new token filters for Japanese sutegana (捨て仮名) #12915

Merged
merged 14 commits into from Mar 18, 2024

Conversation

daixque
Copy link
Contributor

@daixque daixque commented Dec 11, 2023

Description

Sutegana (捨て仮名) is small letter of hiragana and katakana in Japanese. In the old Japanese text, sutegana (捨て仮名) is not used unlikely to modern one. For example:

  • "ストップウォッチ" is written as "ストツプウオツチ"
  • "ちょっとまって" is written as "ちよつとまつて"

So it's meaningful to normalize sutegana to normal (uppercase) characters if we search against the corpus which includes old Japanese text such as patents, legal documents, contract policies, etc.

This pull request introduces 2 token filters:

  • JapaneseHiraganaUppercaseFilter for hiragana
  • JapaneseKatakanaUppercaseFilter for katakana

so that user can use either one separately. Each. filter make all the sutegana (small characters) into normal kana (uppercase character) to normalize the token.

Why it is needed

This transformation must be done as token filter. There have already been MappingCharFilter, but if we apply this character filter to normalize sutegana, it will impact to tokenization and it is not expected.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thanks @daixque. I left some small comments.

I do not know Japanese myself, so I'll leave this open for a few days to give others a chance to review. After that I'll merge (lazy consensus).

@@ -0,0 +1,65 @@
package org.apache.lucene.analysis.ja;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add the standard Apache copyright header, if that's OK with you? Thanks! I think this will also make the GitHub actions checks (./gradlew check) happy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to do, thanks!

String term = termAttr.toString();
// Small letter "ㇷ゚" is not single character, so it should be converted to "プ" as String
term = term.replace("ㇷ゚", "プ");
char[] src = term.toCharArray();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could instead call term.buffer() to access the source char[] and save creating a few temporary objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, but it will affect length of result character array and break the tests. So let me keep current implementation.

Here is the example of test result.

term 0 expected:<ちよつと[]> but was:<ちよつと[sTerm�������]>
Expected :ちよつと
Actual   :ちよつとsTerm�������

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buffer return the internal byte[] of the CharTermAttribute, which might has more bytes than the actual term length. You need to use term.length() as well.

}
}
String resultTerm = String.copyValueOf(result);
termAttr.setEmpty().append(resultTerm);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can avoid making String here by appending the char[] result instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find append method signature which accept char[]. (There is CharSequence instead)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you can modify the CharTermAttribute directly by accessing buffer(), which will return the internal buffer.

byte[] buffer = termAttr.buffer();
buffer[i] = LETTER_MAPPINGS.get(buffer[i]);

This will eliminate all of the byte copy. I don't know if we are supposed to do that (but the API allow). Maybe @mikemccand could have some thought here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will eliminate all of the byte copy. I don't know if we are supposed to do that (but the API allow). Maybe @mikemccand could have some thought here.

This is indeed the intended usage for high performance -- directly alter that underlying char[] buffer, asking the term att to grow if needed, and setting the length when you are done.

@kojisekig
Copy link
Member

From a Japanese perspective, the necessity sounds reasonable. Thank you for the contribution!

@daixque
Copy link
Contributor Author

daixque commented Dec 12, 2023

Hi @mikemccand and @kojisekig, thank you for your reviews.
I updated some codes along with the comments and add lines to module-info and resources to make gradle check green.

Comment on lines 63 to 64
String term = termAttr.toString();
char[] src = term.toCharArray();
Copy link
Contributor

@dungba88 dungba88 Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can iterate through the term attribute directly. These methods require byte-copy so might be inefficient

for (int i = 0; i < termAttr.length(); i++) {
    char c = termAttr.charAt(i);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, let me do that.

* legal, contract policies, etc.
*/
public final class JapaneseHiraganaUppercaseFilter extends TokenFilter {
private static final Map<Character, Character> s2l;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the parameter should be in all-uppercase as it's a constant?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also s2l is a bit cryptic, maybe we could use LETTER_MAPPINGS or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, let me do that.

for (int i = 0; i < src.length; i++) {
Character c = s2l.get(src[i]);
if (c != null) {
result[i] = c;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems all small characters are just 1 position ahead of the normal characters, so you can use result[i] = src[i] + 1;, and you can use a Set instead of Map: https://en.wikipedia.org/wiki/Hiragana_(Unicode_block)

Copy link
Contributor Author

@daixque daixque Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems all small characters are just 1 position ahead of the normal characters

It's not correct. See for example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that makes sense. Thank you

* <p>This filter is useful if you want to search against old style Japanese text such as patents,
* legal, contract policies, etc.
*/
public final class JapaneseKatakanaUppercaseFilter extends TokenFilter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be mostly the same as the other filter, so maybe we can combine them?

E.g you can either pass the mapping as a constructor parameter and provide 2 constants mapping

Copy link
Contributor Author

@daixque daixque Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dungba88 How should the constructor look like?

Like this?

public JapaneseKanaUppercaseFilter(TokenStream input, bool hiragana, bool katakana)

Note that Katakana has an exceptional character ㇷ゚, so logic is slightly different from hiragana.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, maybe we can consolidate them with a base class as a follow-up. This LGTM.

@dungba88
Copy link
Contributor

Besides the optimization of manipulating the internal byte[] directly, I think this is good to go.

char[] result = new char[src.length];
for (int i = 0; i < src.length; i++) {
Character c = s2l.get(src[i]);
char[] result = new char[termAttr.length()];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could instead be something like:

    char[] termBuffer = termAttr.buffer();
    int termLength = termAttr.length();
    for(int i=0; i<termLength; i++) {
        Character c = LETTER_MAPPINGS.get(termBuffer[i]);
        if (c != null) {
          termBuffer[i] = c;
        }
    }

    return true;

I.e. you can just directly manipulate the underlying buffer.

But really this is all just optimizing -- not urgent for the first commit of this awesome contribution. It can be done in follow-on PRs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikemccand @dungba88 Yeah, thanks for your suggestion. I reflected this, so please check it.

Copy link
Contributor

@dungba88 dungba88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the change and optimization. LGTM!

@daixque
Copy link
Contributor Author

daixque commented Dec 16, 2023

I did refactoring to apply a same kind of enhancement to Katakana filter as well.

@mikemccand
Copy link
Member

Looks great @daixque -- would you like to add a lucene/CHANGES.txt entry dscribing this awesome new capability? Be sure to put it under the 9.10.0 section since we can backport this change (it is not a 10.0.0-only feature).

@daixque
Copy link
Contributor Author

daixque commented Dec 18, 2023

Looks great @daixque -- would you like to add a lucene/CHANGES.txt entry dscribing this awesome new capability? Be sure to put it under the 9.10.0 section since we can backport this change (it is not a 10.0.0-only feature).

@mikemccand This is done. Thanks!

Copy link

github-actions bot commented Jan 8, 2024

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jan 8, 2024
@daixque
Copy link
Contributor Author

daixque commented Jan 9, 2024

@mikemccand @dungba88 Let me ping. Do I still have anything to do for this PR? If not, could you merge it or let me know when will it be merged?

@dungba88
Copy link
Contributor

dungba88 commented Jan 9, 2024

I think it's good to go, but I don't have merge permission. Mike should be able to help you, otherwise you can try notify the dev mailing list as suggested by the bot

@github-actions github-actions bot removed the Stale label Jan 10, 2024
Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jan 24, 2024
Comment on lines 182 to 184
* GITHUB#12915: Add new token filters for Japanese sutegana (捨て仮名). This introduces JapaneseHiraganaUppercaseFilter
and JapaneseKatakanaUppercaseFilter. (Dai Sugimori)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have since released 9.10. Could you add your changes to 9.11 and remove from 9.10?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @benwtrent, this is done.

@benwtrent benwtrent merged commit d393b9d into apache:main Mar 18, 2024
3 checks passed
benwtrent pushed a commit that referenced this pull request Mar 18, 2024
### Description

Sutegana (捨て仮名) is small letter of hiragana and katakana in Japanese. In the old Japanese text, sutegana (捨て仮名) is not used unlikely to modern one. For example:

- "ストップウォッチ" is written as "ストツプウオツチ"
- "ちょっとまって" is written as "ちよつとまつて"

So it's meaningful to normalize sutegana to normal (uppercase) characters if we search against the corpus which includes old Japanese text such as patents, legal documents, contract policies, etc.

This pull request introduces 2 token filters:

- JapaneseHiraganaUppercaseFilter for hiragana
- JapaneseKatakanaUppercaseFilter for katakana

so that user can use either one separately. Each. filter make all the sutegana (small characters) into normal kana (uppercase character) to normalize the token.

### Why it is needed

This transformation must be done as token filter. There have already been [MappingCharFilter](https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucene/analysis/charfilter/MappingCharFilter.html), but if we apply this character filter to normalize sutegana, it will impact to tokenization and it is not expected.
benwtrent added a commit to elastic/elasticsearch that referenced this pull request Mar 21, 2024
… in kuromoji analysis plugin (#106553)

This adds support for `hiragana_uppercase` and `katakana_uppercase` provided in the new lucene release.

Sutegana (捨て仮名) is small letter of hiragana and katakana in Japanese. In the old Japanese text, sutegana (捨て仮名) is not used unlikely to modern one. For example:

- "ストップウォッチ" is written as "ストツプウオツチ"
- "ちょっとまって" is written as "ちよつとまつて"

So it's meaningful to normalize sutegana to normal (uppercase) characters if we search against the corpus which includes old Japanese text such as patents, legal documents, contract policies, etc.

Related to: apache/lucene#12915
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants