Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new token filters for Japanese sutegana (捨て仮名) #12915

Merged
merged 14 commits into from
Mar 18, 2024
Original file line number Diff line number Diff line change
Expand Up @@ -30,12 +30,12 @@
* legal, contract policies, etc.
*/
public final class JapaneseHiraganaUppercaseFilter extends TokenFilter {
private static final Map<Character, Character> s2l;
private static final Map<Character, Character> LETTER_MAPPINGS;

static {
// supported characters are:
// ぁ ぃ ぅ ぇ ぉ っ ゃ ゅ ょ ゎ ゕ ゖ
s2l =
LETTER_MAPPINGS =
Map.ofEntries(
Map.entry('ぁ', 'あ'),
Map.entry('ぃ', 'い'),
Expand All @@ -60,15 +60,13 @@ public JapaneseHiraganaUppercaseFilter(TokenStream input) {
@Override
public boolean incrementToken() throws IOException {
if (input.incrementToken()) {
String term = termAttr.toString();
char[] src = term.toCharArray();
char[] result = new char[src.length];
for (int i = 0; i < src.length; i++) {
Character c = s2l.get(src[i]);
char[] result = new char[termAttr.length()];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could instead be something like:

    char[] termBuffer = termAttr.buffer();
    int termLength = termAttr.length();
    for(int i=0; i<termLength; i++) {
        Character c = LETTER_MAPPINGS.get(termBuffer[i]);
        if (c != null) {
          termBuffer[i] = c;
        }
    }

    return true;

I.e. you can just directly manipulate the underlying buffer.

But really this is all just optimizing -- not urgent for the first commit of this awesome contribution. It can be done in follow-on PRs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikemccand @dungba88 Yeah, thanks for your suggestion. I reflected this, so please check it.

for (int i = 0; i < termAttr.length(); i++) {
Character c = LETTER_MAPPINGS.get(termAttr.charAt(i));
if (c != null) {
result[i] = c;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems all small characters are just 1 position ahead of the normal characters, so you can use result[i] = src[i] + 1;, and you can use a Set instead of Map: https://en.wikipedia.org/wiki/Hiragana_(Unicode_block)

Copy link
Contributor Author

@daixque daixque Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems all small characters are just 1 position ahead of the normal characters

It's not correct. See for example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that makes sense. Thank you

} else {
result[i] = src[i];
result[i] = termAttr.charAt(i);
}
}
String resultTerm = String.copyValueOf(result);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,12 +30,12 @@
* legal, contract policies, etc.
*/
public final class JapaneseKatakanaUppercaseFilter extends TokenFilter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be mostly the same as the other filter, so maybe we can combine them?

E.g you can either pass the mapping as a constructor parameter and provide 2 constants mapping

Copy link
Contributor Author

@daixque daixque Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dungba88 How should the constructor look like?

Like this?

public JapaneseKanaUppercaseFilter(TokenStream input, bool hiragana, bool katakana)

Note that Katakana has an exceptional character ㇷ゚, so logic is slightly different from hiragana.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, maybe we can consolidate them with a base class as a follow-up. This LGTM.

private static final Map<Character, Character> s2l;
private static final Map<Character, Character> LETTER_MAPPINGS;

static {
// supported characters are:
// ァ ィ ゥ ェ ォ ヵ ㇰ ヶ ㇱ ㇲ ッ ㇳ ㇴ ㇵ ㇶ ㇷ ㇷ゚ ㇸ ㇹ ㇺ ャ ュ ョ ㇻ ㇼ ㇽ ㇾ ㇿ ヮ
s2l =
LETTER_MAPPINGS =
Map.ofEntries(
Map.entry('ァ', 'ア'),
Map.entry('ィ', 'イ'),
Expand Down Expand Up @@ -82,7 +82,7 @@ public boolean incrementToken() throws IOException {
char[] src = term.toCharArray();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could instead call term.buffer() to access the source char[] and save creating a few temporary objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, but it will affect length of result character array and break the tests. So let me keep current implementation.

Here is the example of test result.

term 0 expected:<ちよつと[]> but was:<ちよつと[sTerm�������]>
Expected :ちよつと
Actual   :ちよつとsTerm�������

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buffer return the internal byte[] of the CharTermAttribute, which might has more bytes than the actual term length. You need to use term.length() as well.

char[] result = new char[src.length];
for (int i = 0; i < src.length; i++) {
Character c = s2l.get(src[i]);
Character c = LETTER_MAPPINGS.get(src[i]);
if (c != null) {
result[i] = c;
} else {
Expand Down