-
Notifications
You must be signed in to change notification settings - Fork 962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new token filters for Japanese sutegana (捨て仮名) #12915
Changes from 1 commit
409a80b
7408013
c617ef8
5acfa34
6c72ff7
a15b138
50e9916
b5b29d8
2f4463d
508b485
c610053
6a16ad1
e79893b
01e7d2e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -30,12 +30,12 @@ | |
* legal, contract policies, etc. | ||
*/ | ||
public final class JapaneseHiraganaUppercaseFilter extends TokenFilter { | ||
private static final Map<Character, Character> s2l; | ||
private static final Map<Character, Character> LETTER_MAPPINGS; | ||
|
||
static { | ||
// supported characters are: | ||
// ぁ ぃ ぅ ぇ ぉ っ ゃ ゅ ょ ゎ ゕ ゖ | ||
s2l = | ||
LETTER_MAPPINGS = | ||
Map.ofEntries( | ||
Map.entry('ぁ', 'あ'), | ||
Map.entry('ぃ', 'い'), | ||
|
@@ -60,15 +60,13 @@ public JapaneseHiraganaUppercaseFilter(TokenStream input) { | |
@Override | ||
public boolean incrementToken() throws IOException { | ||
if (input.incrementToken()) { | ||
String term = termAttr.toString(); | ||
char[] src = term.toCharArray(); | ||
char[] result = new char[src.length]; | ||
for (int i = 0; i < src.length; i++) { | ||
Character c = s2l.get(src[i]); | ||
char[] result = new char[termAttr.length()]; | ||
for (int i = 0; i < termAttr.length(); i++) { | ||
Character c = LETTER_MAPPINGS.get(termAttr.charAt(i)); | ||
if (c != null) { | ||
result[i] = c; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems all small characters are just 1 position ahead of the normal characters, so you can use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It's not correct. See There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see, that makes sense. Thank you |
||
} else { | ||
result[i] = src[i]; | ||
result[i] = termAttr.charAt(i); | ||
} | ||
} | ||
String resultTerm = String.copyValueOf(result); | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -30,12 +30,12 @@ | |
* legal, contract policies, etc. | ||
*/ | ||
public final class JapaneseKatakanaUppercaseFilter extends TokenFilter { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems to be mostly the same as the other filter, so maybe we can combine them? E.g you can either pass the mapping as a constructor parameter and provide 2 constants mapping There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dungba88 How should the constructor look like? Like this?
Note that Katakana has an exceptional character There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You are right, maybe we can consolidate them with a base class as a follow-up. This LGTM. |
||
private static final Map<Character, Character> s2l; | ||
private static final Map<Character, Character> LETTER_MAPPINGS; | ||
|
||
static { | ||
// supported characters are: | ||
// ァ ィ ゥ ェ ォ ヵ ㇰ ヶ ㇱ ㇲ ッ ㇳ ㇴ ㇵ ㇶ ㇷ ㇷ゚ ㇸ ㇹ ㇺ ャ ュ ョ ㇻ ㇼ ㇽ ㇾ ㇿ ヮ | ||
s2l = | ||
LETTER_MAPPINGS = | ||
Map.ofEntries( | ||
Map.entry('ァ', 'ア'), | ||
Map.entry('ィ', 'イ'), | ||
|
@@ -82,7 +82,7 @@ public boolean incrementToken() throws IOException { | |
char[] src = term.toCharArray(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You could instead call There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks, but it will affect length of result character array and break the tests. So let me keep current implementation. Here is the example of test result.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The buffer return the internal byte[] of the CharTermAttribute, which might has more bytes than the actual term length. You need to use term.length() as well. |
||
char[] result = new char[src.length]; | ||
for (int i = 0; i < src.length; i++) { | ||
Character c = s2l.get(src[i]); | ||
Character c = LETTER_MAPPINGS.get(src[i]); | ||
if (c != null) { | ||
result[i] = c; | ||
} else { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could instead be something like:
I.e. you can just directly manipulate the underlying buffer.
But really this is all just optimizing -- not urgent for the first commit of this awesome contribution. It can be done in follow-on PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mikemccand @dungba88 Yeah, thanks for your suggestion. I reflected this, so please check it.