You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
JVM version (java -version): java version "1.7.0_151"
OpenJDK Runtime Environment (IcedTea 2.6.11) (7u151-2.6.11-1~deb8u1)
OpenJDK 64-Bit Server VM (build 24.151-b01, mixed mode)
OS version (uname -a if on a Unix-like system): Linux mediawiki-vagrant 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2 (2017-04-30) x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
The standard tokenizer claims to follow the Unicode Text Segmentation algorithm, but does not for hiragana.
For example, the Unicode Text Segmentation demo splits おおかみ as お | おかみ, while the standard analyzer tokenizes it as お | お | か | み, and generally breaks up hiragana character by character.
Steps to reproduce:
Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.
Set up analyzer text as { "type": "custom", "tokenizer": "standard" }
Unfortunately, I don't know exactly about Unicode Standard Annex #29. Lucene's comment is here, I think they expect Hiragana is a single character as a token.
We may ask to unicode.org that the implementation of "breaks.jsp" is the original UNICODE TEXT SEGMENTATION or not.
In the specification, they mentioned
This specification defines default mechanisms; more sophisticated implementations can and should tailor them for particular locales or environments. For example, reliable detection of word boundaries in languages such as Thai, Lao, Chinese, or Japanese requires the use of dictionary lookup, analogous to English hyphenation. An implementation therefore may need to provide means to override or subclass the default mechanisms described in this annex.
Elasticsearch version (
bin/elasticsearch --version
): 5.3.2Plugins installed: [analysis-hebrew, analysis-icu, analysis-smartcn, analysis-stconvert, analysis-stempel, analysis-ukrainian, experimental-highlighter, extra, ltr-query]
JVM version (
java -version
): java version "1.7.0_151"OpenJDK Runtime Environment (IcedTea 2.6.11) (7u151-2.6.11-1~deb8u1)
OpenJDK 64-Bit Server VM (build 24.151-b01, mixed mode)
OS version (
uname -a
if on a Unix-like system): Linux mediawiki-vagrant 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2 (2017-04-30) x86_64 GNU/LinuxDescription of the problem including expected versus actual behavior:
The standard tokenizer claims to follow the Unicode Text Segmentation algorithm, but does not for hiragana.
For example, the Unicode Text Segmentation demo splits おおかみ as お | おかみ, while the standard analyzer tokenizes it as お | お | か | み, and generally breaks up hiragana character by character.
Steps to reproduce:
Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.
text
as{ "type": "custom", "tokenizer": "standard" }
curl -sk localhost:9200/wiki_content/_analyze?pretty -d '{"analyzer": "text", "text" : "おおかみ" }'
The text was updated successfully, but these errors were encountered: