Standard tokenizer incorrectly tokenizes hiragana #27291

Trey314159 · 2017-11-06T21:40:44Z

Elasticsearch version (bin/elasticsearch --version): 5.3.2

Plugins installed: [analysis-hebrew, analysis-icu, analysis-smartcn, analysis-stconvert, analysis-stempel, analysis-ukrainian, experimental-highlighter, extra, ltr-query]

JVM version (java -version): java version "1.7.0_151"
OpenJDK Runtime Environment (IcedTea 2.6.11) (7u151-2.6.11-1~deb8u1)
OpenJDK 64-Bit Server VM (build 24.151-b01, mixed mode)

OS version (uname -a if on a Unix-like system): Linux mediawiki-vagrant 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2 (2017-04-30) x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

The standard tokenizer claims to follow the Unicode Text Segmentation algorithm, but does not for hiragana.

For example, the Unicode Text Segmentation demo splits おおかみ as お | おかみ, while the standard analyzer tokenizes it as お | お | か | み, and generally breaks up hiragana character by character.

Steps to reproduce:

Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.

Set up analyzer text as { "type": "custom", "tokenizer": "standard" }
curl -sk localhost:9200/wiki_content/_analyze?pretty -d '{"analyzer": "text", "text" : "おおかみ" }'

{
  "tokens" : [
    {
      "token" : "お",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<HIRAGANA>",
      "position" : 0
    },
    {
      "token" : "お",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<HIRAGANA>",
      "position" : 1
    },
    {
      "token" : "か",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<HIRAGANA>",
      "position" : 2
    },
    {
      "token" : "み",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<HIRAGANA>",
      "position" : 3
    }
  ]
}

The text was updated successfully, but these errors were encountered:

cbuescher · 2017-11-06T21:43:47Z

@johtani this looks like something you might be able to comment on?

johtani · 2017-11-07T07:54:27Z

Unfortunately, I don't know exactly about Unicode Standard Annex #29.
Lucene's comment is here, I think they expect Hiragana is a single character as a token.

We may ask to unicode.org that the implementation of "breaks.jsp" is the original UNICODE TEXT SEGMENTATION or not.
In the specification, they mentioned

This specification defines default mechanisms; more sophisticated implementations can and should tailor them for particular locales or environments. For example, reliable detection of word boundaries in languages such as Thai, Lao, Chinese, or Japanese requires the use of dictionary lookup, analogous to English hyphenation. An implementation therefore may need to provide means to override or subclass the default mechanisms described in this annex.

I try to analyze おおかみ with Kuromoji that is uses IPADic, then the result is お | おかみ.
I'm not sure exactly, but if they use icu project for demo, it looks like the demo uses this dictionary.
http://source.icu-project.org/repos/icu/icu/tags/cldr-29-beta1/source/data/brkitr/dictionaries/cjdict.txt
This is my guess...

romseygeek · 2018-03-14T14:15:49Z

cc @elastic/es-search-aggs

cbuescher added the :Search/Analysis How text is split into tokens label Nov 6, 2017

DaveCTurner assigned jpountz Nov 7, 2017

DaveCTurner mentioned this issue Nov 7, 2017

ICU Normalizer adds spaces before certain non-combining dakuten and handakuten #27292

Open

DaveCTurner added the >bug label Nov 7, 2017

rjernst added the Team:Search Meta label for search team label May 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standard tokenizer incorrectly tokenizes hiragana #27291

Standard tokenizer incorrectly tokenizes hiragana #27291

Trey314159 commented Nov 6, 2017

cbuescher commented Nov 6, 2017

johtani commented Nov 7, 2017

romseygeek commented Mar 14, 2018

Standard tokenizer incorrectly tokenizes hiragana #27291

Standard tokenizer incorrectly tokenizes hiragana #27291

Comments

Trey314159 commented Nov 6, 2017

cbuescher commented Nov 6, 2017

johtani commented Nov 7, 2017

romseygeek commented Mar 14, 2018