CJK analyzer tokenization issues #34285

Trey314159 · 2018-10-03T21:19:53Z

It makes sense to me to report these all together, but I can split these into separate bugs if that's better.

Elasticsearch version (curl -XGET 'localhost:9200'):

{
  "name" : "adOS8gy",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "GVS7gpVBQDGwtHl3xnJbLw",
  "version" : {
    "number" : "6.4.0",
    "build_flavor" : "default",
    "build_type" : "deb",
    "build_hash" : "595516e",
    "build_date" : "2018-08-17T23:18:47.308994Z",
    "build_snapshot" : false,
    "lucene_version" : "7.4.0",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}

Plugins installed: [analysis-icu, analysis-nori]

JVM version (java -version):
openjdk version "1.8.0_181"
OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1~deb9u1-b13)
OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)

OS version (uname -a if on a Unix-like system):
Linux vagrantes6 4.9.0-6-amd64 #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02) x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

I've uncovered a number of oddities in tokenization in the CJK analyzer. All examples are from Korean Wikipedia or Korean Wiktionary (including non-CJK examples). In rough order of importance:

A. Mixed-script tokens (Korean and non-CJK—such as numbers, Latin characters) are treated as one long token, rather than being broken up into bigrams. For example, 안녕은하철도999극장판2.1981년8월8일.일본개봉작1999년재더빙video판 is tokenized as one token.

B. Middle dots (·, U+00B7) can be used as list separators in Korean. When they are, the text is not broken up into bigrams. For example, 경승지·산악·협곡·해협·곶·심연·폭포·호수·급류 is tokenized as one token. I'm not sure whether this is a special case of (A) or not.

Work around: use a character filter to convert middle dots to spaces before CJK.

C. The CJK analyzer eats encircled numbers (①②③), "dingbat" circled numbers (➀➁➂), parenthesized numbers (⑴⑵⑶), fractions (¼ ⅓ ⅜ ½ ⅔ ¾), superscript numbers (¹²³), and subscript numbers (₁₂₃). They just disappear.

Work around: use the icu_normalizer before CJK to convert these to ASCII numbers.

D. Soft hyphens (U+00AD) and zero-width non-joiners (U+200C), and left-to-right and right-to-left markers (U+200E and U+200F) are left in tokens. They should be stripped out. Examples: hyphenation (soft hyphen) and بازی‌های (zero-width non-joiners), הארץ‎ (left-to-right mark).

Work around: use a character filter to strip these characters before CJK.

Steps to reproduce:

Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.

Set up CJK analyzer:

curl -X PUT "localhost:9200/cjk?pretty" -H 'Content-Type: application/json' -d'
{
  "settings" : {
    "index": {
      "analysis": {
        "analyzer": {
          "text": {
            "type": "cjk"
          }
        }
      }
    }
  }
}
'

Analyze example tokens:

A. Mixed Korean–Non-CJK characters

curl -sk localhost:9200/cjk/_analyze?pretty -H 'Content-Type: application/json' -d '{"analyzer": "text", "text" : "안녕은하철도999극장판2.1981년8월8일.일본개봉작1999년재더빙video판"}'

{
  "tokens" : [
    {
      "token" : "안녕은하철도999극장판2.1981년8월8일.일본개봉작1999년재더빙video판",
      "start_offset" : 0,
      "end_offset" : 43,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

B. Middle dots as lists

curl -sk localhost:9200/cjk/_analyze?pretty -H 'Content-Type: application/json' -d '{"analyzer": "text", "text" : "경승지·산악·협곡·해협·곶·심연·폭포·호수·급류"}'

{
  "tokens" : [
    {
      "token" : "경승지·산악·협곡·해협·곶·심연·폭포·호수·급류",
      "start_offset" : 0,
      "end_offset" : 26,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

C. Unicode numerical characters disappear

curl -sk localhost:9200/cjk/_analyze?pretty -H 'Content-Type: application/json' -d '{"analyzer": "text", "text" : "① ② ③ ➀ ➁ ➂ ⑴ ⑵ ⑶ ¼ ⅓ ⅜ ½ ⅔ ¾ ¹ ² ³ ₁ ₂ ₃"}'

{
  "tokens" : [ ]
}

D. soft hyphens, zero-width non-joiners, left-to-right and right-to-left markers (note that these are usually invisible)

curl -sk localhost:9200/cjk/_analyze?pretty -H 'Content-Type: application/json' -d '{"analyzer": "text", "text" : "hyphenation"}'

{
  "tokens" : [
    {
      "token" : "hyphenation",
      "start_offset" : 0,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

curl -sk localhost:9200/cjk/_analyze?pretty -H 'Content-Type: application/json' -d '{"analyzer": "text", "text" : "بازی‌های"}'

{
  "tokens" : [
    {
      "token" : "بازی‌های",
      "start_offset" : 0,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

curl -sk localhost:9200/cjk/_analyze?pretty -H 'Content-Type: application/json' -d '{"analyzer": "text", "text" : "הארץ‎"}'

{
  "tokens" : [
    {
      "token" : "הארץ‎",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-10-04T09:00:31Z

Pinging @elastic/es-search-aggs

colings86 · 2018-10-04T09:00:43Z

@romseygeek Could you take a look at this one?

romseygeek · 2018-10-04T11:16:30Z

Thanks for the very detailed issue @Trey314159.

It seems that all of these issues are solved by using some combination of the ICU normalizers and tokenizers. CJKAnalyzer uses StandardAnalyzer, which does not deal at all well with mixed-character text, and does no character filtering beforehand.

A and B are tokenized correctly with icu_tokenizer
C and D are tokenized correctly with icu_normalizer and icu_tokenizer

The advantage of the CJKAnalyzer is that it's part of the server code and requires no plugins, but I wonder if we ought to think about deprecating it and pointing people to the analysis-icu plugin for CJK text instead? Providing an icu_analyzer out of the box would also be helpful.

cc @jimczi

jimczi · 2018-10-04T11:55:29Z

A and B look like bugs to me. The ICUTokenizer and the StandardTokenizer are supposed to implement the same rules as specified in http://unicode.org/reports/tr29. However hangul characters are not recognized by the StandardTokenizer when they are mixed with latin characters not separated by spaces. I consider this as a bug because that's what this tokenizer is supposed to do, separate characters if they don't belong to the same alphabet. We need to open an issue in Lucene.

For C and D this is a normalization issue that can be solved by the ICUNormalizer as Alan noticed and I'd not expect this normalization by default in the standard analyzer.

Trey314159 · 2018-10-04T19:50:37Z

Thanks for the feedback. I agree that C and D can be fixed with other tokenizers and additional filters, but I wonder how many people even know how to "unpack"/rebuild a tokenizer (it's in the docs, but you gotta know to look for it). The invisible characters are the worst, because you don't even know you have them in your document or your query if you cut-n-paste it from somewhere else.

Not having the desired normalization for these relatively common invisible characters in the standard analyzer is a reasonable design choice, but I can't imagine anyone would expect that searching for hyphenation would fail to find hyphenation, because they look exactly the same.

Opening issues in Lucene for A and B would be great—thanks!

jimczi · 2018-10-05T18:12:50Z

I opened https://issues.apache.org/jira/browse/LUCENE-8526 for the StandardTokenizer issue.

jimczi · 2018-10-11T10:32:42Z

As explained by Steve in the Lucene issue I was wrong about the StandardTokenizer behavior. The unicode spec that this tokenizer implements does not have a rule to break Hangul syllables. The script boundary break is not part of the specification, this is just an additional functionality of the ICUTokenizer. We'll add documentation about this behavior in Lucene and Elasticsearch.

Thanks for the feedback. I agree that C and D can be fixed with other tokenizers and additional filters, but I wonder how many people even know how to "unpack"/rebuild a tokenizer (it's in the docs, but you gotta know to look for it). The invisible characters are the worst, because you don't even know you have them in your document or your query if you cut-n-paste it from somewhere else.

What do you think of Alan's idea to provide an icu_analyzer in the ICU plugin ? This could simplify the usage of the plugin.

Not having the desired normalization for these relatively common invisible characters in the standard analyzer is a reasonable design choice, but I can't imagine anyone would expect that searching for hyphenation would fail to find hyphenation, because they look exactly the same.

It would be nice to have some documentation in the ICU plugin that explains these normalization issues. @Trey314159 is this something you would be interested to contribute on ?

Trey314159 · 2018-10-11T21:19:54Z

What do you think of Alan's idea to provide an icu_analyzer in the ICU plugin ? This could simplify the usage of the plugin.

Paraphrasing @romseygeek, I agree that pointing people to analysis-icu for CJK text and providing an icu_analyzer out of the box could be a good thing.

It would be nice to have some documentation in the ICU plugin that explains these normalization issues. @Trey314159 is this something you would be interested to contribute on ?

Maybe? I wouldn't want to commit to any really big project, but if you want some examples with explanation, I could probably put something together, covering at least the issues I regularly run into. Would it go here? Would it be best to issue a pull request there or just provide some descriptive text to someone else who better knows where to put it?

jimczi · 2018-10-15T17:29:19Z

Would it go here? Would it be best to issue a pull request there or just provide some descriptive text to someone else who better knows where to put it?

Yes a pull request there would be great.

The ICU plugin provides the building blocks of an analysis chain, but doesn't actually have a prebuilt analyzer. It would be a better for users if there was a simple analyzer that they could use out of the box, and also something we can point to from the CJK Analyzer docs as a superior alternative. Relates to #34285

romseygeek · 2018-12-03T09:20:50Z

The ICU analyzer will be in 6.6, and docs on CJKAnalyzer point to it as an alternative. Can we close this one out now, or is there more to do?

romseygeek · 2019-04-03T08:34:16Z

I'm going to close this now, as we have a working alternative in the icu_analyzer from 6.6. Please do re-open it if you think there are more problems that need to be fixed.

colings86 added the :Search/Analysis How text is split into tokens label Oct 4, 2018

romseygeek mentioned this issue Oct 29, 2018

Add a prebuilt ICU Analyzer #34958

Merged

codebrain mentioned this issue Jan 25, 2019

[meta] 6.6.0 Release elastic/elasticsearch-net#3552

Closed

48 tasks

codebrain mentioned this issue Mar 19, 2019

[meta] 6.7.0 Release elastic/elasticsearch-net#3615

Closed

24 tasks

romseygeek closed this as completed Apr 3, 2019

asfimport mentioned this issue Oct 11, 2018

StandardTokenizer doesn't separate hangul characters from other non-CJK chars [LUCENE-8526] apache/lucene#9572

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CJK analyzer tokenization issues #34285

CJK analyzer tokenization issues #34285

Trey314159 commented Oct 3, 2018

elasticmachine commented Oct 4, 2018

colings86 commented Oct 4, 2018

romseygeek commented Oct 4, 2018

jimczi commented Oct 4, 2018 •

edited

Trey314159 commented Oct 4, 2018

jimczi commented Oct 5, 2018

jimczi commented Oct 11, 2018

Trey314159 commented Oct 11, 2018

jimczi commented Oct 15, 2018 •

edited

romseygeek commented Dec 3, 2018

romseygeek commented Apr 3, 2019

CJK analyzer tokenization issues #34285

CJK analyzer tokenization issues #34285

Comments

Trey314159 commented Oct 3, 2018

elasticmachine commented Oct 4, 2018

colings86 commented Oct 4, 2018

romseygeek commented Oct 4, 2018

jimczi commented Oct 4, 2018 • edited

Trey314159 commented Oct 4, 2018

jimczi commented Oct 5, 2018

jimczi commented Oct 11, 2018

Trey314159 commented Oct 11, 2018

jimczi commented Oct 15, 2018 • edited

romseygeek commented Dec 3, 2018

romseygeek commented Apr 3, 2019

jimczi commented Oct 4, 2018 •

edited

jimczi commented Oct 15, 2018 •

edited