ICU Tokenizer: letter-space-number-letter tokenized inconsistently #27290

Trey314159 · 2017-11-06T21:25:13Z

Elasticsearch version (bin/elasticsearch --version): 5.3.2

Plugins installed: [analysis-hebrew, analysis-icu, analysis-smartcn, analysis-stconvert, analysis-stempel, analysis-ukrainian, experimental-highlighter, extra, ltr-query]

JVM version (java -version): java version "1.7.0_151"
OpenJDK Runtime Environment (IcedTea 2.6.11) (7u151-2.6.11-1~deb8u1)
OpenJDK 64-Bit Server VM (build 24.151-b01, mixed mode)

OS version (uname -a if on a Unix-like system): Linux mediawiki-vagrant 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2 (2017-04-30) x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

2021 Update: This cropped up again recently, and after a bit more testing, I realize that the relevant generalization is that in a letter-space-number-letter sequence, if the writing system before the space is the same as the writing system after the number, then you get two tokens. If the writing systems differ, you get three tokens. So both ァ 14th and th 14ァ (and many other writing systems) are affected.

The tokenization of strings like 14th with the ICU tokenizer is affected by the character that comes before preceeding whitespace.

For example, x 14th is tokenized as x | 14th; ァ 14th is tokenized as ァ | 14 | th. This holds true for many other characters at U+0370 and above, including Japanese, Korean (가), Chinese(豈), Greek (Δ), Cyrillic (Д), Devanagari (ख), Futhark (𐊃). It also applied to halfwidth Japanese (ｲ), but not fullwidth Latin characters (Ｄ).

The expected behavior is that after one or more spaces, tabs, or newlines, tokenization would not be affected by the characters before the whitespace.

Underscores are also affected, _x and __x are treated differently after a Latin character and a space rather than after a U+0370+ character and a space.

The Unicode Segmentation Algorithm Demo agrees. Example.

Tokenizing the same string x 14th 가 14th 豈 14th Δ 14th Д 14th ख 14th 𐊃 14th ｲ 14th Ｄ 14th with the ICU tokenizer breaks up "14th" for all but the first and last instances.

Steps to reproduce:

Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.

Set up analyzer text as { "type": "custom", "tokenizer": "icu_tokenizer" }
curl -sk localhost:9200/wiki_content/_analyze?pretty -d '{"analyzer": "text", "text" : "x 14th" }'

{
  "tokens" : [
    {
      "token" : "x",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "14th",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

curl -sk localhost:9200/wiki_content/_analyze?pretty -d '{"analyzer": "text", "text" : "ァ 14th" }'

{
  "tokens" : [
    {
      "token" : "ァ",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "14",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "<NUM>",
      "position" : 1
    },
    {
      "token" : "th",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 2
    }
  ]
}

The text was updated successfully, but these errors were encountered:

romseygeek · 2018-03-14T13:55:36Z

cc @elastic/es-search-aggs

Trey314159 · 2021-01-22T20:44:36Z

Any update on this?

I discovered a very annoying Heisenbug that is caused by unexpected interaction between this bug and the 4k buffer used to read input for tokenization—it assumes that spaces are unambiguous token breaks. Since this bug reaches across spaces, the 4k chunking matters in rare cases. I had an example that happened to chunk between the equivalent of 가 and 14th in the example above, correcting the bug. When I changed unrelated text thousands of characters away from 가 14th, it altered the chunking, and the bug reemerged, which threw my regression testing into chaos until I figured out what was causing it.

robbertbrak · 2023-10-18T08:22:23Z

I just discovered this issue. It looks exactly the same as the issue I reported yesterday against OpenSearch: opensearch-project/OpenSearch#10663.

cbuescher · 2023-10-18T09:08:15Z

This still shows on 8.10 which uses "lucene-analysis-icu-9.8.0". I wrote a small test that confirms there's nothing in the Elasticsearch code causing this, the behaviour also appears in isolation when using a plain Lucene ICUTokenizer, which in turn I think uses the icu libraries CompositeBreakIterator. I assume that means the behaviour already manifests itself there. Here's the small unit test:

    public void testLetterSpaceNumberLetters() throws IOException {
        Reader input = new StringReader("x 14th 가 14th 豈 14th Δ 14th Д 14th ख 14th ｲ 14th Ｄ 14th");

        Tokenizer t = new ICUTokenizer();
        t.setReader(input);
        t.reset();
        CharTermAttribute termAtt = t.addAttribute(CharTermAttribute.class);
        while(t.incrementToken() != false) {
            System.out.println(termAtt.toString());
        }
    }

This shows the same tokenization as reported above from our "_analyze" endpoint.
For this reason I propose closing this issue and opening a corresponding one with Lucene.

Trey314159 · 2023-10-18T15:09:30Z

For this reason I propose closing this issue and opening a corresponding one with Lucene.

I should have linked to my Lucene tickets. I suggest keeping this one open so people can find it here—but I understand if the decision is to close it.

I opened a ticket on Jira in 2021 and they have since moved it to GitHub. The most recent comments are that it is behaving "as designed" even though it goes against the examples in the cited Unicode standard.

I think the solution is to reset "the current character set" at whitespace and maybe other boundaries, but they disagree. I'm actually going to start working soon on a post-icu tokenizer token filter to repair these cases, but I don't know how that will pan out.

cbuescher · 2023-10-18T15:45:19Z

I suggest keeping this one open so people can find it here—but I understand if the decision is to close it.

The issue should still be discoverable when closed. We like to keep issues open only when actionable by us, so I'm going to close this one.

DaveCTurner added :Search Relevance/Analysis How text is split into tokens >bug labels Nov 7, 2017

DaveCTurner assigned jpountz Nov 7, 2017

DaveCTurner mentioned this issue Nov 7, 2017

ICU Normalizer adds spaces before certain non-combining dakuten and handakuten #27292

Open

DaveCTurner added the :Plugin Analysis ICU label Nov 7, 2017

lcawl added :Search Relevance/Analysis How text is split into tokens and removed :Plugin Analysis ICU labels Feb 13, 2018

rjernst added the Team:Search Meta label for search team label May 4, 2020

Trey314159 changed the title ~~ICU Tokenizer: U+0370 and above affect tokenization of characters after whitespace~~ ICU Tokenizer: letter-space-number-letter tokenized inconsistently Jan 22, 2021

reta mentioned this issue Oct 18, 2023

Inconsistent behavior of icu_tokenizer when indexing cyrillic followed by digit and latin opensearch-project/OpenSearch#10663

Closed

cbuescher closed this as completed Oct 18, 2023

javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ICU Tokenizer: letter-space-number-letter tokenized inconsistently #27290

ICU Tokenizer: letter-space-number-letter tokenized inconsistently #27290

Trey314159 commented Nov 6, 2017 •

edited

Loading

romseygeek commented Mar 14, 2018

Trey314159 commented Jan 22, 2021

robbertbrak commented Oct 18, 2023

cbuescher commented Oct 18, 2023

Trey314159 commented Oct 18, 2023

cbuescher commented Oct 18, 2023

ICU Tokenizer: letter-space-number-letter tokenized inconsistently #27290

ICU Tokenizer: letter-space-number-letter tokenized inconsistently #27290

Comments

Trey314159 commented Nov 6, 2017 • edited Loading

romseygeek commented Mar 14, 2018

Trey314159 commented Jan 22, 2021

robbertbrak commented Oct 18, 2023

cbuescher commented Oct 18, 2023

Trey314159 commented Oct 18, 2023

cbuescher commented Oct 18, 2023

Trey314159 commented Nov 6, 2017 •

edited

Loading