Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU Tokenizer: letter-space-number-letter tokenized inconsistently #27290

Closed
Trey314159 opened this issue Nov 6, 2017 · 6 comments
Closed

ICU Tokenizer: letter-space-number-letter tokenized inconsistently #27290

Trey314159 opened this issue Nov 6, 2017 · 6 comments
Assignees
Labels
>bug :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@Trey314159
Copy link

Trey314159 commented Nov 6, 2017

Elasticsearch version (bin/elasticsearch --version): 5.3.2

Plugins installed: [analysis-hebrew, analysis-icu, analysis-smartcn, analysis-stconvert, analysis-stempel, analysis-ukrainian, experimental-highlighter, extra, ltr-query]

JVM version (java -version): java version "1.7.0_151"
OpenJDK Runtime Environment (IcedTea 2.6.11) (7u151-2.6.11-1~deb8u1)
OpenJDK 64-Bit Server VM (build 24.151-b01, mixed mode)

OS version (uname -a if on a Unix-like system): Linux mediawiki-vagrant 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2 (2017-04-30) x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

2021 Update: This cropped up again recently, and after a bit more testing, I realize that the relevant generalization is that in a letter-space-number-letter sequence, if the writing system before the space is the same as the writing system after the number, then you get two tokens. If the writing systems differ, you get three tokens. So both ァ 14th and th 14ァ (and many other writing systems) are affected.

The tokenization of strings like 14th with the ICU tokenizer is affected by the character that comes before preceeding whitespace.

For example, x 14th is tokenized as x | 14th; ァ 14th is tokenized as ァ | 14 | th. This holds true for many other characters at U+0370 and above, including Japanese, Korean (가), Chinese(豈), Greek (Δ), Cyrillic (Д), Devanagari (ख), Futhark (𐊃). It also applied to halfwidth Japanese (イ), but not fullwidth Latin characters (D).

The expected behavior is that after one or more spaces, tabs, or newlines, tokenization would not be affected by the characters before the whitespace.

Underscores are also affected, _x and __x are treated differently after a Latin character and a space rather than after a U+0370+ character and a space.

The Unicode Segmentation Algorithm Demo agrees. Example.

Tokenizing the same string x 14th 가 14th 豈 14th Δ 14th Д 14th ख 14th 𐊃 14th イ 14th D 14th with the ICU tokenizer breaks up "14th" for all but the first and last instances.

Steps to reproduce:

Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.

  1. Set up analyzer text as { "type": "custom", "tokenizer": "icu_tokenizer" }
  2. curl -sk localhost:9200/wiki_content/_analyze?pretty -d '{"analyzer": "text", "text" : "x 14th" }'
{
  "tokens" : [
    {
      "token" : "x",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "14th",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}
  1. curl -sk localhost:9200/wiki_content/_analyze?pretty -d '{"analyzer": "text", "text" : "ァ 14th" }'
{
  "tokens" : [
    {
      "token" : "ァ",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "14",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "<NUM>",
      "position" : 1
    },
    {
      "token" : "th",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 2
    }
  ]
}
@romseygeek
Copy link
Contributor

cc @elastic/es-search-aggs

@rjernst rjernst added the Team:Search Meta label for search team label May 4, 2020
@Trey314159
Copy link
Author

Any update on this?

I discovered a very annoying Heisenbug that is caused by unexpected interaction between this bug and the 4k buffer used to read input for tokenization—it assumes that spaces are unambiguous token breaks. Since this bug reaches across spaces, the 4k chunking matters in rare cases. I had an example that happened to chunk between the equivalent of and 14th in the example above, correcting the bug. When I changed unrelated text thousands of characters away from 가 14th, it altered the chunking, and the bug reemerged, which threw my regression testing into chaos until I figured out what was causing it.

@Trey314159 Trey314159 changed the title ICU Tokenizer: U+0370 and above affect tokenization of characters after whitespace ICU Tokenizer: letter-space-number-letter tokenized inconsistently Jan 22, 2021
@robbertbrak
Copy link

I just discovered this issue. It looks exactly the same as the issue I reported yesterday against OpenSearch: opensearch-project/OpenSearch#10663.

@cbuescher
Copy link
Member

This still shows on 8.10 which uses "lucene-analysis-icu-9.8.0". I wrote a small test that confirms there's nothing in the Elasticsearch code causing this, the behaviour also appears in isolation when using a plain Lucene ICUTokenizer, which in turn I think uses the icu libraries CompositeBreakIterator. I assume that means the behaviour already manifests itself there. Here's the small unit test:

    public void testLetterSpaceNumberLetters() throws IOException {
        Reader input = new StringReader("x 14th 가 14th 豈 14th Δ 14th Д 14th ख 14th イ 14th D 14th");

        Tokenizer t = new ICUTokenizer();
        t.setReader(input);
        t.reset();
        CharTermAttribute termAtt = t.addAttribute(CharTermAttribute.class);
        while(t.incrementToken() != false) {
            System.out.println(termAtt.toString());
        }
    }

This shows the same tokenization as reported above from our "_analyze" endpoint.
For this reason I propose closing this issue and opening a corresponding one with Lucene.

@Trey314159
Copy link
Author

For this reason I propose closing this issue and opening a corresponding one with Lucene.

I should have linked to my Lucene tickets. I suggest keeping this one open so people can find it here—but I understand if the decision is to close it.

I opened a ticket on Jira in 2021 and they have since moved it to GitHub. The most recent comments are that it is behaving "as designed" even though it goes against the examples in the cited Unicode standard.

I think the solution is to reset "the current character set" at whitespace and maybe other boundaries, but they disagree. I'm actually going to start working soon on a post-icu tokenizer token filter to repair these cases, but I don't know how that will pan out.

@cbuescher
Copy link
Member

I suggest keeping this one open so people can find it here—but I understand if the decision is to close it.

The issue should still be discoverable when closed. We like to keep issues open only when actionable by us, so I'm going to close this one.

@javanna javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

9 participants