-
Notifications
You must be signed in to change notification settings - Fork 24.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ICU Tokenizer: letter-space-number-letter tokenized inconsistently #27290
Comments
cc @elastic/es-search-aggs |
Any update on this? I discovered a very annoying Heisenbug that is caused by unexpected interaction between this bug and the 4k buffer used to read input for tokenization—it assumes that spaces are unambiguous token breaks. Since this bug reaches across spaces, the 4k chunking matters in rare cases. I had an example that happened to chunk between the equivalent of 가 and 14th in the example above, correcting the bug. When I changed unrelated text thousands of characters away from 가 14th, it altered the chunking, and the bug reemerged, which threw my regression testing into chaos until I figured out what was causing it. |
I just discovered this issue. It looks exactly the same as the issue I reported yesterday against OpenSearch: opensearch-project/OpenSearch#10663. |
This still shows on 8.10 which uses "lucene-analysis-icu-9.8.0". I wrote a small test that confirms there's nothing in the Elasticsearch code causing this, the behaviour also appears in isolation when using a plain Lucene ICUTokenizer, which in turn I think uses the icu libraries CompositeBreakIterator. I assume that means the behaviour already manifests itself there. Here's the small unit test:
This shows the same tokenization as reported above from our "_analyze" endpoint. |
I should have linked to my Lucene tickets. I suggest keeping this one open so people can find it here—but I understand if the decision is to close it. I opened a ticket on Jira in 2021 and they have since moved it to GitHub. The most recent comments are that it is behaving "as designed" even though it goes against the examples in the cited Unicode standard. I think the solution is to reset "the current character set" at whitespace and maybe other boundaries, but they disagree. I'm actually going to start working soon on a post-icu tokenizer token filter to repair these cases, but I don't know how that will pan out. |
The issue should still be discoverable when closed. We like to keep issues open only when actionable by us, so I'm going to close this one. |
Elasticsearch version (
bin/elasticsearch --version
): 5.3.2Plugins installed: [analysis-hebrew, analysis-icu, analysis-smartcn, analysis-stconvert, analysis-stempel, analysis-ukrainian, experimental-highlighter, extra, ltr-query]
JVM version (
java -version
): java version "1.7.0_151"OpenJDK Runtime Environment (IcedTea 2.6.11) (7u151-2.6.11-1~deb8u1)
OpenJDK 64-Bit Server VM (build 24.151-b01, mixed mode)
OS version (
uname -a
if on a Unix-like system): Linux mediawiki-vagrant 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2 (2017-04-30) x86_64 GNU/LinuxDescription of the problem including expected versus actual behavior:
2021 Update: This cropped up again recently, and after a bit more testing, I realize that the relevant generalization is that in a letter-space-number-letter sequence, if the writing system before the space is the same as the writing system after the number, then you get two tokens. If the writing systems differ, you get three tokens. So both ァ 14th and th 14ァ (and many other writing systems) are affected.
The tokenization of strings like 14th with the ICU tokenizer is affected by the character that comes before preceeding whitespace.
For example, x 14th is tokenized as x | 14th; ァ 14th is tokenized as ァ | 14 | th. This holds true for many other characters at U+0370 and above, including Japanese, Korean (가), Chinese(豈), Greek (Δ), Cyrillic (Д), Devanagari (ख), Futhark (𐊃). It also applied to halfwidth Japanese (イ), but not fullwidth Latin characters (D).
The expected behavior is that after one or more spaces, tabs, or newlines, tokenization would not be affected by the characters before the whitespace.
Underscores are also affected, _x and __x are treated differently after a Latin character and a space rather than after a U+0370+ character and a space.
The Unicode Segmentation Algorithm Demo agrees. Example.
Tokenizing the same string x 14th 가 14th 豈 14th Δ 14th Д 14th ख 14th 𐊃 14th イ 14th D 14th with the ICU tokenizer breaks up "14th" for all but the first and last instances.
Steps to reproduce:
Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.
text
as{ "type": "custom", "tokenizer": "icu_tokenizer" }
curl -sk localhost:9200/wiki_content/_analyze?pretty -d '{"analyzer": "text", "text" : "x 14th" }'
curl -sk localhost:9200/wiki_content/_analyze?pretty -d '{"analyzer": "text", "text" : "ァ 14th" }'
The text was updated successfully, but these errors were encountered: