-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ngram #110084
Comments
Pinging @elastic/es-search (Team:Search) |
@S-Dragon0302 Would you please let us know what the problem is you are encountering? I'm going to remove the "bug" label for now as I don't see whats missing. Also keep in mind that if this is a language-specific problem, the language-specific discuss forums (https://discuss.elastic.co/c/in-your-native-tongue/11) might be a good place to ask. |
The segmentation result is incorrect. The token I generated from segmentation has no value. Actually, there should be a result |
The segmentation result should be this. |
The actual result is this. |
For the given:
Results in:
None of those are longer than |
closing as expected behavior. Filtering requiring 2 ngram when there is only 1 ngram is expected. |
Elasticsearch Version
7.15.1
Installed Plugins
No response
Java Version
bundled
OS Version
mac
Problem Description
PUT /my_index
{
"settings": {
"analysis": {
"tokenizer": {
"letter_digit_tokenizer": {
"type": "pattern",
"pattern": "[^\\p{L}\\p{N}]+"
}
},
"filter": {
"my_ngram_filter": {
"type": "ngram",
"min_gram": 2,
"max_gram": 2
}
},
"analyzer": {
"my_letter_digit_ngram_analyzer": {
"type": "custom",
"tokenizer": "letter_digit_tokenizer",
"filter": [
"lowercase",
"my_ngram_filter"
]
}
}
}
}
}
GET /my_index/_analyze
{
"analyzer": "my_letter_digit_ngram_analyzer",
"text": "是ྂ不ྂ是ྂ发ྂ现ྂ我ྂ的ྂ字ྂ冒ྂ烟ྂ了ྂ"
}
{
"tokens" : [ ]
}
or
{
"tokens": [
{
"token": "是不",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "不是",
"start_offset": 2,
"end_offset": 5,
"type": "word",
"position": 1
},
{
"token": "是发",
"start_offset": 4,
"end_offset": 7,
"type": "word",
"position": 2
},
{
"token": "发现",
"start_offset": 6,
"end_offset": 9,
"type": "word",
"position": 3
},
{
"token": "现我",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 4
},
{
"token": "我的",
"start_offset": 10,
"end_offset": 13,
"type": "word",
"position": 5
},
{
"token": "的字",
"start_offset": 12,
"end_offset": 15,
"type": "word",
"position": 6
},
{
"token": "字冒",
"start_offset": 14,
"end_offset": 17,
"type": "word",
"position": 7
},
{
"token": "冒烟",
"start_offset": 16,
"end_offset": 19,
"type": "word",
"position": 8
},
{
"token": "烟了",
"start_offset": 18,
"end_offset": 21,
"type": "word",
"position": 9
}
]
}
Steps to Reproduce
PUT /my_index
{
"settings": {
"analysis": {
"tokenizer": {
"letter_digit_tokenizer": {
"type": "pattern",
"pattern": "[^\\p{L}\\p{N}]+"
}
},
"filter": {
"my_ngram_filter": {
"type": "ngram",
"min_gram": 2,
"max_gram": 2
}
},
"analyzer": {
"my_letter_digit_ngram_analyzer": {
"type": "custom",
"tokenizer": "letter_digit_tokenizer",
"filter": [
"lowercase",
"my_ngram_filter"
]
}
}
}
}
}
GET /my_index/_analyze
{
"analyzer": "my_letter_digit_ngram_analyzer",
"text": "是ྂ不ྂ是ྂ发ྂ现ྂ我ྂ的ྂ字ྂ冒ྂ烟ྂ了ྂ"
}
{
"tokens" : [ ]
}
or
{
"tokens": [
{
"token": "是不",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "不是",
"start_offset": 2,
"end_offset": 5,
"type": "word",
"position": 1
},
{
"token": "是发",
"start_offset": 4,
"end_offset": 7,
"type": "word",
"position": 2
},
{
"token": "发现",
"start_offset": 6,
"end_offset": 9,
"type": "word",
"position": 3
},
{
"token": "现我",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 4
},
{
"token": "我的",
"start_offset": 10,
"end_offset": 13,
"type": "word",
"position": 5
},
{
"token": "的字",
"start_offset": 12,
"end_offset": 15,
"type": "word",
"position": 6
},
{
"token": "字冒",
"start_offset": 14,
"end_offset": 17,
"type": "word",
"position": 7
},
{
"token": "冒烟",
"start_offset": 16,
"end_offset": 19,
"type": "word",
"position": 8
},
{
"token": "烟了",
"start_offset": 18,
"end_offset": 21,
"type": "word",
"position": 9
}
]
}
Logs (if relevant)
No response
The text was updated successfully, but these errors were encountered: