ngram #110084

S-Dragon0302 · 2024-06-24T08:18:12Z

Elasticsearch Version

7.15.1

Installed Plugins

No response

Java Version

bundled

OS Version

mac

Problem Description

PUT /my_index
{
"settings": {
"analysis": {
"tokenizer": {
"letter_digit_tokenizer": {
"type": "pattern",
"pattern": "[^\\p{L}\\p{N}]+"
}
},
"filter": {
"my_ngram_filter": {
"type": "ngram",
"min_gram": 2,
"max_gram": 2
}
},
"analyzer": {
"my_letter_digit_ngram_analyzer": {
"type": "custom",
"tokenizer": "letter_digit_tokenizer",
"filter": [
"lowercase",
"my_ngram_filter"
]
}
}
}
}
}

GET /my_index/_analyze
{
"analyzer": "my_letter_digit_ngram_analyzer",
"text": "是ྂ不ྂ是ྂ发ྂ现ྂ我ྂ的ྂ字ྂ冒ྂ烟ྂ了ྂ"
}

{
"tokens" : [ ]
}
or
{
"tokens": [
{
"token": "是不",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "不是",
"start_offset": 2,
"end_offset": 5,
"type": "word",
"position": 1
},
{
"token": "是发",
"start_offset": 4,
"end_offset": 7,
"type": "word",
"position": 2
},
{
"token": "发现",
"start_offset": 6,
"end_offset": 9,
"type": "word",
"position": 3
},
{
"token": "现我",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 4
},
{
"token": "我的",
"start_offset": 10,
"end_offset": 13,
"type": "word",
"position": 5
},
{
"token": "的字",
"start_offset": 12,
"end_offset": 15,
"type": "word",
"position": 6
},
{
"token": "字冒",
"start_offset": 14,
"end_offset": 17,
"type": "word",
"position": 7
},
{
"token": "冒烟",
"start_offset": 16,
"end_offset": 19,
"type": "word",
"position": 8
},
{
"token": "烟了",
"start_offset": 18,
"end_offset": 21,
"type": "word",
"position": 9
}
]
}

Steps to Reproduce

PUT /my_index
{
"settings": {
"analysis": {
"tokenizer": {
"letter_digit_tokenizer": {
"type": "pattern",
"pattern": "[^\\p{L}\\p{N}]+"
}
},
"filter": {
"my_ngram_filter": {
"type": "ngram",
"min_gram": 2,
"max_gram": 2
}
},
"analyzer": {
"my_letter_digit_ngram_analyzer": {
"type": "custom",
"tokenizer": "letter_digit_tokenizer",
"filter": [
"lowercase",
"my_ngram_filter"
]
}
}
}
}
}

GET /my_index/_analyze
{
"analyzer": "my_letter_digit_ngram_analyzer",
"text": "是ྂ不ྂ是ྂ发ྂ现ྂ我ྂ的ྂ字ྂ冒ྂ烟ྂ了ྂ"
}

{
"tokens" : [ ]
}
or
{
"tokens": [
{
"token": "是不",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "不是",
"start_offset": 2,
"end_offset": 5,
"type": "word",
"position": 1
},
{
"token": "是发",
"start_offset": 4,
"end_offset": 7,
"type": "word",
"position": 2
},
{
"token": "发现",
"start_offset": 6,
"end_offset": 9,
"type": "word",
"position": 3
},
{
"token": "现我",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 4
},
{
"token": "我的",
"start_offset": 10,
"end_offset": 13,
"type": "word",
"position": 5
},
{
"token": "的字",
"start_offset": 12,
"end_offset": 15,
"type": "word",
"position": 6
},
{
"token": "字冒",
"start_offset": 14,
"end_offset": 17,
"type": "word",
"position": 7
},
{
"token": "冒烟",
"start_offset": 16,
"end_offset": 19,
"type": "word",
"position": 8
},
{
"token": "烟了",
"start_offset": 18,
"end_offset": 21,
"type": "word",
"position": 9
}
]
}

Logs (if relevant)

No response

elasticsearchmachine · 2024-06-25T03:38:17Z

Pinging @elastic/es-search (Team:Search)

cbuescher · 2024-06-25T07:50:54Z

@S-Dragon0302 Would you please let us know what the problem is you are encountering? I'm going to remove the "bug" label for now as I don't see whats missing. Also keep in mind that if this is a language-specific problem, the language-specific discuss forums (https://discuss.elastic.co/c/in-your-native-tongue/11) might be a good place to ask.

S-Dragon0302 · 2024-06-28T06:15:31Z

The segmentation result is incorrect. The token I generated from segmentation has no value. Actually, there should be a result

S-Dragon0302 · 2024-06-28T06:21:29Z

The segmentation result should be this.
{
"tokens": [
{
"token": "是不",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "不是",
"start_offset": 2,
"end_offset": 5,
"type": "word",
"position": 1
},
{
"token": "是发",
"start_offset": 4,
"end_offset": 7,
"type": "word",
"position": 2
},
{
"token": "发现",
"start_offset": 6,
"end_offset": 9,
"type": "word",
"position": 3
},
{
"token": "现我",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 4
},
{
"token": "我的",
"start_offset": 10,
"end_offset": 13,
"type": "word",
"position": 5
},
{
"token": "的字",
"start_offset": 12,
"end_offset": 15,
"type": "word",
"position": 6
},
{
"token": "字冒",
"start_offset": 14,
"end_offset": 17,
"type": "word",
"position": 7
},
{
"token": "冒烟",
"start_offset": 16,
"end_offset": 19,
"type": "word",
"position": 8
},
{
"token": "烟了",
"start_offset": 18,
"end_offset": 21,
"type": "word",
"position": 9
}
]
}

S-Dragon0302 · 2024-06-28T06:22:21Z

The actual result is this.
{
"tokens" : [ ]
}

benwtrent · 2024-06-28T10:59:30Z

@S-Dragon0302

For the given:

是ྂ不ྂ是ྂ发ྂ现ྂ我ྂ的ྂ字ྂ冒ྂ烟ྂ了ྂ the pattern without token filtering:

GET /my_index/_analyze
{
  "filter": [
    "lowercase"
  ],
  "tokenizer": {
    "type": "pattern",
    "pattern": "[^\\p{L}\\p{N}]+"
  },
  "text": "是ྂ不ྂ是ྂ发ྂ现ྂ我ྂ的ྂ字ྂ冒ྂ烟ྂ了ྂ"
}

Results in:

{
  "tokens": [
    {
      "token": "是",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "不",
      "start_offset": 2,
      "end_offset": 3,
      "type": "word",
      "position": 1
    },
    {
      "token": "是",
      "start_offset": 4,
      "end_offset": 5,
      "type": "word",
      "position": 2
    },
    {
      "token": "发",
      "start_offset": 6,
      "end_offset": 7,
      "type": "word",
      "position": 3
    },
    {
      "token": "现",
      "start_offset": 8,
      "end_offset": 9,
      "type": "word",
      "position": 4
    },
    {
      "token": "我",
      "start_offset": 10,
      "end_offset": 11,
      "type": "word",
      "position": 5
    },
    {
      "token": "的",
      "start_offset": 12,
      "end_offset": 13,
      "type": "word",
      "position": 6
    },
    {
      "token": "字",
      "start_offset": 14,
      "end_offset": 15,
      "type": "word",
      "position": 7
    },
    {
      "token": "冒",
      "start_offset": 16,
      "end_offset": 17,
      "type": "word",
      "position": 8
    },
    {
      "token": "烟",
      "start_offset": 18,
      "end_offset": 19,
      "type": "word",
      "position": 9
    },
    {
      "token": "了",
      "start_offset": 20,
      "end_offset": 21,
      "type": "word",
      "position": 10
    }
  ]
}

None of those are longer than 1 ngram. So filtering, requiring 2 ngram results in no output.

benwtrent · 2024-07-12T15:38:58Z

closing as expected behavior. Filtering requiring 2 ngram when there is only 1 ngram is expected.

S-Dragon0302 added >bug needs:triage Requires assignment of a team area label labels Jun 24, 2024

tvernum added :Search/Search Search-related issues that do not fall into other categories and removed needs:triage Requires assignment of a team area label labels Jun 25, 2024

elasticsearchmachine added the Team:Search Meta label for search team label Jun 25, 2024

cbuescher added feedback_needed and removed >bug labels Jun 25, 2024

benwtrent closed this as completed Jul 12, 2024

benwtrent added :Search Relevance/Analysis How text is split into tokens and removed feedback_needed :Search/Search Search-related issues that do not fall into other categories labels Jul 12, 2024

javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ngram #110084

ngram #110084

S-Dragon0302 commented Jun 24, 2024

elasticsearchmachine commented Jun 25, 2024

cbuescher commented Jun 25, 2024

S-Dragon0302 commented Jun 28, 2024

S-Dragon0302 commented Jun 28, 2024

S-Dragon0302 commented Jun 28, 2024

benwtrent commented Jun 28, 2024

benwtrent commented Jul 12, 2024

ngram #110084

ngram #110084

Comments

S-Dragon0302 commented Jun 24, 2024

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Logs (if relevant)

elasticsearchmachine commented Jun 25, 2024

cbuescher commented Jun 25, 2024

S-Dragon0302 commented Jun 28, 2024

S-Dragon0302 commented Jun 28, 2024

S-Dragon0302 commented Jun 28, 2024

benwtrent commented Jun 28, 2024

benwtrent commented Jul 12, 2024