position inconsistency when using _analyze API with or without index name #29021

dadoonet · 2018-03-13T16:32:23Z

Elasticsearch version (bin/elasticsearch --version): 6.2.2
Steps to reproduce:

Consider the following

DELETE temp
PUT temp 
GET temp/_analyze
{
  "analyzer": "standard", 
  "text": [ 
    "a", "b"
  ]
}

It gives:

{
  "tokens": [
    {
      "token": "a",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "b",
      "start_offset": 2,
      "end_offset": 3,
      "type": "<ALPHANUM>",
      "position": 101
    }
  ]
}

Position of the 1st term of the second text is 101 which is correct because we don't want to be able to match phrase like a b.

Now the same _analyze but with no index name:

GET _analyze
{
  "analyzer": "standard", 
  "text": [ 
    "a", "b"
  ]
}

It gives:

{
  "tokens": [
    {
      "token": "a",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "b",
      "start_offset": 2,
      "end_offset": 3,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

Position of the 1st term of the second text is 1 which is incorrect.
It leaves the impression that a match phrase could work on a b.

Not important though as indexing a document is always happening within an index but I'm feeling that this _analyze API is giving inconsistent results. May be we should fix it?

cc @jpountz

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-03-13T16:32:26Z

Pinging @elastic/es-search-aggs

umeshdangat · 2018-03-16T01:53:32Z

The reason for the difference in positionIncrementGap is because for the index based case NamedAnalyzer is used which overrides getpositionIncrementGap() vs for the non-index based case StandardAnalyzer is used directly from lucene.

The code that makes this decision is here

Lucene defaults positionIncrementGap to 0 in all analyzers, so if we desire non default lucene behavior for all analyzers the code for setupAnalyzers might need to change to not return vanilla lucene analyzers but ones that override the getPositionIncrementGap.

I am not super familiar with this code so I am not sure if what I suggest is even desired or correct behavior

cc: @jpountz @dadoonet

When no index is specified on an analyze request, the code that builds the analysis chain for that request goes directly to the analysis registry to check for pre-built analyzers. This can cause inconsistencies due to the fact that elasticsearch defaults for various analysis settings (eg the position increment gap) are different to the lucene defaults. To remove these inconsistencies, this commit builds a one-off IndexAnalyzers object when none is provided. This means that all analyzers accessed through an analysis request will use elasticsearch defaults. Fixes elastic#29021

elasticsearchmachine · 2024-03-08T20:41:05Z

Pinging @elastic/es-search (Team:Search)

dadoonet added >bug :Search/Analysis How text is split into tokens labels Mar 13, 2018

ontology-rory mentioned this issue Jul 11, 2018

Position offsets incorrect on analysis - not getting results from a phrase query using query_string #31966

Closed

romseygeek mentioned this issue Apr 3, 2019

Always use IndexAnalyzers in analyze transport action #40769

Closed

romseygeek self-assigned this Apr 3, 2019

rjernst added the Team:Search Meta label for search team label May 4, 2020

benwtrent unassigned romseygeek Mar 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

position inconsistency when using _analyze API with or without index name #29021

position inconsistency when using _analyze API with or without index name #29021

dadoonet commented Mar 13, 2018

elasticmachine commented Mar 13, 2018

umeshdangat commented Mar 16, 2018

elasticsearchmachine commented Mar 8, 2024

position inconsistency when using _analyze API with or without index name #29021

position inconsistency when using _analyze API with or without index name #29021

Comments

dadoonet commented Mar 13, 2018

elasticmachine commented Mar 13, 2018

umeshdangat commented Mar 16, 2018

elasticsearchmachine commented Mar 8, 2024