Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

position inconsistency when using _analyze API with or without index name #29021

Open
dadoonet opened this issue Mar 13, 2018 · 3 comments
Open
Labels
>bug :Search/Analysis How text is split into tokens Team:Search Meta label for search team

Comments

@dadoonet
Copy link
Member

Elasticsearch version (bin/elasticsearch --version): 6.2.2
Steps to reproduce:

Consider the following

DELETE temp
PUT temp 
GET temp/_analyze
{
  "analyzer": "standard", 
  "text": [ 
    "a", "b"
  ]
}

It gives:

{
  "tokens": [
    {
      "token": "a",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "b",
      "start_offset": 2,
      "end_offset": 3,
      "type": "<ALPHANUM>",
      "position": 101
    }
  ]
}

Position of the 1st term of the second text is 101 which is correct because we don't want to be able to match phrase like a b.

Now the same _analyze but with no index name:

GET _analyze
{
  "analyzer": "standard", 
  "text": [ 
    "a", "b"
  ]
}

It gives:

{
  "tokens": [
    {
      "token": "a",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "b",
      "start_offset": 2,
      "end_offset": 3,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

Position of the 1st term of the second text is 1 which is incorrect.
It leaves the impression that a match phrase could work on a b.

Not important though as indexing a document is always happening within an index but I'm feeling that this _analyze API is giving inconsistent results. May be we should fix it?

cc @jpountz

@dadoonet dadoonet added >bug :Search/Analysis How text is split into tokens labels Mar 13, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@umeshdangat
Copy link
Contributor

The reason for the difference in positionIncrementGap is because for the index based case NamedAnalyzer is used which overrides getpositionIncrementGap() vs for the non-index based case StandardAnalyzer is used directly from lucene.

The code that makes this decision is here

Lucene defaults positionIncrementGap to 0 in all analyzers, so if we desire non default lucene behavior for all analyzers the code for setupAnalyzers might need to change to not return vanilla lucene analyzers but ones that override the getPositionIncrementGap.

I am not super familiar with this code so I am not sure if what I suggest is even desired or correct behavior

cc: @jpountz @dadoonet

romseygeek added a commit to romseygeek/elasticsearch that referenced this issue Apr 3, 2019
When no index is specified on an analyze request, the code that builds
the analysis chain for that request goes directly to the analysis registry
to check for pre-built analyzers.  This can cause inconsistencies due to
the fact that elasticsearch defaults for various analysis settings (eg
the position increment gap) are different to the lucene defaults.

To remove these inconsistencies, this commit builds a one-off IndexAnalyzers
object when none is provided.  This means that all analyzers accessed
through an analysis request will use elasticsearch defaults.

Fixes elastic#29021
@romseygeek romseygeek self-assigned this Apr 3, 2019
@rjernst rjernst added the Team:Search Meta label for search team label May 4, 2020
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Analysis How text is split into tokens Team:Search Meta label for search team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants