Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

word_delimiter_graph combined with length filter produces array_index_out_of_bounds_exception during query building #46272

Open
xabbu42 opened this issue Sep 3, 2019 · 12 comments
Assignees
Labels
>bug priority:high A label for assessing bug priority to be used by ES engineers :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@xabbu42
Copy link
Contributor

xabbu42 commented Sep 3, 2019

Elasticsearch version (bin/elasticsearch --version):
Version: 7.3.0, Build: default/tar/de777fa/2019-07-24T18:30:11.767338Z, JVM: 1.8.0_212

I also tested the below reproduction on 7.3.1

Plugins installed: [analysis-icu]

JVM version (java -version):
openjdk version "1.8.0_212"
OpenJDK Runtime Environment (IcedTea 3.12.0) (build 1.8.0_212-b4 suse-2.2-x86_64)
OpenJDK 64-Bit Server VM (build 25.212-b04, mixed mode)

OS version (uname -a if on a Unix-like system):
Linux duktig 5.1.16-1-default #1 SMP Wed Jul 3 12:37:47 UTC 2019 (2af8a22) x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

Building query for the below index settings and query fails with array_index_out_of_bounds_exception. It should build a query as it did with elasticsearch 6.8 (I did not test the exact steps below on 6.8 but the original error this script is based on did not happen on 6.8).

Steps to reproduce:

curl -XDELETE 'http://localhost:9200/testdb'
curl -XPUT 'http://localhost:9200/testdb' -H "Content-Type: application/json" -d '
{
  "settings": { 
    "analysis": {
      "filter": {
        "words" : {
          "type": "word_delimiter_graph",
          "catenate_all": true,
          "catenate_numbers": true,
          "catenate_words": true
        },
        "length": {"type": "length", "min": 2}
      },
	  "analyzer": {"default": {"type": "custom", "filter": ["words", "length"], "tokenizer": "whitespace"}}
    }
  }
}'

curl -XPOST 'http://localhost:9200/testdb/_doc' -H "Content-Type: application/json" -d '{"field": "value"}'

curl -XPOST 'http://localhost:9200/testdb/_refresh'

curl 'http://localhost:9200/testdb/_search?q=g2+foo'
@martijnvg martijnvg added the :Search Relevance/Analysis How text is split into tokens label Sep 3, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@martijnvg martijnvg added the >bug label Sep 3, 2019
@martijnvg
Copy link
Member

Thanks for reporting this issue @xabbu42, this should not result in an error.

In the master branches this currently results in an assertion error being thrown:

java.lang.AssertionError: state=0 nextState=0
[elasticsearch]         at org.apache.lucene.util.automaton.Automaton.initTransition(Automaton.java:476) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]
[elasticsearch]         at org.apache.lucene.util.graph.GraphTokenStreamFiniteStrings.hasSidePath(GraphTokenStreamFiniteStrings.java:99) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]
[elasticsearch]         at org.elasticsearch.index.search.MatchQuery$MatchQueryBuilder.analyzeGraphBoolean(MatchQuery.java:664) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.index.search.MatchQuery$MatchQueryBuilder.createFieldQuery(MatchQuery.java:415) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.index.search.MatchQuery$MatchQueryBuilder.createQuery(MatchQuery.java:456) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.index.search.MatchQuery$MatchQueryBuilder.createFieldQuery(MatchQuery.java:336) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.apache.lucene.util.QueryBuilder.createBooleanQuery(QueryBuilder.java:96) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]
[elasticsearch]         at org.elasticsearch.index.search.MatchQuery.parseInternal(MatchQuery.java:266) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.index.search.MatchQuery.parse(MatchQuery.java:259) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.index.search.MultiMatchQuery.buildFieldQueries(MultiMatchQuery.java:108) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.index.search.MultiMatchQuery.parse(MultiMatchQuery.java:76) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.index.search.QueryStringQueryParser.getFieldQuery(QueryStringQueryParser.java:337) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.apache.lucene.queryparser.classic.QueryParser.MultiTerm(QueryParser.java:637) ~[lucene-queryparser-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:06:42]
[elasticsearch]         at org.apache.lucene.queryparser.classic.QueryParser.Query(QueryParser.java:226) ~[lucene-queryparser-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:06:42]
[elasticsearch]         at org.apache.lucene.queryparser.classic.QueryParser.TopLevelQuery(QueryParser.java:215) ~[lucene-queryparser-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:06:42]
[elasticsearch]         at org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:109) ~[lucene-queryparser-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:06:42]
[elasticsearch]         at org.elasticsearch.index.search.QueryStringQueryParser.parse(QueryStringQueryParser.java:791) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.index.query.QueryStringQueryBuilder.doToQuery(QueryStringQueryBuilder.java:907) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.index.query.AbstractQueryBuilder.toQuery(AbstractQueryBuilder.java:99) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.index.query.QueryShardContext.lambda$toQuery$1(QueryShardContext.java:278) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.index.query.QueryShardContext.toQuery(QueryShardContext.java:290) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.index.query.QueryShardContext.toQuery(QueryShardContext.java:277) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.search.SearchService.parseSource(SearchService.java:738) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.search.SearchService.createContext(SearchService.java:586) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:545) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:348) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$1(SearchService.java:340) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.action.ActionListener.lambda$map$2(ActionListener.java:146) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.search.SearchService.lambda$rewriteShardRequest$7(SearchService.java:1043) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.action.ActionRunnable$1.doRun(ActionRunnable.java:45) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:769) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
[elasticsearch]         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
[elasticsearch]         at java.lang.Thread.run(Thread.java:835) [?:?]

@null-pointer-byte
Copy link

null-pointer-byte commented Sep 14, 2019

@martijnvg I am also facing the same issue. I have tested both on v7.3.2 and on master branch: got java.lang.ArrayIndexOutOfBoundsException: Index 0 out of bounds for length 0.

Please note that curl 'http://localhost:9200/testdb/_search?q=g2+fo' will result in an error while curl 'http://localhost:9200/testdb/_search?q=g2+f' will not.

Logs (looks like a lucene bug):

org.elasticsearch.index.query.QueryShardException: failed to create query: {
  "query_string" : {
    "query" : "g2 fo",
    "fields" : [ ],
    "type" : "best_fields",
    "default_operator" : "or",
    "max_determinized_states" : 10000,
    "enable_position_increments" : true,
    "fuzziness" : "AUTO",
    "fuzzy_prefix_length" : 0,
    "fuzzy_max_expansions" : 50,
    "phrase_slop" : 0,
    "analyze_wildcard" : false,
    "escape" : false,
    "auto_generate_synonyms_phrase_query" : true,
    "fuzzy_transpositions" : true,
    "boost" : 1.0
  }
}
	at org.elasticsearch.index.query.QueryShardContext.toQuery(QueryShardContext.java:305) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.query.QueryShardContext.toQuery(QueryShardContext.java:288) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.search.SearchService.parseSource(SearchService.java:745) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.search.SearchService.createContext(SearchService.java:586) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:545) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:348) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$1(SearchService.java:340) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.action.ActionListener.lambda$map$2(ActionListener.java:146) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.search.SearchService.lambda$rewriteShardRequest$7(SearchService.java:1055) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.action.ActionRunnable$1.doRun(ActionRunnable.java:45) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:769) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:835) [?:?]
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 0 out of bounds for length 0
	at org.apache.lucene.util.QueryBuilder.newSynonymQuery(QueryBuilder.java:653) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]
	at org.elasticsearch.index.search.MatchQuery$MatchQueryBuilder.analyzeGraphBoolean(MatchQuery.java:694) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.search.MatchQuery$MatchQueryBuilder.createFieldQuery(MatchQuery.java:415) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.search.MatchQuery$MatchQueryBuilder.createQuery(MatchQuery.java:456) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.search.MatchQuery$MatchQueryBuilder.createFieldQuery(MatchQuery.java:336) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.apache.lucene.util.QueryBuilder.createBooleanQuery(QueryBuilder.java:96) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]
	at org.elasticsearch.index.search.MatchQuery.parseInternal(MatchQuery.java:266) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.search.MatchQuery.parse(MatchQuery.java:259) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.search.MultiMatchQuery.buildFieldQueries(MultiMatchQuery.java:108) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.search.MultiMatchQuery.parse(MultiMatchQuery.java:76) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.search.QueryStringQueryParser.getFieldQuery(QueryStringQueryParser.java:337) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.apache.lucene.queryparser.classic.QueryParser.MultiTerm(QueryParser.java:637) ~[lucene-queryparser-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:06:42]
	at org.apache.lucene.queryparser.classic.QueryParser.Query(QueryParser.java:226) ~[lucene-queryparser-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:06:42]
	at org.apache.lucene.queryparser.classic.QueryParser.TopLevelQuery(QueryParser.java:215) ~[lucene-queryparser-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:06:42]
	at org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:109) ~[lucene-queryparser-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:06:42]
	at org.elasticsearch.index.search.QueryStringQueryParser.parse(QueryStringQueryParser.java:791) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.query.QueryStringQueryBuilder.doToQuery(QueryStringQueryBuilder.java:907) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.query.AbstractQueryBuilder.toQuery(AbstractQueryBuilder.java:99) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.query.QueryShardContext.lambda$toQuery$1(QueryShardContext.java:289) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.query.QueryShardContext.toQuery(QueryShardContext.java:301) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	... 17 more

@cbuescher
Copy link
Member

Just tried on 7.4.0 and narrowing this down a bit more: The NPE reproduces with the above script, but only if the "catenate_all": true in the word_delimiter_graph filter and only in combination with the first search term being of length 2 (relates to the length filter of course) and ending in a number. That is to say all of the below works (and returns zero results):

GET /testdb/_search?q=foo
GET /testdb/_search?q=ga2+foo
GET /testdb/_search?q=gaaaa2+foo ...

Also changing the order of search terms doesn't trigger the NPE:

GET /testdb/_search?q=foo+g2

The code path the NPE is triggered expects a Term[] or length > 0 which we even assert when assertions are enabled. This assumption seemst to be violated in this particular case.

@cbuescher
Copy link
Member

cbuescher commented Mar 31, 2020

This came up again in #54434, so adding a summary of how this manifests in 7.6.

Stack trace on 7.6.1:

Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 0 out of bounds for length 0
	at org.apache.lucene.util.QueryBuilder.newSynonymQuery(QueryBuilder.java:653) ~[lucene-core-8.4.0.jar:8.4.0 bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:14]
	at org.elasticsearch.index.search.MatchQuery$MatchQueryBuilder.analyzeGraphBoolean(MatchQuery.java:744) ~[elasticsearch-7.6.1.jar:7.6.1]
[...]

Basically this error can happen whenever a filter produces a tokenstream with a graph structure (here the word-delimiter-grapg), e.g. by splitting sth. like "3d" on numerics while keeping the original aswell (here "3", "d" and "3d"), and a subsequent filter removes some of these tokens (here the length filter). The corresponding token stream is turned into an Automaton inside GraphTokenStreamFiniteStrings in MatchQueryBuilder#analyzeGraphBoolean, and depending on the incoming token stream this automaton is rewritten to something that looks produces no terms in https://github.com/elastic/elasticsearch/blob/7.6/server/src/main/java/org/elasticsearch/index/search/MatchQuery.java#L738. The assertion in the following line shows that this is not expected to happen and this finally leads to the NPE in QueryBuilder.newSynonymQuery.

Short reproduction on 7.6.1:

DELETE test

PUT test
{
  "settings": {
    "analysis": {
      "filter": {
        "length_min_2": {
          "type": "length",
          "min": 2
        },
        "word_split_product_number": {
          "type": "word_delimiter_graph",
          "catenate_all" :true
        }
      },
      "analyzer": {
        "word_split_product_number_analyzer": {
          "filter": [
            "word_split_product_number",
            "length_min_2"
          ],
          "tokenizer": "standard"          
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "productNumber": {
        "type": "text",
        "analyzer": "word_split_product_number_analyzer"
      }
    }
  }
}

GET test/_search?error_trace=true
{
  "query": {
    "match": {
      "productNumber": {
        "query": "3d printer"
      }
    }
  }
}

The token streams produced by word delimiter filter and the length filter look like this

"tokenfilters" : [
      {
        "name" : "word_split_product_number",
        "tokens" : [
          {
            "token" : "3d",
            "start_offset" : 0,
            "end_offset" : 2,
            "type" : "<ALPHANUM>",
            "position" : 0,
            "positionLength" : 2,
            "bytes" : "[33 64]",
            "keyword" : false,
            "positionLength" : 2,
            "termFrequency" : 1
          },
          {
            "token" : "3",
            "start_offset" : 0,
            "end_offset" : 1,
            "type" : "<ALPHANUM>",
            "position" : 0,
            "bytes" : "[33]",
            "keyword" : false,
            "positionLength" : 1,
            "termFrequency" : 1
          },
          {
            "token" : "d",
            "start_offset" : 1,
            "end_offset" : 2,
            "type" : "<ALPHANUM>",
            "position" : 1,
            "bytes" : "[64]",
            "keyword" : false,
            "positionLength" : 1,
            "termFrequency" : 1
          },
          {
            "token" : "printer",
            "start_offset" : 3,
            "end_offset" : 10,
            "type" : "<ALPHANUM>",
            "position" : 2,
            "bytes" : "[70 72 69 6e 74 65 72]",
            "keyword" : false,
            "positionLength" : 1,
            "termFrequency" : 1
          }
        ]
      },
      {
        "name" : "length_min_2",
        "tokens" : [
          {
            "token" : "3d",
            "start_offset" : 0,
            "end_offset" : 2,
            "type" : "<ALPHANUM>",
            "position" : 0,
            "positionLength" : 2,
            "bytes" : "[33 64]",
            "keyword" : false,
            "positionLength" : 2,
            "termFrequency" : 1
          },
          {
            "token" : "printer",
            "start_offset" : 3,
            "end_offset" : 10,
            "type" : "<ALPHANUM>",
            "position" : 2,
            "bytes" : "[70 72 69 6e 74 65 72]",
            "keyword" : false,
            "positionLength" : 1,
            "termFrequency" : 1
          }
        ]
      }

I will need to discuss this some more and understand how the underlying automaton gets constructed to see if we have a chance of detecting this early or even mitigate it.

@cbuescher
Copy link
Member

Some intermediate results: The way we build the Automaton doesn't always deal well with deleted tokens.

Without the length filter and deleting the tokens the Automaton we build in GraphTokenStreamFiniteStrings has the following four token and transitions:


[0] ------ "3d -----------> [2]  --- "printer" ---> [3]
 \                          / 
  \-- "3" --> [1] -- "d" -/

When the two single-letter token get deleted we currently create four states with the following transitions:

[0] --- "3d ---> [2]

          [1] ---- "printer ----> [3]

Later after removing dead states the Automaton is empty, leading to the exception later on in our code.

Dealing with deletions in token streams for graphs in a general way seems challanging. There are some open issues like https://issues.apache.org/jira/browse/LUCENE-8717 that might be able to adress this problem in a way in the future.
For this particular problem with deleting both tokens from the side path I tried a small change in GraphTokenStreamFiniteStrings
that detects when a token with position length N is directly followed by another with an identical increment of N, showing that there must have been deletions which fixes this particular scenario, but might not help with others. (cbuescher/lucene-solr@55b9c71 and related ES tests in cbuescher@5f35bfa).

We might be able to detect when the built Automaton is empty although the token stream isn't, throwing a different exception then but I'm not sure how helpful that would be over the current behaviour.

@rjernst rjernst added the Team:Search Meta label for search team label May 4, 2020
@seanbirdsell
Copy link

We're continuing to see this error in our production environment. Any word on when it will be addressed?

@gjobin
Copy link

gjobin commented Nov 23, 2021

Very interested in this issue as well as we get many of it in production.

@akhgeek30
Copy link

akhgeek30 commented Oct 20, 2022

Are they related apache/lucene#11864?

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@xabbu42
Copy link
Contributor Author

xabbu42 commented Sep 19, 2023

This is still an issue with elasticsearch 8.6.2. I had to add a _refresh to the script to reproduce it though.

@javanna javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@javanna javanna added the priority:high A label for assessing bug priority to be used by ES engineers label Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug priority:high A label for assessing bug priority to be used by ES engineers :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

No branches or pull requests