word_delimiter_graph combined with length filter produces array_index_out_of_bounds_exception during query building #46272

xabbu42 · 2019-09-03T13:30:07Z

Elasticsearch version (bin/elasticsearch --version):
Version: 7.3.0, Build: default/tar/de777fa/2019-07-24T18:30:11.767338Z, JVM: 1.8.0_212

I also tested the below reproduction on 7.3.1

Plugins installed: [analysis-icu]

JVM version (java -version):
openjdk version "1.8.0_212"
OpenJDK Runtime Environment (IcedTea 3.12.0) (build 1.8.0_212-b4 suse-2.2-x86_64)
OpenJDK 64-Bit Server VM (build 25.212-b04, mixed mode)

OS version (uname -a if on a Unix-like system):
Linux duktig 5.1.16-1-default #1 SMP Wed Jul 3 12:37:47 UTC 2019 (2af8a22) x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

Building query for the below index settings and query fails with array_index_out_of_bounds_exception. It should build a query as it did with elasticsearch 6.8 (I did not test the exact steps below on 6.8 but the original error this script is based on did not happen on 6.8).

Steps to reproduce:

curl -XDELETE 'http://localhost:9200/testdb'
curl -XPUT 'http://localhost:9200/testdb' -H "Content-Type: application/json" -d '
{
  "settings": { 
    "analysis": {
      "filter": {
        "words" : {
          "type": "word_delimiter_graph",
          "catenate_all": true,
          "catenate_numbers": true,
          "catenate_words": true
        },
        "length": {"type": "length", "min": 2}
      },
	  "analyzer": {"default": {"type": "custom", "filter": ["words", "length"], "tokenizer": "whitespace"}}
    }
  }
}'

curl -XPOST 'http://localhost:9200/testdb/_doc' -H "Content-Type: application/json" -d '{"field": "value"}'

curl -XPOST 'http://localhost:9200/testdb/_refresh'

curl 'http://localhost:9200/testdb/_search?q=g2+foo'

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-09-03T14:46:31Z

Pinging @elastic/es-search

martijnvg · 2019-09-03T14:47:59Z

Thanks for reporting this issue @xabbu42, this should not result in an error.

In the master branches this currently results in an assertion error being thrown:

java.lang.AssertionError: state=0 nextState=0
[elasticsearch]         at org.apache.lucene.util.automaton.Automaton.initTransition(Automaton.java:476) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]
[elasticsearch]         at org.apache.lucene.util.graph.GraphTokenStreamFiniteStrings.hasSidePath(GraphTokenStreamFiniteStrings.java:99) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]
[elasticsearch]         at org.elasticsearch.index.search.MatchQuery$MatchQueryBuilder.analyzeGraphBoolean(MatchQuery.java:664) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.index.search.MatchQuery$MatchQueryBuilder.createFieldQuery(MatchQuery.java:415) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.index.search.MatchQuery$MatchQueryBuilder.createQuery(MatchQuery.java:456) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.index.search.MatchQuery$MatchQueryBuilder.createFieldQuery(MatchQuery.java:336) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.apache.lucene.util.QueryBuilder.createBooleanQuery(QueryBuilder.java:96) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]
[elasticsearch]         at org.elasticsearch.index.search.MatchQuery.parseInternal(MatchQuery.java:266) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.index.search.MatchQuery.parse(MatchQuery.java:259) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.index.search.MultiMatchQuery.buildFieldQueries(MultiMatchQuery.java:108) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.index.search.MultiMatchQuery.parse(MultiMatchQuery.java:76) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.index.search.QueryStringQueryParser.getFieldQuery(QueryStringQueryParser.java:337) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.apache.lucene.queryparser.classic.QueryParser.MultiTerm(QueryParser.java:637) ~[lucene-queryparser-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:06:42]
[elasticsearch]         at org.apache.lucene.queryparser.classic.QueryParser.Query(QueryParser.java:226) ~[lucene-queryparser-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:06:42]
[elasticsearch]         at org.apache.lucene.queryparser.classic.QueryParser.TopLevelQuery(QueryParser.java:215) ~[lucene-queryparser-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:06:42]
[elasticsearch]         at org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:109) ~[lucene-queryparser-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:06:42]
[elasticsearch]         at org.elasticsearch.index.search.QueryStringQueryParser.parse(QueryStringQueryParser.java:791) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.index.query.QueryStringQueryBuilder.doToQuery(QueryStringQueryBuilder.java:907) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.index.query.AbstractQueryBuilder.toQuery(AbstractQueryBuilder.java:99) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.index.query.QueryShardContext.lambda$toQuery$1(QueryShardContext.java:278) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.index.query.QueryShardContext.toQuery(QueryShardContext.java:290) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.index.query.QueryShardContext.toQuery(QueryShardContext.java:277) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.search.SearchService.parseSource(SearchService.java:738) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.search.SearchService.createContext(SearchService.java:586) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:545) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:348) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$1(SearchService.java:340) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.action.ActionListener.lambda$map$2(ActionListener.java:146) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.search.SearchService.lambda$rewriteShardRequest$7(SearchService.java:1043) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.action.ActionRunnable$1.doRun(ActionRunnable.java:45) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:769) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
[elasticsearch]         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
[elasticsearch]         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
[elasticsearch]         at java.lang.Thread.run(Thread.java:835) [?:?]

null-pointer-byte · 2019-09-14T19:04:28Z

@martijnvg I am also facing the same issue. I have tested both on v7.3.2 and on master branch: got java.lang.ArrayIndexOutOfBoundsException: Index 0 out of bounds for length 0.

Please note that curl 'http://localhost:9200/testdb/_search?q=g2+fo' will result in an error while curl 'http://localhost:9200/testdb/_search?q=g2+f' will not.

Logs (looks like a lucene bug):

org.elasticsearch.index.query.QueryShardException: failed to create query: {
  "query_string" : {
    "query" : "g2 fo",
    "fields" : [ ],
    "type" : "best_fields",
    "default_operator" : "or",
    "max_determinized_states" : 10000,
    "enable_position_increments" : true,
    "fuzziness" : "AUTO",
    "fuzzy_prefix_length" : 0,
    "fuzzy_max_expansions" : 50,
    "phrase_slop" : 0,
    "analyze_wildcard" : false,
    "escape" : false,
    "auto_generate_synonyms_phrase_query" : true,
    "fuzzy_transpositions" : true,
    "boost" : 1.0
  }
}
	at org.elasticsearch.index.query.QueryShardContext.toQuery(QueryShardContext.java:305) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.query.QueryShardContext.toQuery(QueryShardContext.java:288) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.search.SearchService.parseSource(SearchService.java:745) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.search.SearchService.createContext(SearchService.java:586) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:545) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:348) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$1(SearchService.java:340) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.action.ActionListener.lambda$map$2(ActionListener.java:146) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.search.SearchService.lambda$rewriteShardRequest$7(SearchService.java:1055) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.action.ActionRunnable$1.doRun(ActionRunnable.java:45) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:769) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:835) [?:?]
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 0 out of bounds for length 0
	at org.apache.lucene.util.QueryBuilder.newSynonymQuery(QueryBuilder.java:653) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]
	at org.elasticsearch.index.search.MatchQuery$MatchQueryBuilder.analyzeGraphBoolean(MatchQuery.java:694) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.search.MatchQuery$MatchQueryBuilder.createFieldQuery(MatchQuery.java:415) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.search.MatchQuery$MatchQueryBuilder.createQuery(MatchQuery.java:456) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.search.MatchQuery$MatchQueryBuilder.createFieldQuery(MatchQuery.java:336) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.apache.lucene.util.QueryBuilder.createBooleanQuery(QueryBuilder.java:96) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]
	at org.elasticsearch.index.search.MatchQuery.parseInternal(MatchQuery.java:266) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.search.MatchQuery.parse(MatchQuery.java:259) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.search.MultiMatchQuery.buildFieldQueries(MultiMatchQuery.java:108) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.search.MultiMatchQuery.parse(MultiMatchQuery.java:76) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.search.QueryStringQueryParser.getFieldQuery(QueryStringQueryParser.java:337) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.apache.lucene.queryparser.classic.QueryParser.MultiTerm(QueryParser.java:637) ~[lucene-queryparser-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:06:42]
	at org.apache.lucene.queryparser.classic.QueryParser.Query(QueryParser.java:226) ~[lucene-queryparser-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:06:42]
	at org.apache.lucene.queryparser.classic.QueryParser.TopLevelQuery(QueryParser.java:215) ~[lucene-queryparser-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:06:42]
	at org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:109) ~[lucene-queryparser-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:06:42]
	at org.elasticsearch.index.search.QueryStringQueryParser.parse(QueryStringQueryParser.java:791) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.query.QueryStringQueryBuilder.doToQuery(QueryStringQueryBuilder.java:907) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.query.AbstractQueryBuilder.toQuery(AbstractQueryBuilder.java:99) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.query.QueryShardContext.lambda$toQuery$1(QueryShardContext.java:289) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.query.QueryShardContext.toQuery(QueryShardContext.java:301) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	... 17 more

cbuescher · 2019-10-10T10:40:45Z

Just tried on 7.4.0 and narrowing this down a bit more: The NPE reproduces with the above script, but only if the "catenate_all": true in the word_delimiter_graph filter and only in combination with the first search term being of length 2 (relates to the length filter of course) and ending in a number. That is to say all of the below works (and returns zero results):

GET /testdb/_search?q=foo
GET /testdb/_search?q=ga2+foo
GET /testdb/_search?q=gaaaa2+foo ...

Also changing the order of search terms doesn't trigger the NPE:

GET /testdb/_search?q=foo+g2

The code path the NPE is triggered expects a Term[] or length > 0 which we even assert when assertions are enabled. This assumption seemst to be violated in this particular case.

cbuescher · 2020-03-31T14:08:23Z

This came up again in #54434, so adding a summary of how this manifests in 7.6.

Stack trace on 7.6.1:

Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 0 out of bounds for length 0
	at org.apache.lucene.util.QueryBuilder.newSynonymQuery(QueryBuilder.java:653) ~[lucene-core-8.4.0.jar:8.4.0 bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:14]
	at org.elasticsearch.index.search.MatchQuery$MatchQueryBuilder.analyzeGraphBoolean(MatchQuery.java:744) ~[elasticsearch-7.6.1.jar:7.6.1]
[...]

Basically this error can happen whenever a filter produces a tokenstream with a graph structure (here the word-delimiter-grapg), e.g. by splitting sth. like "3d" on numerics while keeping the original aswell (here "3", "d" and "3d"), and a subsequent filter removes some of these tokens (here the length filter). The corresponding token stream is turned into an Automaton inside GraphTokenStreamFiniteStrings in MatchQueryBuilder#analyzeGraphBoolean, and depending on the incoming token stream this automaton is rewritten to something that looks produces no terms in https://github.com/elastic/elasticsearch/blob/7.6/server/src/main/java/org/elasticsearch/index/search/MatchQuery.java#L738. The assertion in the following line shows that this is not expected to happen and this finally leads to the NPE in QueryBuilder.newSynonymQuery.

Short reproduction on 7.6.1:

DELETE test

PUT test
{
  "settings": {
    "analysis": {
      "filter": {
        "length_min_2": {
          "type": "length",
          "min": 2
        },
        "word_split_product_number": {
          "type": "word_delimiter_graph",
          "catenate_all" :true
        }
      },
      "analyzer": {
        "word_split_product_number_analyzer": {
          "filter": [
            "word_split_product_number",
            "length_min_2"
          ],
          "tokenizer": "standard"          
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "productNumber": {
        "type": "text",
        "analyzer": "word_split_product_number_analyzer"
      }
    }
  }
}

GET test/_search?error_trace=true
{
  "query": {
    "match": {
      "productNumber": {
        "query": "3d printer"
      }
    }
  }
}

The token streams produced by word delimiter filter and the length filter look like this

"tokenfilters" : [
      {
        "name" : "word_split_product_number",
        "tokens" : [
          {
            "token" : "3d",
            "start_offset" : 0,
            "end_offset" : 2,
            "type" : "<ALPHANUM>",
            "position" : 0,
            "positionLength" : 2,
            "bytes" : "[33 64]",
            "keyword" : false,
            "positionLength" : 2,
            "termFrequency" : 1
          },
          {
            "token" : "3",
            "start_offset" : 0,
            "end_offset" : 1,
            "type" : "<ALPHANUM>",
            "position" : 0,
            "bytes" : "[33]",
            "keyword" : false,
            "positionLength" : 1,
            "termFrequency" : 1
          },
          {
            "token" : "d",
            "start_offset" : 1,
            "end_offset" : 2,
            "type" : "<ALPHANUM>",
            "position" : 1,
            "bytes" : "[64]",
            "keyword" : false,
            "positionLength" : 1,
            "termFrequency" : 1
          },
          {
            "token" : "printer",
            "start_offset" : 3,
            "end_offset" : 10,
            "type" : "<ALPHANUM>",
            "position" : 2,
            "bytes" : "[70 72 69 6e 74 65 72]",
            "keyword" : false,
            "positionLength" : 1,
            "termFrequency" : 1
          }
        ]
      },
      {
        "name" : "length_min_2",
        "tokens" : [
          {
            "token" : "3d",
            "start_offset" : 0,
            "end_offset" : 2,
            "type" : "<ALPHANUM>",
            "position" : 0,
            "positionLength" : 2,
            "bytes" : "[33 64]",
            "keyword" : false,
            "positionLength" : 2,
            "termFrequency" : 1
          },
          {
            "token" : "printer",
            "start_offset" : 3,
            "end_offset" : 10,
            "type" : "<ALPHANUM>",
            "position" : 2,
            "bytes" : "[70 72 69 6e 74 65 72]",
            "keyword" : false,
            "positionLength" : 1,
            "termFrequency" : 1
          }
        ]
      }

I will need to discuss this some more and understand how the underlying automaton gets constructed to see if we have a chance of detecting this early or even mitigate it.

cbuescher · 2020-04-06T13:54:25Z

Some intermediate results: The way we build the Automaton doesn't always deal well with deleted tokens.

Without the length filter and deleting the tokens the Automaton we build in GraphTokenStreamFiniteStrings has the following four token and transitions:


[0] ------ "3d -----------> [2]  --- "printer" ---> [3]
 \                          / 
  \-- "3" --> [1] -- "d" -/

When the two single-letter token get deleted we currently create four states with the following transitions:

[0] --- "3d ---> [2]

          [1] ---- "printer ----> [3]

Later after removing dead states the Automaton is empty, leading to the exception later on in our code.

Dealing with deletions in token streams for graphs in a general way seems challanging. There are some open issues like https://issues.apache.org/jira/browse/LUCENE-8717 that might be able to adress this problem in a way in the future.
For this particular problem with deleting both tokens from the side path I tried a small change in GraphTokenStreamFiniteStrings
that detects when a token with position length N is directly followed by another with an identical increment of N, showing that there must have been deletions which fixes this particular scenario, but might not help with others. (cbuescher/lucene-solr@55b9c71 and related ES tests in cbuescher@5f35bfa).

We might be able to detect when the built Automaton is empty although the token stream isn't, throwing a different exception then but I'm not sure how helpful that would be over the current behaviour.

seanbirdsell · 2021-07-22T16:17:06Z

We're continuing to see this error in our production environment. Any word on when it will be addressed?

gjobin · 2021-11-23T22:49:44Z

Very interested in this issue as well as we get many of it in production.

akhgeek30 · 2022-10-20T17:00:25Z

Are they related apache/lucene#11864?

elasticsearchmachine · 2023-09-19T17:51:35Z

Pinging @elastic/es-search (Team:Search)

xabbu42 · 2023-09-19T17:53:33Z

This is still an issue with elasticsearch 8.6.2. I had to add a _refresh to the script to reproduce it though.

elasticsearchmachine · 2024-07-12T10:25:50Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

martijnvg added the :Search Relevance/Analysis How text is split into tokens label Sep 3, 2019

martijnvg added the >bug label Sep 3, 2019

cbuescher self-assigned this Oct 10, 2019

cbuescher mentioned this issue Mar 30, 2020

Word Delimiter Graph: AAIOB in QueryBuilder.newSynonymQuery #54434

Closed

rjernst added the Team:Search Meta label for search team label May 4, 2020

mmatela mentioned this issue Nov 14, 2022

ArrayIndexOutOfBoundException apache/lucene#11864

Closed

javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024

javanna added the priority:high A label for assessing bug priority to be used by ES engineers label Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

word_delimiter_graph combined with length filter produces array_index_out_of_bounds_exception during query building #46272

word_delimiter_graph combined with length filter produces array_index_out_of_bounds_exception during query building #46272

xabbu42 commented Sep 3, 2019 •

edited

Loading

elasticmachine commented Sep 3, 2019

martijnvg commented Sep 3, 2019

null-pointer-byte commented Sep 14, 2019 •

edited

Loading

cbuescher commented Oct 10, 2019

cbuescher commented Mar 31, 2020 •

edited

Loading

cbuescher commented Apr 6, 2020

seanbirdsell commented Jul 22, 2021

gjobin commented Nov 23, 2021

akhgeek30 commented Oct 20, 2022 •

edited

Loading

elasticsearchmachine commented Sep 19, 2023

xabbu42 commented Sep 19, 2023

elasticsearchmachine commented Jul 12, 2024

word_delimiter_graph combined with length filter produces array_index_out_of_bounds_exception during query building #46272

word_delimiter_graph combined with length filter produces array_index_out_of_bounds_exception during query building #46272

Comments

xabbu42 commented Sep 3, 2019 • edited Loading

elasticmachine commented Sep 3, 2019

martijnvg commented Sep 3, 2019

null-pointer-byte commented Sep 14, 2019 • edited Loading

cbuescher commented Oct 10, 2019

cbuescher commented Mar 31, 2020 • edited Loading

cbuescher commented Apr 6, 2020

seanbirdsell commented Jul 22, 2021

gjobin commented Nov 23, 2021

akhgeek30 commented Oct 20, 2022 • edited Loading

elasticsearchmachine commented Sep 19, 2023

xabbu42 commented Sep 19, 2023

elasticsearchmachine commented Jul 12, 2024

xabbu42 commented Sep 3, 2019 •

edited

Loading

null-pointer-byte commented Sep 14, 2019 •

edited

Loading

cbuescher commented Mar 31, 2020 •

edited

Loading

akhgeek30 commented Oct 20, 2022 •

edited

Loading