Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't serialize the entire query when query parsing fails #51843

Closed
jimczi opened this issue Feb 4, 2020 · 6 comments · Fixed by #51885
Closed

Don't serialize the entire query when query parsing fails #51843

jimczi opened this issue Feb 4, 2020 · 6 comments · Fixed by #51885
Labels
:Core/Infra/Logging Log management and logging utilities >enhancement :Search/Search Search-related issues that do not fall into other categories

Comments

@jimczi
Copy link
Contributor

jimczi commented Feb 4, 2020

Today when the query parsing fails on a shard, we create a QueryShardException that serializes the entire QueryBuilder in the message. All these exceptions are kept on a per-shard basis by the coordinating node so we except them to be light weight. For very large boolean queries (with lots of clauses that trip the max_clause_count for instance) this can cause memory pressure on the coordinating node since we don't de-duplicate exceptions coming from different shards.
The simplest fix would be to limit the size of the QueryShardException's message by truncating the QueryBuilder serialization if it is above a certain threshold (1k ?).

@jimczi jimczi added >enhancement :Search/Search Search-related issues that do not fall into other categories :Core/Infra/Logging Log management and logging utilities labels Feb 4, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Search)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (:Core/Infra/Logging)

jimczi added a commit to jimczi/elasticsearch that referenced this issue Feb 4, 2020
QueryBuilders that throw exceptions on shards when building the Lucene query
returns the full serialization of the query builder in the exception message.
For large queries that fails to execute due to the max boolean clause, this means
that we keep a reference of these big messages for every shard that participate
in the request. In order to limit the memory needed to hold these query shard
exceptions in the coordinating node, this change removes the query builder
serialization from the shard exception. The query is known by the user so
there should be no need to repeat it on every shard exception. We could also
omit the entire stack trace for known bad request exception but it would deserve
a separate issue/pr.

Closes elastic#51843
Closes elastic#48910
@altinp
Copy link

altinp commented Feb 5, 2020

Thanks @jimczi, this should sort out #48910 ! One question (and pardon my ignorance): is it not possible for the coordinating node to do some (basic/partial?) parsing just to ensure that global circuit breakers like too_many_clauses are not tripped, so as to not even send the query to all shards? If not, when putting together the response, it could check the first shard failure's reason.caused_by.caused_by.{type|reason} and if something global like too_many_clauses, it could send back a much smaller response, omitting the rest of the shard failures.

jimczi added a commit that referenced this issue Feb 6, 2020
…ge (#51885)

QueryBuilders that throw exceptions on shards when building the Lucene query
returns the full serialization of the query builder in the exception message.
For large queries that fails to execute due to the max boolean clause, this means
that we keep a reference of these big messages for every shard that participate
in the request. In order to limit the memory needed to hold these query shard
exceptions in the coordinating node, this change removes the query builder
serialization from the shard exception. The query is known by the user so
there should be no need to repeat it on every shard exception. We could also
omit the entire stack trace for known bad request exception but it would deserve
a separate issue/pr.

Closes #51843
Closes #48910
jimczi added a commit that referenced this issue Feb 6, 2020
…ge (#51885)

QueryBuilders that throw exceptions on shards when building the Lucene query
returns the full serialization of the query builder in the exception message.
For large queries that fails to execute due to the max boolean clause, this means
that we keep a reference of these big messages for every shard that participate
in the request. In order to limit the memory needed to hold these query shard
exceptions in the coordinating node, this change removes the query builder
serialization from the shard exception. The query is known by the user so
there should be no need to repeat it on every shard exception. We could also
omit the entire stack trace for known bad request exception but it would deserve
a separate issue/pr.

Closes #51843
Closes #48910
@jimczi
Copy link
Contributor Author

jimczi commented Feb 6, 2020

is it not possible for the coordinating node to do some (basic/partial?) parsing just to ensure that global circuit breakers like too_many_clauses are not tripped, so as to not even send the query to all shards?

The coordinating node doesn't have the mapping so it's not possible at the moment. The change I committed is just a small fix, there's a lot of follow up that we could try. One thing that I'd like to test is whether we can de-duplicate the shard failures at the coordinator level. We shouldn't have to save every shard failures if they are all the same (parsing failure, too many clauses, ...). Another possibility would be to try to early terminate the query if a shard failure is reported and allow_partial_results is set to false.

@altinp
Copy link

altinp commented Feb 6, 2020

The coordinating node doesn't have the mapping so it's not possible at the moment.

Ah, true. Would this (and other?) use cases perhaps benefit from an integrated "schema" (mapping) registry? This could be replicated on all nodes for fast access, since the overall size would be small.
Even before getting to that though, would it not be possible to do some mapping-unaware, Lucene-syntactical parsing on the as-is user supplied query, like checking for the number of terms in a terms query, or of single term clauses, or of boolean OR clauses in query_string. I.e. skim over the query just to rule out a few early-termination opportunities. (But yes, perhaps worth waiting for a comprehensive approach instead.)

I also do agree on deduping the shard failures as they stream in. Even without peeking too deep into the error strings and relying on string similarity etc., you could perhaps check reason.caused_by.caused_by.type etc. for some global limit breach like too_many_clauses and collapse on that.

Early termination when allow_partial_search_results=false would be great for us! In all our current use cases, and probably many others', partial results are worse than none. This would be a solid last line of defense since it's not dependent on the type of query/failure. (As you noted, the case of a large query_string would need a different fix than #51885)

Thanks again!

@altinp
Copy link

altinp commented Feb 6, 2020

@jimczi just to reconfirm: current behavior is for coordinating node to hold on to all shard failures regardless of allow_partial_search_results, right? I.e. this param only governs what is sent to the client, so a code change is needed.

One remaining worry is that this is only a search param. We have users who go through a UI facade that we control tightly but also data scientists who have more freedom in writing queries directly to ES. In the latter case, we'd still be vulnerable if they omit this param in their searches.

Perhaps this could also be made available as a global setting?

Finally, would you still want/have to distinguish between shard failures that can be retried against another replica and those that should cause the query to be terminated right away?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Logging Log management and logging utilities >enhancement :Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants