[8.16] ES|QL per-cluster took time is incorrectly calculated and causes fatal exceptions (#115017) #115124

quux00 · 2024-10-18T16:26:29Z

Backports the following commits to 8.16:

ES|QL per-cluster took time is incorrectly calculated and causes fatal exceptions (ES|QL per-cluster took time is incorrectly calculated and causes fatal exceptions #115017)

…l exceptions (elastic#115017) The model for calculating per-cluster `took` times from remote clusters in elastic#112595 was flawed. It attempted to use Java's System.nanoTime between the local and remote clusters, which is not safe. This results in per-cluster took times that have arbitrary (invalid) values including negative values which cause exceptions to be thrown by the `TimeValue` constructor. (Note: the overall took time calculation was done correctly, so it was the remote per-cluster took times that were flawed.) In this PR, I've done a redesign to address this. A key decision of this re-design was whether to always calculate took times only on the querying cluster (bypassing this whole problem) or to continue to allow the remote clusters to calculate their own took times for the remote processing and report that back to the querying cluster via the `ComputeResponse`. I decided in favor of having remote clusters compute their own took times for the remote processing and to additionally track "planning" time (encompassing field-caps and policy enrich remote calls), so that total per-cluster took time is a combination of the two. In _search, remote cluster took times are calculated entirely on the remote cluster, so network time is not included in the per-cluster took times. This has been helpful in diagnosing issues on user environments because if you see an overall took time that is significantly larger than the per cluster took times, that may indicate a network issue, which has happened in diagnosing cross-cluster issues in _search. I moved relative time tracking into `EsqlExecutionInfo`. The "planning time" marker is currently only used in cross-cluster searches, so it will conflict with the INLINESTATS 2 phase model (where planning can be done twice). We will improve this design to handle a 2 phase model in a later ticket, as part of the INLINESTATS work. I tested the current overall took time calculation model with local-only INLINESTATS queries and they work correctly. I also fixed another secondary bug in this PR. If the remote cluster is an older version that does not return took time (and shard info) in the ComputeResponse, the per-cluster took time is then calculated on the querying cluster as a fallback. Finally, I fixed some minor inconsistencies about whether the `_shards` info is shown in the response. The rule now is that `_shards` is always shown with 0 shards for SKIPPED clusters, with actual counts for SUCCESSFUL clusters and for remotes running an older version that doesn't report shard stats, the `_shards` field is left out of the XContent response. Fixes elastic#115022

quux00 · 2024-10-18T17:27:56Z

@elasticmachine run elasticsearch-ci/part-3

quux00 · 2024-10-18T18:42:09Z

buildkite test this

quux00 · 2024-10-18T19:43:54Z

@elasticmachine run elasticsearch-ci/bwc-snapshots

quux00 · 2024-10-18T21:00:35Z

@elasticmachine run elasticsearch-ci/8.16.0 / bwc-snapshots

quux00 · 2024-10-18T21:06:55Z

@elasticmachine run elasticsearch-ci/8.16.0/bwc-snapshots

quux00 · 2024-10-18T21:15:46Z

buildkite test this

quux00 · 2024-10-18T22:47:37Z

buildkite test this

quux00 · 2024-10-18T23:44:42Z

buildkite test this

quux00 · 2024-10-19T01:12:55Z

buildkite test this

quux00 · 2024-10-19T02:20:56Z

buildkite test this

quux00 added :Analytics/ES|QL AKA ESQL >non-issue auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) backport Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) labels Oct 18, 2024

elasticsearchmachine mentioned this pull request Oct 18, 2024

ES|QL per-cluster took time is incorrectly calculated and causes fatal exceptions #115017

Merged

elasticsearchmachine added the v8.16.1 label Oct 18, 2024

Added fix for elastic#115127 into this since can't get the build to pass

eecf127

elasticsearchmachine merged commit be2fcc7 into elastic:8.16 Oct 19, 2024
15 checks passed

quux00 deleted the backport/8.16/pr-115017 branch October 19, 2024 15:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[8.16] ES|QL per-cluster took time is incorrectly calculated and causes fatal exceptions (#115017) #115124

[8.16] ES|QL per-cluster took time is incorrectly calculated and causes fatal exceptions (#115017) #115124

Uh oh!

quux00 commented Oct 18, 2024

Uh oh!

quux00 commented Oct 18, 2024

Uh oh!

quux00 commented Oct 18, 2024

Uh oh!

quux00 commented Oct 18, 2024

Uh oh!

quux00 commented Oct 18, 2024

Uh oh!

quux00 commented Oct 18, 2024

Uh oh!

quux00 commented Oct 18, 2024

Uh oh!

quux00 commented Oct 18, 2024

Uh oh!

quux00 commented Oct 18, 2024

Uh oh!

quux00 commented Oct 19, 2024

Uh oh!

quux00 commented Oct 19, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[8.16] ES|QL per-cluster took time is incorrectly calculated and causes fatal exceptions (#115017) #115124

[8.16] ES|QL per-cluster took time is incorrectly calculated and causes fatal exceptions (#115017) #115124

Uh oh!

Conversation

quux00 commented Oct 18, 2024

Uh oh!

quux00 commented Oct 18, 2024

Uh oh!

quux00 commented Oct 18, 2024

Uh oh!

quux00 commented Oct 18, 2024

Uh oh!

quux00 commented Oct 18, 2024

Uh oh!

quux00 commented Oct 18, 2024

Uh oh!

quux00 commented Oct 18, 2024

Uh oh!

quux00 commented Oct 18, 2024

Uh oh!

quux00 commented Oct 18, 2024

Uh oh!

quux00 commented Oct 19, 2024

Uh oh!

quux00 commented Oct 19, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants