Resume driver when failing to fetch pages #106392

dnhatn · 2024-03-17T06:40:27Z

I investigated a heap attack test failure and found that an ESQL request was stuck. This occurred in the following:

The ExchangeSource on the coordinator was blocked on reading because there were no available pages.
Meanwhile, the ExchangeSink on the data node had pages ready for fetching.
When an exchange request tried to fetch pages, it failed due to a CircuitBreakingException. Despite the failure, no cancellation was triggered because the status of the ExchangeSource on the coordinator remained unchanged.

To fix this issue, this PR introduces two changes:

Resumes the ExchangeSourceOperator and Driver on the coordinator, eventually allowing the coordinator to trigger cancellation of the request when failing to fetch pages.
Ensures that an exchange sink on the data nodes fails when a data node request is cancelled. This callback was inadvertently omitted when introducing the node-level reduction in Run empty reduction node level on data nodes #106204.

I plan to spend some time to harden the exchange and compute service.

Closes #106262

dnhatn · 2024-03-18T06:07:07Z

...compute/src/main/java/org/elasticsearch/compute/operator/exchange/ExchangeSourceHandler.java

@@ -203,6 +203,7 @@ void onSinkFailed(Exception originEx) {
                }
                return first;
            });
+            buffer.waitForReading().onResponse(null); // resume the Driver if it is being blocked on reading


We can notify about the failure, but I think it's simpler just to resume and let the driver handle the error, as if it hadn't been blocked before.

elasticsearchmachine · 2024-03-18T06:09:07Z

Hi @dnhatn, I've created a changelog YAML for you.

elasticsearchmachine · 2024-03-18T06:09:07Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

nik9000 · 2024-03-18T12:38:41Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeService.java

@@ -690,6 +690,7 @@ private void runComputeOnDataNode(
            dataNodeRequestExecutor.start();
            // run the node-level reduction
            var externalSink = exchangeService.getSinkHandler(externalId);
+            task.addListener(() -> exchangeService.finishSinkHandler(externalId, new TaskCancelledException(task.getReasonCancelled())));


Could you stick @Nullable on the exception argument to finishSinkHandler and add a note that it'll fire any errors into the sync - if it is still running. Or something like that.

Added in 3c018f7.

dnhatn · 2024-03-18T16:31:22Z

Thanks Nik!

elasticsearchmachine · 2024-03-18T16:33:42Z

Backport to 8.13 in #106436

I investigated a heap attack test failure and found that an ESQL request was stuck. This occurred in the following: 1. The ExchangeSource on the coordinator was blocked on reading because there were no available pages. 2. Meanwhile, the ExchangeSink on the data node had pages ready for fetching. 3. When an exchange request tried to fetch pages, it failed due to a CircuitBreakingException. Despite the failure, no cancellation was triggered because the status of the ExchangeSource on the coordinator remained unchanged. To fix this issue, this PR introduces two changes: Resumes the ExchangeSourceOperator and Driver on the coordinator, eventually allowing the coordinator to trigger cancellation of the request when failing to fetch pages. Ensures that an exchange sink on the data nodes fails when a data node request is cancelled. This callback was inadvertently omitted when introducing the node-level reduction in Run empty reduction node level on data nodes elastic#106204. I plan to spend some time to harden the exchange and compute service. Closes elastic#106262

I investigated a heap attack test failure and found that an ESQL request was stuck. This occurred in the following: 1. The ExchangeSource on the coordinator was blocked on reading because there were no available pages. 2. Meanwhile, the ExchangeSink on the data node had pages ready for fetching. 3. When an exchange request tried to fetch pages, it failed due to a CircuitBreakingException. Despite the failure, no cancellation was triggered because the status of the ExchangeSource on the coordinator remained unchanged. To fix this issue, this PR introduces two changes: Resumes the ExchangeSourceOperator and Driver on the coordinator, eventually allowing the coordinator to trigger cancellation of the request when failing to fetch pages. Ensures that an exchange sink on the data nodes fails when a data node request is cancelled. This callback was inadvertently omitted when introducing the node-level reduction in Run empty reduction node level on data nodes #106204. I plan to spend some time to harden the exchange and compute service. Closes #106262

elasticsearchmachine added the v8.14.0 label Mar 17, 2024

Resume Driver when fail to fetch pages

63fcb10

dnhatn force-pushed the exchange-deadlock branch from c33e763 to 63fcb10 Compare March 17, 2024 06:46

dnhatn changed the title ~~Resume Driver when fail to fetch pages~~ Resume driver when failing to fetch pages Mar 18, 2024

dnhatn commented Mar 18, 2024

View reviewed changes

dnhatn requested review from nik9000 and ChrisHegarty March 18, 2024 06:08

dnhatn added v8.13.1 >bug :Analytics/ES|QL AKA ESQL labels Mar 18, 2024

dnhatn marked this pull request as ready for review March 18, 2024 06:08

elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Mar 18, 2024

Update docs/changelog/106392.yaml

435ee9b

nik9000 approved these changes Mar 18, 2024

View reviewed changes

javadoc

3c018f7

dnhatn added the auto-backport-and-merge Automatically create backport pull requests and merge when ready label Mar 18, 2024

dnhatn merged commit d66c7d4 into elastic:main Mar 18, 2024
14 checks passed

dnhatn deleted the exchange-deadlock branch March 18, 2024 16:32

elasticsearchmachine added the backport pending label Mar 18, 2024

dnhatn removed backport pending auto-backport-and-merge Automatically create backport pull requests and merge when ready labels Mar 18, 2024

This was referenced Mar 21, 2024

Add DownsampleMetrics #106637

Closed

Set index mode earlier for new downsample index #106728

Merged

Added initial metrics for synthetic source #106732

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resume driver when failing to fetch pages #106392

Resume driver when failing to fetch pages #106392

dnhatn commented Mar 17, 2024 •

edited

dnhatn Mar 18, 2024

elasticsearchmachine commented Mar 18, 2024

elasticsearchmachine commented Mar 18, 2024

nik9000 Mar 18, 2024

dnhatn Mar 18, 2024

dnhatn commented Mar 18, 2024

elasticsearchmachine commented Mar 18, 2024 •

edited by dnhatn

Resume driver when failing to fetch pages #106392

Resume driver when failing to fetch pages #106392

Conversation

dnhatn commented Mar 17, 2024 • edited

dnhatn Mar 18, 2024

Choose a reason for hiding this comment

elasticsearchmachine commented Mar 18, 2024

elasticsearchmachine commented Mar 18, 2024

nik9000 Mar 18, 2024

Choose a reason for hiding this comment

dnhatn Mar 18, 2024

Choose a reason for hiding this comment

dnhatn commented Mar 18, 2024

elasticsearchmachine commented Mar 18, 2024 • edited by dnhatn

dnhatn commented Mar 17, 2024 •

edited

elasticsearchmachine commented Mar 18, 2024 •

edited by dnhatn