New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resume driver when failing to fetch pages #106392
Conversation
c33e763
to
63fcb10
Compare
@@ -203,6 +203,7 @@ void onSinkFailed(Exception originEx) { | |||
} | |||
return first; | |||
}); | |||
buffer.waitForReading().onResponse(null); // resume the Driver if it is being blocked on reading |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can notify about the failure, but I think it's simpler just to resume and let the driver handle the error, as if it hadn't been blocked before.
Hi @dnhatn, I've created a changelog YAML for you. |
Pinging @elastic/es-analytical-engine (Team:Analytics) |
@@ -690,6 +690,7 @@ private void runComputeOnDataNode( | |||
dataNodeRequestExecutor.start(); | |||
// run the node-level reduction | |||
var externalSink = exchangeService.getSinkHandler(externalId); | |||
task.addListener(() -> exchangeService.finishSinkHandler(externalId, new TaskCancelledException(task.getReasonCancelled()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you stick @Nullable
on the exception
argument to finishSinkHandler
and add a note that it'll fire any errors into the sync - if it is still running. Or something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 3c018f7.
Thanks Nik! |
Backport to 8.13 in #106436 |
I investigated a heap attack test failure and found that an ESQL request was stuck. This occurred in the following: 1. The ExchangeSource on the coordinator was blocked on reading because there were no available pages. 2. Meanwhile, the ExchangeSink on the data node had pages ready for fetching. 3. When an exchange request tried to fetch pages, it failed due to a CircuitBreakingException. Despite the failure, no cancellation was triggered because the status of the ExchangeSource on the coordinator remained unchanged. To fix this issue, this PR introduces two changes: Resumes the ExchangeSourceOperator and Driver on the coordinator, eventually allowing the coordinator to trigger cancellation of the request when failing to fetch pages. Ensures that an exchange sink on the data nodes fails when a data node request is cancelled. This callback was inadvertently omitted when introducing the node-level reduction in Run empty reduction node level on data nodes elastic#106204. I plan to spend some time to harden the exchange and compute service. Closes elastic#106262
I investigated a heap attack test failure and found that an ESQL request was stuck. This occurred in the following: 1. The ExchangeSource on the coordinator was blocked on reading because there were no available pages. 2. Meanwhile, the ExchangeSink on the data node had pages ready for fetching. 3. When an exchange request tried to fetch pages, it failed due to a CircuitBreakingException. Despite the failure, no cancellation was triggered because the status of the ExchangeSource on the coordinator remained unchanged. To fix this issue, this PR introduces two changes: Resumes the ExchangeSourceOperator and Driver on the coordinator, eventually allowing the coordinator to trigger cancellation of the request when failing to fetch pages. Ensures that an exchange sink on the data nodes fails when a data node request is cancelled. This callback was inadvertently omitted when introducing the node-level reduction in Run empty reduction node level on data nodes #106204. I plan to spend some time to harden the exchange and compute service. Closes #106262
I investigated a heap attack test failure and found that an ESQL request was stuck. This occurred in the following:
To fix this issue, this PR introduces two changes:
Resumes the ExchangeSourceOperator and Driver on the coordinator, eventually allowing the coordinator to trigger cancellation of the request when failing to fetch pages.
Ensures that an exchange sink on the data nodes fails when a data node request is cancelled. This callback was inadvertently omitted when introducing the node-level reduction in Run empty reduction node level on data nodes #106204.
I plan to spend some time to harden the exchange and compute service.
Closes #106262