Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume driver when failing to fetch pages #106392

Merged
merged 3 commits into from Mar 18, 2024
Merged

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Mar 17, 2024

I investigated a heap attack test failure and found that an ESQL request was stuck. This occurred in the following:

  1. The ExchangeSource on the coordinator was blocked on reading because there were no available pages.
  2. Meanwhile, the ExchangeSink on the data node had pages ready for fetching.
  3. When an exchange request tried to fetch pages, it failed due to a CircuitBreakingException. Despite the failure, no cancellation was triggered because the status of the ExchangeSource on the coordinator remained unchanged.

To fix this issue, this PR introduces two changes:

  • Resumes the ExchangeSourceOperator and Driver on the coordinator, eventually allowing the coordinator to trigger cancellation of the request when failing to fetch pages.

  • Ensures that an exchange sink on the data nodes fails when a data node request is cancelled. This callback was inadvertently omitted when introducing the node-level reduction in Run empty reduction node level on data nodes #106204.

I plan to spend some time to harden the exchange and compute service.

Closes #106262

@dnhatn dnhatn changed the title Resume Driver when fail to fetch pages Resume driver when failing to fetch pages Mar 18, 2024
@@ -203,6 +203,7 @@ void onSinkFailed(Exception originEx) {
}
return first;
});
buffer.waitForReading().onResponse(null); // resume the Driver if it is being blocked on reading
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can notify about the failure, but I think it's simpler just to resume and let the driver handle the error, as if it hadn't been blocked before.

@dnhatn dnhatn marked this pull request as ready for review March 18, 2024 06:08
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Mar 18, 2024
@elasticsearchmachine
Copy link
Collaborator

Hi @dnhatn, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@@ -690,6 +690,7 @@ private void runComputeOnDataNode(
dataNodeRequestExecutor.start();
// run the node-level reduction
var externalSink = exchangeService.getSinkHandler(externalId);
task.addListener(() -> exchangeService.finishSinkHandler(externalId, new TaskCancelledException(task.getReasonCancelled())));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you stick @Nullable on the exception argument to finishSinkHandler and add a note that it'll fire any errors into the sync - if it is still running. Or something like that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 3c018f7.

@dnhatn
Copy link
Member Author

dnhatn commented Mar 18, 2024

Thanks Nik!

@dnhatn dnhatn added the auto-backport-and-merge Automatically create backport pull requests and merge when ready label Mar 18, 2024
@dnhatn dnhatn merged commit d66c7d4 into elastic:main Mar 18, 2024
14 checks passed
@dnhatn dnhatn deleted the exchange-deadlock branch March 18, 2024 16:32
@elasticsearchmachine
Copy link
Collaborator

elasticsearchmachine commented Mar 18, 2024

Backport to 8.13 in #106436

dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Mar 18, 2024
I investigated a heap attack test failure and found that an ESQL request
was stuck. This occurred in the following:

1. The ExchangeSource on the coordinator was blocked on reading because
there were no available pages.

2. Meanwhile, the ExchangeSink on the data node had pages ready for
fetching.

3. When an exchange request tried to fetch pages, it failed due to a
CircuitBreakingException. Despite the failure, no cancellation was
triggered because the status of the ExchangeSource on the coordinator
remained unchanged.  To fix this issue, this PR introduces two changes:

Resumes the ExchangeSourceOperator and Driver on the coordinator,
eventually allowing the coordinator to trigger cancellation of the
request when failing to fetch pages.

Ensures that an exchange sink on the data nodes fails when a data node
request is cancelled. This callback was inadvertently omitted when
introducing the node-level reduction in Run empty reduction node level
on data nodes elastic#106204.

I plan to spend some time to harden the exchange and compute service.

Closes elastic#106262
dnhatn added a commit that referenced this pull request Mar 18, 2024
I investigated a heap attack test failure and found that an ESQL request
was stuck. This occurred in the following:

1. The ExchangeSource on the coordinator was blocked on reading because
there were no available pages.

2. Meanwhile, the ExchangeSink on the data node had pages ready for
fetching.

3. When an exchange request tried to fetch pages, it failed due to a
CircuitBreakingException. Despite the failure, no cancellation was
triggered because the status of the ExchangeSource on the coordinator
remained unchanged.  To fix this issue, this PR introduces two changes:

Resumes the ExchangeSourceOperator and Driver on the coordinator,
eventually allowing the coordinator to trigger cancellation of the
request when failing to fetch pages.

Ensures that an exchange sink on the data nodes fails when a data node
request is cancelled. This callback was inadvertently omitted when
introducing the node-level reduction in Run empty reduction node level
on data nodes #106204.

I plan to spend some time to harden the exchange and compute service.

Closes #106262
@dnhatn dnhatn removed backport pending auto-backport-and-merge Automatically create backport pull requests and merge when ready labels Mar 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.13.1 v8.14.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] HeapAttackIT testFetchTooManyBigFields failing
3 participants