Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Transform] Retry Destination IndexNotFoundException #108394

Merged
merged 3 commits into from
May 9, 2024

Conversation

prwhelan
Copy link
Member

@prwhelan prwhelan commented May 7, 2024

An Index can be removed from its previous shard in the middle of a Transform run. Ideally, this happens as part of the Delete API, and the Transform has already been stopped, but in the case that it isn't, we want to retry the checkpoint.

If the Transform had been stopped, the retry will move the Indexer into a graceful shutdown.

If the Transform had not been stopped, the retry will check if the Index exists or recreate the Index if it does not exist.

This is currently how unattended Transforms work, and this change will make it so regular Transforms can also auto-recover from this error.

Fix #107263

An Index can be removed from its previous shard in the middle of a
Transform run.  Ideally, this happens as part of the Delete API,
and the Transform has already been stopped, but in the case that
it isn't, we want to retry the checkpoint.

If the Transform had been stopped, the retry will move the Indexer into
a graceful shutdown.

If the Transform had not been stopped, the retry will check if the Index
exists or recreate the Index if it does not exist.

This is currently how unattended Transforms work, and this change will
make it so regular Transforms can also auto-recover from this error.

Fix elastic#107263
@prwhelan prwhelan added >bug :ml/Transform Transform Team:ML Meta label for the ML team v8.15.0 labels May 7, 2024
@elasticsearchmachine
Copy link
Collaborator

Hi @prwhelan, I've created a changelog YAML for you.

@prwhelan prwhelan marked this pull request as ready for review May 8, 2024 12:20
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@przemekwitek przemekwitek self-requested a review May 9, 2024 11:23
@@ -63,7 +64,7 @@ public static Throwable getFirstIrrecoverableExceptionFromBulkResponses(Collecti
}

if (unwrappedThrowable instanceof ElasticsearchException elasticsearchException) {
if (isExceptionIrrecoverable(elasticsearchException)) {
if (isExceptionIrrecoverable(elasticsearchException) && isNotIndexNotFoundException(elasticsearchException)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we no longer consider IndexNotFoundException irrecoverable, could we modify the isExceptionIrrecoverable method to reflect that instead of adding the isNotIndexNotFoundException method?

Or the distinction is still needed for cases other than bulk responses?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good question - at least for now I was thinking of just handling it for the destination index? The source index can still fail with IndexNotFoundException (if PIT == null): https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/transform/src/main/java/org/elasticsearch/xpack/transform/transforms/ClientTransformIndexer.java#L602

I haven't seen this same type of intermittent failure scenario though, where the source index gets relocated to a new shard and the request fails on the old shard, so I'm not sure if we just haven't seen it yet or if search behaves differently where the request will automatically reroute.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we just haven't seen it yet or if search behaves differently where the request will automatically reroute.

I don't know either unfortunately :(
But since we are going in a direction to make more errors recoverable (and more transforms unattended) it could make sense to make IndexNotFoundException recoverable.
But maybe let's not do it in this PR if you don't feel comfortable.
We can leave this PR for dealing with destination index only.
Please make it clear in the PR title/description.

Copy link
Contributor

@przemekwitek przemekwitek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -63,7 +64,7 @@ public static Throwable getFirstIrrecoverableExceptionFromBulkResponses(Collecti
}

if (unwrappedThrowable instanceof ElasticsearchException elasticsearchException) {
if (isExceptionIrrecoverable(elasticsearchException)) {
if (isExceptionIrrecoverable(elasticsearchException) && isNotIndexNotFoundException(elasticsearchException)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we just haven't seen it yet or if search behaves differently where the request will automatically reroute.

I don't know either unfortunately :(
But since we are going in a direction to make more errors recoverable (and more transforms unattended) it could make sense to make IndexNotFoundException recoverable.
But maybe let's not do it in this PR if you don't feel comfortable.
We can leave this PR for dealing with destination index only.
Please make it clear in the PR title/description.

@prwhelan prwhelan changed the title [Transform] Handle IndexNotFoundException [Transform] Retry Destination IndexNotFoundException May 9, 2024
@prwhelan prwhelan merged commit 0b71746 into elastic:main May 9, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml/Transform Transform Team:ML Meta label for the ML team v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Transform] IndexNotFoundException during cluster upgrade
3 participants