-
Notifications
You must be signed in to change notification settings - Fork 24.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Transform] IndexNotFoundException during cluster upgrade #107263
Comments
Pinging @elastic/ml-core (Team:ML) |
Similar to #107266 (comment) A stopped transform has a running thread that eventually fails to do whatever it is that it is doing (in this case, save to index) |
I cannot think of a good way to approach this, the current option is to explicitly check for this here: else if (irrecoverableException instanceof IndexNotFoundException && IndexerState.ABORTING == getState()) {
logger.debug(
"[{}] Bulk index experienced IndexNotFoundException failure while Transform is aborting. This is likely "
+ "due to the Transform delete API called with delete_dest_index=true. Aborting indexer. ",
getJobId()
);
onAbort();
// do not call listener
} This is similar to what happens before and after this code is invoked, where the Transform exits gracefully if it was moved into the The alternative is to have fail check for the |
The above isn't quite true, it is correlated but I cannot repro it. This is actually the line that is throwing the exception: https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java#L832 I haven't fully digested what this code is doing enough to repro it, it feels vaguely like the node running the transform has a earlier cluster state version than the node processing the bulk request, so we fail out instead of retrying? |
This can kinda be consistently reproduced by trying to get the shards to relocate. When this happens, we want the Transform to retry the checkpoint and check if the Index exists. It might be possible that a Delete API can still trigger this issue, in which case we will want to have the Indexer gracefully shut down. It also might be possible that the user can delete the Index using the Index API without stopping the Transform, in that case the Transform seems to continue running and the Bulk API will recreate the Index as per its spec. Ideally we'd either stop the Transform or have the Transform create/update the Index as per the Transport API spec, but I don't think that is within the scope of this issue. |
An Index can be removed from its previous shard in the middle of a Transform run. Ideally, this happens as part of the Delete API, and the Transform has already been stopped, but in the case that it isn't, we want to retry the checkpoint. If the Transform had been stopped, the retry will move the Indexer into a graceful shutdown. If the Transform had not been stopped, the retry will check if the Index exists or recreate the Index if it does not exist. This is currently how unattended Transforms work, and this change will make it so regular Transforms can also auto-recover from this error. Fix elastic#107263
Description
From: #107251
While a cluster is upgrading, we see
IndexNotFoundException
for some Transforms.Example:
This is likely an eventual consistency issue, and the transform needs to give the index time to become consistently reachable
The text was updated successfully, but these errors were encountered: