Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Transform] IndexNotFoundException during cluster upgrade #107263

Closed
Tracked by #107251
prwhelan opened this issue Apr 9, 2024 · 5 comments · Fixed by #108394
Closed
Tracked by #107251

[Transform] IndexNotFoundException during cluster upgrade #107263

prwhelan opened this issue Apr 9, 2024 · 5 comments · Fixed by #108394
Assignees
Labels
>bug :ml/Transform Transform Team:ML Meta label for the ML team v8.15.0

Comments

@prwhelan
Copy link
Member

prwhelan commented Apr 9, 2024

Description

From: #107251

While a cluster is upgrading, we see IndexNotFoundException for some Transforms.
Example:

[endpoint.metadata_united-default-8.13.0] transform has failed; experienced: [Failed to index documents into destination index due to permanent error: [org.elasticsearch.xpack.transform.transforms.BulkIndexingException: Bulk index experienced [36] failures and at least 1 irrecoverable [org.elasticsearch.index.IndexNotFoundException: no such index [.metrics-endpoint.metadata_united_default]]. Other failures: [RemoteTransportException] message [org.elasticsearch.transport.RemoteTransportException: [<>][<>][indices:data/write/bulk[s]]]; org.elasticsearch.index.IndexNotFoundException: no such index [.metrics-endpoint.metadata_united_default]]].

This is likely an eventual consistency issue, and the transform needs to give the index time to become consistently reachable

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@prwhelan
Copy link
Member Author

Similar to #107266 (comment)

A stopped transform has a running thread that eventually fails to do whatever it is that it is doing (in this case, save to index)

@prwhelan
Copy link
Member Author

I cannot think of a good way to approach this, the current option is to explicitly check for this here:

 else if (irrecoverableException instanceof IndexNotFoundException && IndexerState.ABORTING == getState()) {
    logger.debug(
        "[{}] Bulk index experienced IndexNotFoundException failure while Transform is aborting. This is likely "
            + "due to the Transform delete API called with delete_dest_index=true.  Aborting indexer. ",
        getJobId()
    );
    onAbort();
    // do not call listener
}

This is similar to what happens before and after this code is invoked, where the Transform exits gracefully if it was moved into the ABORTING state (via the DELETE API).

The alternative is to have fail check for the ABORTING state and exit gracefully there, but that is a more invasive change with broader impact

@prwhelan
Copy link
Member Author

prwhelan commented May 2, 2024

The above isn't quite true, it is correlated but I cannot repro it.

This is actually the line that is throwing the exception: https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java#L832

I haven't fully digested what this code is doing enough to repro it, it feels vaguely like the node running the transform has a earlier cluster state version than the node processing the bulk request, so we fail out instead of retrying?

@prwhelan
Copy link
Member Author

prwhelan commented May 7, 2024

This can kinda be consistently reproduced by trying to get the shards to relocate. When this happens, we want the Transform to retry the checkpoint and check if the Index exists.

It might be possible that a Delete API can still trigger this issue, in which case we will want to have the Indexer gracefully shut down.

It also might be possible that the user can delete the Index using the Index API without stopping the Transform, in that case the Transform seems to continue running and the Bulk API will recreate the Index as per its spec. Ideally we'd either stop the Transform or have the Transform create/update the Index as per the Transport API spec, but I don't think that is within the scope of this issue.

prwhelan added a commit to prwhelan/elasticsearch that referenced this issue May 7, 2024
An Index can be removed from its previous shard in the middle of a
Transform run.  Ideally, this happens as part of the Delete API,
and the Transform has already been stopped, but in the case that
it isn't, we want to retry the checkpoint.

If the Transform had been stopped, the retry will move the Indexer into
a graceful shutdown.

If the Transform had not been stopped, the retry will check if the Index
exists or recreate the Index if it does not exist.

This is currently how unattended Transforms work, and this change will
make it so regular Transforms can also auto-recover from this error.

Fix elastic#107263
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml/Transform Transform Team:ML Meta label for the ML team v8.15.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants