[Transform] IndexNotFoundException during cluster upgrade #107263

prwhelan · 2024-04-09T12:00:30Z

Description

While a cluster is upgrading, we see IndexNotFoundException for some Transforms.
Example:

[endpoint.metadata_united-default-8.13.0] transform has failed; experienced: [Failed to index documents into destination index due to permanent error: [org.elasticsearch.xpack.transform.transforms.BulkIndexingException: Bulk index experienced [36] failures and at least 1 irrecoverable [org.elasticsearch.index.IndexNotFoundException: no such index [.metrics-endpoint.metadata_united_default]]. Other failures: [RemoteTransportException] message [org.elasticsearch.transport.RemoteTransportException: [<>][<>][indices:data/write/bulk[s]]]; org.elasticsearch.index.IndexNotFoundException: no such index [.metrics-endpoint.metadata_united_default]]].

This is likely an eventual consistency issue, and the transform needs to give the index time to become consistently reachable

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2024-04-09T12:00:58Z

Pinging @elastic/ml-core (Team:ML)

prwhelan · 2024-04-22T18:57:01Z

Similar to #107266 (comment)

A stopped transform has a running thread that eventually fails to do whatever it is that it is doing (in this case, save to index)

prwhelan · 2024-04-25T20:04:32Z

I cannot think of a good way to approach this, the current option is to explicitly check for this here:

 else if (irrecoverableException instanceof IndexNotFoundException && IndexerState.ABORTING == getState()) {
    logger.debug(
        "[{}] Bulk index experienced IndexNotFoundException failure while Transform is aborting. This is likely "
            + "due to the Transform delete API called with delete_dest_index=true.  Aborting indexer. ",
        getJobId()
    );
    onAbort();
    // do not call listener
}

This is similar to what happens before and after this code is invoked, where the Transform exits gracefully if it was moved into the ABORTING state (via the DELETE API).

The alternative is to have fail check for the ABORTING state and exit gracefully there, but that is a more invasive change with broader impact

prwhelan · 2024-05-02T20:45:23Z

The above isn't quite true, it is correlated but I cannot repro it.

This is actually the line that is throwing the exception: https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java#L832

I haven't fully digested what this code is doing enough to repro it, it feels vaguely like the node running the transform has a earlier cluster state version than the node processing the bulk request, so we fail out instead of retrying?

prwhelan · 2024-05-07T22:41:59Z

This can kinda be consistently reproduced by trying to get the shards to relocate. When this happens, we want the Transform to retry the checkpoint and check if the Index exists.

It might be possible that a Delete API can still trigger this issue, in which case we will want to have the Indexer gracefully shut down.

It also might be possible that the user can delete the Index using the Index API without stopping the Transform, in that case the Transform seems to continue running and the Bulk API will recreate the Index as per its spec. Ideally we'd either stop the Transform or have the Transform create/update the Index as per the Transport API spec, but I don't think that is within the scope of this issue.

An Index can be removed from its previous shard in the middle of a Transform run. Ideally, this happens as part of the Delete API, and the Transform has already been stopped, but in the case that it isn't, we want to retry the checkpoint. If the Transform had been stopped, the retry will move the Indexer into a graceful shutdown. If the Transform had not been stopped, the retry will check if the Index exists or recreate the Index if it does not exist. This is currently how unattended Transforms work, and this change will make it so regular Transforms can also auto-recover from this error. Fix elastic#107263

prwhelan added >bug :ml/Transform Transform Team:ML Meta label for the ML team v8.14.0 labels Apr 9, 2024

prwhelan self-assigned this Apr 9, 2024

prwhelan mentioned this issue Apr 9, 2024

[Transform] Elasticsearch upgrades make transforms fail easily #107251

Closed

5 tasks

elasticsearchmachine added v8.15.0 and removed v8.14.0 labels Apr 17, 2024

prwhelan mentioned this issue May 7, 2024

[Transform] Retry Destination IndexNotFoundException #108394

Merged

prwhelan closed this as completed in #108394 May 9, 2024

prwhelan closed this as completed in 0b71746 May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Transform] IndexNotFoundException during cluster upgrade #107263

[Transform] IndexNotFoundException during cluster upgrade #107263

prwhelan commented Apr 9, 2024

elasticsearchmachine commented Apr 9, 2024

prwhelan commented Apr 22, 2024

prwhelan commented Apr 25, 2024

prwhelan commented May 2, 2024

prwhelan commented May 7, 2024

[Transform] IndexNotFoundException during cluster upgrade #107263

[Transform] IndexNotFoundException during cluster upgrade #107263

Comments

prwhelan commented Apr 9, 2024

Description

elasticsearchmachine commented Apr 9, 2024

prwhelan commented Apr 22, 2024

prwhelan commented Apr 25, 2024

prwhelan commented May 2, 2024

prwhelan commented May 7, 2024