New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] SnapshotStressTestsIT testRandomActivities failing #104939
Comments
Pinging @elastic/es-distributed (Team:Distributed) |
I've seen this failure in another test in which we had more verbose logging: fail-2023-12-25T15:49:30.191Z-4dac44f91ae404e81db9861c43e9803243e15c25.log.gz In that case the transition which moved the stuck snapshot to a completed state was an |
Today when processing a batch of `ShardSnapshotUpdate` tasks each update's listener considers whether the corresponding snapshot has completed and, if so, it enqueues it for finalization. This is somewhat inefficient since we may be processing many shard snapshot updates for the same few snapshots, but there's no need to check each snapshot for completion more than once. It's also insufficient since the completion of a shard snapshot may cause the completion of subsequent snapshots too (e.g. they can go from state `QUEUED` straight to `MISSING`). This commit detaches the completion handling from the individual shard snapshot updates and instead makes sure that any snapshot that reaches a completed state is enqueued for finalization. Closes elastic#104939
Today when processing a batch of `ShardSnapshotUpdate` tasks each update's listener considers whether the corresponding snapshot has completed and, if so, it enqueues it for finalization. This is somewhat inefficient since we may be processing many shard snapshot updates for the same few snapshots, but there's no need to check each snapshot for completion more than once. It's also insufficient since the completion of a shard snapshot may cause the completion of subsequent snapshots too (e.g. they can go from state `QUEUED` straight to `MISSING`). This commit detaches the completion handling from the individual shard snapshot updates and instead makes sure that any snapshot that reaches a completed state is enqueued for finalization. Closes #104939
Today when processing a batch of `ShardSnapshotUpdate` tasks each update's listener considers whether the corresponding snapshot has completed and, if so, it enqueues it for finalization. This is somewhat inefficient since we may be processing many shard snapshot updates for the same few snapshots, but there's no need to check each snapshot for completion more than once. It's also insufficient since the completion of a shard snapshot may cause the completion of subsequent snapshots too (e.g. they can go from state `QUEUED` straight to `MISSING`). This commit detaches the completion handling from the individual shard snapshot updates and instead makes sure that any snapshot that reaches a completed state is enqueued for finalization. Closes #104939
Not the same as #104072. The cluster state at failure suggests we've dropped a finalization task somehow (maybe related to a
node-left
event):Build scan:
https://gradle-enterprise.elastic.co/s/ed4zqzurtbpn6/tests/:server:internalClusterTest/org.elasticsearch.snapshots.SnapshotStressTestsIT/testRandomActivities
Reproduction line:
Applicable branches:
main, 8.12
Reproduces locally?:
Didn't try
Failure history:
Failure dashboard for
org.elasticsearch.snapshots.SnapshotStressTestsIT#testRandomActivities
Failure excerpt:
The text was updated successfully, but these errors were encountered: