New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] Multiple Failures in org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT #38447
Comments
Pinging @elastic/es-distributed |
Thanks @original-brownbear , FYI my PR build where #38368 is not merged also failed in testMasterAndDataShutdownDuringSnapshot: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-1/7454/console Looks like it ran for an extended duration:
|
I were able to reproduce the same failure locally by running the single test method on repeat until failure from within IntelliJ. |
Thanks for checking @henningandersen same here. Looking into fixing it now :) |
Seems with the changes from #38368 we're missing a state update on leaving nodes and get snapshots stuck in a state like: {
"snapshots": [
{
"repository": "test-repo",
"snapshot": "test-snap",
"uuid": "KuLcFik0T86h21Y4bMVZ4Q",
"include_global_state": true,
"partial": false,
"state": "STARTED",
"indices": [
{
"name": "test-idx",
"id": "shkWw-veQFaFvBHAGegcjQ"
}
],
"start_time_millis": 1549394078202,
"repository_state_id": -1,
"shards": [
{
"index": {
"index_name": "test-idx",
"index_uuid": "Lhug5HNuQ-i3ykpz0gkz-A"
},
"shard": 1,
"state": "SUCCESS",
"node": "6JO5slm7SzynyqKZ-C2sGg"
},
{
"index": {
"index_name": "test-idx",
"index_uuid": "Lhug5HNuQ-i3ykpz0gkz-A"
},
"shard": 0,
"state": "INIT",
"node": "Ovz6FD8bSEOdCzohp0_DKw"
}
]
}
]
} When node |
This fixes it, we were missing the node removal event when it coincided with master failover, opening a PR in a bit (trying to create a deterministic test that reproduces the issue): diff --git a/server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java b/server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java
index 17306e1585f..e2628fda991 100644
--- a/server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java
+++ b/server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java
@@ -687,8 +687,9 @@ public class SnapshotsService extends AbstractLifecycleComponent implements Clus
if (event.localNodeMaster()) {
// We don't remove old master when master flips anymore. So, we need to check for change in master
final SnapshotsInProgress snapshotsInProgress = event.state().custom(SnapshotsInProgress.TYPE);
+ final boolean newMaster = event.previousState().nodes().isLocalNodeElectedMaster() == false;
if (snapshotsInProgress != null) {
- if (removedNodesCleanupNeeded(snapshotsInProgress, event.nodesDelta().removedNodes())) {
+ if (newMaster || removedNodesCleanupNeeded(snapshotsInProgress, event.nodesDelta().removedNodes())) {
processSnapshotsOnRemovedNodes();
}
if (event.routingTableChanged() && waitingShardsStartedOrUnassigned(snapshotsInProgress, event)) {
@@ -704,7 +705,7 @@ public class SnapshotsService extends AbstractLifecycleComponent implements Clus
|| entry.state() != State.INIT && completed(entry.shards().values())
).forEach(this::endSnapshot);
}
- if (event.previousState().nodes().isLocalNodeElectedMaster() == false) {
+ if (newMaster) {
finalizeSnapshotDeletionFromPreviousMaster(event);
}
} |
Seems #38368 did destabilize
org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT
on CI (so far can't reproduce this locally though).I'm on it.
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+matrix-java-periodic/ES_BUILD_JAVA=openjdk12,ES_RUNTIME_JAVA=java11,nodes=immutable&&linux&&docker/224/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/1846/console
The text was updated successfully, but these errors were encountered: