[CI] SnapshotStressTestsIT testRandomActivities failing #104939

DaveCTurner · 2024-01-30T17:00:08Z

Not the same as #104072. The cluster state at failure suggests we've dropped a finalization task somehow (maybe related to a node-left event):

   1 ▽ (14) {cluster_uuid: "3iF3FzgnQbiZ4MEVUf_52w", version: 455, state_uuid: "XV_EtlP7Tp-3Yf4S5wFyYg", master_node: "SyU09oHfQdq6_ADzHuF9sw", blocks: {}, nodes: {…}, nodes_versions: […], nodes_features: […], metadata: {…}, routing_table: {…}, routing_nodes: {…}, snapshot_deletions: {…}, health: {…}, snapshots: {…}}
   2     cluster_uuid: "3iF3FzgnQbiZ4MEVUf_52w"
   3     version: 455
   4     state_uuid: "XV_EtlP7Tp-3Yf4S5wFyYg"
   5     master_node: "SyU09oHfQdq6_ADzHuF9sw"
   6     blocks: {}
   7   ▷ nodes: (4) {SyU09oHfQdq6_ADzHuF9sw: {…}, TBO2zztJRu2lpLoSDchIUQ: {…}, "o--_8jfeSTmH5ATd8cis_Q": {…}, n7WWMC0kTFOz6K6yuVroOw: {…}}
  82   ▷ nodes_versions: (4) [{…}, {…}, {…}, {…}]
 140   ▷ nodes_features: (4) [{…}, {…}, {…}, {…}]
 166   ▷ metadata: (9) {cluster_uuid: "3iF3FzgnQbiZ4MEVUf_52w", cluster_uuid_committed: true, cluster_coordination: {…}, templates: {…}, indices: {…}, repositories: {…}, "index-graveyard": {…}, persistent_tasks: {…}, reserved_state: {}}
1214   ▷ routing_table: (1) {indices: {"index-6": {…}, "index-0": {…}, "index-1": {…}, "index-3": {…}, "index-2": {…}, "index-4": {…}, "index-7": {…}, "index-5": {…}}}
1554   ▷ routing_nodes: (2) {unassigned: [], nodes: {…}}
1827   ▽ snapshot_deletions: (1) {snapshot_deletions: [{repository: "repo-1", snapshots: […], start_time: "2024-01-26T21:55:05.463Z", start_time_millis: 1706306105463, repository_state_id: 24, state: "WAITING"}]}
1828     ▽ snapshot_deletions: (1) [{repository: "repo-1", snapshots: […], start_time: "2024-01-26T21:55:05.463Z", start_time_millis: 1706306105463, repository_state_id: 24, state: "WAITING"}]
1829       ▽ [0]: (6) {repository: "repo-1", snapshots: […], start_time: "2024-01-26T21:55:05.463Z", start_time_millis: 1706306105463, repository_state_id: 24, state: "WAITING"}
1830           repository: "repo-1"
1831         ▽ snapshots: (4) ["snapshot-partial-49", "snapshot-partial-8", "snapshot-partial-10", "snapshot-partial-5"]
1832             [0]: "snapshot-partial-49"
1833             [1]: "snapshot-partial-8"
1834             [2]: "snapshot-partial-10"
1835             [3]: "snapshot-partial-5"
1837           start_time: "2024-01-26T21:55:05.463Z"
1838           start_time_millis: 1706306105463
1839           repository_state_id: 24
1840           state: "WAITING"
1844   ▷ health: (2) {disk: {…}, shard_limits: {…}}
1858   ▼ snapshots: (1) {snapshots: [{repository: "repo-1", snapshot: "snapshot-partial-34", uuid: "e2rqGJv-RTapmYZHofl-fw", include_global_state: true, partial: true, state: "SUCCESS", indices: […], start_time: "2024-01-26T21:55:14.716Z", start_time_millis: 1706306114716, repository_state_id: 24, shards: […], feature_states: [], data_streams: []}]}
1859     ▽ snapshots: (1) [{repository: "repo-1", snapshot: "snapshot-partial-34", uuid: "e2rqGJv-RTapmYZHofl-fw", include_global_state: true, partial: true, state: "SUCCESS", indices: […], start_time: "2024-01-26T21:55:14.716Z", start_time_millis: 1706306114716, repository_state_id: 24, shards: […], feature_states: [], data_streams: []}]
1860       ▽ [0]: (13) {repository: "repo-1", snapshot: "snapshot-partial-34", uuid: "e2rqGJv-RTapmYZHofl-fw", include_global_state: true, partial: true, state: "SUCCESS", indices: […], start_time: "2024-01-26T21:55:14.716Z", start_time_millis: 1706306114716, repository_state_id: 24, shards: […], feature_states: [], data_streams: []}
1861           repository: "repo-1"
1862           snapshot: "snapshot-partial-34"
1863           uuid: "e2rqGJv-RTapmYZHofl-fw"
1864           include_global_state: true
1865           partial: true
1866           state: "SUCCESS"
1867         ▷ indices: (8) [{…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}]
1901           start_time: "2024-01-26T21:55:14.716Z"
1902           start_time_millis: 1706306114716
1903           repository_state_id: 24
1904         ▽ shards: (19) [{…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}]
1905           ▷ [0]: (6) {index: {…}, shard: 0, state: "FAILED", generation: "TOdX3wdgRCiISfDoBdxJSA", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
1916           ▷ [1]: (6) {index: {…}, shard: 0, state: "FAILED", generation: "GeXG7ci9SbuhRzVuoz2j0A", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
1927           ▷ [2]: (6) {index: {…}, shard: 2, state: "FAILED", generation: "UGF5HAe5RDmvkMpM2LpCkw", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
1938           ▷ [3]: (6) {index: {…}, shard: 0, state: "FAILED", generation: "C59sEw0WTNSObp24ZhRhdw", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
1949           ▷ [4]: (6) {index: {…}, shard: 4, state: "FAILED", generation: "56s0k5MTQimT2kt4MAOgVg", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
1960           ▷ [5]: (6) {index: {…}, shard: 3, state: "FAILED", generation: "yZafCREsT5ColMIHdYvEeg", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
1971           ▷ [6]: (6) {index: {…}, shard: 2, state: "FAILED", generation: "J1Y7sqi_R6WQYNB74JwiSg", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
1982           ▷ [7]: (6) {index: {…}, shard: 1, state: "FAILED", generation: "jGyO3iITTaWQpe_kNoiw_Q", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
1993           ▷ [8]: (6) {index: {…}, shard: 1, state: "FAILED", generation: "aTmdCOpwSuSLcz35XqOKpg", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
2004           ▷ [9]: (6) {index: {…}, shard: 0, state: "FAILED", generation: "3pPnCXohQtu0F-l8t8_ibg", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
2015           ▷ [10]: (6) {index: {…}, shard: 0, state: "MISSING", generation: "C-FSij6JRLKp86r9KEInRQ", node: null, reason: "primary shard is not allocated"}
2026           ▷ [11]: (6) {index: {…}, shard: 0, state: "FAILED", generation: "7qj2AuiDRIORdetYE3Rtuw", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
2037           ▷ [12]: (6) {index: {…}, shard: 3, state: "FAILED", generation: "m69FC_knQni0zOJWQCAKIg", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
2048           ▷ [13]: (6) {index: {…}, shard: 2, state: "FAILED", generation: "So1Gnk0YSkKGhgX7rRBcvw", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
2059           ▷ [14]: (6) {index: {…}, shard: 1, state: "FAILED", generation: "_pNn0VxDR5Gj3ehR-d2kCg", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
2070           ▷ [15]: (6) {index: {…}, shard: 0, state: "FAILED", generation: "l4eh9ypOQniwYlP9ySOcYg", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
2081           ▷ [16]: (6) {index: {…}, shard: 2, state: "FAILED", generation: "X8WtbMerSwm7XRJh_E5Itg", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
2092           ▷ [17]: (6) {index: {…}, shard: 1, state: "FAILED", generation: "NmSyXaavRQKPywGYFhfUdw", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
2103           ▷ [18]: (6) {index: {…}, shard: 0, state: "FAILED", generation: "Rf5BOCSgScSXBY3r3cxA0Q", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
2115           feature_states: []
2116           data_streams: []

Build scan:
https://gradle-enterprise.elastic.co/s/ed4zqzurtbpn6/tests/:server:internalClusterTest/org.elasticsearch.snapshots.SnapshotStressTestsIT/testRandomActivities

Reproduction line:

gradlew ':server:internalClusterTest' --tests "org.elasticsearch.snapshots.SnapshotStressTestsIT.testRandomActivities" -Dtests.seed=25353760A7947592 -Dtests.locale=ms-MY -Dtests.timezone=Jamaica -Druntime.java=21

Applicable branches:
main, 8.12

Reproduces locally?:
Didn't try

Failure history:
Failure dashboard for org.elasticsearch.snapshots.SnapshotStressTestsIT#testRandomActivities

Failure excerpt:

java.lang.AssertionError: failed to acquire all permits: [repo-1, snapshot-partial-8, snapshot-partial-10, snapshot-partial-5, node_s3, node_s2, node_s1, node_s0]

  at org.junit.Assert.fail(Assert.java:88)
  at org.elasticsearch.snapshots.SnapshotStressTestsIT$TrackedCluster.run(SnapshotStressTestsIT.java:346)
  at org.elasticsearch.snapshots.SnapshotStressTestsIT.testRandomActivities(SnapshotStressTestsIT.java:87)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
  at java.lang.reflect.Method.invoke(Method.java:580)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:1583)

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2024-01-30T17:00:42Z

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner · 2024-01-31T11:05:16Z

I've seen this failure in another test in which we had more verbose logging:

fail-2023-12-25T15:49:30.191Z-4dac44f91ae404e81db9861c43e9803243e15c25.log.gz

In that case the transition which moved the stuck snapshot to a completed state was an UpdateSnapshotStatusAction for [index-0][1] in snapshot-clone-11, moving it to state SUCCESS, but while that update was in flight the node left the cluster so the other shard snapshots moved to state MISSING in the same update, which completed snapshot-partial-12 without starting its finalization.

Today when processing a batch of `ShardSnapshotUpdate` tasks each update's listener considers whether the corresponding snapshot has completed and, if so, it enqueues it for finalization. This is somewhat inefficient since we may be processing many shard snapshot updates for the same few snapshots, but there's no need to check each snapshot for completion more than once. It's also insufficient since the completion of a shard snapshot may cause the completion of subsequent snapshots too (e.g. they can go from state `QUEUED` straight to `MISSING`). This commit detaches the completion handling from the individual shard snapshot updates and instead makes sure that any snapshot that reaches a completed state is enqueued for finalization. Closes elastic#104939

Today when processing a batch of `ShardSnapshotUpdate` tasks each update's listener considers whether the corresponding snapshot has completed and, if so, it enqueues it for finalization. This is somewhat inefficient since we may be processing many shard snapshot updates for the same few snapshots, but there's no need to check each snapshot for completion more than once. It's also insufficient since the completion of a shard snapshot may cause the completion of subsequent snapshots too (e.g. they can go from state `QUEUED` straight to `MISSING`). This commit detaches the completion handling from the individual shard snapshot updates and instead makes sure that any snapshot that reaches a completed state is enqueued for finalization. Closes #104939

DaveCTurner added :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI medium-risk An open issue or test failure that is a medium risk to future releases labels Jan 30, 2024

elasticsearchmachine added the Team:Distributed Meta label for distributed team label Jan 30, 2024

DaveCTurner mentioned this issue Feb 7, 2024

Finalize all snapshots completed by shard snapshot updates #105245

Merged

elasticsearchmachine closed this as completed in #105245 Feb 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] SnapshotStressTestsIT testRandomActivities failing #104939

[CI] SnapshotStressTestsIT testRandomActivities failing #104939

DaveCTurner commented Jan 30, 2024

elasticsearchmachine commented Jan 30, 2024

DaveCTurner commented Jan 31, 2024

[CI] SnapshotStressTestsIT testRandomActivities failing #104939

[CI] SnapshotStressTestsIT testRandomActivities failing #104939

Comments

DaveCTurner commented Jan 30, 2024

elasticsearchmachine commented Jan 30, 2024

DaveCTurner commented Jan 31, 2024