Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] SnapshotStressTestsIT testRandomActivities failing #104939

Closed
DaveCTurner opened this issue Jan 30, 2024 · 2 comments · Fixed by #105245
Closed

[CI] SnapshotStressTestsIT testRandomActivities failing #104939

DaveCTurner opened this issue Jan 30, 2024 · 2 comments · Fixed by #105245
Labels
:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs medium-risk An open issue or test failure that is a medium risk to future releases Team:Distributed Meta label for distributed team >test-failure Triaged test failures from CI

Comments

@DaveCTurner
Copy link
Contributor

Not the same as #104072. The cluster state at failure suggests we've dropped a finalization task somehow (maybe related to a node-left event):

   1 ▽ (14) {cluster_uuid: "3iF3FzgnQbiZ4MEVUf_52w", version: 455, state_uuid: "XV_EtlP7Tp-3Yf4S5wFyYg", master_node: "SyU09oHfQdq6_ADzHuF9sw", blocks: {}, nodes: {…}, nodes_versions: […], nodes_features: […], metadata: {…}, routing_table: {…}, routing_nodes: {…}, snapshot_deletions: {…}, health: {…}, snapshots: {…}}
   2     cluster_uuid: "3iF3FzgnQbiZ4MEVUf_52w"
   3     version: 455
   4     state_uuid: "XV_EtlP7Tp-3Yf4S5wFyYg"
   5     master_node: "SyU09oHfQdq6_ADzHuF9sw"
   6     blocks: {}
   7   ▷ nodes: (4) {SyU09oHfQdq6_ADzHuF9sw: {…}, TBO2zztJRu2lpLoSDchIUQ: {…}, "o--_8jfeSTmH5ATd8cis_Q": {…}, n7WWMC0kTFOz6K6yuVroOw: {…}}
  82   ▷ nodes_versions: (4) [{…}, {…}, {…}, {…}]
 140   ▷ nodes_features: (4) [{…}, {…}, {…}, {…}]
 166   ▷ metadata: (9) {cluster_uuid: "3iF3FzgnQbiZ4MEVUf_52w", cluster_uuid_committed: true, cluster_coordination: {…}, templates: {…}, indices: {…}, repositories: {…}, "index-graveyard": {…}, persistent_tasks: {…}, reserved_state: {}}
1214   ▷ routing_table: (1) {indices: {"index-6": {…}, "index-0": {…}, "index-1": {…}, "index-3": {…}, "index-2": {…}, "index-4": {…}, "index-7": {…}, "index-5": {…}}}
1554   ▷ routing_nodes: (2) {unassigned: [], nodes: {…}}
1827   ▽ snapshot_deletions: (1) {snapshot_deletions: [{repository: "repo-1", snapshots: […], start_time: "2024-01-26T21:55:05.463Z", start_time_millis: 1706306105463, repository_state_id: 24, state: "WAITING"}]}
1828     ▽ snapshot_deletions: (1) [{repository: "repo-1", snapshots: […], start_time: "2024-01-26T21:55:05.463Z", start_time_millis: 1706306105463, repository_state_id: 24, state: "WAITING"}]
1829       ▽ [0]: (6) {repository: "repo-1", snapshots: […], start_time: "2024-01-26T21:55:05.463Z", start_time_millis: 1706306105463, repository_state_id: 24, state: "WAITING"}
1830           repository: "repo-1"
1831         ▽ snapshots: (4) ["snapshot-partial-49", "snapshot-partial-8", "snapshot-partial-10", "snapshot-partial-5"]
1832             [0]: "snapshot-partial-49"
1833             [1]: "snapshot-partial-8"
1834             [2]: "snapshot-partial-10"
1835             [3]: "snapshot-partial-5"
1837           start_time: "2024-01-26T21:55:05.463Z"
1838           start_time_millis: 1706306105463
1839           repository_state_id: 24
1840           state: "WAITING"
1844   ▷ health: (2) {disk: {…}, shard_limits: {…}}
1858   ▼ snapshots: (1) {snapshots: [{repository: "repo-1", snapshot: "snapshot-partial-34", uuid: "e2rqGJv-RTapmYZHofl-fw", include_global_state: true, partial: true, state: "SUCCESS", indices: […], start_time: "2024-01-26T21:55:14.716Z", start_time_millis: 1706306114716, repository_state_id: 24, shards: […], feature_states: [], data_streams: []}]}
1859     ▽ snapshots: (1) [{repository: "repo-1", snapshot: "snapshot-partial-34", uuid: "e2rqGJv-RTapmYZHofl-fw", include_global_state: true, partial: true, state: "SUCCESS", indices: […], start_time: "2024-01-26T21:55:14.716Z", start_time_millis: 1706306114716, repository_state_id: 24, shards: […], feature_states: [], data_streams: []}]
1860       ▽ [0]: (13) {repository: "repo-1", snapshot: "snapshot-partial-34", uuid: "e2rqGJv-RTapmYZHofl-fw", include_global_state: true, partial: true, state: "SUCCESS", indices: […], start_time: "2024-01-26T21:55:14.716Z", start_time_millis: 1706306114716, repository_state_id: 24, shards: […], feature_states: [], data_streams: []}
1861           repository: "repo-1"
1862           snapshot: "snapshot-partial-34"
1863           uuid: "e2rqGJv-RTapmYZHofl-fw"
1864           include_global_state: true
1865           partial: true
1866           state: "SUCCESS"
1867         ▷ indices: (8) [{…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}]
1901           start_time: "2024-01-26T21:55:14.716Z"
1902           start_time_millis: 1706306114716
1903           repository_state_id: 24
1904         ▽ shards: (19) [{…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}, {…}]
1905           ▷ [0]: (6) {index: {…}, shard: 0, state: "FAILED", generation: "TOdX3wdgRCiISfDoBdxJSA", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
1916           ▷ [1]: (6) {index: {…}, shard: 0, state: "FAILED", generation: "GeXG7ci9SbuhRzVuoz2j0A", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
1927           ▷ [2]: (6) {index: {…}, shard: 2, state: "FAILED", generation: "UGF5HAe5RDmvkMpM2LpCkw", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
1938           ▷ [3]: (6) {index: {…}, shard: 0, state: "FAILED", generation: "C59sEw0WTNSObp24ZhRhdw", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
1949           ▷ [4]: (6) {index: {…}, shard: 4, state: "FAILED", generation: "56s0k5MTQimT2kt4MAOgVg", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
1960           ▷ [5]: (6) {index: {…}, shard: 3, state: "FAILED", generation: "yZafCREsT5ColMIHdYvEeg", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
1971           ▷ [6]: (6) {index: {…}, shard: 2, state: "FAILED", generation: "J1Y7sqi_R6WQYNB74JwiSg", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
1982           ▷ [7]: (6) {index: {…}, shard: 1, state: "FAILED", generation: "jGyO3iITTaWQpe_kNoiw_Q", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
1993           ▷ [8]: (6) {index: {…}, shard: 1, state: "FAILED", generation: "aTmdCOpwSuSLcz35XqOKpg", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
2004           ▷ [9]: (6) {index: {…}, shard: 0, state: "FAILED", generation: "3pPnCXohQtu0F-l8t8_ibg", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
2015           ▷ [10]: (6) {index: {…}, shard: 0, state: "MISSING", generation: "C-FSij6JRLKp86r9KEInRQ", node: null, reason: "primary shard is not allocated"}
2026           ▷ [11]: (6) {index: {…}, shard: 0, state: "FAILED", generation: "7qj2AuiDRIORdetYE3Rtuw", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
2037           ▷ [12]: (6) {index: {…}, shard: 3, state: "FAILED", generation: "m69FC_knQni0zOJWQCAKIg", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
2048           ▷ [13]: (6) {index: {…}, shard: 2, state: "FAILED", generation: "So1Gnk0YSkKGhgX7rRBcvw", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
2059           ▷ [14]: (6) {index: {…}, shard: 1, state: "FAILED", generation: "_pNn0VxDR5Gj3ehR-d2kCg", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
2070           ▷ [15]: (6) {index: {…}, shard: 0, state: "FAILED", generation: "l4eh9ypOQniwYlP9ySOcYg", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
2081           ▷ [16]: (6) {index: {…}, shard: 2, state: "FAILED", generation: "X8WtbMerSwm7XRJh_E5Itg", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
2092           ▷ [17]: (6) {index: {…}, shard: 1, state: "FAILED", generation: "NmSyXaavRQKPywGYFhfUdw", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
2103           ▷ [18]: (6) {index: {…}, shard: 0, state: "FAILED", generation: "Rf5BOCSgScSXBY3r3cxA0Q", node: "TBO2zztJRu2lpLoSDchIUQ", reason: "node left the cluster during snapshot"}
2115           feature_states: []
2116           data_streams: []

Build scan:
https://gradle-enterprise.elastic.co/s/ed4zqzurtbpn6/tests/:server:internalClusterTest/org.elasticsearch.snapshots.SnapshotStressTestsIT/testRandomActivities

Reproduction line:

gradlew ':server:internalClusterTest' --tests "org.elasticsearch.snapshots.SnapshotStressTestsIT.testRandomActivities" -Dtests.seed=25353760A7947592 -Dtests.locale=ms-MY -Dtests.timezone=Jamaica -Druntime.java=21

Applicable branches:
main, 8.12

Reproduces locally?:
Didn't try

Failure history:
Failure dashboard for org.elasticsearch.snapshots.SnapshotStressTestsIT#testRandomActivities

Failure excerpt:

java.lang.AssertionError: failed to acquire all permits: [repo-1, snapshot-partial-8, snapshot-partial-10, snapshot-partial-5, node_s3, node_s2, node_s1, node_s0]

  at org.junit.Assert.fail(Assert.java:88)
  at org.elasticsearch.snapshots.SnapshotStressTestsIT$TrackedCluster.run(SnapshotStressTestsIT.java:346)
  at org.elasticsearch.snapshots.SnapshotStressTestsIT.testRandomActivities(SnapshotStressTestsIT.java:87)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
  at java.lang.reflect.Method.invoke(Method.java:580)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:1583)

@DaveCTurner DaveCTurner added :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI medium-risk An open issue or test failure that is a medium risk to future releases labels Jan 30, 2024
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team label Jan 30, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@DaveCTurner
Copy link
Contributor Author

I've seen this failure in another test in which we had more verbose logging:

fail-2023-12-25T15:49:30.191Z-4dac44f91ae404e81db9861c43e9803243e15c25.log.gz

In that case the transition which moved the stuck snapshot to a completed state was an UpdateSnapshotStatusAction for [index-0][1] in snapshot-clone-11, moving it to state SUCCESS, but while that update was in flight the node left the cluster so the other shard snapshots moved to state MISSING in the same update, which completed snapshot-partial-12 without starting its finalization.

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Feb 7, 2024
Today when processing a batch of `ShardSnapshotUpdate` tasks each
update's listener considers whether the corresponding snapshot has
completed and, if so, it enqueues it for finalization. This is somewhat
inefficient since we may be processing many shard snapshot updates for
the same few snapshots, but there's no need to check each snapshot for
completion more than once. It's also insufficient since the completion
of a shard snapshot may cause the completion of subsequent snapshots too
(e.g. they can go from state `QUEUED` straight to `MISSING`).

This commit detaches the completion handling from the individual shard
snapshot updates and instead makes sure that any snapshot that reaches a
completed state is enqueued for finalization.

Closes elastic#104939
elasticsearchmachine pushed a commit that referenced this issue Feb 9, 2024
Today when processing a batch of `ShardSnapshotUpdate` tasks each
update's listener considers whether the corresponding snapshot has
completed and, if so, it enqueues it for finalization. This is somewhat
inefficient since we may be processing many shard snapshot updates for
the same few snapshots, but there's no need to check each snapshot for
completion more than once. It's also insufficient since the completion
of a shard snapshot may cause the completion of subsequent snapshots too
(e.g. they can go from state `QUEUED` straight to `MISSING`).

This commit detaches the completion handling from the individual shard
snapshot updates and instead makes sure that any snapshot that reaches a
completed state is enqueued for finalization.

Closes #104939
DaveCTurner added a commit that referenced this issue Feb 9, 2024
Today when processing a batch of `ShardSnapshotUpdate` tasks each
update's listener considers whether the corresponding snapshot has
completed and, if so, it enqueues it for finalization. This is somewhat
inefficient since we may be processing many shard snapshot updates for
the same few snapshots, but there's no need to check each snapshot for
completion more than once. It's also insufficient since the completion
of a shard snapshot may cause the completion of subsequent snapshots too
(e.g. they can go from state `QUEUED` straight to `MISSING`).

This commit detaches the completion handling from the individual shard
snapshot updates and instead makes sure that any snapshot that reaches a
completed state is enqueued for finalization.

Closes #104939
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs medium-risk An open issue or test failure that is a medium risk to future releases Team:Distributed Meta label for distributed team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants