Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SnapshotLifecycleRestIT testFullPolicySnapshot failure #50358

Closed
davidkyle opened this issue Dec 19, 2019 · 9 comments · Fixed by #51055 or #51769
Closed

SnapshotLifecycleRestIT testFullPolicySnapshot failure #50358

davidkyle opened this issue Dec 19, 2019 · 9 comments · Fixed by #51055 or #51769
Assignees
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team >test-failure Triaged test failures from CI v7.10.0 v8.0.0-alpha1

Comments

@davidkyle
Copy link
Member

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-windows-compatibility/os=windows-2016/331/console
https://gradle-enterprise.elastic.co/s/fmz3orwsf6ceu

org.elasticsearch.client.ResponseException: 
method [DELETE], host [http://127.0.0.1:60674], URI [/_snapshot/my-repo/snap-dgpzyeujtlil-pxstwwurg], status line [HTTP/1.1 503 Service Unavailable]
{"error":{"root_cause":[{"type":"concurrent_snapshot_execution_exception","reason":"[my-repo:snap-dgpzyeujtlil-pxstwwurg/5U_JWp2pRceMG2VKwqteqQ] another snapshot is currently running cannot delete"}],"type":"concurrent_snapshot_execution_exception","reason":"[my-repo:snap-dgpzyeujtlil-pxstwwurg/5U_JWp2pRceMG2VKwqteqQ] another snapshot is currently running cannot delete"},"status":503}

at __randomizedtesting.SeedInfo.seed([A77F6987797E068:3AE5B97574CB83D]:0)
at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:283)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:261)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:267)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:267)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:267)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:267)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:267)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:267)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:267)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235)
at org.elasticsearch.test.rest.ESRestTestCase.wipeSnapshots(ESRestTestCase.java:613)
at org.elasticsearch.test.rest.ESRestTestCase.lambda$wipeCluster$3(ESRestTestCase.java:504)
at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:854)
at org.elasticsearch.test.rest.ESRestTestCase.wipeCluster(ESRestTestCase.java:503)
at org.elasticsearch.test.rest.ESRestTestCase.cleanUpCluster(ESRestTestCase.java:287)
...

Does not reproduce:

gradlew ':x-pack:plugin:ilm:qa:multi-node:integTestRunner' --tests "org.elasticsearch.xpack.slm.SnapshotLifecycleRestIT.testFullPolicySnapshot" \
  -Dtests.seed=A77F6987797E068 \
  -Dtests.security.manager=true \
  -Dtests.locale=ca \
  -Dtests.timezone=Europe/Ulyanovsk \
  -Dcompiler.java=13
@davidkyle davidkyle added >test-failure Triaged test failures from CI :Data Management/ILM+SLM Index and Snapshot lifecycle management v8.0.0 labels Dec 19, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

@davidkyle
Copy link
Member Author

davidkyle added a commit that referenced this issue Dec 19, 2019
@danielmitterdorfer
Copy link
Member

We have another failure in 7.x (CI link, build scan):

The symptoms look a bit different to me though:

java.lang.AssertionError: 
Expected: a string starting with "snap-"
     but: was "snapuyytvuakl-31toqfartrwfolzgmlvdra"
	at __randomizedtesting.SeedInfo.seed([D4BE8BA449B8A835:DD6726AB6963F060]:0)
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
	at org.junit.Assert.assertThat(Assert.java:956)
	at org.junit.Assert.assertThat(Assert.java:923)
	at org.elasticsearch.xpack.slm.SnapshotLifecycleRestIT.lambda$testFullPolicySnapshot$1(SnapshotLifecycleRestIT.java:110)

reproduction line:

./gradlew ':x-pack:plugin:ilm:qa:multi-node:integTestRunner' --tests "org.elasticsearch.xpack.slm.SnapshotLifecycleRestIT.testFullPolicySnapshot" -Dtests.seed=D4BE8BA449B8A835 -Dtests.security.manager=true -Dtests.locale=ko-KR -Dtests.timezone=Canada/Central -Dcompiler.java=13 -Druntime.java=8

Does not reproduce locally.

dakrone added a commit to dakrone/elasticsearch that referenced this issue Jan 14, 2020
This test failed a couple of different ways, related to timing, as well
as concurrent snapshots, and also naming.

This commit splits the giant `assertBusy` into separate parts so that we don't
perform ~5 different requests and tests in the same loop. It also gives each
test a unique repository so that no other test can accidentally re-use
snapshots.

Resolves elastic#50358 (hopefully!)
@polyfractal polyfractal added v7.7.0 and removed v7.6.0 labels Jan 15, 2020
dakrone added a commit that referenced this issue Jan 15, 2020
…51013)

This test failed a couple of different ways, related to timing, as well
as concurrent snapshots, and also naming.

This commit splits the giant `assertBusy` into separate parts so that we don't
perform ~5 different requests and tests in the same loop. It also gives each
test a unique repository so that no other test can accidentally re-use
snapshots.

Resolves #50358 (hopefully!)
dakrone added a commit to dakrone/elasticsearch that referenced this issue Jan 15, 2020
…lastic#51013)

This test failed a couple of different ways, related to timing, as well
as concurrent snapshots, and also naming.

This commit splits the giant `assertBusy` into separate parts so that we don't
perform ~5 different requests and tests in the same loop. It also gives each
test a unique repository so that no other test can accidentally re-use
snapshots.

Resolves elastic#50358 (hopefully!)
dakrone added a commit that referenced this issue Jan 15, 2020
This test failed a couple of different ways, related to timing, as well
as concurrent snapshots, and also naming.

This commit splits the giant `assertBusy` into separate parts so that we don't
perform ~5 different requests and tests in the same loop. It also gives each
test a unique repository so that no other test can accidentally re-use
snapshots.

Resolves #50358 (hopefully!)
@hendrikmuhs
Copy link
Contributor

Sorry, I have to re-open:

14:07:10 org.elasticsearch.xpack.slm.SnapshotLifecycleRestIT > testFullPolicySnapshot FAILED
14:07:10     java.lang.AssertionError: 
14:07:10     Expected: a string starting with "snap-"
14:07:10          but: was "snapyuqttxlpu-zwgyb5vytjics8kuqznzpq"
14:07:10         at __randomizedtesting.SeedInfo.seed([C2B5BA3D5616DEE8:CB6C173276CD86BD]:0)
14:07:10         at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
14:07:10         at org.junit.Assert.assertThat(Assert.java:956)
14:07:10         at org.junit.Assert.assertThat(Assert.java:923)
14:07:10         at org.elasticsearch.xpack.slm.SnapshotLifecycleRestIT.lambda$testFullPolicySnapshot$1(SnapshotLifecycleRestIT.java:110)
14:07:10         at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:904)
14:07:10         at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:877)
14:07:10         at org.elasticsearch.xpack.slm.SnapshotLifecycleRestIT.testFullPolicySnapshot(SnapshotLifecycleRestIT.java:101)
14:07:26 REPRODUCE WITH: ./gradlew ':x-pack:plugin:ilm:qa:multi-node:integTestRunner' --tests "org.elasticsearch.xpack.slm.SnapshotLifecycleRestIT.testFullPolicySnapshot" -Dtests.seed=C2B5BA3D5616DEE8 -Dtests.security.manager=true -Dtests.locale=es-DO -Dtests.timezone=SystemV/MST7MDT -Dcompiler.java=13 -Druntime.java=8
14:07:26 
14:07:26 Suite: Test class org.elasticsearch.xpack.slm.SnapshotLifecycleRestIT
14:07:26   1> [2020-01-17T05:06:57,560][INFO ][o.e.x.s.SnapshotLifecycleRestIT] [testFullPolicySnapshot] before test
14:07:26   1> [2020-01-17T05:06:57,563][INFO ][o.e.x.s.SnapshotLifecycleRestIT] [testFullPolicySnapshot] initializing REST clients against [http://[::1]:39063, http://127.0.0.1:33047, http://[::1]:35411, http://127.0.0.1:37777, http://[::1]:42741, http://127.0.0.1:46725, http://[::1]:35935, http://127.0.0.1:44347]
14:07:26   1> [2020-01-17T05:07:09,867][INFO ][o.e.x.s.SnapshotLifecycleRestIT] [testFullPolicySnapshot] after test
14:07:26   2> REPRODUCE WITH: ./gradlew ':x-pack:plugin:ilm:qa:multi-node:integTestRunner' --tests "org.elasticsearch.xpack.slm.SnapshotLifecycleRestIT.testFullPolicySnapshot" -Dtests.seed=C2B5BA3D5616DEE8 -Dtests.security.manager=true -Dtests.locale=es-DO -Dtests.timezone=SystemV/MST7MDT -Dcompiler.java=13 -Druntime.java=8
14:07:26   2> java.lang.AssertionError: 
14:07:26     Expected: a string starting with "snap-"
14:07:26          but: was "snapyuqttxlpu-zwgyb5vytjics8kuqznzpq"
14:07:26         at __randomizedtesting.SeedInfo.seed([C2B5BA3D5616DEE8:CB6C173276CD86BD]:0)
14:07:26         at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
14:07:26         at org.junit.Assert.assertThat(Assert.java:956)
14:07:26         at org.junit.Assert.assertThat(Assert.java:923)
14:07:26         at org.elasticsearch.xpack.slm.SnapshotLifecycleRestIT.lambda$testFullPolicySnapshot$1(SnapshotLifecycleRestIT.java:110)
14:07:26         at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:904)
14:07:26         at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:877)
14:07:26         at org.elasticsearch.xpack.slm.SnapshotLifecycleRestIT.testFullPolicySnapshot(SnapshotLifecycleRestIT.java:101)

from

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+multijob-unix-compatibility/os=amazon/513/consoleFull

@hendrikmuhs hendrikmuhs reopened this Jan 17, 2020
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this issue Jan 23, 2020
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this issue Jan 23, 2020
…tic#51055)

This test failed a couple of different ways, related to timing, as well
as concurrent snapshots, and also naming.

This commit splits the giant `assertBusy` into separate parts so that we don't
perform ~5 different requests and tests in the same loop. It also gives each
test a unique repository so that no other test can accidentally re-use
snapshots.

Resolves elastic#50358 (hopefully!)
dakrone added a commit to dakrone/elasticsearch that referenced this issue Jan 23, 2020
Rather than check the first returned snapshot for a snapshot starting with `snap-` in
SnapshotLifecycleRestIT.testFullPolicy, this commit changes the test to find any snapshots starting
with `snap-`.

In the event that there are no snapshots (the failure case), this also exposes the full results map
so we can diagnose why a failure occurred.

Relates to elastic#50358
dakrone added a commit that referenced this issue Jan 24, 2020
* Check all snapshots in SnapshotLifecycleRestIT.testFullPolicy

Rather than check the first returned snapshot for a snapshot starting with `snap-` in
SnapshotLifecycleRestIT.testFullPolicy, this commit changes the test to find any snapshots starting
with `snap-`.

In the event that there are no snapshots (the failure case), this also exposes the full results map
so we can diagnose why a failure occurred.

Relates to #50358

* Use a more imperative style for checking
dakrone added a commit to dakrone/elasticsearch that referenced this issue Jan 24, 2020
…c#51386)

* Check all snapshots in SnapshotLifecycleRestIT.testFullPolicy

Rather than check the first returned snapshot for a snapshot starting with `snap-` in
SnapshotLifecycleRestIT.testFullPolicy, this commit changes the test to find any snapshots starting
with `snap-`.

In the event that there are no snapshots (the failure case), this also exposes the full results map
so we can diagnose why a failure occurred.

Relates to elastic#50358

* Use a more imperative style for checking
dakrone added a commit that referenced this issue Jan 24, 2020
* Check all snapshots in SnapshotLifecycleRestIT.testFullPolicy

Rather than check the first returned snapshot for a snapshot starting with `snap-` in
SnapshotLifecycleRestIT.testFullPolicy, this commit changes the test to find any snapshots starting
with `snap-`.

In the event that there are no snapshots (the failure case), this also exposes the full results map
so we can diagnose why a failure occurred.

Relates to #50358

* Use a more imperative style for checking
@droberts195
Copy link
Contributor

It's now failing in a different way, but immediately after the code that was changed in #51386 to try and fix this:

Expected: <[test]>
     but: was <[index-ypfbpkpzll]>

That is coming from line 123 of SnapshotLifecycleRestIT.java, which is immediately after the code change of #51386:

        at __randomizedtesting.SeedInfo.seed([D1426930D73DDCAE:D89BC43FF7E684FB]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:956)
        at org.junit.Assert.assertThat(Assert.java:923)
        at org.elasticsearch.xpack.slm.SnapshotLifecycleRestIT.lambda$testFullPolicySnapshot$2(SnapshotLifecycleRestIT.java:123)
        at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:905)
        at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:878)
        at org.elasticsearch.xpack.slm.SnapshotLifecycleRestIT.testFullPolicySnapshot(SnapshotLifecycleRestIT.java:108)

This has happened a few times on the 7.x branch. A couple of examples are:

A repro command is:

./gradlew ':x-pack:plugin:ilm:qa:multi-node:integTestRunner' --tests "org.elasticsearch.xpack.slm.SnapshotLifecycleRestIT.testFullPolicySnapshot" \
  -Dtests.seed=D1426930D73DDCAE \
  -Dtests.security.manager=true \
  -Dtests.locale=vi-VN \
  -Dtests.timezone=Africa/Maputo \
  -Dcompiler.java=13

This didn't reproduce locally for me.

dakrone added a commit to dakrone/elasticsearch that referenced this issue Jan 31, 2020
This previously was missing some key information in the output of the failure. This captures that
information and adds logging at each step so we can determine the cause *if* it fails again.

Resolves elastic#50358
dakrone added a commit that referenced this issue Jan 31, 2020
* Fix SnapshotLifecycleRestIT.testFullPolicySnapshot

This previously was missing some key information in the output of the failure. This captures that
information and adds logging at each step so we can determine the cause *if* it fails again.

Resolves #50358
dakrone added a commit to dakrone/elasticsearch that referenced this issue Jan 31, 2020
* Fix SnapshotLifecycleRestIT.testFullPolicySnapshot

This previously was missing some key information in the output of the failure. This captures that
information and adds logging at each step so we can determine the cause *if* it fails again.

Resolves elastic#50358
dakrone added a commit that referenced this issue Jan 31, 2020
* Fix SnapshotLifecycleRestIT.testFullPolicySnapshot

This previously was missing some key information in the output of the failure. This captures that
information and adds logging at each step so we can determine the cause *if* it fails again.

Resolves #50358
@davidkyle
Copy link
Member Author

It came back 😱

org.elasticsearch.client.ResponseException: method [DELETE], host [http://127.0.0.1:38055], URI [/_snapshot/full-policy-repo/snap-fy0vwyhpsrmv9f1bgahmtg], status line [HTTP/1.1 503 Service Unavailable]
{"error":{"root_cause":[{"type":"concurrent_snapshot_execution_exception","reason":"[full-policy-repo:snap-fy0vwyhpsrmv9f1bgahmtg/N6ULZf9_Spiwz8IGF0jHdw] another snapshot is currently running cannot delete"}],"type":"concurrent_snapshot_execution_exception","reason":"[full-policy-repo:snap-fy0vwyhpsrmv9f1bgahmtg/N6ULZf9_Spiwz8IGF0jHdw] another snapshot is currently running cannot delete"},"status":503}
	at __randomizedtesting.SeedInfo.seed([8453CDD48058A5E9:8D8A60DBA083FDBC]:0)
	at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:283)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:261)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:267)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:267)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:267)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:267)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:267)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:267)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:267)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235)
	at org.elasticsearch.test.rest.ESRestTestCase.wipeSnapshots(ESRestTestCase.java:622)
	at org.elasticsearch.test.rest.ESRestTestCase.lambda$wipeCluster$3(ESRestTestCase.java:510)
	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:875)
	at org.elasticsearch.test.rest.ESRestTestCase.wipeCluster(ESRestTestCase.java:509)
	at org.elasticsearch.test.rest.ESRestTestCase.cleanUpCluster(ESRestTestCase.java:293)

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=sles-15&&immutable/636/console

https://gradle-enterprise.elastic.co/s/l3ntz5zhjvxry/tests/ao2nxxbijves6-3ndcchqutszas

 ./gradlew ':x-pack:plugin:ilm:qa:multi-node:integTestRunner' --tests "org.elasticsearch.xpack.slm.SnapshotLifecycleRestIT.testFullPolicySnapshot" \
  -Dtests.seed=8453CDD48058A5E9 \
  -Dtests.security.manager=true \
  -Dtests.locale=ar-SA \
  -Dtests.timezone=SystemV/HST10 \
  -Dcompiler.java=13

@davidkyle
Copy link
Member Author

Muted on master in 340fcd1

@bpintea bpintea added v7.8.0 and removed v7.7.0 labels Mar 25, 2020
@cbuescher
Copy link
Member

This failure on 7.7 looks similar.

org.elasticsearch.client.ResponseException: method [DELETE], host [http://[::1]:41150], URI [/_snapshot/full-policy-repo/snap-0zx3ektksugws-extubuza], status line [HTTP/1.1 503 Service Unavailable]
{"error":{"root_cause":[{"type":"concurrent_snapshot_execution_exception","reason":"[full-policy-repo:snap-0zx3ektksugws-extubuza/6v6TuJAsTNuvGuzs24UwcQ] another snapshot is currently running cannot delete"}],"type":"concurrent_snapshot_execution_exception","reason":"[full-policy-repo:snap-0zx3ektksugws-extubuza/6v6TuJAsTNuvGuzs24UwcQ] another snapshot is currently running cannot delete"},"status":503}
	at __randomizedtesting.SeedInfo.seed([A5EB03C9FA047982:AC32AEC6DADF21D7]:0)
	at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:283)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:261)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:267)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:267)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:267)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:267)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:267)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:267)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:267)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235)
	at org.elasticsearch.test.rest.ESRestTestCase.wipeSnapshots(ESRestTestCase.java:644)
	at org.elasticsearch.test.rest.ESRestTestCase.lambda$wipeCluster$4(ESRestTestCase.java:541)
	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:905)
	at org.elasticsearch.test.rest.ESRestTestCase.wipeCluster(ESRestTestCase.java:540)
	at org.elasticsearch.test.rest.ESRestTestCase.cleanUpCluster(ESRestTestCase.java:321)

https://gradle-enterprise.elastic.co/s/dfi2xbqejo4fu

Looks like it still fails on 7.x and 7.7 around 3-4 times a month. @dakrone should we mute those branches as well or are you looking for more infos?

@rjernst rjernst added the Team:Data Management Meta label for data/management team label May 4, 2020
@dakrone
Copy link
Member

dakrone commented Jul 16, 2020

I believe this should be resolved now that #56911 has been merged, so I'm going to close it, we can re-open if it occurs again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team >test-failure Triaged test failures from CI v7.10.0 v8.0.0-alpha1
Projects
None yet