Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TEST] RecoveryIT.testHistoryUUIDIsGenerated occasionally fails on 6.x #31291

Closed
dimitris-athanasiou opened this issue Jun 13, 2018 · 16 comments
Closed
Assignees
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >test-failure Triaged test failures from CI v6.7.0

Comments

@dimitris-athanasiou
Copy link
Contributor

Build: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+multijob-windows-compatibility/1119/console

Reproduce with:

gradlew :qa:rolling-upgrade:v5.6.10-SNAPSHOT#upgradedClusterTestRunner -Dtests.seed=CC9DDAFEF7C7362F -Dtests.class=org.elasticsearch.upgrades.RecoveryIT -Dtests.method="testHistoryUUIDIsGenerated" -Dtests.security.manager=true -Dtests.locale=en-IN -Dtests.timezone=Australia/Eucla

Failure:

RecoveryIT.testHistoryUUIDIsGenerated <<< FAILURES!
07:49:28    > Throwable #1: java.lang.AssertionError: different history uuid found for shard on BPdfH1zoREy4bR0ga9iksg
07:49:28    > Expected: "D9AAXTrJT7WOJsynoFB0LQ"
07:49:28    >      but: was "JsLgmYhvTEyjm_x9NWESGQ"
07:49:28    > 	at __randomizedtesting.SeedInfo.seed([CC9DDAFEF7C7362F:E43FE5891A4AA329]:0)
07:49:28    > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
07:49:28    > 	at org.elasticsearch.upgrades.RecoveryIT.testHistoryUUIDIsGenerated(RecoveryIT.java:79)
07:49:28    > 	at java.lang.Thread.run(Thread.java:748)

I could not reproduce locally.

@dimitris-athanasiou dimitris-athanasiou added >test-failure Triaged test failures from CI >upgrade :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Jun 13, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@dimitris-athanasiou
Copy link
Contributor Author

I've muted this test on 6.x.

@bleskes
Copy link
Contributor

bleskes commented Jun 22, 2018

@dnhatn ping..

@dnhatn
Copy link
Member

dnhatn commented Jun 22, 2018

@bleskes Looking...

@dnhatn
Copy link
Member

dnhatn commented Jun 23, 2018

I opened #31542

@dnhatn
Copy link
Member

dnhatn commented Jun 24, 2018

I've enabled debug log and unmuted this test. I will surely keep an eye on it.

@dnhatn
Copy link
Member

dnhatn commented Jul 10, 2018

Another instance https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+default-distro-bwc-tests/141/console. CI log: history_uuid.txt.

Since this test does not fail too often, I won't mute it. I am looking at this.

@dnhatn
Copy link
Member

dnhatn commented Jul 10, 2018

I looked at this. Sadly, we still don't have enough information. I increased the logging level and wait for another failure.

@cbuescher
Copy link
Member

@dnhatn another instance today.
Maybe this time the increased logging helps: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.3+default-distro/150/console

@dnhatn
Copy link
Member

dnhatn commented Jul 13, 2018

This failed on 6.3. but I did not increase the logging level 6.3. I have also increased logging for 6.3 and will wait for another failure again. Thanks for pinging @cbuescher!

@dnhatn
Copy link
Member

dnhatn commented Jul 20, 2018

@dnhatn
Copy link
Member

dnhatn commented Jul 23, 2018

We have two different history_uuids in this test because the index[index_history_uuid] was relocated from a 5.x node to a 6.x node. The test happened as follows:

  1. The primary (allocation_id z0Ph-alJRWWJr6LWONJv2Q) was allocated on node-0 (5.x)
[2018-07-20T14:54:15,945][DEBUG][o.e.i.IndexService       ] [node-0] [index_history_uuid] creating shard_id [index_history_uuid][0]
[2018-07-20T14:54:15,954][DEBUG][o.e.i.s.IndexShard       ] [node-0] [index_history_uuid][0] starting recovery from store ...

[2018-07-20T14:54:15,965][DEBUG][o.e.c.a.s.ShardStateAction] [node-0] [index_history_uuid][0] sending [internal:cluster/shard/started] to [65oBw7POR8GDa4P4qeXKbg] 
for shard entry [shard id [[index_history_uuid][0]], allocation id [z0Ph-alJRWWJr6LWONJv2Q], primary term [0], message [after new shard recovery]]
  1. The replica (allocation id XouyzbFCRI6gZXsQvwMEow) was allocated on node-2 (5.x) and recovered from node-0 (5.x)
[2018-07-20T14:54:15,988][DEBUG][o.e.i.s.IndexShard       ] [node-2] [index_history_uuid][0] state: [CREATED]->[RECOVERING], 
reason [from {node-0}{h95bxFeJRXmPIkTfmYqqWw}{p038wDwtTE-wig470uAJzQ}{127.0.0.1}{127.0.0.1:43341}{testattr=test, gen=old}]

[2018-07-20T14:54:16,039][DEBUG][o.e.c.a.s.ShardStateAction] [node-2] [index_history_uuid][0] sending [internal:cluster/shard/started] to [65oBw7POR8GDa4P4qeXKbg] 
for shard entry [shard id [[index_history_uuid][0]], allocation id [XouyzbFCRI6gZXsQvwMEow], primary term [0], message [after peer recovery]]
  1. node-0 was shutdown to upgrade to 6.x
[2018-07-20T14:54:23,955][INFO ][o.e.n.Node               ] [upgraded-node-0] node name [upgraded-node-0], node ID [h95bxFeJRXmPIkTfmYqqWw]
  1. The replica (XouyzbFCRI6gZXsQvwMEow) on node-2 was promoted

  2. The master allocated a new replica (allocation id 53gPTAmpR9W3BGA3bblDIQ) on the node-1 and recovered from node-2

[2018-07-20T14:54:36,036][DEBUG][o.e.i.s.IndexShard       ] [node-1] [index_history_uuid][0] state: [CREATED]->[RECOVERING], 
reason [from {node-2}{65oBw7POR8GDa4P4qeXKbg}{2wI684M0T7eH2HDnvQbrew}{127.0.0.1}{127.0.0.1:44748}{testattr=test, gen=old}]

[2018-07-20T14:54:36,571][DEBUG][o.e.c.a.s.ShardStateAction] [node-1] [index_history_uuid][0] sending [internal:cluster/shard/started] to [65oBw7POR8GDa4P4qeXKbg] 
for shard entry [shard id [[index_history_uuid][0]], allocation id [53gPTAmpR9W3BGA3bblDIQ], primary term [0], message [after peer recovery]]
  1. node-1 was shutdown to upgrade to 6.x
[2018-07-20T14:54:43,427][INFO ][o.e.n.Node               ] [upgraded-node-1] node name [upgraded-node-1], node ID [hr0M7tSkQj20Bz0cxGp1gg]
  1. The allocation was delayed (because of node_left) and re-allocated a new replica on node-1 (6.x). The new replica on node-1 bootstrapped a new history_uuid (_iuqpOaISOC6lU4pO4YLKw).
[2018-07-20T14:54:57,364][DEBUG][o.e.i.s.IndexShard       ] [upgraded-node-1] [index_history_uuid][0] state: [CREATED]->[RECOVERING], 
reason [from {node-2}{65oBw7POR8GDa4P4qeXKbg}{2wI684M0T7eH2HDnvQbrew}{127.0.0.1}{127.0.0.1:44748}{testattr=test, gen=old}]

[2018-07-20T14:54:57,896][DEBUG][o.e.i.s.Store            ] [upgraded-node-1] [index_history_uuid][0] bootstrap 6.x commit tags [{local_checkpoint=-1, history_uuid=_iuqpOaISOC6lU4pO4YLKw, max_seq_no=-1}], 
user_data [{max_unsafe_auto_id_timestamp=-1, translog_uuid=caEJxALmThuJDK-gYShYoA, translog_generation=1}]
  1. The rebalance decided to relocate the primary from node-2 (5.x) to node-0 (6.x)
[2018-07-20T14:54:58,110][DEBUG][o.e.c.r.a.a.BalancedShardsAllocator] [node-2] Relocate shard [[index_history_uuid][0], node[65oBw7POR8GDa4P4qeXKbg], [P], s[STARTED], a[id=XouyzbFCRI6gZXsQvwMEow]] 
from node [65oBw7POR8GDa4P4qeXKbg] to node [h95bxFeJRXmPIkTfmYqqWw]
  1. The relocating shard recovered from node-2, and bootstrapped a new history_uuid RQ6qJbquSCaTJlRUXXWcUA
[2018-07-20T14:54:58,126][DEBUG][o.e.i.s.IndexShard       ] [upgraded-node-0] [index_history_uuid][0] state: [CREATED]->[RECOVERING], 
reason [from {node-2}{65oBw7POR8GDa4P4qeXKbg}{2wI684M0T7eH2HDnvQbrew}{127.0.0.1}{127.0.0.1:44748}{testattr=test, gen=old}]

[2018-07-20T14:54:58,661][DEBUG][o.e.i.s.Store            ] [upgraded-node-0] [index_history_uuid][0] bootstrap 6.x commit tags [{local_checkpoint=-1, history_uuid=RQ6qJbquSCaTJlRUXXWcUA, max_seq_no=-1}],
user_data [{max_unsafe_auto_id_timestamp=-1, translog_uuid=wsKtwrYERWmf_kwd5tLigw, translog_generation=1}]

@dnhatn
Copy link
Member

dnhatn commented Jul 23, 2018

I opened #32270

dnhatn added a commit that referenced this issue Jul 23, 2018
dnhatn added a commit that referenced this issue Jul 23, 2018
dnhatn added a commit that referenced this issue Jul 23, 2018
@gwbrown gwbrown removed the >upgrade label Dec 14, 2018
@dnhatn dnhatn added the v6.7.0 label Feb 25, 2019
dnhatn added a commit that referenced this issue Feb 27, 2019
If the `index_history_uuid` can be allocated to any testing node, the
rebalance allocation decider may relocate that index. If the index is
relocated from 5.x to 6.x, we might have two different history_uuids
because history_uuid is bootstrapped twice by two different 6.x nodes:
a replica and a relocating primary.

This change prevents `index_history_uuid` from relocating by allocating
that index to two nodes only.

Closes #31291
@dnhatn
Copy link
Member

dnhatn commented Feb 27, 2019

Fixed in #32270.

@dnhatn dnhatn closed this as completed Feb 27, 2019
dnhatn added a commit that referenced this issue Feb 27, 2019
If the `index_history_uuid` can be allocated to any testing node, the
rebalance allocation decider may relocate that index. If the index is
relocated from 5.x to 6.x, we might have two different history_uuids
because history_uuid is bootstrapped twice by two different 6.x nodes:
a replica and a relocating primary.

This change prevents `index_history_uuid` from relocating by allocating
that index to two nodes only.

Closes #31291
@DaveCTurner
Copy link
Contributor

@dnhatn this test is still muted in master, 7.x and 7.0. Should it be unmuted there?

@dnhatn
Copy link
Member

dnhatn commented Mar 4, 2019

@DaveCTurner Yes, we should unmute it. I'll do it. Thanks for the ping.

dnhatn added a commit that referenced this issue Mar 4, 2019
dnhatn added a commit that referenced this issue Mar 4, 2019
dnhatn added a commit that referenced this issue Mar 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >test-failure Triaged test failures from CI v6.7.0
Projects
None yet
Development

No branches or pull requests

7 participants