[TEST] RecoveryIT.testHistoryUUIDIsGenerated occasionally fails on 6.x #31291

dimitris-athanasiou · 2018-06-13T10:25:57Z

Build: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+multijob-windows-compatibility/1119/console

Reproduce with:

gradlew :qa:rolling-upgrade:v5.6.10-SNAPSHOT#upgradedClusterTestRunner -Dtests.seed=CC9DDAFEF7C7362F -Dtests.class=org.elasticsearch.upgrades.RecoveryIT -Dtests.method="testHistoryUUIDIsGenerated" -Dtests.security.manager=true -Dtests.locale=en-IN -Dtests.timezone=Australia/Eucla

Failure:

RecoveryIT.testHistoryUUIDIsGenerated <<< FAILURES!
07:49:28    > Throwable #1: java.lang.AssertionError: different history uuid found for shard on BPdfH1zoREy4bR0ga9iksg
07:49:28    > Expected: "D9AAXTrJT7WOJsynoFB0LQ"
07:49:28    >      but: was "JsLgmYhvTEyjm_x9NWESGQ"
07:49:28    > 	at __randomizedtesting.SeedInfo.seed([CC9DDAFEF7C7362F:E43FE5891A4AA329]:0)
07:49:28    > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
07:49:28    > 	at org.elasticsearch.upgrades.RecoveryIT.testHistoryUUIDIsGenerated(RecoveryIT.java:79)
07:49:28    > 	at java.lang.Thread.run(Thread.java:748)

I could not reproduce locally.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-06-13T10:25:59Z

Pinging @elastic/es-distributed

Relates #31291

dimitris-athanasiou · 2018-06-13T10:31:27Z

I've muted this test on 6.x.

bleskes · 2018-06-22T10:02:03Z

@dnhatn ping..

dnhatn · 2018-06-22T17:31:24Z

@bleskes Looking...

dnhatn · 2018-06-23T21:21:03Z

I opened #31542

dnhatn · 2018-06-24T21:16:53Z

I've enabled debug log and unmuted this test. I will surely keep an eye on it.

dnhatn · 2018-07-10T15:33:54Z

Another instance https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+default-distro-bwc-tests/141/console. CI log: history_uuid.txt.

Since this test does not fail too often, I won't mute it. I am looking at this.

dnhatn · 2018-07-10T21:15:29Z

I looked at this. Sadly, we still don't have enough information. I increased the logging level and wait for another failure.

cbuescher · 2018-07-13T09:46:00Z

@dnhatn another instance today.
Maybe this time the increased logging helps: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.3+default-distro/150/console

dnhatn · 2018-07-13T13:11:34Z

This failed on 6.3. but I did not increase the logging level 6.3. I have also increased logging for 6.3 and will wait for another failure again. Thanks for pinging @cbuescher!

dnhatn · 2018-07-20T15:04:15Z

This time is on 6.x https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+default-distro-bwc-tests/171/console.

dnhatn · 2018-07-23T01:19:39Z

We have two different history_uuids in this test because the index[index_history_uuid] was relocated from a 5.x node to a 6.x node. The test happened as follows:

The primary (allocation_id z0Ph-alJRWWJr6LWONJv2Q) was allocated on node-0 (5.x)

[2018-07-20T14:54:15,945][DEBUG][o.e.i.IndexService       ] [node-0] [index_history_uuid] creating shard_id [index_history_uuid][0]
[2018-07-20T14:54:15,954][DEBUG][o.e.i.s.IndexShard       ] [node-0] [index_history_uuid][0] starting recovery from store ...

[2018-07-20T14:54:15,965][DEBUG][o.e.c.a.s.ShardStateAction] [node-0] [index_history_uuid][0] sending [internal:cluster/shard/started] to [65oBw7POR8GDa4P4qeXKbg] 
for shard entry [shard id [[index_history_uuid][0]], allocation id [z0Ph-alJRWWJr6LWONJv2Q], primary term [0], message [after new shard recovery]]

The replica (allocation id XouyzbFCRI6gZXsQvwMEow) was allocated on node-2 (5.x) and recovered from node-0 (5.x)

[2018-07-20T14:54:15,988][DEBUG][o.e.i.s.IndexShard       ] [node-2] [index_history_uuid][0] state: [CREATED]->[RECOVERING], 
reason [from {node-0}{h95bxFeJRXmPIkTfmYqqWw}{p038wDwtTE-wig470uAJzQ}{127.0.0.1}{127.0.0.1:43341}{testattr=test, gen=old}]

[2018-07-20T14:54:16,039][DEBUG][o.e.c.a.s.ShardStateAction] [node-2] [index_history_uuid][0] sending [internal:cluster/shard/started] to [65oBw7POR8GDa4P4qeXKbg] 
for shard entry [shard id [[index_history_uuid][0]], allocation id [XouyzbFCRI6gZXsQvwMEow], primary term [0], message [after peer recovery]]

node-0 was shutdown to upgrade to 6.x

[2018-07-20T14:54:23,955][INFO ][o.e.n.Node               ] [upgraded-node-0] node name [upgraded-node-0], node ID [h95bxFeJRXmPIkTfmYqqWw]

The replica (XouyzbFCRI6gZXsQvwMEow) on node-2 was promoted
The master allocated a new replica (allocation id 53gPTAmpR9W3BGA3bblDIQ) on the node-1 and recovered from node-2

[2018-07-20T14:54:36,036][DEBUG][o.e.i.s.IndexShard       ] [node-1] [index_history_uuid][0] state: [CREATED]->[RECOVERING], 
reason [from {node-2}{65oBw7POR8GDa4P4qeXKbg}{2wI684M0T7eH2HDnvQbrew}{127.0.0.1}{127.0.0.1:44748}{testattr=test, gen=old}]

[2018-07-20T14:54:36,571][DEBUG][o.e.c.a.s.ShardStateAction] [node-1] [index_history_uuid][0] sending [internal:cluster/shard/started] to [65oBw7POR8GDa4P4qeXKbg] 
for shard entry [shard id [[index_history_uuid][0]], allocation id [53gPTAmpR9W3BGA3bblDIQ], primary term [0], message [after peer recovery]]

node-1 was shutdown to upgrade to 6.x

[2018-07-20T14:54:43,427][INFO ][o.e.n.Node               ] [upgraded-node-1] node name [upgraded-node-1], node ID [hr0M7tSkQj20Bz0cxGp1gg]

The allocation was delayed (because of node_left) and re-allocated a new replica on node-1 (6.x). The new replica on node-1 bootstrapped a new history_uuid (_iuqpOaISOC6lU4pO4YLKw).

[2018-07-20T14:54:57,364][DEBUG][o.e.i.s.IndexShard       ] [upgraded-node-1] [index_history_uuid][0] state: [CREATED]->[RECOVERING], 
reason [from {node-2}{65oBw7POR8GDa4P4qeXKbg}{2wI684M0T7eH2HDnvQbrew}{127.0.0.1}{127.0.0.1:44748}{testattr=test, gen=old}]

[2018-07-20T14:54:57,896][DEBUG][o.e.i.s.Store            ] [upgraded-node-1] [index_history_uuid][0] bootstrap 6.x commit tags [{local_checkpoint=-1, history_uuid=_iuqpOaISOC6lU4pO4YLKw, max_seq_no=-1}], 
user_data [{max_unsafe_auto_id_timestamp=-1, translog_uuid=caEJxALmThuJDK-gYShYoA, translog_generation=1}]

The rebalance decided to relocate the primary from node-2 (5.x) to node-0 (6.x)

[2018-07-20T14:54:58,110][DEBUG][o.e.c.r.a.a.BalancedShardsAllocator] [node-2] Relocate shard [[index_history_uuid][0], node[65oBw7POR8GDa4P4qeXKbg], [P], s[STARTED], a[id=XouyzbFCRI6gZXsQvwMEow]] 
from node [65oBw7POR8GDa4P4qeXKbg] to node [h95bxFeJRXmPIkTfmYqqWw]

The relocating shard recovered from node-2, and bootstrapped a new history_uuid RQ6qJbquSCaTJlRUXXWcUA

[2018-07-20T14:54:58,126][DEBUG][o.e.i.s.IndexShard       ] [upgraded-node-0] [index_history_uuid][0] state: [CREATED]->[RECOVERING], 
reason [from {node-2}{65oBw7POR8GDa4P4qeXKbg}{2wI684M0T7eH2HDnvQbrew}{127.0.0.1}{127.0.0.1:44748}{testattr=test, gen=old}]

[2018-07-20T14:54:58,661][DEBUG][o.e.i.s.Store            ] [upgraded-node-0] [index_history_uuid][0] bootstrap 6.x commit tags [{local_checkpoint=-1, history_uuid=RQ6qJbquSCaTJlRUXXWcUA, max_seq_no=-1}],
user_data [{max_unsafe_auto_id_timestamp=-1, translog_uuid=wsKtwrYERWmf_kwd5tLigw, translog_generation=1}]

dnhatn · 2018-07-23T01:26:54Z

I opened #32270

Relates #31291

If the `index_history_uuid` can be allocated to any testing node, the rebalance allocation decider may relocate that index. If the index is relocated from 5.x to 6.x, we might have two different history_uuids because history_uuid is bootstrapped twice by two different 6.x nodes: a replica and a relocating primary. This change prevents `index_history_uuid` from relocating by allocating that index to two nodes only. Closes #31291

dnhatn · 2019-02-27T17:00:34Z

Fixed in #32270.

If the `index_history_uuid` can be allocated to any testing node, the rebalance allocation decider may relocate that index. If the index is relocated from 5.x to 6.x, we might have two different history_uuids because history_uuid is bootstrapped twice by two different 6.x nodes: a replica and a relocating primary. This change prevents `index_history_uuid` from relocating by allocating that index to two nodes only. Closes #31291

DaveCTurner · 2019-03-04T13:57:32Z

@dnhatn this test is still muted in master, 7.x and 7.0. Should it be unmuted there?

dnhatn · 2019-03-04T14:02:35Z

@DaveCTurner Yes, we should unmute it. I'll do it. Thanks for the ping.

Fixed in #32270 Relates to #31291

dimitris-athanasiou added >test-failure Triaged test failures from CI >upgrade :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Jun 13, 2018

dimitris-athanasiou added a commit that referenced this issue Jun 13, 2018

[TEST] Mute RecoveryIT.testHistoryUUIDIsGenerated

bb11b99

Relates #31291

dnhatn self-assigned this Jun 21, 2018

dnhatn mentioned this issue Jun 21, 2018

Fix missing historyUUID in peer recovery when rolling upgrade 5.x to 6.3 #31506

Merged

dnhatn mentioned this issue Jun 23, 2018

Ensure green in RecoveryIT#testHistoryUUIDIsGenerated #31542

Closed

dnhatn mentioned this issue Jul 23, 2018

Avoid relocation in testHistoryUUIDIsGenerated #32270

Merged

dnhatn added a commit that referenced this issue Jul 23, 2018

AwaitsFix RecoveryIT#testHistoryUUIDIsGenerated

2610022

Relates #31291

dnhatn added a commit that referenced this issue Jul 23, 2018

AwaitsFix RecoveryIT#testHistoryUUIDIsGenerated

bccc6b4

Relates #31291

dnhatn added a commit that referenced this issue Jul 23, 2018

AwaitsFix RecoveryIT#testHistoryUUIDIsGenerated

373cec0

Relates #31291

gwbrown removed the >upgrade label Dec 14, 2018

dnhatn added the v6.7.0 label Feb 25, 2019

dnhatn closed this as completed Feb 27, 2019

dnhatn added a commit that referenced this issue Mar 4, 2019

Unmute testHistoryUUIDIsGenerated

02b4e80

Fixed in #32270 Relates to #31291

dnhatn added a commit that referenced this issue Mar 4, 2019

Unmute testHistoryUUIDIsGenerated

15ab0e2

Fixed in #32270 Relates to #31291

dnhatn added a commit that referenced this issue Mar 4, 2019

Unmute testHistoryUUIDIsGenerated

5bc5abb

Fixed in #32270 Relates to #31291

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TEST] RecoveryIT.testHistoryUUIDIsGenerated occasionally fails on 6.x #31291

[TEST] RecoveryIT.testHistoryUUIDIsGenerated occasionally fails on 6.x #31291

dimitris-athanasiou commented Jun 13, 2018

elasticmachine commented Jun 13, 2018

dimitris-athanasiou commented Jun 13, 2018

bleskes commented Jun 22, 2018

dnhatn commented Jun 22, 2018

dnhatn commented Jun 23, 2018

dnhatn commented Jun 24, 2018

dnhatn commented Jul 10, 2018

dnhatn commented Jul 10, 2018

cbuescher commented Jul 13, 2018

dnhatn commented Jul 13, 2018

dnhatn commented Jul 20, 2018

dnhatn commented Jul 23, 2018 •

edited

Loading

dnhatn commented Jul 23, 2018

dnhatn commented Feb 27, 2019

DaveCTurner commented Mar 4, 2019

dnhatn commented Mar 4, 2019

[TEST] RecoveryIT.testHistoryUUIDIsGenerated occasionally fails on 6.x #31291

[TEST] RecoveryIT.testHistoryUUIDIsGenerated occasionally fails on 6.x #31291

Comments

dimitris-athanasiou commented Jun 13, 2018

elasticmachine commented Jun 13, 2018

dimitris-athanasiou commented Jun 13, 2018

bleskes commented Jun 22, 2018

dnhatn commented Jun 22, 2018

dnhatn commented Jun 23, 2018

dnhatn commented Jun 24, 2018

dnhatn commented Jul 10, 2018

dnhatn commented Jul 10, 2018

cbuescher commented Jul 13, 2018

dnhatn commented Jul 13, 2018

dnhatn commented Jul 20, 2018

dnhatn commented Jul 23, 2018 • edited Loading

dnhatn commented Jul 23, 2018

dnhatn commented Feb 27, 2019

DaveCTurner commented Mar 4, 2019

dnhatn commented Mar 4, 2019

dnhatn commented Jul 23, 2018 •

edited

Loading