New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEST: Capture replication targets when ReplicationGroup ready #34407

Merged
merged 4 commits into from Oct 17, 2018

Conversation

Projects
None yet
3 participants
@dnhatn
Contributor

dnhatn commented Oct 11, 2018

Today, WriteReplicationAction uses the list of replication targets directly from the primary shard of ReplicationGroup. It should be fine except when we add/remove or promote a shard while a WriteReplicationAction is executing. We have encountered these two issues:

  1. Replicas are not found in the replication targets. This happens because we remove a replica but the WriteReplicationAction still uses the old replication targets which include the removed replica.
ERROR   30.5s J3 | ShardFollowTaskReplicationTests.testFailLeaderReplicaShard <<< FAILURES!
  2>     ... 1 more
  2> Caused by: java.util.NoSuchElementException: No value present
  2>     at java.util.Optional.get(Optional.java:135)
  2>     at org.elasticsearch.index.replication.ESIndexLevelReplicationTestCase$ReplicationAction$ReplicasRef.performOn(ESIndexLevelReplicationTestCase.java:590)
  2>     at org.elasticsearch.action.support.replication.ReplicationOperation.performOnReplica(ReplicationOperation.java:166)
  2>     at org.elasticsearch.action.support.replication.ReplicationOperation.performOnReplicas(ReplicationOperation.java:153)
  2>     at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:124)
  2>     at org.elasticsearch.index.replication.ESIndexLevelReplicationTestCase$ReplicationAction.execute(ESIndexLevelReplicationTestCase.java:517)
  2>     at org.elasticsearch.index.replication.ESIndexLevelReplicationTestCase$ReplicationGroup.executeWriteRequest(ESIndexLevelReplicationTestCase.java:240)

CI: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+periodic/112/consoleText

  1. Access ReplicationGroup from a primary shard which hasn't activated the primary-mode yet. This is because we won't activate the primary-mode for a promoting shard after bumping the primary term which is executed asynchronously.
  2> java.lang.AssertionError: shard [test][0], node[s2], [P], s[STARTED], a[id=MIjcMDyKRvWHNHaTbFi_-A] is not a primary shard in primary mode
  2>     at __randomizedtesting.SeedInfo.seed([9AD0BC1EEFEBA576]:0)
  2>     at org.elasticsearch.index.shard.IndexShard.assertPrimaryMode(IndexShard.java:1497)
  2>     at org.elasticsearch.index.shard.IndexShard.getReplicationGroup(IndexShard.java:1913)
  2>     at org.elasticsearch.index.replication.ESIndexLevelReplicationTestCase$ReplicationAction$PrimaryRef.getReplicationGroup(ESIndexLevelReplicationTestCase.java:577)
  1> [2018-10-11T03:58:46,300][INFO ][o.e.i.s.IndexShard       ] [org.elasticsearch.xpack.ccr.action.ShardFollowTaskReplicationTests] [test][0] detected new primary with primary term [72], global checkpoint [-1], max_seq_no [43]
  2>     at org.elasticsearch.action.support.replication.ReplicationOperation.checkActiveShardCount(ReplicationOperation.java:221)
  2>     at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:91)
  2>     at org.elasticsearch.index.replication.ESIndexLevelReplicationTestCase$ReplicationAction.execute(ESIndexLevelReplicationTestCase.java:519)
  2>     at org.elasticsearch.xpack.ccr.action.ShardFollowTaskReplicationTests$2.lambda$innerSendBulkShardOperationsRequest$0(ShardFollowTaskReplicationTests.java:358)

CI: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+periodic/159/consoleText

This PR captures the replication targets when the replication group is ready and continue using those targets until we re-compute the new targets after the group is changed.

Closes #33457

TEST: Capture replication targets when replication group ready
Today, WriteReplicationAction uses a set of replication targets directly
from the primary shard of ReplicationGroup. It should be fine except
when we add/remove or promote a shard while a write action is executing.
We have encountered these two issues:

1. Replicas are not found in the replication targets. This happens
because we remove replicas but the WriteReplicationAction still uses the
old replication targets which include the removed replicas.

2. Access ReplicationGroup from a primary shard which hasn't activated
the primary-mode yet. This is because we won't activate the primary-mode
for a promoting shard after bumping the primary term which is executed
asynchronously.

This commit captures the replication targets when the replication group
is ready and continue using those targets until we re-compute the new
targets after the group is changed.
@elasticmachine

This comment has been minimized.

Show comment
Hide comment
@elasticmachine

elasticmachine commented Oct 11, 2018

@martijnvg

LGTM

@dnhatn

This comment has been minimized.

Show comment
Hide comment
@dnhatn

dnhatn Oct 17, 2018

Contributor

Thanks @martijnvg.

Contributor

dnhatn commented Oct 17, 2018

Thanks @martijnvg.

@dnhatn dnhatn merged commit eb36f10 into elastic:master Oct 17, 2018

4 checks passed

CLA Commit author is a member of Elasticsearch
Details
elasticsearch-ci Build finished.
Details
elasticsearch-ci/oss-distro-docs Build finished.
Details
elasticsearch-ci/packaging-sample Build finished.
Details

@dnhatn dnhatn deleted the dnhatn:replication-targets branch Oct 17, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment