[RCI] Check blocks while having index shard permit in TransportReplicationAction #35332

tlrx · 2018-11-07T09:37:49Z

Today, the TransportReplicationAction checks the global level blocks and the index level blocks before routing the operation to the primary, in the ReroutePhase, and it happens at the very beginning of the transport replication action execution. For the upcoming rework of the Close Index API and in order to deal with primary relocation, we'll need to also check for blocks before executing the operation on the primary (while holding a permit) but before routing to the new primary.

This pull request change the AsyncPrimaryAction so that it checks for replication action's blocks before executing the operation locally or before routing the primary action to the newly primary shard. The check is done while holding a PrimaryShardReference.

Related to #33888

elasticmachine · 2018-11-07T09:37:51Z

Pinging @elastic/es-distributed

bleskes

I left some initial suggestions

...r/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java

bleskes · 2018-11-08T09:36:08Z

...r/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java

@@ -310,6 +337,10 @@ protected void doRun() throws Exception {
        @Override
        public void onResponse(PrimaryShardReference primaryShardReference) {
            try {
+                final ClusterState clusterState = clusterService.state();
+                if (handleBlockExceptions(clusterState, request, this::handleBlockException)) {


handleBlockException re-resolves the index. I think we should always use the PrimaryShardReference here to make sure we always use the right one (with uuid and all).

tlrx · 2018-11-08T12:59:44Z

@bleskes I updated the code, let me know what you think.

bleskes

looking great. Left more comments

bleskes · 2018-11-08T13:38:26Z

...r/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java

@@ -356,6 +382,11 @@ public void handleException(TransportException exp) {
            }
        }

+        private void handleBlockException(final ClusterBlockException blockException) {


do we really need this method? can't we just inline it in the two places we call it?

bleskes · 2018-11-08T13:40:03Z

...r/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java

-            }
-            return false;
-        }
-
        private void handleBlockException(ClusterBlockException blockException) {


having two methods whos name differ with just an s is tricky. Can you maybe come up with something else?

Let's do something very simple; I pushed 5cc18b2

bleskes · 2018-11-08T13:43:41Z

.../test/java/org/elasticsearch/action/support/replication/TransportReplicationActionTests.java

@@ -249,6 +250,53 @@ protected ClusterBlockLevel globalBlockLevel() {
        assertListenerThrows("should fail with an IndexNotFoundException when no blocks checked", listener, IndexNotFoundException.class);
    }

+    public void testBlocksInPrimaryAction() {


It's important to test that there is a happens before relationship between acquiring all permits and checking/adding the block and check the block on each individual operation. Maybe add a test that acquires all the permits concurrently enables another operation and then adds a block? The test than checks that the operation picks up on the block.

This is a totally valid test but I don't think I can write this test for now because the "all permit acquisition" logic is not exposed yet in IndexShard and TransportReplicationAction and also because the current tests in this class rely on a mocked IndexShard with no real IndexShardOperationPermits (but a simple atomic counter instead).

Once this PR is merged, I'd like to open a follow up PR which exposes the asyncBlockOperations(ActionListener<Releasable>) added in #34902 in both IndexShard and TransportReplicationAction. It will help to write the test you're thinking of while using a non mocked IndexShard instance. WDYT?

sounds good!

bleskes

Getting close. I think I found another edge case (if I'm correct, we need to sharpen our testing).

...r/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java

bleskes · 2018-11-09T10:23:37Z

...r/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java

+                final IndexMetaData indexMetaData = clusterState.metaData().getIndexSafe(primaryShardReference.routingEntry().index());
+
+                final ClusterBlockException blockException = blockExceptions(clusterState, indexMetaData.getIndex().getName());
+                if (blockException != null) {


I think this doesn't go well if the cluster block is retryable (i.e., we don't retry). How about dealing with this in retryPrimaryException and that means we can also remove the special handling in the reroute phase.

I looked at this before submitting this pull request and it looked good to me...

The ClusterBlockException is thrown and caught by the surrounding try-catch block which calls the AsyncPrimaryAction.onFailure() which sends back the exception to where it was executed ie the ReroutePhase.performAction() which already takes care in handleException() to retry the request if the exception is retryable.

I agree that we could remove the special handling in the reroute phase and only let the AsyncPrimaryAction to check for blocks.

We talked via another channel and Boaz's suggestion makes perfect sense. In fact I thought that what was suggested was already implemented like this, but it wasn't.

bleskes · 2018-11-09T10:23:50Z

.../test/java/org/elasticsearch/action/support/replication/TransportReplicationActionTests.java

@@ -249,6 +250,53 @@ protected ClusterBlockLevel globalBlockLevel() {
        assertListenerThrows("should fail with an IndexNotFoundException when no blocks checked", listener, IndexNotFoundException.class);
    }

+    public void testBlocksInPrimaryAction() {


sounds good!

tlrx · 2018-11-12T11:34:31Z

@bleskes I updated the code and also adapted reroute phase tests, can you have another look please? Thanks

bleskes · 2018-11-12T12:53:44Z

...r/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java

@@ -696,12 +735,9 @@ public void onFailure(Exception e) {
        protected void doRun() {
            setPhase(task, "routing");
            final ClusterState state = observer.setAndGetObservedState();
-            if (handleBlockExceptions(state)) {


I think we still want to shortcut on blocks here rather than wait for the primary to acquired a permit? Otherwise we just end up sending all requests to the primary rather than terminating them on the coordinating node (and also make them acquire a permit).

Of course. I've been too quick on this.

tlrx · 2018-11-12T16:19:11Z

@bleskes I updated the code again.

bleskes

LGTM. I left one comment about testing. No need for another round.

bleskes · 2018-11-13T14:11:00Z

.../test/java/org/elasticsearch/action/support/replication/TransportReplicationActionTests.java

+        }
+        {
+            setState(clusterService,
+                ClusterState.builder(clusterService.state()).blocks(ClusterBlocks.builder().addGlobalBlock(retryableBlock)));


I think we want to test index level blocks too here

Sure, I randomized the test so that is checks global or index level blocks.

…onAction

tlrx · 2018-11-14T09:01:12Z

Thanks @bleskes !

…ationAction (#35332) Today, the TransportReplicationAction checks the global level blocks and the index level blocks before routing the operation to the primary, in the ReroutePhase, and it happens at the very beginning of the transport replication action execution. For the upcoming rework of the Close Index API and in order to deal with primary relocation, we'll need to also check for blocks before executing the operation on the primary (while holding a permit) but before routing to the new primary. This pull request change the AsyncPrimaryAction so that it checks for replication action's blocks before executing the operation locally or before routing the primary action to the newly primary shard. The check is done while holding a PrimaryShardReference. Related to #33888

…rtReplicationAction (#35332)" This reverts commit 31567ce.

…rtReplicationAction (#35332)" This reverts commit 0c5e87f

tlrx · 2018-11-16T14:55:32Z

This has been reverted from master in d3d7c01 and 6.x in c70b8ac

* master: (59 commits) SQL: Move internals from Joda to java.time (elastic#35649) Add HLRC docs for Get Lifecycle Policy (elastic#35612) Align RolloverStep's name with other step names (elastic#35655) Watcher: Use joda method to get local TZ (elastic#35608) Fix line length for org.elasticsearch.action.* files (elastic#35607) Remove use of AbstractComponent in server (elastic#35444) Deprecate types in count and msearch. (elastic#35421) Refactor an ambigious TermVectorsRequest constructor. (elastic#35614) [Scripting] Use Number as a return value for BucketAggregationScript (elastic#35653) Removes AbstractComponent from several classes (elastic#35566) [DOCS] Add beta warning to ILM pages. (elastic#35571) Deprecate types in validate query requests. (elastic#35575) Unmute BuildExamplePluginsIT Revert "AwaitsFix the RecoveryIT suite - see elastic#35597" Revert "[RCI] Check blocks while having index shard permit in TransportReplicationAction (elastic#35332)" Remove remaining line length violations for o.e.action.admin.cluster (elastic#35156) ML: Adjusing BWC version post backport to 6.6 (elastic#35605) [TEST] Replace fields in response with actual values Remove usages of CharSequence in Sets (elastic#35501) AwaitsFix the RecoveryIT suite - see elastic#35597 ...

After #35332 has been merged, we noticed some test failures like #35597 in which one or more replica shards failed to be promoted as primaries because the primary replica re-synchronization never succeed. After some digging it appeared that the execution of the resync action was blocked because of the presence of a global cluster block in the cluster state (in this case, the "no master" block), making the resync action to fail when executed on the primary. Until #35332 such failures never happened because the TransportResyncReplicationAction is skipping the reroute phase, the only place where blocks were checked. Now with #35332 blocks are checked during reroute and also during the execution of the transport replication action on the primary. After some internal discussion, we decided that the TransportResyncReplicationAction should never be blocked. This action is part of the replica to primary promotion and makes sure that replicas are in sync and should not be blocked when the cluster state has no master or when the index is read only. This commit changes the TransportResyncReplicationAction to make obvious that it does not honor blocks. It also adds a simple test that fails if the resync action is blocked during the primary action execution. Closes #35597

… TransportReplicationAction (#35332)"" This reverts commit c70b8ac

… TransportReplicationAction (#35332)"" This reverts commit d3d7c01

tlrx added >enhancement :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Nov 7, 2018

tlrx requested review from bleskes and ywelsch November 7, 2018 09:37

tlrx changed the title ~~[RCI] Check blocks while having indexing permit in TransportReplicationAction~~ [RCI] Check blocks while having index shard permit in TransportReplicationAction Nov 7, 2018

tlrx mentioned this pull request Nov 7, 2018

Replicate closed indices #33888

Closed

50 tasks

tlrx requested a review from s1monw November 7, 2018 16:20

tlrx force-pushed the replicated-closed-indices branch from 833df18 to 95cfc1e Compare November 8, 2018 08:26

bleskes suggested changes Nov 8, 2018

View reviewed changes

tlrx force-pushed the replicated-closed-indices branch from 95cfc1e to c2a3ec2 Compare November 8, 2018 10:59

bleskes suggested changes Nov 8, 2018

View reviewed changes

bleskes suggested changes Nov 9, 2018

View reviewed changes

tlrx added a commit to tlrx/elasticsearch that referenced this pull request Nov 9, 2018

Add TransportReplicationPermitTests (necessite elastic#35332)

6a42d37

bleskes reviewed Nov 12, 2018

View reviewed changes

bleskes approved these changes Nov 13, 2018

View reviewed changes

tlrx force-pushed the check-blocks-while-having-permit-in-transport-replication-action branch from ef4375d to bbaf292 Compare November 13, 2018 14:41

tlrx changed the base branch from replicated-closed-indices to master November 13, 2018 15:56

tlrx added 9 commits November 13, 2018 18:05

[RCI] Check blocks while having indexing permit in TransportReplicati…

5e0649f

…onAction

Use Optional

8dc7f23

Fix checkstyle

85299c2

Simplify handleBlockException

ab42a14

Apply feedback

b8ff35f

Adapt ReroutePhase tests

8f097fa

checkstyle

39f5a3f

More feedback

09a9068

Restore blockException.retryable()

93d94fd

Also check index blocks

ed7591b

tlrx force-pushed the check-blocks-while-having-permit-in-transport-replication-action branch from bbaf292 to ed7591b Compare November 13, 2018 17:06

tlrx merged commit 31567ce into elastic:master Nov 14, 2018

tlrx deleted the check-blocks-while-having-permit-in-transport-replication-action branch November 14, 2018 08:44

tlrx added v7.0.0 v6.6.0 labels Nov 14, 2018

tlrx mentioned this pull request Nov 14, 2018

Expose all permits acquisition in IndexShard and TransportReplicationAction #35540

Merged

DaveCTurner mentioned this pull request Nov 15, 2018

Rolling-upgrade RecoveryIT tests are broken #35597

Closed

tlrx added a commit that referenced this pull request Nov 16, 2018

Revert "[RCI] Check blocks while having index shard permit in Transpo…

d3d7c01

…rtReplicationAction (#35332)" This reverts commit 31567ce.

tlrx added a commit that referenced this pull request Nov 16, 2018

Revert "[RCI] Check blocks while having index shard permit in Transpo…

c70b8ac

…rtReplicationAction (#35332)" This reverts commit 0c5e87f

tlrx mentioned this pull request Nov 19, 2018

Add global and index level blocks to IndexSettings #35695

Closed

villelaitila mentioned this pull request Nov 19, 2018

Add global and index level blocks to IndexSettings softagram/elasticsearch#23

Open

tlrx mentioned this pull request Nov 21, 2018

TransportResyncReplicationAction should not honour blocks #35795

Merged

tlrx added a commit that referenced this pull request Nov 22, 2018

Revert "Revert "[RCI] Check blocks while having index shard permit in…

6560b7c

… TransportReplicationAction (#35332)"" This reverts commit c70b8ac

tlrx added a commit that referenced this pull request Nov 22, 2018

Revert "Revert "[RCI] Check blocks while having index shard permit in…

f9f7261

… TransportReplicationAction (#35332)"" This reverts commit d3d7c01

original-brownbear pushed a commit that referenced this pull request Nov 23, 2018

Revert "Revert "[RCI] Check blocks while having index shard permit in…

9293189

… TransportReplicationAction (#35332)"" This reverts commit d3d7c01

jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RCI] Check blocks while having index shard permit in TransportReplicationAction #35332

[RCI] Check blocks while having index shard permit in TransportReplicationAction #35332

tlrx commented Nov 7, 2018 •

edited

elasticmachine commented Nov 7, 2018

bleskes left a comment

bleskes Nov 8, 2018

tlrx Nov 8, 2018

tlrx commented Nov 8, 2018

bleskes left a comment

bleskes Nov 8, 2018

bleskes Nov 8, 2018

tlrx Nov 8, 2018

s1monw Nov 8, 2018

bleskes Nov 8, 2018

tlrx Nov 8, 2018

bleskes Nov 9, 2018

bleskes left a comment

bleskes Nov 9, 2018

tlrx Nov 9, 2018

tlrx Nov 9, 2018

bleskes Nov 9, 2018

tlrx commented Nov 12, 2018

bleskes Nov 12, 2018

tlrx Nov 12, 2018

tlrx commented Nov 12, 2018

bleskes left a comment

bleskes Nov 13, 2018

tlrx Nov 13, 2018

tlrx commented Nov 14, 2018

tlrx commented Nov 16, 2018

[RCI] Check blocks while having index shard permit in TransportReplicationAction #35332

[RCI] Check blocks while having index shard permit in TransportReplicationAction #35332

Conversation

tlrx commented Nov 7, 2018 • edited

elasticmachine commented Nov 7, 2018

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlrx commented Nov 8, 2018

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlrx commented Nov 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlrx commented Nov 12, 2018

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlrx commented Nov 14, 2018

tlrx commented Nov 16, 2018

tlrx commented Nov 7, 2018 •

edited