Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling-upgrade RecoveryIT tests are broken #35597

Closed
DaveCTurner opened this issue Nov 15, 2018 · 6 comments
Closed

Rolling-upgrade RecoveryIT tests are broken #35597

DaveCTurner opened this issue Nov 15, 2018 · 6 comments
Assignees
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >test-failure Triaged test failures from CI

Comments

@DaveCTurner
Copy link
Contributor

The 6.x intake builds have been broken for most of today, with clusters failing to pass their wait conditions properly:

[ant:echo] ==> [Thu Nov 15 05:56:30 UTC 2018] checking health: http://[::1]:35001/_cluster/health?wait_for_nodes=>=1&wait_for_status=yellow
     [echo] ==> [Thu Nov 15 05:56:30 UTC 2018] checking health: http://[::1]:35001/_cluster/health?wait_for_nodes=>=1&wait_for_status=yellow
[ant:get] Getting: http://[::1]:35001/_cluster/health?wait_for_nodes=>=1&wait_for_status=yellow
      [get] Getting: http://[::1]:35001/_cluster/health?wait_for_nodes=>=1&wait_for_status=yellow
[ant:get] To: /var/lib/jenkins/workspace/elastic+elasticsearch+6.x+intake/qa/rolling-upgrade/build/cluster/v5.6.14#upgradedClusterTestCluster node0/cwd/wait.success
      [get] To: /var/lib/jenkins/workspace/elastic+elasticsearch+6.x+intake/qa/rolling-upgrade/build/cluster/v5.6.14#upgradedClusterTestCluster node0/cwd/wait.success

[ant:get] Error opening connection java.io.IOException: Server returned HTTP response code: 408 for URL: http://[::1]:35001/_cluster/health?wait_for_nodes=>=1&wait_for_status=yellow
      [get] Error opening connection java.io.IOException: Server returned HTTP response code: 408 for URL: http://[::1]:35001/_cluster/health?wait_for_nodes=>=1&wait_for_status=yellow
[ant:get] Error opening connection java.io.IOException: Server returned HTTP response code: 408 for URL: http://[::1]:35001/_cluster/health?wait_for_nodes=>=1&wait_for_status=yellow

This is from https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/197/console although there are lots of other similar failures.

There's nothing obvious to me in the node logs themselves - the 3-node cluster looks to have formed fine. It does seem to reproduce locally, and @original-brownbear and I have been looking at this for a few hours without too much success. I will mute this test suite.

@DaveCTurner DaveCTurner added >test-failure Triaged test failures from CI :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Nov 15, 2018
@original-brownbear original-brownbear self-assigned this Nov 15, 2018
@DaveCTurner
Copy link
Contributor Author

It is perhaps notable that all of the failures I can see have been in the two-thirds upgraded cluster or the fully-upgraded cluster. We've also seen cases where the cluster does form but then times out in a subsequent health check:

ERROR   70.6s | RecoveryIT.testRecoveryWithConcurrentIndexing <<< FAILURES!
   > Throwable #1: org.elasticsearch.client.ResponseException: method [GET], host [http://[::1]:44918], URI [/_cluster/health/recovery_with_concurrent_indexing?wait_for_no_relocating_shards=true&level=shards&timeout=70s&wait_for_status=green], status line [HTTP/1.1 408 Request Timeout]
   > {"cluster_name":"rolling-upgrade","status":"yellow","timed_out":true,"number_of_nodes":3,"number_of_data_nodes":3,"active_primary_shards":1,"active_shards":2,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":1,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":98.68421052631578,"indices":{"recovery_with_concurrent_indexing":{"status":"yellow","number_of_shards":1,"number_of_replicas":2,"active_primary_shards":1,"active_shards":2,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":1,"shards":{"0":{"status":"yellow","primary_active":true,"active_shards":2,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":1}}}}}
   > 	at __randomizedtesting.SeedInfo.seed([DA5FC8F3EE50997:8DF438DCBC41D75D]:0)
   > 	at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:926)
   > 	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:229)
   > 	at org.elasticsearch.test.rest.ESRestTestCase.ensureGreen(ESRestTestCase.java:711)
   > 	at org.elasticsearch.upgrades.RecoveryIT.testRecoveryWithConcurrentIndexing(RecoveryIT.java:141)
   > 	at java.lang.Thread.run(Thread.java:748)
   > Caused by: org.elasticsearch.client.ResponseException: method [GET], host [http://[::1]:44918], URI [/_cluster/health/recovery_with_concurrent_indexing?wait_for_no_relocating_shards=true&level=shards&timeout=70s&wait_for_status=green], status line [HTTP/1.1 408 Request Timeout]
   > {"cluster_name":"rolling-upgrade","status":"yellow","timed_out":true,"number_of_nodes":3,"number_of_data_nodes":3,"active_primary_shards":1,"active_shards":2,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":1,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":98.68421052631578,"indices":{"recovery_with_concurrent_indexing":{"status":"yellow","number_of_shards":1,"number_of_replicas":2,"active_primary_shards":1,"active_shards":2,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":1,"shards":{"0":{"status":"yellow","primary_active":true,"active_shards":2,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":1}}}}}
   > 	at org.elasticsearch.client.RestClient$1.completed(RestClient.java:546)
   > 	at org.elasticsearch.client.RestClient$1.completed(RestClient.java:531)
   > 	at org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:119)
   > 	at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:177)
   > 	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:436)
   > 	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:326)
   > 	at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:265)
   > 	at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81)
   > 	at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39)
   > 	at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114)
   > 	at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162)
   > 	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337)
   > 	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315)
   > 	at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276)
   > 	at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
   > 	at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588)
   > 	... 1 more

(from https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/186/console)

Unfortunately the first few failures in the sequence of broken builds on 6.x seem unrelated to this one, with more changes having been pushed on top of the red CI, so it's hard to narrow down the point at which things started to break. Looking at the changes, I wonder if perhaps #34902 or #35332 could have this sort of effect?

Anyhow, muted in 188de54.

@DaveCTurner
Copy link
Contributor Author

DaveCTurner commented Nov 15, 2018

Ok I reverted 0c5e87f (#35332) and that seemed to clear it up. Letting @tlrx know.

Edit: I do not mean I reverted this in 6.x, just on my local machine.

@tlrx
Copy link
Member

tlrx commented Nov 16, 2018

@DaveCTurner Sorry to hear that, this is concerning. I'm looking at this.

@tlrx
Copy link
Member

tlrx commented Nov 16, 2018

Also muted the test in master in f9aff7b

tlrx added a commit that referenced this issue Nov 16, 2018
tlrx added a commit that referenced this issue Nov 16, 2018
@tlrx
Copy link
Member

tlrx commented Nov 16, 2018

Thanks @DaveCTurner, your preliminary investigation helped a lot. I reverted #35332 from master in d3d7c01 and 6.x in c70b8ac and unmuted the test.

jasontedor added a commit to jasontedor/elasticsearch that referenced this issue Nov 17, 2018
* master: (59 commits)
  SQL: Move internals from Joda to java.time (elastic#35649)
  Add HLRC docs for Get Lifecycle Policy (elastic#35612)
  Align RolloverStep's name with other step names (elastic#35655)
  Watcher: Use joda method to get local TZ (elastic#35608)
  Fix line length for org.elasticsearch.action.* files (elastic#35607)
  Remove use of AbstractComponent in server (elastic#35444)
  Deprecate types in count and msearch. (elastic#35421)
  Refactor an ambigious TermVectorsRequest constructor. (elastic#35614)
  [Scripting] Use Number as a return value for BucketAggregationScript (elastic#35653)
  Removes AbstractComponent from several classes (elastic#35566)
  [DOCS] Add beta warning to ILM pages. (elastic#35571)
  Deprecate types in validate query requests. (elastic#35575)
  Unmute BuildExamplePluginsIT
  Revert "AwaitsFix the RecoveryIT suite - see elastic#35597"
  Revert "[RCI] Check blocks while having index shard permit in TransportReplicationAction (elastic#35332)"
  Remove remaining line length violations for o.e.action.admin.cluster (elastic#35156)
  ML: Adjusing BWC version post backport to 6.6 (elastic#35605)
  [TEST] Replace fields in response with actual values
  Remove usages of CharSequence in Sets (elastic#35501)
  AwaitsFix the RecoveryIT suite - see elastic#35597
  ...
@tlrx tlrx self-assigned this Nov 19, 2018
@tlrx
Copy link
Member

tlrx commented Nov 20, 2018

For a better explanation of the issue, see #35695 (comment)

tlrx added a commit that referenced this issue Nov 22, 2018
After #35332 has been merged, we noticed some test failures like #35597 
in which one or more replica shards failed to be promoted as primaries 
because the primary replica re-synchronization never succeed.

After some digging it appeared that the execution of the resync action was 
blocked because of the presence of a global cluster block in the cluster state 
(in this case, the "no master" block), making the resync action to fail when 
executed on the primary.

Until #35332 such failures never happened because the 
TransportResyncReplicationAction is skipping the reroute phase, the only 
place where blocks were checked. Now with #35332 blocks are checked 
during reroute and also during the execution of the transport replication 
action on the primary. After some internal discussion, we decided that the TransportResyncReplicationAction should never be blocked. This action is 
part of the replica to primary promotion and makes sure that replicas are in 
sync and should not be blocked when the cluster state has no master or 
when the index is read only.

This commit changes the TransportResyncReplicationAction to make obvious 
that it does not honor blocks. It also adds a simple test that fails if the resync 
action is blocked during the primary action execution.

Closes #35597
tlrx added a commit that referenced this issue Nov 22, 2018
After #35332 has been merged, we noticed some test failures like #35597
in which one or more replica shards failed to be promoted as primaries
because the primary replica re-synchronization never succeed.

After some digging it appeared that the execution of the resync action was
blocked because of the presence of a global cluster block in the cluster state
(in this case, the "no master" block), making the resync action to fail when
executed on the primary.

Until #35332 such failures never happened because the
TransportResyncReplicationAction is skipping the reroute phase, the only
place where blocks were checked. Now with #35332 blocks are checked
during reroute and also during the execution of the transport replication
action on the primary. After some internal discussion, we decided that the
TransportResyncReplicationAction should never be blocked. This action is
part of the replica to primary promotion and makes sure that replicas are in
sync and should not be blocked when the cluster state has no master or
when the index is read only.

This commit changes the TransportResyncReplicationAction to make obvious
that it does not honor blocks. It also adds a simple test that fails if the resync
action is blocked during the primary action execution.

Closes #35597
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

3 participants