New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resiliency: Recovering replicas might get stuck in initializing state #6808

Closed
s1monw opened this Issue Jul 10, 2014 · 8 comments

Comments

Projects
None yet
6 participants
@s1monw
Contributor

s1monw commented Jul 10, 2014

If a primary fails while a replica starts recovery but has not yet initialized the recovery process the replica will retry until the primary is allocated again on the node. This never happens and the replica gets stuck in INITIALIZING state and will never cleaned up.

@s1monw s1monw added bug labels Jul 10, 2014

@s1monw s1monw self-assigned this Jul 10, 2014

@s1monw s1monw closed this in 72e6150 Jul 10, 2014

s1monw added a commit that referenced this issue Jul 10, 2014

[STORE]: Make use of Lucene build-in checksums
Since Lucene version 4.8 each file has a checksum written as it's
footer. We used to calculate the checksums for all files transparently
on the filesystem layer (Directory / Store) which is now not necessary
anymore. This commit makes use of the new checksums in a backwards
compatible way such that files written with the old checksum mechanism
are still compared against the corresponding Alder32 checksum while
newer files are compared against the Lucene build in CRC32 checksum.

Since now every written file is checksummed by default this commit
also verifies the checksum for files during recovery and restore if
applicable.

Closes #5924

This commit also has a fix for #6808 since the added tests in
`CorruptedFileTest.java` exposed the issue.

Closes #6808

@kimchy kimchy added the resiliency label Jul 10, 2014

kimchy added a commit to kimchy/elasticsearch that referenced this issue Jul 10, 2014

Improve handling of failed primary replica handling
Out of elastic#6808, we improved the handling of a primary failing to make sure replicas that are initializing are properly failed as well. After double checking it, it has 2 problems, the first, if the same shard routing is failed again, there is no protection that we don't apply the failure (which we do in failed shard cases), and the other was that we already tried to handle it (wrongly) in the elect primary method.
This change fixes the handling to work correctly in the elect primary method, and adds unit tests to verify the behavior
closes elastic#6816

kimchy added a commit that referenced this issue Jul 10, 2014

Improve handling of failed primary replica handling
Out of #6808, we improved the handling of a primary failing to make sure replicas that are initializing are properly failed as well. After double checking it, it has 2 problems, the first, if the same shard routing is failed again, there is no protection that we don't apply the failure (which we do in failed shard cases), and the other was that we already tried to handle it (wrongly) in the elect primary method.
This change fixes the handling to work correctly in the elect primary method, and adds unit tests to verify the behavior
closes #6816

kimchy added a commit that referenced this issue Jul 10, 2014

Improve handling of failed primary replica handling
Out of #6808, we improved the handling of a primary failing to make sure replicas that are initializing are properly failed as well. After double checking it, it has 2 problems, the first, if the same shard routing is failed again, there is no protection that we don't apply the failure (which we do in failed shard cases), and the other was that we already tried to handle it (wrongly) in the elect primary method.
This change fixes the handling to work correctly in the elect primary method, and adds unit tests to verify the behavior
closes #6816

kimchy added a commit to kimchy/elasticsearch that referenced this issue Jul 11, 2014

Improve handling of failed primary replica handling
Out of elastic#6808, we improved the handling of a primary failing to make sure replicas that are initializing are properly failed as well. After double checking it, it has 2 problems, the first, if the same shard routing is failed again, there is no protection that we don't apply the failure (which we do in failed shard cases), and the other was that we already tried to handle it (wrongly) in the elect primary method.
This change fixes the handling to work correctly in the elect primary method, and adds unit tests to verify the behavior
The change also expose a problem in our handling of replica shards that stay initializing during primary failure and electing another replica shard as primary, where we need to cancel its ongoing recovery to make sure it re-starts from the new elected primary
closes elastic#6825

kimchy added a commit that referenced this issue Jul 11, 2014

Improve handling of failed primary replica handling
Out of #6808, we improved the handling of a primary failing to make sure replicas that are initializing are properly failed as well. After double checking it, it has 2 problems, the first, if the same shard routing is failed again, there is no protection that we don't apply the failure (which we do in failed shard cases), and the other was that we already tried to handle it (wrongly) in the elect primary method.
This change fixes the handling to work correctly in the elect primary method, and adds unit tests to verify the behavior
The change also expose a problem in our handling of replica shards that stay initializing during primary failure and electing another replica shard as primary, where we need to cancel its ongoing recovery to make sure it re-starts from the new elected primary
closes #6825

kimchy added a commit that referenced this issue Jul 11, 2014

Improve handling of failed primary replica handling
Out of #6808, we improved the handling of a primary failing to make sure replicas that are initializing are properly failed as well. After double checking it, it has 2 problems, the first, if the same shard routing is failed again, there is no protection that we don't apply the failure (which we do in failed shard cases), and the other was that we already tried to handle it (wrongly) in the elect primary method.
This change fixes the handling to work correctly in the elect primary method, and adds unit tests to verify the behavior
The change also expose a problem in our handling of replica shards that stay initializing during primary failure and electing another replica shard as primary, where we need to cancel its ongoing recovery to make sure it re-starts from the new elected primary
closes #6825

@clintongormley clintongormley changed the title from [CLUSTER] Recovering replicas might get stuck in initializing state to Resiliency: Recovering replicas might get stuck in initializing state Jul 16, 2014

@OlegYch

This comment has been minimized.

OlegYch commented Aug 22, 2014

was this fixed?
i think i just experienced this after rolling upgrade from 1.3.1 to 1.3.2
i disabled cluster.routing.allocation before upgraded then installed new version on each node and set cluster.routing.allocation=all
then i waited several hours for cluster to become green (with dogslow response times in the meantime)
then i restarted each node one by one again
then waited a bit more and noticed there were 4 primary shards stuck in initializing on one of them and killed it - and the cluster went back up green in no time
there were errors like this in the failed node log:

[2014-08-22 22:01:45,221][DEBUG][action.search.type       ] [thisnode] [myidx][1], node[oDw5wWJ-S6etTr0cGLNbGw], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@57228563] lastShard [true]
org.elasticsearch.transport.SendRequestTransportException: [anothernode][inet[/10.35.62.130:9300]][search/phase/query]
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [anothernode][inet[/10.35.62.130:9300]] Node not connected
        at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:874)
        at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:556)
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:206)
        ... 40 more
@OlegYch

This comment has been minimized.

OlegYch commented Aug 23, 2014

welp there are some shards stuck initializing on that node again (from newly created indexes)
the only exceptions in log are

[2014-08-23 00:02:12,622][ERROR][marvel.agent.exporter    ] [thisnode] remote target didn't respond with 200 OK response code [404 Not Found]. content: [:)
^E�errorrIndexMissingException[[.marvel-2014.08.23] missing]�status$^L��]

note the garbled strings
and on other nodes only stuff like

[2014-08-23 00:38:31,588][DEBUG][action.bulk              ] [othernode] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2014-08-23 00:38:31,589][ERROR][marvel.agent.exporter    ] [othernode] create failure (index:[.marvel-2014.08.23] type: [node_stats]): UnavailableShardsException[[.marvel-2014.08.23][0] [2] shardIt, [0] active : Timeout waiting for [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@4f0b6ef7]
@clintongormley

This comment has been minimized.

Member

clintongormley commented Aug 24, 2014

@s1monw

This comment has been minimized.

Contributor

s1monw commented Aug 25, 2014

hey @OlegYch I personally can't see really evidence that your issue is related to this. Did you really see a shard initialising that was supposed to recover from a shard that is not actually allocated. You also said:

then waited a bit more and noticed there were 4 primary shards stuck in initializing on one of them and killed it - and the cluster went back up green in no time

and in this issue the shards that got stuck were replicas in this issue. Can you provide more infos that what you already added?

@OlegYch

This comment has been minimized.

OlegYch commented Aug 25, 2014

oh sorry, didn't understand that this issue was specifically about replica shards as opposed to primary shards, perhaps this is better described in #6816 ?
i've removed that node from the cluster and stopped elastic, so i can upload files from it if that would help diagnosing

@garyelephant

This comment has been minimized.

garyelephant commented Jan 7, 2015

is this problem resolved after 1.4.0 ? Mine 1.4.0 still has this problem.

curl es_host:9200/_cat/shards 2>1 |grep [UI]N
test-2015.01.07      4 p INITIALIZING                   127.0.0.1 10.71.16.121 
test-2015.01.07      4 r UNASSIGNED                                            
test-2015.01.07      4 r UNASSIGNED 
@ioc32

This comment has been minimized.

ioc32 commented Jan 9, 2015

@garyelephant I just run into this same issue after resetting the number of replicas using the API. 1.4.0, too.

FWIW here's the status of the shards comprising one of the stuck indices:

index-2015.01.06 0 r INITIALIZING 10.0.0.162 es2.io.example.com
index-2015.01.06 0 r UNASSIGNED
index-2015.01.06 1 r UNASSIGNED
index-2015.01.06 1 r UNASSIGNED
index-2015.01.06 2 r UNASSIGNED
index-2015.01.06 2 r UNASSIGNED

@clintongormley

This comment has been minimized.

Member

clintongormley commented Jan 13, 2015

@garyelephant @ioc32 this issue is closed. If you're still seeing problems in 1.4, please open a new issue with more information about the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment