New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Red Cluster State: failed to obtain in-memory shard lock #23199

Closed
speedplane opened this Issue Feb 16, 2017 · 28 comments

Comments

Projects
None yet
@speedplane
Contributor

speedplane commented Feb 16, 2017

Elasticsearch version: 5.2.0

Plugins installed: [repository-gcs, repository-s3, x-pack, io.fabric8:elasticsearch-cloud-kubernetes]

JVM version: 1.8.0_121

OS version: ubuntu xenial, running in a container managed by kubernetes

Description of the problem including expected versus actual behavior: Shard leaves cluster, did not have a replica setup so resulted in data loss :(

Steps to reproduce:

  1. Had a 5 node cluster that was mostly indexing for a full week (about 1B docs) across 5 different indexes.
  2. When it was almost done, I ramped up to 10 nodes.
  3. Things were working out just fine for a while, then one shard on one of the shards left, and I got into red state.

I looked through the logs and it appears there is a lock error. It may have resulted from a sporadic network failure, but I'm not sure. The error logs refer to a few indices, but the only one that went into red state and did not come back is da-prod8-other, probably because It did not have a replica.

Provide logs (if relevant):

[2017-02-15T00:17:46,359][WARN ][o.e.c.a.s.ShardStateAction] [node-2-data-pod] [da-prod8-ttab][0] unexpected failure while sending request [internal:cluster/shard/failure] to [{es-master-714112077-ae5jq}{KfcqAA57R02arAOj1kshuw}{8YGdiFciTWeiidJXI4uh3A}{10.0.3.58}{10.0.3.58:9300}] for shard entry [shard id [[da-prod8-ttab][0]], allocation id [-rtlh5w4QAqqVf_nLd8cVw], primary term [16], message [failed to perform indices:data/write/bulk[s] on replica [da-prod8-ttab][0], node[uviqqBXkR9a63SRtoW28Wg], [R], s[STARTED], a[id=-rtlh5w4QAqqVf_nLd8cVw]], failure [RemoteTransportException[[node-1-data-pod][10.0.25.4:9300][indices:data/write/bulk[s][r]]]; nested: IllegalStateException[active primary shard cannot be a replication target before  relocation hand off [da-prod8-ttab][0], node[uviqqBXkR9a63SRtoW28Wg], [P], s[STARTED], a[id=-rtlh5w4QAqqVf_nLd8cVw], state is [STARTED]]; ]]
org.elasticsearch.transport.RemoteTransportException: [es-master-714112077-ae5jq][10.0.3.58:9300][internal:cluster/shard/failure]
Caused by: org.elasticsearch.cluster.action.shard.ShardStateAction$NoLongerPrimaryShardException: primary term [16] did not match current primary term [17]
        at org.elasticsearch.cluster.action.shard.ShardStateAction$ShardFailedClusterStateTaskExecutor.execute(ShardStateAction.java:291) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService.executeTasks(ClusterService.java:674) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService.calculateTaskOutputs(ClusterService.java:653) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:612) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:1112) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:527) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:238) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:201) ~[elasticsearch-5.2.0.jar:5.2.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_121]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_121]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
[2017-02-15T00:17:46,372][WARN ][o.e.d.z.ZenDiscovery     ] [node-2-data-pod] master left (reason = failed to ping, tried [12] times, each with  maximum [2s] timeout), current nodes: nodes:
   {es-master-714112077-ae5jq}{KfcqAA57R02arAOj1kshuw}{8YGdiFciTWeiidJXI4uh3A}{10.0.3.58}{10.0.3.58:9300}, master
   {es-master-714112077-19hsw}{B9ss9idVQN-5EITg9jhhtw}{Rzabp7WaS_SilvEBWHuI9A}{10.0.4.53}{10.0.4.53:9300}
   {node-7-data-pod}{Zs7Q_tpgTZmpnwBNFZYi6w}{pGi8rVIqTE6j5OIceuFxdg}{10.0.36.3}{10.0.36.3:9300}
   {node-4-data-pod}{AuYiFtGDTvqJrBeI2wU_sA}{2wPJrJUiR6GSCPd6ZnObfA}{10.0.35.3}{10.0.35.3:9300}
   {node-2-data-pod}{HyrtTYWWRVODlbTCKTxdzw}{Nkv_gLwwQnyuCe5tTXD8fg}{10.0.34.3}{10.0.34.3:9300}, local
   {node-6-data-pod}{S_GBUaOfRHS4XW65x9OIhw}{Egd5QuxvTMOmNOehbZTtQQ}{10.0.33.3}{10.0.33.3:9300}
   {node-5-data-pod}{0aaunpa-Qkab66Ti5mFoTw}{BC27Z_hiRcGgaDQcmHgEaA}{10.0.37.3}{10.0.37.3:9300}
   {node-9-data-pod}{tcTjGDFLRCWlQMND7vlL6A}{rgaljuXoRr2SpjWuXCTqaA}{10.0.26.4}{10.0.26.4:9300}
   {node-3-data-pod}{kzr2o00tSzyuY-ekWuiNng}{x3RMwZicQ46ljZS-muWy-g}{10.0.32.6}{10.0.32.6:9300}
   {node-1-data-pod}{uviqqBXkR9a63SRtoW28Wg}{K9PqwDXLSuO5XTPmQds0aw}{10.0.25.4}{10.0.25.4:9300}
   {node-8-data-pod}{ATEayeK_SZWydO1cFfsZfg}{QE63FwDJQY2HL0S1Nys2gg}{10.0.27.4}{10.0.27.4:9300}
   {es-master-714112077-kh7ur}{nKYzKbxWRv-kvQBZVZJuGA}{SfV1jqmWSiS1jbWqnl-TPQ}{10.0.1.50}{10.0.1.50:9300}
   {node-0-data-pod}{uUZCt9RrS2aY_gqkSmNV5A}{SUYJlFEGQzyrF1tyocLj3w}{10.0.28.4}{10.0.28.4:9300}

[2017-02-15T00:17:46,439][INFO ][i.f.e.d.k.KubernetesUnicastHostsProvider] [node-2-data-pod] adding endpoint /10.0.1.50, transport_address 10.0.1.50:9300
[2017-02-15T00:17:46,438][WARN ][o.e.i.c.IndicesClusterStateService] [node-2-data-pod] [[da-prod8-ttab][0]] marking and sending shard failed due to [shard failure, reason [primary shard [[da-prod8-ttab][0], node[HyrtTYWWRVODlbTCKTxdzw], [P], s[STARTED], a[id=kgFYdXusT6ObzYvLd74PTQ]] was demoted while failing replica shard]]
org.elasticsearch.cluster.action.shard.ShardStateAction$NoLongerPrimaryShardException: primary term [16] did not match current primary term [17]
        at org.elasticsearch.cluster.action.shard.ShardStateAction$ShardFailedClusterStateTaskExecutor.execute(ShardStateAction.java:291) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService.executeTasks(ClusterService.java:674) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService.calculateTaskOutputs(ClusterService.java:653) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:612) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:1112) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:527) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:238) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:201) ~[elasticsearch-5.2.0.jar:5.2.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
[2017-02-15T00:17:46,440][INFO ][i.f.e.d.k.KubernetesUnicastHostsProvider] [node-2-data-pod] adding endpoint /10.0.25.4, transport_address 10.0.25.4:9300
[2017-02-15T00:17:46,440][INFO ][i.f.e.d.k.KubernetesUnicastHostsProvider] [node-2-data-pod] adding endpoint /10.0.26.4, transport_address 10.0.26.4:9300
[2017-02-15T00:17:46,440][WARN ][o.e.c.a.s.ShardStateAction] [node-2-data-pod] [da-prod8-ttab][0] no master known for action [internal:cluster/shard/failure] for shard entry [shard id [[da-prod8-ttab][0]], allocation id [kgFYdXusT6ObzYvLd74PTQ], primary term [0], message [shard failure, reason [primary shard [[da-prod8-ttab][0], node[HyrtTYWWRVODlbTCKTxdzw], [P], s[STARTED], a[id=kgFYdXusT6ObzYvLd74PTQ]] was demoted while failing replica shard]], failure [NoLongerPrimaryShardException[primary term [16] did not match current primary term [17]]]]
[2017-02-15T00:17:46,440][INFO ][i.f.e.d.k.KubernetesUnicastHostsProvider] [node-2-data-pod] adding endpoint /10.0.27.4, transport_address 10.0.27.4:9300
[2017-02-15T00:17:46,440][INFO ][i.f.e.d.k.KubernetesUnicastHostsProvider] [node-2-data-pod] adding endpoint /10.0.28.4, transport_address 10.0.28.4:9300
[2017-02-15T00:17:46,440][INFO ][i.f.e.d.k.KubernetesUnicastHostsProvider] [node-2-data-pod] adding endpoint /10.0.3.58, transport_address 10.0.3.58:9300
[2017-02-15T00:17:46,440][INFO ][i.f.e.d.k.KubernetesUnicastHostsProvider] [node-2-data-pod] adding endpoint /10.0.32.6, transport_address 10.0.32.6:9300
[2017-02-15T00:17:46,440][INFO ][i.f.e.d.k.KubernetesUnicastHostsProvider] [node-2-data-pod] adding endpoint /10.0.33.3, transport_address 10.0.33.3:9300
[2017-02-15T00:17:46,441][INFO ][i.f.e.d.k.KubernetesUnicastHostsProvider] [node-2-data-pod] adding endpoint /10.0.35.3, transport_address 10.0.35.3:9300
[2017-02-15T00:17:46,441][INFO ][i.f.e.d.k.KubernetesUnicastHostsProvider] [node-2-data-pod] adding endpoint /10.0.36.3, transport_address 10.0.36.3:9300
[2017-02-15T00:17:46,441][INFO ][i.f.e.d.k.KubernetesUnicastHostsProvider] [node-2-data-pod] adding endpoint /10.0.37.3, transport_address 10.0.37.3:9300
[2017-02-15T00:17:46,441][INFO ][i.f.e.d.k.KubernetesUnicastHostsProvider] [node-2-data-pod] adding endpoint /10.0.4.53, transport_address 10.0.4.53:9300
[2017-02-15T00:17:46,493][WARN ][o.e.i.c.IndicesClusterStateService] [node-2-data-pod] [[da-prod8-pacer][2]] marking and sending shard failed due to [shard failure, reason [primary shard [[da-prod8-pacer][2], node[HyrtTYWWRVODlbTCKTxdzw], [P], s[STARTED], a[id=VSyGabxyQPKQf9q9ow1F_Q]] was demoted while failing replica shard]]
org.elasticsearch.cluster.action.shard.ShardStateAction$NoLongerPrimaryShardException: primary term [7] did not match current primary term [8]
        at org.elasticsearch.cluster.action.shard.ShardStateAction$ShardFailedClusterStateTaskExecutor.execute(ShardStateAction.java:291) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService.executeTasks(ClusterService.java:674) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService.calculateTaskOutputs(ClusterService.java:653) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:612) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:1112) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:527) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:238) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:201) ~[elasticsearch-5.2.0.jar:5.2.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
[2017-02-15T00:17:46,494][WARN ][o.e.c.a.s.ShardStateAction] [node-2-data-pod] [da-prod8-pacer][2] no master known for action [internal:cluster/shard/failure] for shard entry [shard id [[da-prod8-pacer][2]], allocation id [VSyGabxyQPKQf9q9ow1F_Q], primary term [0], message [shard failure, reason [primary shard [[da-prod8-pacer][2], node[HyrtTYWWRVODlbTCKTxdzw], [P], s[STARTED], a[id=VSyGabxyQPKQf9q9ow1F_Q]] was demoted while failing replica shard]], failure [NoLongerPrimaryShardException[primary term [7] did not match current primary term [8]]]]
[2017-02-15T00:17:49,472][INFO ][o.e.c.s.ClusterService   ] [node-2-data-pod] detected_master {es-master-714112077-ae5jq}{KfcqAA57R02arAOj1kshuw}{8YGdiFciTWeiidJXI4uh3A}{10.0.3.58}{10.0.3.58:9300}, reason: zen-disco-receive(from master [master {es-master-714112077-ae5jq}{KfcqAA57R02arAOj1kshuw}{8YGdiFciTWeiidJXI4uh3A}{10.0.3.58}{10.0.3.58:9300} committed version [1538]])
[2017-02-15T00:17:54,466][INFO ][o.e.i.s.TransportNodesListShardStoreMetaData] [node-2-data-pod] [da-prod8-ttab][0]: failed to obtain shard lock
org.elasticsearch.env.ShardLockObtainFailedException: [da-prod8-ttab][0]: obtaining shard lock timed out after 5000ms
        at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:712) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:631) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.index.store.Store.readMetadataSnapshot(Store.java:383) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.listStoreMetaData(TransportNodesListShardStoreMetaData.java:153) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:112) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:64) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:145) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:270) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:266) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1488) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:596) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.2.0.jar:5.2.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
[2017-02-15T00:17:54,467][INFO ][o.e.i.s.TransportNodesListShardStoreMetaData] [node-2-data-pod] [da-prod8-pacer][2]: failed to obtain shard lock
org.elasticsearch.env.ShardLockObtainFailedException: [da-prod8-pacer][2]: obtaining shard lock timed out after 5000ms
        at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:712) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:631) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.index.store.Store.readMetadataSnapshot(Store.java:383) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.listStoreMetaData(TransportNodesListShardStoreMetaData.java:153) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:112) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:64) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:145) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:270) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:266) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1488) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:596) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.2.0.jar:5.2.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
[2017-02-15T00:17:55,007][WARN ][o.e.i.c.IndicesClusterStateService] [node-2-data-pod] [[da-prod8-other][3]] marking and sending shard failed due to [failed to create shard]
java.io.IOException: failed to obtain in-memory shard lock
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:367) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:476) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:146) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:542) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:519) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:204) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService.callClusterStateAppliers(ClusterService.java:856) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService.publishAndApplyChanges(ClusterService.java:810) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:628) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:1112) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:527) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:238) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:201) [elasticsearch-5.2.0.jar:5.2.0]        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Caused by: org.elasticsearch.env.ShardLockObtainFailedException: [da-prod8-other][3]: obtaining shard lock timed out after 5000ms
        at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:712) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:631) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:297) ~[elasticsearch-5.2.0.jar:5.2.0]
        ... 15 more
[2017-02-15T00:18:00,044][WARN ][o.e.i.c.IndicesClusterStateService] [node-2-data-pod] [[da-prod8-scotus][0]] marking and sending shard failed due to [failed to create shard]
java.io.IOException: failed to obtain in-memory shard lock
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:367) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:476) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:146) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:542) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:519) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:204) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService.callClusterStateAppliers(ClusterService.java:856) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService.publishAndApplyChanges(ClusterService.java:810) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:628) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:1112) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:527) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:238) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:201) [elasticsearch-5.2.0.jar:5.2.0]        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Caused by: org.elasticsearch.env.ShardLockObtainFailedException: [da-prod8-scotus][0]: obtaining shard lock timed out after 5000ms
        at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:712) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:631) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:297) ~[elasticsearch-5.2.0.jar:5.2.0]
        ... 15 more
[2017-02-15T00:18:05,075][INFO ][o.e.i.s.TransportNodesListShardStoreMetaData] [node-2-data-pod] [da-prod8-scotus][0]: failed to obtain shard lock
org.elasticsearch.env.ShardLockObtainFailedException: [da-prod8-scotus][0]: obtaining shard lock timed out after 5000ms
        at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:712) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:631) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.index.store.Store.readMetadataSnapshot(Store.java:383) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.listStoreMetaData(TransportNodesListShardStoreMetaData.java:153) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:112) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:64) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:145) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:270) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:266) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1488) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:596) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.2.0.jar:5.2.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
[2017-02-15T00:18:10,120][WARN ][o.e.i.c.IndicesClusterStateService] [node-2-data-pod] [[da-prod8-other][3]] marking and sending shard failed due to [failed to create shard]
java.io.IOException: failed to obtain in-memory shard lock
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:367) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:476) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:146) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:542) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:519) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:204) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService.callClusterStateAppliers(ClusterService.java:856) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService.publishAndApplyChanges(ClusterService.java:810) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:628) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:1112) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:527) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:238) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:201) [elasticsearch-5.2.0.jar:5.2.0]        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Caused by: org.elasticsearch.env.ShardLockObtainFailedException: [da-prod8-other][3]: obtaining shard lock timed out after 5000ms
        at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:712) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:631) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:297) ~[elasticsearch-5.2.0.jar:5.2.0]
        ... 15 more
[2017-02-15T00:18:20,175][WARN ][o.e.i.c.IndicesClusterStateService] [node-2-data-pod] [[da-prod8-other][3]] marking and sending shard failed due to [failed to create shard]
java.io.IOException: failed to obtain in-memory shard lock
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:367) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:476) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:146) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:542) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:519) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:204) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService.callClusterStateAppliers(ClusterService.java:856) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService.publishAndApplyChanges(ClusterService.java:810) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:628) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:1112) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:527) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:238) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:201) [elasticsearch-5.2.0.jar:5.2.0]        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Caused by: org.elasticsearch.env.ShardLockObtainFailedException: [da-prod8-other][3]: obtaining shard lock timed out after 5000ms
        at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:712) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:631) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:297) ~[elasticsearch-5.2.0.jar:5.2.0]
        ... 15 more
[2017-02-15T00:18:30,250][WARN ][o.e.i.c.IndicesClusterStateService] [node-2-data-pod] [[da-prod8-other][3]] marking and sending shard failed due to [failed to create shard]
java.io.IOException: failed to obtain in-memory shard lock
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:367) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:476) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:146) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:542) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:519) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:204) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService.callClusterStateAppliers(ClusterService.java:856) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService.publishAndApplyChanges(ClusterService.java:810) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:628) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:1112) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:527) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:238) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:201) [elasticsearch-5.2.0.jar:5.2.0]        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Caused by: org.elasticsearch.env.ShardLockObtainFailedException: [da-prod8-other][3]: obtaining shard lock timed out after 5000ms
        at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:712) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:631) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:297) ~[elasticsearch-5.2.0.jar:5.2.0]
        ... 15 more
[2017-02-15T00:18:38,070][WARN ][o.e.i.c.IndicesClusterStateService] [node-2-data-pod] [[da-prod8-pacer][2]] marking and sending shard failed due to [failed to create shard]
java.io.IOException: failed to obtain in-memory shard lock
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:367) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:476) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:146) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:542) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:519) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:204) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService.callClusterStateAppliers(ClusterService.java:856) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService.publishAndApplyChanges(ClusterService.java:810) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:628) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:1112) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:527) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:238) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:201) [elasticsearch-5.2.0.jar:5.2.0]        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Caused by: org.elasticsearch.env.ShardLockObtainFailedException: [da-prod8-pacer][2]: obtaining shard lock timed out after 5000ms
        at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:712) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:631) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:297) ~[elasticsearch-5.2.0.jar:5.2.0]
        ... 15 more
[2017-02-15T00:18:43,071][WARN ][o.e.i.c.IndicesClusterStateService] [node-2-data-pod] [[da-prod8-ttab][0]] marking and sending shard failed due to [failed to create shard]
java.io.IOException: failed to obtain in-memory shard lock
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:367) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:476) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:146) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:542) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:519) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:204) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService.callClusterStateAppliers(ClusterService.java:856) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService.publishAndApplyChanges(ClusterService.java:810) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:628) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:1112) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:527) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:238) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:201) [elasticsearch-5.2.0.jar:5.2.0]        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Caused by: org.elasticsearch.env.ShardLockObtainFailedException: [da-prod8-ttab][0]: obtaining shard lock timed out after 5000ms
        at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:712) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:631) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:297) ~[elasticsearch-5.2.0.jar:5.2.0]
        ... 15 more
[2017-02-15T00:18:48,107][INFO ][o.e.i.s.TransportNodesListShardStoreMetaData] [node-2-data-pod] [da-prod8-ttab][0]: failed to obtain shard lock
org.elasticsearch.env.ShardLockObtainFailedException: [da-prod8-ttab][0]: obtaining shard lock timed out after 5000ms
        at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:712) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:631) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.index.store.Store.readMetadataSnapshot(Store.java:383) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.listStoreMetaData(TransportNodesListShardStoreMetaData.java:153) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:112) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:64) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:145) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:270) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:266) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1488) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:596) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.2.0.jar:5.2.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
[2017-02-15T00:18:48,107][INFO ][o.e.i.s.TransportNodesListShardStoreMetaData] [node-2-data-pod] [da-prod8-pacer][2]: failed to obtain shard lock
org.elasticsearch.env.ShardLockObtainFailedException: [da-prod8-pacer][2]: obtaining shard lock timed out after 5000ms
        at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:712) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:631) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.index.store.Store.readMetadataSnapshot(Store.java:383) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.listStoreMetaData(TransportNodesListShardStoreMetaData.java:153) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:112) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:64) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:145) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:270) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:266) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1488) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:596) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.2.0.jar:5.2.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
[2017-02-15T00:18:48,115][WARN ][o.e.i.c.IndicesClusterStateService] [node-2-data-pod] [[da-prod8-other][3]] marking and sending shard failed due to [failed to create shard]
java.io.IOException: failed to obtain in-memory shard lock
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:367) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:476) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:146) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:542) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:519) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:204) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService.callClusterStateAppliers(ClusterService.java:856) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService.publishAndApplyChanges(ClusterService.java:810) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:628) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:1112) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:527) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:238) [elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:201) [elasticsearch-5.2.0.jar:5.2.0]        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Caused by: org.elasticsearch.env.ShardLockObtainFailedException: [da-prod8-other][3]: obtaining shard lock timed out after 5000ms
        at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:712) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:631) ~[elasticsearch-5.2.0.jar:5.2.0]
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:297) ~[elasticsearch-5.2.0.jar:5.2.0]
        ... 15 more
@speedplane

This comment has been minimized.

Contributor

speedplane commented Feb 16, 2017

I was able to get my cluster working again by manually rerouting that shard:

curl -XPOST 'localhost:9200/_cluster/reroute?pretty' -d '{
    "commands" : [ {
        "allocate_stale_primary" :
            {
              "index" : "da-prod8-other", "shard" : 3,
              "node" : "node-2-data-pod",
              "accept_data_loss" : true
            }
        }
    ]
}'
@ywelsch

This comment has been minimized.

Contributor

ywelsch commented Feb 16, 2017

It looks like there was a primary with no replicas allocated to the node when it got disconnected from the master. When rejoining the cluster, the locally allocated shard copy on the node was not able to free previously used resources in the time where the master had already made 5 attempts to unsuccessfully allocate the shard to the node again.

After 5 unsuccessful allocation attempts, the master gives up and needs manual triggering to give it another allocation attempt (see https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-reroute.html#_retry_failed_shards). By the time you ran the command above, the shard was finished freeing resources. It would have been sufficient though to just run

curl -XPOST 'localhost:9200/_cluster/reroute?retry_failed

which is a much safer command. Data loss did not occur in this specific case where you used the allocate_stale_primary command, but using it in an incorrect situation can easily do so.

Note that it's always good to first run the allocation explain API (_cluster/allocation/explain) in case where the cluster is red - it would also have provided the retry_failed suggestion.

Note: I made a short edit to the explanation in the first paragraph (I initially misread the log output above thinking that the primary was relocating).

@ywelsch

This comment has been minimized.

Contributor

ywelsch commented Feb 16, 2017

Also, do you by chance know whether there were long running search/scroll requests or snapshots running at the time of the failure? This could explain why the shard was not able to free its existing resources within a minute.

@speedplane

This comment has been minimized.

Contributor

speedplane commented Feb 16, 2017

@ywelsch Thank you for the analysis and the workaround if it occurs again. Any suggestion on how to prevent this from occurring in the first place (e.g., increasing timeouts)?

I don't think there was a long running search at the time. The index that went down is not small (~130GB, 6 shards, 429M docs) but the queries to it are generally straightforward. I have a few other indexes on the same cluster that take much more complex queries (e.g., multiple nested aggregations). Most are relatively fast (5-20S), but one in particular can take over a minute to run... that said, I don't think it was run at the time.

@speedplane

This comment has been minimized.

Contributor

speedplane commented Feb 16, 2017

I'm going to try the following, which will hopefully reduce the likelihood of this error:

curl -XPUT localhost:9200/da-prod8-*/_settings?pretty -d '{
  "index.allocation.max_retries" : 10
}'
@speedplane

This comment has been minimized.

Contributor

speedplane commented Feb 16, 2017

Also, a small suggestion based on your comments: Perhaps the documentation for retrying failed shards should be moved above the manual override section. Or at least, there should be a warning suggesting that the user tries the retry_failed option before manual override.

Had I known about the retry_failed option, it would have been far less panic-inducing. I was reading that documentation page, but for some reason I didn't see it, and focused in on the manual shard allocation section instead.

@speedplane

This comment has been minimized.

Contributor

speedplane commented Mar 7, 2017

I just hit this error again, below is the output of _cluster/allocation/explain?pretty. It looks like I'm getting a ShardLockObtainFailedException error.

{
  "index" : "da-prod8-tmark",
  "shard" : 5,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2017-03-04T04:59:38.099Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[da-prod8-tmark][5]: obtaining shard lock timed out after 5000ms]; ",
    "last_allocation_status" : "no"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",
  "node_allocation_decisions" : [
    {
      "node_id" : "uviqqBXkR9a63SRtoW28Wg",
      "node_name" : "node-1-data-pod",
      "transport_address" : "10.0.25.4:9300",
      "node_decision" : "no",
      "store" : {
        "in_sync" : true,
        "allocation_id" : "OjjjGv5ZRIKPA1r3TRn3Cg",
        "store_exception" : {
          "type" : "shard_lock_obtain_failed_exception",
          "reason" : "[da-prod8-tmark][5]: obtaining shard lock timed out after 5000ms",
          "index_uuid" : "vS61_OAERrq0qdD7LqEuLA",
          "shard" : "5",
          "index" : "da-prod8-tmark"
        }
      },
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2017-03-04T04:59:38.099Z], failed_attempts[5], delayed=false, details[failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[da-prod8-tmark][5]: obtaining shard lock timed out after 5000ms]; ], allocation_status[deciders_no]]]"
        }
      ]
    }
  ]
}
@abeyad

This comment has been minimized.

Contributor

abeyad commented Mar 7, 2017

@speedplane it looks like the same issue as before, where resources on the shard were not cleared in time for the allocation to take place on the node after it rejoined the cluster. It can be remedied with the same command which @ywelsch gave earlier:

curl -XPOST 'localhost:9200/_cluster/reroute?retry_failed

But the larger issue to address is why you repeatedly run into the ShardLockObtainFailedException. It likely signals some longer running job on the shard that failed to abort when the node was temporarily removed from the cluster. Are you experiencing long GC cycles on the node? Or is it a network issue?

It would be helpful to know what was happening on that node prior to this issue manifesting. Are there logs of that node that you can share with us? If so, you can email it to my_first_name @ elastic dot co

@speedplane

This comment has been minimized.

Contributor

speedplane commented Mar 9, 2017

Just happened again, this time within just a few days of the last failure. Tests caught it in production during business hours, so I had to just fix it. Right now (due to completely separate issues) I can't SSH into the servers (ridiculous I know), so I can't grab the logs. I suspect this will happen again, and I'll post whatever logs I can get, but if there are any work-arounds that you can think of, I'm all ears.

@tmegow

This comment has been minimized.

tmegow commented Mar 14, 2017

@abeyad I've also been struggling against ShardLockObtainFailedException occurring during times of heavy bulk indexing activity. The symptoms are: a data node (2.3) becoming unresponsive, GC times spike, and the affected data node is removed from the cluster. Immediately after rejoining the cluster (this is a kubernetes environment) the node receives ShardLockObtainFailedException repeatedly, and ES performance suffers. So far, I've been forced to reduce/re-raise the number of replicas and let the shards sync from primary again.
Do you have any insight into how I can avoid this symptom?

@speedplane

This comment has been minimized.

Contributor

speedplane commented Mar 15, 2017

Just happened again, this time I was able to collect some logs. It looks similar to the situation described by @tmegow, there's an issue during bulk indexing, and then it drops out.

[2017-03-15T07:24:54,984][INFO ][o.e.c.s.ClusterService   ] [node-4-data-pod] removed {{node-3-data-pod}{kzr2o00tSzyuY-ekWuiNng}{x3RMwZicQ46ljZS-muWy-g}{10.0.32.6}{10.0.32.6:9300},}, reason: zen-disco-receive(from master [master {es-master-714112077-ae5jq}{KfcqAA57R02arAOj1kshuw}{8YGdiFciTWeiidJXI4uh3A}{10.0.3.58}{10.0.3.58:9300} committed version [2098]])
[2017-03-15T07:24:55,032][WARN ][o.e.a.b.TransportShardBulkAction] [node-4-data-pod] [[da-prod8-other][1]] failed to perform indices:data/write/bulk[s] on replica [da-prod8-other][1], node[kzr2o00tSzyuY-ekWuiNng], [R], s[STARTED], a[id=wXUXFzduRYifGmjfzR3dEw]
org.elasticsearch.transport.NodeDisconnectedException: [node-3-data-pod][10.0.32.6:9300][indices:data/write/bulk[s][r]] disconnected
[2017-03-15T07:24:55,032][WARN ][o.e.a.b.TransportShardBulkAction] [node-4-data-pod] [[da-prod8-other][1]] failed to perform indices:data/write/bulk[s] on replica [da-prod8-other][1], node[kzr2o00tSzyuY-ekWuiNng], [R], s[STARTED], a[id=wXUXFzduRYifGmjfzR3dEw]
org.elasticsearch.transport.NodeDisconnectedException: [node-3-data-pod][10.0.32.6:9300][indices:data/write/bulk[s][r]] disconnected
[2017-03-15T07:25:07,330][INFO ][o.e.c.s.ClusterService   ] [node-4-data-pod] added {{node-3-data-pod}{kzr2o00tSzyuY-ekWuiNng}{x3RMwZicQ46ljZS-muWy-g}{10.0.32.6}{10.0.32.6:9300},}, reason: zen-disco-receive(from master [master {es-master-714112077-ae5jq}{KfcqAA57R02arAOj1kshuw}{8YGdiFciTWeiidJXI4uh3A}{10.0.3.58}{10.0.3.58:9300} committed version [2102]])
[2017-03-15T13:41:28,863][INFO ][o.e.c.s.ClusterService   ] [node-4-data-pod] removed {{node-5-data-pod}{0aaunpa-Qkab66Ti5mFoTw}{BC27Z_hiRcGgaDQcmHgEaA}{10.0.37.3}{10.0.37.3:9300},}, reason: zen-disco-receive(from master [master {es-master-714112077-ae5jq}{KfcqAA57R02arAOj1kshuw}{8YGdiFciTWeiidJXI4uh3A}{10.0.3.58}{10.0.3.58:9300} committed version [2115]])
[2017-03-15T13:41:55,099][INFO ][o.e.c.s.ClusterService   ] [node-4-data-pod] added {{node-5-data-pod}{0aaunpa-Qkab66Ti5mFoTw}{BC27Z_hiRcGgaDQcmHgEaA}{10.0.37.3}{10.0.37.3:9300},}, reason: zen-disco-receive(from master [master {es-master-714112077-ae5jq}{KfcqAA57R02arAOj1kshuw}{8YGdiFciTWeiidJXI4uh3A}{10.0.3.58}{10.0.3.58:9300} committed version [2118]])
[2017-03-15T13:57:03,432][INFO ][o.e.c.s.ClusterService   ] [node-4-data-pod] removed {{node-0-data-pod}{uUZCt9RrS2aY_gqkSmNV5A}{SUYJlFEGQzyrF1tyocLj3w}{10.0.28.4}{10.0.28.4:9300},}, reason: zen-disco-receive(from master [master {es-master-714112077-ae5jq}{KfcqAA57R02arAOj1kshuw}{8YGdiFciTWeiidJXI4uh3A}{10.0.3.58}{10.0.3.58:9300} committed version [2140]])
[2017-03-15T13:57:03,465][WARN ][o.e.a.b.TransportShardBulkAction] [node-4-data-pod] [[da-prod8-ptab][1]] failed to perform indices:data/write/bulk[s] on replica [da-prod8-ptab][1], node[uUZCt9RrS2aY_gqkSmNV5A], [R], s[STARTED], a[id=3QiLMQsdS1GnzumWOm2SFw]
org.elasticsearch.transport.NodeDisconnectedException: [node-0-data-pod][10.0.28.4:9300][indices:data/write/bulk[s][r]] disconnected
[2017-03-15T13:57:11,946][INFO ][o.e.c.s.ClusterService   ] [node-4-data-pod] added {{node-0-data-pod}{uUZCt9RrS2aY_gqkSmNV5A}{SUYJlFEGQzyrF1tyocLj3w}{10.0.28.4}{10.0.28.4:9300},}, reason: zen-disco-receive(from master [master {es-master-714112077-ae5jq}{KfcqAA57R02arAOj1kshuw}{8YGdiFciTWeiidJXI4uh3A}{10.0.3.58}{10.0.3.58:9300} committed version [2144]])
[2017-03-15T13:58:39,531][INFO ][o.e.m.j.JvmGcMonitorService] [node-4-data-pod] [gc][2603293] overhead, spent [438ms] collecting in the last [1.3s]
[2017-03-15T13:58:41,773][INFO ][o.e.m.j.JvmGcMonitorService] [node-4-data-pod] [gc][2603295] overhead, spent [608ms] collecting in the last [1.2s]
[2017-03-15T13:58:47,774][INFO ][o.e.m.j.JvmGcMonitorService] [node-4-data-pod] [gc][2603301] overhead, spent [336ms] collecting in the last [1s]
@abeyad

This comment has been minimized.

Contributor

abeyad commented Mar 16, 2017

@speedplane @tmegow thank you for the extra insights; just to keep you posted, we are actively looking into this issue (its a tricky one) and have some plans around providing more insight into which types of threads are holding onto shard locks. I'll keep this issue updated as we find out more.

@tmegow

This comment has been minimized.

tmegow commented Mar 16, 2017

Thank you for the update, @abeyad !
A bit more info, I enabled the slowlog and it seems that indexing the newly added items is taking longer than I anticipated. My current workaround is to have my syncing app communicate with the node stats endpoint and wait to send more items until .nodes.*.indices.merges & .nodes.*.jvm.gc return to a baseline threshold.
@speedplane What's your merge time and garbage collection rate look like right before you lose the data node? Mine was spiking to 10s of hours immediately prior to losing the data node.

Would giving my data nodes more resources help here? Each of my data nodes currently has 2 vCPUs and 16GB of mem.

@abeyad

This comment has been minimized.

Contributor

abeyad commented Mar 16, 2017

Would giving my data nodes more resources help here? Each of my data nodes currently has 2 vCPUs and 16GB of mem.

Giving data nodes more resources can always help, preventing GC's and thereby potentially improving cluster stability. The ShardLockObtainFailedException manifests as a result of cluster instabilities having nodes come and go quickly. More resources on data nodes can, but is not guaranteed to, remedy such situations.

@speedplane

This comment has been minimized.

Contributor

speedplane commented Mar 17, 2017

@abeyad @tmegow My nodes are not under much stress. It's possible some bursty activity happens which causes this, but I feel like my servers are already underutilized, and would prefer to not increase them further.

@sikelong123

This comment has been minimized.

sikelong123 commented Apr 6, 2017

I also encountered the same problem, but my log is like this:

[2017-04-03T10:04:48,503][WARN ][o.e.i.IndexService ] [mogu015052] [es_xp_item_mgj] [2] failed to close store on shard removal (reason: [initialization failed])
java.lang.NullPointerException
at org.elasticsearch.index.IndexService.closeShard(IndexService.java:409) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.index.IndexService.createShard(IndexService.java:361) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:449) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:137) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:534) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:511) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:200) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.cluster.service.ClusterService.runTasksForExecutor(ClusterService.java:708) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:894) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:444) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:237) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:200) [elasticsearch-5.0.0.jar:5.0.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_72]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_72]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_72]
[2017-04-03T10:04:48,504][WARN ][o.e.i.c.IndicesClusterStateService] [mogu015052] [[es_xp_item_mgj][2]] marking and sending shard failed due to [failed to create shard]
java.io.IOException: failed to obtain in-memory shard lock
at org.elasticsearch.index.IndexService.createShard(IndexService.java:355) ~[elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:449) ~[elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:137) ~[elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:534) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:511) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:200) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.cluster.service.ClusterService.runTasksForExecutor(ClusterService.java:708) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:894) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:444) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:237) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:200) [elasticsearch-5.0.0.jar:5.0.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_72]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_72]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_72]
Caused by: org.elasticsearch.env.ShardLockObtainFailedException: [es_xp_item_mgj][2]: obtaining shard lock timed out after 5000ms
at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:711) ~[elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:630) ~[elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.index.IndexService.createShard(IndexService.java:285) ~[elasticsearch-5.0.0.jar:5.0.0]
... 13 more

There is a null pointer exception,So far, I haven't find out why.This problem is also from disconnected to re-election, rejoin the cluster was wrong. @speedplane @abeyad

@sikelong123

This comment has been minimized.

sikelong123 commented Apr 6, 2017

Did you solve the problem? @speedplane

@speedplane

This comment has been minimized.

Contributor

speedplane commented Apr 12, 2017

@sikelong123 No I didn't, I just saw it pop up yesterday again. As a work-around, I added a cron job that calls /_cluster/reroute?retry_failed every few hours.

@HenleyChiu

This comment has been minimized.

HenleyChiu commented Apr 14, 2017

@speedplane Are you running Kibana or X-Pack too by any chance?

@speedplane

This comment has been minimized.

Contributor

speedplane commented Apr 14, 2017

@HenleyChiu Yup. Kibana 5.2 with monitoring enabled. Not the full X-Pack. Here's the relevant section from my kibana.yaml if it helps:

server:
  basePath: /kibana

logging:
    dest: /data/log/kibana.log
    silent: false
    quiet: false
    verbose: false

xpack:
  monitoring:
    enabled: true
  security:
    enabled: false
  graph:
    enabled: false
  reporting:
    enabled: false
@HenleyChiu

This comment has been minimized.

HenleyChiu commented Apr 14, 2017

I saw your comment on issue 22551: #22551

They seemed to have ignored the 2nd comment by the OP about the ShardLockException. They mentioned disabling monitoring fixes the IllegalStateException, but I wonder if disabling it also fixes the ShardLockException as well?

@abeyad

This comment has been minimized.

Contributor

abeyad commented Apr 15, 2017

@HenleyChiu If you are encountering the ShardLockObtainException frequently, it could be a cascading effect as a result of that bug. The ShardLockObtainException itself can happen as a result of other causes too though, so its no guarantee, but given that #22551 causes frequent disconnects of the node, which can lead to ShardLockObtainException if the cluster is unstable, then its probably a good idea to disable stats collection. Note that the fix (#22317) went into 5.2.0, and the version number on this issue from the OP is 5.2.0. If you are running 5.1.1 or before, then yes, you should disable stats collection.

@clintongormley

This comment has been minimized.

Member

clintongormley commented May 26, 2017

/cc myself

@tomsommer

This comment has been minimized.

tomsommer commented Oct 12, 2017

I had a similar problem, with 4 shards jumping around with the error:

from primary shard with sync id but number of docs differ: 3232205 (XXX, primary) vs 3232204(YYYY)
It didn't increment the error counter on the index and it always tried to allocate to the same node that failed.

Sètting cluster.routing.allocation.enable to none and then to all fixed the problem, maybe it resets some internal logic?

@adichad

This comment has been minimized.

adichad commented Oct 19, 2017

Also seen multiple times on multiple clusters running ES 5.5.1

@mvar

This comment has been minimized.

mvar commented Dec 12, 2017

Experiencing similar issues on cluster running ES 5.6.3. Status stays yellow with single shard unassigned.

Happens on cluster upscaling. Every second time or so.

Few different errors today:

failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[XXX][13]: obtaining shard lock timed out after 5000ms];

failed recovery, failure RecoveryFailedException[[XXX][38]: Recovery failed from ... into ...]; nested: RemoteTransportException[ZZZ[internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [0] files with total size of [0b]]; nested: IllegalStateException[try to recover [XXX][38] from primary shard with sync id but number of docs differ: 69887 (node1, primary) vs 69868(node2)];

@Klezer

This comment has been minimized.

Klezer commented Jan 28, 2018

Also experiencing the same behavior (ShardLockObtainException when the cluster is undergoing heavy load - heavy GC -> shard allocation blocked). We are running ES on version 2.3.1. The only possible solution I read here is to disable the stats collection, right? any other ideas for a solution. Note that we could also upgrade to version 2.4, but could that make a difference in solving this issue?

@bleskes

This comment has been minimized.

Member

bleskes commented Mar 20, 2018

I have opened #29160 containing the suggested doc change. I'm going to close the issue as it have become a catch for people to note that run into the shard locking exception. This can be caused by many reasons, each of them, once sorted out, should be tracked in a dedicated issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment