Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster Unable to Assign Shards, Marvel or otherwise. #16708

Closed
zukeru opened this issue Feb 17, 2016 · 4 comments
Closed

Cluster Unable to Assign Shards, Marvel or otherwise. #16708

zukeru opened this issue Feb 17, 2016 · 4 comments
Labels
:Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) feedback_needed

Comments

@zukeru
Copy link

zukeru commented Feb 17, 2016

Hello, I'm trying to start a new cluster of elasticsearch, but I can't seem to get the shards to allocate correctly. I upgrade to the latest marvel, and elasticsearch 2.2.0 and the cluster won't register the marvel shards. I cant figure out why it wont register. I can't even manually register because it tells me the shard is disabled.

I then created a custom index with a few shards and the shards remain unassigned as well.

curl -XPUT http://localhost:9200/test -d '
{
   "settings" : {
      "number_of_shards" : 3,
      "number_of_replicas" : 1
   }
}

'

In the logs I get the following error:

[2016-02-17 20:04:52,458][ERROR][marvel.agent             ] [i-11a6decb] background thread had an uncaught exception
ElasticsearchException[failed to flush exporter bulks]
    at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:104)
    at org.elasticsearch.marvel.agent.exporter.ExportBulk.close(ExportBulk.java:53)
    at org.elasticsearch.marvel.agent.AgentService$ExportingWorker.run(AgentService.java:201)
    at java.lang.Thread.run(Thread.java:745)
    Suppressed: ElasticsearchException[failed to flush [default_local] exporter bulk]; nested: ElasticsearchException[failure in bulk execution:
[0]: index [.marvel-es-2016.02.17], type [node_stats], id [AVLw1O4Ctq-FZ8CmFK_-], message [UnavailableShardsException[[.marvel-es-2016.02.17][0] primary shard is not active Timeout: [1m], request: [shard bulk {[.marvel-es-2016.02.17][0]}]]]];
        at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:106)
        ... 3 more
    Caused by: ElasticsearchException[failure in bulk execution:
[0]: index [.marvel-es-2016.02.17], type [node_stats], id [AVLw1O4Ctq-FZ8CmFK_-], message [UnavailableShardsException[[.marvel-es-2016.02.17][0] primary shard is not active Timeout: [1m], request: [shard bulk {[.marvel-es-2016.02.17][0]}]]]]
        at org.elasticsearch.marvel.agent.exporter.local.LocalBulk.flush(LocalBulk.java:114)
        at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:101)
        ... 3 more

Then when I try to run a re-route I get:

Kenzans-MacBook-Pro-39:~ grantzukel$ curl -XPOST http://localhost:9200/_cluster/reroute?pretty -d '{ "commands" : [ { "allocate" : { "index" : ".marvel-es-data", "shard" : 0, "node" :"i-e098e03a" } } ] }' 
{
  "error" : {
    "root_cause" : [ {
      "type" : "remote_transport_exception",
      "reason" : "[i-169f4cce][10.194.35.20:9300][cluster:admin/reroute]"
    } ],
    "type" : "illegal_argument_exception",
    "reason" : "[allocate] trying to allocate a primary shard [.marvel-es-data][0], which is disabled"
  },
  "status" : 400
}

Here is my elasticsearch config where i enable rebalance and rerouting and primaries to true.


my settings:

---

cluster.name: infra_elastic_cluster_3

index.number_of_shards: 3
index.store.throttle.type: none

action.auto_create_index: true

index.number_of_replicas: 1
index.requests.cache.enable: true
index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.info: 5s
index.search.slowlog.threshold.query.debug: 2s
index.search.slowlog.threshold.query.trace: 500ms
index.search.slowlog.threshold.fetch.warn: 1s
index.search.slowlog.threshold.fetch.info: 800ms
index.search.slowlog.threshold.fetch.debug: 500ms
index.search.slowlog.threshold.fetch.trace: 200ms

index.refresh_interval: 1

cloud:
    aws:
      region: us-west-2
    node:
      auto_attributes: true

discovery:
    type: ec2
    ec2:
      groups: infra_elastic_cluster_3
      any_group: false
      ping_timeout: 60s
    zen:
      minimum_master_nodes: 2

node:
  data: true
  master: false
  name: i-29fd85f3

http:
  max_content_length: 1000mb
  cors.allow-origin: "/.*/"
  cors.enabled: true

bootstrap.mlockall: true

script.inline: on 
script.indexed: on 

tr.logging.maxlength: 500000

indices.memory.index_buffer_size: 30%
indices.store.throttle.max_bytes_per_sec: 1000mb
indices.store.throttle.type: Merge
indices.fielddata.cache.size:  40%

threadpool.bulk.type: fixed
threadpool.bulk.size: 100
threadpool.bulk.queue_size: 10000

network.host: _eth0_

query.bool.max_clause_count: 10240

cluster.routing.allocation.enable: all
cluster.routing.allocation.disable_new_allocation: false
cluster.routing.allocation.disable_allocation: false

cluster.routing.allocation.allow_primary: true
cluster.routing.allocation.allow_rebalance: always

Trace log output

2016-02-17 21:57:03,686][TRACE][action.bulk              ] [i-29fd85f3] primary shard [[.marvel-es-2016.02.17][0]] is not yet active, scheduling a retry: action [indices:data/write/bulk[s]], request [shard bulk {[.marvel-es-2016.02.17][0]}], cluster state version [50]
[2016-02-17 21:57:03,686][TRACE][action.bulk              ] [i-29fd85f3] observer: sampled state rejected by predicate (version [50], status [APPLIED]). adding listener to ClusterService
[2016-02-17 21:57:03,686][TRACE][action.bulk              ] [i-29fd85f3] observer: postAdded - predicate rejected state (version [50], status [APPLIED])
[2016-02-17 21:57:43,104][DEBUG][org.apache.http.impl.conn.PoolingClientConnectionManager] Closing connections idle longer than 60 SECONDS
[2016-02-17 21:57:43,104][DEBUG][com.amazonaws.internal.SdkSSLSocket] shutting down output of ec2.us-west-2.amazonaws.com/205.251.235.5:443
[2016-02-17 21:57:43,105][DEBUG][com.amazonaws.internal.SdkSSLSocket] closing ec2.us-west-2.amazonaws.com/205.251.235.5:443
[2016-02-17 21:57:43,106][DEBUG][org.apache.http.impl.conn.DefaultClientConnection] Connection 0.0.0.0:38289<->205.251.235.5:443 closed
[2016-02-17 21:58:03,687][TRACE][action.bulk              ] [i-29fd85f3] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2016-02-17 21:58:03,687][TRACE][action.bulk              ] [i-29fd85f3] primary shard [[.marvel-es-2016.02.17][0]] is not yet active, scheduling a retry: action [indices:data/write/bulk[s]], request [shard bulk {[.marvel-es-2016.02.17][0]}], cluster state version [50]
[2016-02-17 21:58:03,687][TRACE][action.bulk              ] [i-29fd85f3] operation failed. action [indices:data/write/bulk[s]], request [shard bulk {[.marvel-es-2016.02.17][0]}]
UnavailableShardsException[[.marvel-es-2016.02.17][0] primary shard is not active Timeout: [1m], request: [shard bulk {[.marvel-es-2016.02.17][0]}]]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryBecauseUnavailable(TransportReplicationAction.java:555)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.doRun(TransportReplicationAction.java:431)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onTimeout(TransportReplicationAction.java:520)
    at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:239)
    at org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:794)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
[2016-02-17 21:58:03,687][ERROR][marvel.agent             ] [i-29fd85f3] background thread had an uncaught exception
ElasticsearchException[failed to flush exporter bulks]
    at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:104)
    at org.elasticsearch.marvel.agent.exporter.ExportBulk.close(ExportBulk.java:53)
    at org.elasticsearch.marvel.agent.AgentService$ExportingWorker.run(AgentService.java:201)
    at java.lang.Thread.run(Thread.java:745)
    Suppressed: ElasticsearchException[failed to flush [default_local] exporter bulk]; nested: ElasticsearchException[failure in bulk execution:
[0]: index [.marvel-es-2016.02.17], type [node_stats], id [AVLxPI5GwpZSvKDdhqNh], message [UnavailableShardsException[[.marvel-es-2016.02.17][0] primary shard is not active Timeout: [1m], request: [shard bulk {[.marvel-es-2016.02.17][0]}]]]];
        at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:106)
        ... 3 more
    Caused by: ElasticsearchException[failure in bulk execution:
[0]: index [.marvel-es-2016.02.17], type [node_stats], id [AVLxPI5GwpZSvKDdhqNh], message [UnavailableShardsException[[.marvel-es-2016.02.17][0] primary shard is not active Timeout: [1m], request: [shard bulk {[.marvel-es-2016.02.17][0]}]]]]
        at org.elasticsearch.marvel.agent.exporter.local.LocalBulk.flush(LocalBulk.java:114)
        at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:101)
        ... 3 more
[2016-02-17 21:58:13,693][TRACE][action.bulk              ] [i-29fd85f3] primary shard [[.marvel-es-2016.02.17][0]] is not yet active, scheduling a retry: action [indices:data/write/bulk[s]], request [shard bulk {[.marvel-es-2016.02.17][0]}], cluster state version [50]
[2016-02-17 21:58:13,693][TRACE][action.bulk              ] [i-29fd85f3] observer: sampled state rejected by predicate (version [50], status [APPLIED]). adding listener to ClusterService
[2016-02-17 21:58:13,694][TRACE][action.bulk              ] [i-29fd85f3] observer: postAdded - predicate rejected state (version [50], status [APPLIED])


@zukeru zukeru changed the title Marvel ElasticsearchException[failed to flush exporter bulks] Cluster Unable to Assign Shards, Marvel or otherwise. Feb 17, 2016
@ywelsch
Copy link
Contributor

ywelsch commented Feb 17, 2016

Can you try the reroute command again by setting allow_primary to true (see https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-reroute.html)? This allows the allocate command to also allocate primary shards (Note that this loses all existing data for that shard):

curl -XPOST http://localhost:9200/_cluster/reroute?pretty -d '{ "commands" : [ { "allocate" : { "index" : ".marvel-es-data", "shard" : 0, "node" :"i-e098e03a", "allow_primary": "true" } } ] }' 

@clintongormley
Copy link

No further feedback. Closing

@portante
Copy link

@clintongormley, I encountered the same problem, and the above reroute fixed that instance. How do I fix this so that all new marvel indices don't have this problem? Do I need to add a template that addresses this?

@clintongormley
Copy link

@portante the important thing to figure out is why the index is not being allocated - we never got to the bottom of the story here. Possibly to do with allocation settings? Feel free to open a new issue so we can delve into it

@lcawl lcawl added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Allocation labels Feb 13, 2018
@clintongormley clintongormley added :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) and removed :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) feedback_needed
Projects
None yet
Development

No branches or pull requests

5 participants