Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

5.6.10 to 6.3.0 rolling upgrade broken with 'commit doesn't contain history uuid' when a synced flush is performed #31482

Closed
praseodym opened this issue Jun 20, 2018 · 10 comments
Assignees
Labels
blocker >bug :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. >upgrade v6.3.0

Comments

@praseodym
Copy link

Rolling upgrades of an Elasticsearch 5.6.10 cluster to version 6.3.0 fail with a java.lang.IllegalStateException: commit doesn't contain history uuid when a synced flush (_flush/synced) is performed, as described in the rolling upgrade documentation.

Steps to reproduce:

  1. Start multi-node 5.6.10 cluster
  2. Index some data
  3. Disable shard allocation
  4. Perform a synced flush
  5. Shut down and upgrade one of the nodes
  6. Reenable shard allocation
  7. Node joins the cluster but never fully starts

I cannot reproduce the problem without performing the synced flush. I think this problem could have been introduced in #28245.

Reproduction script, takes about a minute to reproduce the issue
#!/bin/bash
set -ex

# Setup
docker rm -f es1 || true
docker rm -f es2 || true
docker network inspect es || docker network create es
rm -rf /tmp/esdata
mkdir -p /tmp/esdata/data1 /tmp/esdata/data2 /tmp/esdata/snapshot
sudo chown -R 1000:1000 /tmp/esdata
sudo sysctl -w vm.max_map_count=262144

# Start two-node Elasticsearch 5.6.10 cluster
docker run -d --name es1 --net es -v /tmp/esdata/data1:/usr/share/elasticsearch/data -v /tmp/esdata/snapshot:/snapshot -e path.repo=/snapshot -e xpack.security.enabled=false -e discovery.zen.ping.unicast.hosts=es2 -p 127.0.0.1:9200:9200 docker.elastic.co/elasticsearch/elasticsearch:5.6.10
docker run -d --name es2 --net es -v /tmp/esdata/data2:/usr/share/elasticsearch/data -v /tmp/esdata/snapshot:/snapshot -e path.repo=/snapshot -e xpack.security.enabled=false -e discovery.zen.ping.unicast.hosts=es1 -p 127.0.0.1:9201:9200 docker.elastic.co/elasticsearch/elasticsearch:5.6.10
while ! http 127.0.0.1:9200/_cluster/health?wait_for_status=green; do sleep 1; done

# Index some sample data
curl https://download.elastic.co/demos/kibana/gettingstarted/shakespeare_6.0.json | curl -H 'Content-Type: application/x-ndjson' -XPOST '127.0.0.1:9200/shakespeare/doc/_bulk?pretty' --data-binary @-

# Perform rolling upgrade tp 6.3.0 according to docs at
# https://www.elastic.co/guide/en/elasticsearch/reference/current/rolling-upgrades.html

# Step 1: disable shard allocation
http PUT 127.0.0.1:9200/_cluster/settings persistent:='{"cluster.routing.allocation.enable": "none"}'

# Step 2: stop non-essential indexing and perform a synced flush
# Without this step, the upgrade goes well!
http POST 127.0.0.1:9200/_flush/synced

# Step 4: shut down a single node
docker stop es2
docker rm es2

# Step 5, 7: upgrade and start that node
docker run -d --name es2 --net es -v /tmp/esdata/data2:/usr/share/elasticsearch/data -v /tmp/esdata/snapshot:/snapshot -e path.repo=/snapshot -e discovery.zen.ping.unicast.hosts=es1 -p 127.0.0.1:9201:9200 docker.elastic.co/elasticsearch/elasticsearch:6.3.0
while ! http 127.0.0.1:9201; do sleep 1; done

# Step 8: reenable shard allocation
http --check-status PUT 127.0.0.1:9200/_cluster/settings persistent:='{"cluster.routing.allocation.enable": null}'

# Watch mayhem ensue
docker logs -f es2
Log including stack traces from the upgraded node
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
[2018-06-20T21:38:02,917][INFO ][o.e.n.Node               ] [] initializing ...
[2018-06-20T21:38:02,958][INFO ][o.e.e.NodeEnvironment    ] [uLAJsY1] using [1] data paths, mounts [[/usr/share/elasticsearch/data (tmpfs)]], net usable_space [15.6gb], net total_space [15.7gb], types [tmpfs]
[2018-06-20T21:38:02,959][INFO ][o.e.e.NodeEnvironment    ] [uLAJsY1] heap size [989.8mb], compressed ordinary object pointers [true]
[2018-06-20T21:38:02,972][INFO ][o.e.n.Node               ] [uLAJsY1] node name derived from node ID [uLAJsY1xT5yhCUzAvNa8ag]; set [node.name] to override
[2018-06-20T21:38:02,972][INFO ][o.e.n.Node               ] [uLAJsY1] version[6.3.0], pid[1], build[default/tar/424e937/2018-06-11T23:38:03.357887Z], OS[Linux/4.17.2-1-ARCH/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/10.0.1/10.0.1+10]
[2018-06-20T21:38:02,972][INFO ][o.e.n.Node               ] [uLAJsY1] JVM arguments [-Xms1g, -Xmx1g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.io.tmpdir=/tmp/elasticsearch.jX5EEUqv, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=data, -XX:ErrorFile=logs/hs_err_pid%p.log, -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m, -Djava.locale.providers=COMPAT, -Des.cgroups.hierarchy.override=/, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/usr/share/elasticsearch/config, -Des.distribution.flavor=default, -Des.distribution.type=tar]
[2018-06-20T21:38:04,206][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [aggs-matrix-stats]
[2018-06-20T21:38:04,206][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [analysis-common]
[2018-06-20T21:38:04,207][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [ingest-common]
[2018-06-20T21:38:04,207][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [lang-expression]
[2018-06-20T21:38:04,207][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [lang-mustache]
[2018-06-20T21:38:04,207][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [lang-painless]
[2018-06-20T21:38:04,207][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [mapper-extras]
[2018-06-20T21:38:04,207][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [parent-join]
[2018-06-20T21:38:04,207][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [percolator]
[2018-06-20T21:38:04,207][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [rank-eval]
[2018-06-20T21:38:04,207][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [reindex]
[2018-06-20T21:38:04,207][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [repository-url]
[2018-06-20T21:38:04,207][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [transport-netty4]
[2018-06-20T21:38:04,207][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [tribe]
[2018-06-20T21:38:04,207][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [x-pack-core]
[2018-06-20T21:38:04,207][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [x-pack-deprecation]
[2018-06-20T21:38:04,207][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [x-pack-graph]
[2018-06-20T21:38:04,207][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [x-pack-logstash]
[2018-06-20T21:38:04,208][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [x-pack-ml]
[2018-06-20T21:38:04,208][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [x-pack-monitoring]
[2018-06-20T21:38:04,208][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [x-pack-rollup]
[2018-06-20T21:38:04,208][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [x-pack-security]
[2018-06-20T21:38:04,208][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [x-pack-sql]
[2018-06-20T21:38:04,208][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [x-pack-upgrade]
[2018-06-20T21:38:04,208][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [x-pack-watcher]
[2018-06-20T21:38:04,208][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded plugin [ingest-geoip]
[2018-06-20T21:38:04,208][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded plugin [ingest-user-agent]
[2018-06-20T21:38:06,118][INFO ][o.e.x.s.a.s.FileRolesStore] [uLAJsY1] parsed [0] roles from file [/usr/share/elasticsearch/config/roles.yml]
[2018-06-20T21:38:06,428][INFO ][o.e.x.m.j.p.l.CppLogMessageHandler] [controller/172] [Main.cc@109] controller (64 bit): Version 6.3.0 (Build 0f0a34c67965d7) Copyright (c) 2018 Elasticsearch BV
[2018-06-20T21:38:06,632][WARN ][o.e.d.c.m.IndexTemplateMetaData] Deprecated field [template] used, replaced by [index_patterns]
[2018-06-20T21:38:06,634][WARN ][o.e.d.c.m.IndexTemplateMetaData] Deprecated field [template] used, replaced by [index_patterns]
[2018-06-20T21:38:06,640][WARN ][o.e.d.c.m.IndexTemplateMetaData] Deprecated field [template] used, replaced by [index_patterns]
[2018-06-20T21:38:06,641][WARN ][o.e.d.c.m.IndexTemplateMetaData] Deprecated field [template] used, replaced by [index_patterns]
[2018-06-20T21:38:06,643][WARN ][o.e.d.c.m.IndexTemplateMetaData] Deprecated field [template] used, replaced by [index_patterns]
[2018-06-20T21:38:06,644][WARN ][o.e.d.c.m.IndexTemplateMetaData] Deprecated field [template] used, replaced by [index_patterns]
[2018-06-20T21:38:06,644][WARN ][o.e.d.c.m.IndexTemplateMetaData] Deprecated field [template] used, replaced by [index_patterns]
[2018-06-20T21:38:06,644][WARN ][o.e.d.c.m.IndexTemplateMetaData] Deprecated field [template] used, replaced by [index_patterns]
[2018-06-20T21:38:06,645][WARN ][o.e.d.c.m.IndexTemplateMetaData] Deprecated field [template] used, replaced by [index_patterns]
[2018-06-20T21:38:06,646][WARN ][o.e.d.c.m.IndexTemplateMetaData] Deprecated field [template] used, replaced by [index_patterns]
[2018-06-20T21:38:06,647][WARN ][o.e.d.c.m.IndexTemplateMetaData] Deprecated field [template] used, replaced by [index_patterns]
[2018-06-20T21:38:06,648][WARN ][o.e.d.c.m.IndexTemplateMetaData] Deprecated field [template] used, replaced by [index_patterns]
[2018-06-20T21:38:06,650][WARN ][o.e.d.c.m.IndexTemplateMetaData] Deprecated field [template] used, replaced by [index_patterns]
[2018-06-20T21:38:06,865][INFO ][o.e.d.DiscoveryModule    ] [uLAJsY1] using discovery type [zen]
[2018-06-20T21:38:07,373][INFO ][o.e.n.Node               ] [uLAJsY1] initialized
[2018-06-20T21:38:07,373][INFO ][o.e.n.Node               ] [uLAJsY1] starting ...
[2018-06-20T21:38:07,481][INFO ][o.e.t.TransportService   ] [uLAJsY1] publish_address {172.19.0.3:9300}, bound_addresses {0.0.0.0:9300}
[2018-06-20T21:38:07,497][INFO ][o.e.b.BootstrapChecks    ] [uLAJsY1] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2018-06-20T21:38:10,646][INFO ][o.e.c.s.ClusterApplierService] [uLAJsY1] detected_master {4E_A_7z}{4E_A_7zATUu6ebxzJFhMrg}{JxDu4xcyTWKdshEZqUgKQw}{172.19.0.2}{172.19.0.2:9300}{ml.max_open_jobs=10, ml.enabled=true}, added {{4E_A_7z}{4E_A_7zATUu6ebxzJFhMrg}{JxDu4xcyTWKdshEZqUgKQw}{172.19.0.2}{172.19.0.2:9300}{ml.max_open_jobs=10, ml.enabled=true},}, reason: apply cluster state (from master [master {4E_A_7z}{4E_A_7zATUu6ebxzJFhMrg}{JxDu4xcyTWKdshEZqUgKQw}{172.19.0.2}{172.19.0.2:9300}{ml.max_open_jobs=10, ml.enabled=true} committed version [36]])
[2018-06-20T21:38:10,651][INFO ][o.e.c.s.ClusterSettings  ] [uLAJsY1] updating [cluster.routing.allocation.enable] from [all] to [none]
[2018-06-20T21:38:10,827][WARN ][o.e.x.s.a.s.m.NativeRoleMappingStore] [uLAJsY1] Failed to clear cache for realms [[]]
[2018-06-20T21:38:10,837][INFO ][o.e.l.LicenseService     ] [uLAJsY1] license [3d2953c0-7b27-4738-861b-091c92a4fd31] mode [trial] - valid
[2018-06-20T21:38:10,865][INFO ][o.e.x.s.t.n.SecurityNetty4HttpServerTransport] [uLAJsY1] publish_address {172.19.0.3:9200}, bound_addresses {0.0.0.0:9200}
[2018-06-20T21:38:10,865][INFO ][o.e.n.Node               ] [uLAJsY1] started
[2018-06-20T21:38:10,894][INFO ][o.e.x.m.e.l.LocalExporter] waiting for elected master node [{4E_A_7z}{4E_A_7zATUu6ebxzJFhMrg}{JxDu4xcyTWKdshEZqUgKQw}{172.19.0.2}{172.19.0.2:9300}{ml.max_open_jobs=10, ml.enabled=true}] to setup local exporter [default_local] (does it have x-pack installed?)
[2018-06-20T21:38:10,925][INFO ][o.e.x.m.e.l.LocalExporter] waiting for elected master node [{4E_A_7z}{4E_A_7zATUu6ebxzJFhMrg}{JxDu4xcyTWKdshEZqUgKQw}{172.19.0.2}{172.19.0.2:9300}{ml.max_open_jobs=10, ml.enabled=true}] to setup local exporter [default_local] (does it have x-pack installed?)
[2018-06-20T21:38:10,954][INFO ][o.e.x.m.e.l.LocalExporter] waiting for elected master node [{4E_A_7z}{4E_A_7zATUu6ebxzJFhMrg}{JxDu4xcyTWKdshEZqUgKQw}{172.19.0.2}{172.19.0.2:9300}{ml.max_open_jobs=10, ml.enabled=true}] to setup local exporter [default_local] (does it have x-pack installed?)
[2018-06-20T21:38:11,381][INFO ][o.e.c.s.ClusterSettings  ] [uLAJsY1] updating [cluster.routing.allocation.enable] from [none] to [all]
[2018-06-20T21:38:11,392][INFO ][o.e.x.m.e.l.LocalExporter] waiting for elected master node [{4E_A_7z}{4E_A_7zATUu6ebxzJFhMrg}{JxDu4xcyTWKdshEZqUgKQw}{172.19.0.2}{172.19.0.2:9300}{ml.max_open_jobs=10, ml.enabled=true}] to setup local exporter [default_local] (does it have x-pack installed?)
[2018-06-20T21:38:11,529][INFO ][o.e.x.m.e.l.LocalExporter] waiting for elected master node [{4E_A_7z}{4E_A_7zATUu6ebxzJFhMrg}{JxDu4xcyTWKdshEZqUgKQw}{172.19.0.2}{172.19.0.2:9300}{ml.max_open_jobs=10, ml.enabled=true}] to setup local exporter [default_local] (does it have x-pack installed?)
[2018-06-20T21:38:11,592][WARN ][o.e.i.c.IndicesClusterStateService] [uLAJsY1] [[shakespeare][0]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [shakespeare][0]: Recovery failed from {4E_A_7z}{4E_A_7zATUu6ebxzJFhMrg}{JxDu4xcyTWKdshEZqUgKQw}{172.19.0.2}{172.19.0.2:9300}{ml.max_open_jobs=10, ml.enabled=true} into {uLAJsY1}{uLAJsY1xT5yhCUzAvNa8ag}{J4vNZ9OETdeO8pxepzmRHw}{172.19.0.3}{172.19.0.3:9300}{ml.machine_memory=33728278528, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:282) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$900(PeerRecoveryTargetService.java:80) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:623) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:724) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:844) [?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [4E_A_7z][172.19.0.2:9300][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] phase1 failed
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:140) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1556) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:674) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:?]
	at java.lang.Thread.run(Thread.java:748) ~[?:?]
Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: Failed to transfer [0] files with total size of [0b]
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:337) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:138) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1556) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:674) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:?]
	at java.lang.Thread.run(Thread.java:748) ~[?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [uLAJsY1][172.19.0.3:9300][internal:index/shard/recovery/prepare_translog]
Caused by: java.lang.IllegalStateException: commit doesn't contain history uuid
	at org.elasticsearch.index.engine.InternalEngine.loadHistoryUUID(InternalEngine.java:493) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:193) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:157) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:2152) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:2134) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1341) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.shard.IndexShard.openEngineAndSkipTranslogRecovery(IndexShard.java:1305) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.RecoveryTarget.prepareForTranslogOperations(RecoveryTarget.java:366) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$PrepareForTranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:403) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$PrepareForTranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:397) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:246) ~[?:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:304) ~[?:?]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1592) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:724) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:844) ~[?:?]
[2018-06-20T21:38:11,602][WARN ][o.e.i.c.IndicesClusterStateService] [uLAJsY1] [[.monitoring-es-6-2018.06.20][0]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [.monitoring-es-6-2018.06.20][0]: Recovery failed from {4E_A_7z}{4E_A_7zATUu6ebxzJFhMrg}{JxDu4xcyTWKdshEZqUgKQw}{172.19.0.2}{172.19.0.2:9300}{ml.max_open_jobs=10, ml.enabled=true} into {uLAJsY1}{uLAJsY1xT5yhCUzAvNa8ag}{J4vNZ9OETdeO8pxepzmRHw}{172.19.0.3}{172.19.0.3:9300}{ml.machine_memory=33728278528, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:282) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$900(PeerRecoveryTargetService.java:80) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:623) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:724) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:844) [?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [4E_A_7z][172.19.0.2:9300][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] phase1 failed
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:140) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1556) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:674) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:?]
	at java.lang.Thread.run(Thread.java:748) ~[?:?]
Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: Failed to transfer [0] files with total size of [0b]
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:337) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:138) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1556) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:674) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:?]
	at java.lang.Thread.run(Thread.java:748) ~[?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [uLAJsY1][172.19.0.3:9300][internal:index/shard/recovery/prepare_translog]
Caused by: java.lang.IllegalStateException: commit doesn't contain history uuid
	at org.elasticsearch.index.engine.InternalEngine.loadHistoryUUID(InternalEngine.java:493) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:193) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:157) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:2152) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:2134) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1341) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.shard.IndexShard.openEngineAndSkipTranslogRecovery(IndexShard.java:1305) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.RecoveryTarget.prepareForTranslogOperations(RecoveryTarget.java:366) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$PrepareForTranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:403) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$PrepareForTranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:397) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:246) ~[?:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:304) ~[?:?]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1592) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:724) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:844) ~[?:?]
[2018-06-20T21:38:11,634][INFO ][o.e.x.m.e.l.LocalExporter] waiting for elected master node [{4E_A_7z}{4E_A_7zATUu6ebxzJFhMrg}{JxDu4xcyTWKdshEZqUgKQw}{172.19.0.2}{172.19.0.2:9300}{ml.max_open_jobs=10, ml.enabled=true}] to setup local exporter [default_local] (does it have x-pack installed?)
[2018-06-20T21:38:11,657][WARN ][o.e.i.c.IndicesClusterStateService] [uLAJsY1] [[shakespeare][3]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [shakespeare][3]: Recovery failed from {4E_A_7z}{4E_A_7zATUu6ebxzJFhMrg}{JxDu4xcyTWKdshEZqUgKQw}{172.19.0.2}{172.19.0.2:9300}{ml.max_open_jobs=10, ml.enabled=true} into {uLAJsY1}{uLAJsY1xT5yhCUzAvNa8ag}{J4vNZ9OETdeO8pxepzmRHw}{172.19.0.3}{172.19.0.3:9300}{ml.machine_memory=33728278528, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:282) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$900(PeerRecoveryTargetService.java:80) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:623) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:724) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:844) [?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [4E_A_7z][172.19.0.2:9300][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] phase1 failed
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:140) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1556) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:674) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:?]
	at java.lang.Thread.run(Thread.java:748) ~[?:?]
Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: Failed to transfer [0] files with total size of [0b]
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:337) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:138) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1556) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:674) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:?]
	at java.lang.Thread.run(Thread.java:748) ~[?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [uLAJsY1][172.19.0.3:9300][internal:index/shard/recovery/prepare_translog]
Caused by: java.lang.IllegalStateException: commit doesn't contain history uuid
	at org.elasticsearch.index.engine.InternalEngine.loadHistoryUUID(InternalEngine.java:493) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:193) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:157) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:2152) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:2134) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1341) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.shard.IndexShard.openEngineAndSkipTranslogRecovery(IndexShard.java:1305) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.RecoveryTarget.prepareForTranslogOperations(RecoveryTarget.java:366) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$PrepareForTranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:403) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$PrepareForTranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:397) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:246) ~[?:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:304) ~[?:?]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1592) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:724) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:844) ~[?:?]
[2018-06-20T21:38:11,669][WARN ][o.e.i.c.IndicesClusterStateService] [uLAJsY1] [[.watches][0]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [.watches][0]: Recovery failed from {4E_A_7z}{4E_A_7zATUu6ebxzJFhMrg}{JxDu4xcyTWKdshEZqUgKQw}{172.19.0.2}{172.19.0.2:9300}{ml.max_open_jobs=10, ml.enabled=true} into {uLAJsY1}{uLAJsY1xT5yhCUzAvNa8ag}{J4vNZ9OETdeO8pxepzmRHw}{172.19.0.3}{172.19.0.3:9300}{ml.machine_memory=33728278528, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:282) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$900(PeerRecoveryTargetService.java:80) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:623) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:724) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:844) [?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [4E_A_7z][172.19.0.2:9300][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] phase1 failed
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:140) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1556) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:674) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:?]
	at java.lang.Thread.run(Thread.java:748) ~[?:?]
Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: Failed to transfer [0] files with total size of [0b]
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:337) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:138) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1556) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:674) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:?]
	at java.lang.Thread.run(Thread.java:748) ~[?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [uLAJsY1][172.19.0.3:9300][internal:index/shard/recovery/prepare_translog]
Caused by: java.lang.IllegalStateException: commit doesn't contain history uuid
	at org.elasticsearch.index.engine.InternalEngine.loadHistoryUUID(InternalEngine.java:493) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:193) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:157) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:2152) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:2134) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1341) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.shard.IndexShard.openEngineAndSkipTranslogRecovery(IndexShard.java:1305) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.RecoveryTarget.prepareForTranslogOperations(RecoveryTarget.java:366) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$PrepareForTranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:403) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$PrepareForTranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:397) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:246) ~[?:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:304) ~[?:?]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1592) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:724) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:844) ~[?:?]
[2018-06-20T21:38:11,681][INFO ][o.e.x.m.e.l.LocalExporter] waiting for elected master node [{4E_A_7z}{4E_A_7zATUu6ebxzJFhMrg}{JxDu4xcyTWKdshEZqUgKQw}{172.19.0.2}{172.19.0.2:9300}{ml.max_open_jobs=10, ml.enabled=true}] to setup local exporter [default_local] (does it have x-pack installed?)

--- cut, Elasticsearch never seems to recover from this ---
@dnhatn dnhatn added the :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. label Jun 21, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@dnhatn dnhatn self-assigned this Jun 21, 2018
@dnhatn dnhatn added the >bug label Jun 21, 2018
@dnhatn
Copy link
Member

dnhatn commented Jun 21, 2018

This bug can happen in the following scenario.

  1. Have a primary and replica in 5.6.10 with some docs
  2. Issue a synced-flush
  3. Shutdown the replica, then upgrade that node to 6.3.0
  4. Start the replica node
  5. The replica executes a file-based recovery, but it won't receive any file because the commit is sealed. The commit on a replica was created in v5; thus it does not have a historyUUID. Unfortunately, we assume that we should receive a new commit (with a historyUUID) in a file-based recovery.

@dnhatn
Copy link
Member

dnhatn commented Jun 21, 2018

@praseodym Thanks for reporting this bug. We are working on the fix.

@dnhatn dnhatn added >upgrade :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. and removed :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. labels Jun 21, 2018
bleskes added a commit that referenced this issue Jun 21, 2018
bleskes added a commit that referenced this issue Jun 21, 2018
dnhatn added a commit that referenced this issue Jun 22, 2018
…6.3 (#31506)

Today we make sure that a 5.x index commit should have all required
commit tags in RecoveryTarget#cleanFiles method. The reason we do this
in RecoveryTarget#cleanFiles method because this is only needed in a
file-based recovery and we assume that #cleanFiles should be called in a
file-based recovery. However, this assumption is not valid if the index
is sealed (.i.e synced-flush). This incorrect assumption would prevent
users from rolling upgrade from 5.x to 6.3 if their index were sealed.

Closes #31482
dnhatn added a commit that referenced this issue Jun 22, 2018
…6.3 (#31506)

Today we make sure that a 5.x index commit should have all required
commit tags in RecoveryTarget#cleanFiles method. The reason we do this
in RecoveryTarget#cleanFiles method because this is only needed in a
file-based recovery and we assume that #cleanFiles should be called in a
file-based recovery. However, this assumption is not valid if the index
is sealed (.i.e synced-flush). This incorrect assumption would prevent
users from rolling upgrade from 5.x to 6.3 if their index were sealed.

Closes #31482
@dnhatn
Copy link
Member

dnhatn commented Jun 22, 2018

This is fixed by #31506. This fix be will included in 6.3.1.

@dnhatn dnhatn closed this as completed Jun 22, 2018
@praseodym
Copy link
Author

Thank you! Considering that this is a blocker for upgrades, when will 6.3.1 be released?

@bleskes
Copy link
Contributor

bleskes commented Jun 22, 2018

Thank you! Considering that this is a blocker for upgrades, when will 6.3.1 be released?

That's still unknown at this point. Obviously this is a serious issue. Working on it.

dnhatn added a commit that referenced this issue Jun 23, 2018
Although the master branch does not affect by #31482, it's helpful to
have BWC tests that verify the peer recovery with a synced-flush index.
This commit adds the bwc tests from #31506 to the master branch.

Relates #31482
Relates #31506
colings86 pushed a commit that referenced this issue Jun 25, 2018
Although the master branch does not affect by #31482, it's helpful to
have BWC tests that verify the peer recovery with a synced-flush index.
This commit adds the bwc tests from #31506 to the master branch.

Relates #31482
Relates #31506
@JalehD
Copy link

JalehD commented Jun 25, 2018

@bleskes @dnhatn is there a workaround to recover from this state? what's the recommended approach once the upgrade has been affected by this bug.

@gmoskovicz
Copy link
Contributor

Is removing the replica shard an option after upgrading? Or would that not work and upgrade to 6.3.1 is the only option?

@bleskes
Copy link
Contributor

bleskes commented Jun 25, 2018

@gmoskovicz a direct rolling upgrade to 6.3 from 5.x just won't work. You can do a rolling upgrade to a 6.x version before 6.3 and then to 6.3. You can also push to 6.3 using a full cluster restart and then reduce the number of replicas and bring it back up (forcing the data to be cleaned). PLEASE try this first - I think it should work but it should be clear by now this is tricky - many moving parts.

@dnhatn
Copy link
Member

dnhatn commented Jun 25, 2018

A cleaner workaround would be force flush the offending index, then retry the cluster allocation.

  1. Force flush the offending index POST /offending-index/_flush?force=true
  2. Retry the cluster allocation POST /_cluster/reroute?retry_failed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker >bug :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. >upgrade v6.3.0
Projects
None yet
Development

No branches or pull requests

6 participants