5.6.10 to 6.3.0 rolling upgrade broken with 'commit doesn't contain history uuid' when a synced flush is performed #31482

praseodym · 2018-06-20T21:57:41Z

Rolling upgrades of an Elasticsearch 5.6.10 cluster to version 6.3.0 fail with a java.lang.IllegalStateException: commit doesn't contain history uuid when a synced flush (_flush/synced) is performed, as described in the rolling upgrade documentation.

Steps to reproduce:

Start multi-node 5.6.10 cluster
Index some data
Disable shard allocation
Perform a synced flush
Shut down and upgrade one of the nodes
Reenable shard allocation
Node joins the cluster but never fully starts

I cannot reproduce the problem without performing the synced flush. I think this problem could have been introduced in #28245.

Reproduction script, takes about a minute to reproduce the issue

#!/bin/bash
set -ex

# Setup
docker rm -f es1 || true
docker rm -f es2 || true
docker network inspect es || docker network create es
rm -rf /tmp/esdata
mkdir -p /tmp/esdata/data1 /tmp/esdata/data2 /tmp/esdata/snapshot
sudo chown -R 1000:1000 /tmp/esdata
sudo sysctl -w vm.max_map_count=262144

# Start two-node Elasticsearch 5.6.10 cluster
docker run -d --name es1 --net es -v /tmp/esdata/data1:/usr/share/elasticsearch/data -v /tmp/esdata/snapshot:/snapshot -e path.repo=/snapshot -e xpack.security.enabled=false -e discovery.zen.ping.unicast.hosts=es2 -p 127.0.0.1:9200:9200 docker.elastic.co/elasticsearch/elasticsearch:5.6.10
docker run -d --name es2 --net es -v /tmp/esdata/data2:/usr/share/elasticsearch/data -v /tmp/esdata/snapshot:/snapshot -e path.repo=/snapshot -e xpack.security.enabled=false -e discovery.zen.ping.unicast.hosts=es1 -p 127.0.0.1:9201:9200 docker.elastic.co/elasticsearch/elasticsearch:5.6.10
while ! http 127.0.0.1:9200/_cluster/health?wait_for_status=green; do sleep 1; done

# Index some sample data
curl https://download.elastic.co/demos/kibana/gettingstarted/shakespeare_6.0.json | curl -H 'Content-Type: application/x-ndjson' -XPOST '127.0.0.1:9200/shakespeare/doc/_bulk?pretty' --data-binary @-

# Perform rolling upgrade tp 6.3.0 according to docs at
# https://www.elastic.co/guide/en/elasticsearch/reference/current/rolling-upgrades.html

# Step 1: disable shard allocation
http PUT 127.0.0.1:9200/_cluster/settings persistent:='{"cluster.routing.allocation.enable": "none"}'

# Step 2: stop non-essential indexing and perform a synced flush
# Without this step, the upgrade goes well!
http POST 127.0.0.1:9200/_flush/synced

# Step 4: shut down a single node
docker stop es2
docker rm es2

# Step 5, 7: upgrade and start that node
docker run -d --name es2 --net es -v /tmp/esdata/data2:/usr/share/elasticsearch/data -v /tmp/esdata/snapshot:/snapshot -e path.repo=/snapshot -e discovery.zen.ping.unicast.hosts=es1 -p 127.0.0.1:9201:9200 docker.elastic.co/elasticsearch/elasticsearch:6.3.0
while ! http 127.0.0.1:9201; do sleep 1; done

# Step 8: reenable shard allocation
http --check-status PUT 127.0.0.1:9200/_cluster/settings persistent:='{"cluster.routing.allocation.enable": null}'

# Watch mayhem ensue
docker logs -f es2

Log including stack traces from the upgraded node

OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
[2018-06-20T21:38:02,917][INFO ][o.e.n.Node               ] [] initializing ...
[2018-06-20T21:38:02,958][INFO ][o.e.e.NodeEnvironment    ] [uLAJsY1] using [1] data paths, mounts [[/usr/share/elasticsearch/data (tmpfs)]], net usable_space [15.6gb], net total_space [15.7gb], types [tmpfs]
[2018-06-20T21:38:02,959][INFO ][o.e.e.NodeEnvironment    ] [uLAJsY1] heap size [989.8mb], compressed ordinary object pointers [true]
[2018-06-20T21:38:02,972][INFO ][o.e.n.Node               ] [uLAJsY1] node name derived from node ID [uLAJsY1xT5yhCUzAvNa8ag]; set [node.name] to override
[2018-06-20T21:38:02,972][INFO ][o.e.n.Node               ] [uLAJsY1] version[6.3.0], pid[1], build[default/tar/424e937/2018-06-11T23:38:03.357887Z], OS[Linux/4.17.2-1-ARCH/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/10.0.1/10.0.1+10]
[2018-06-20T21:38:02,972][INFO ][o.e.n.Node               ] [uLAJsY1] JVM arguments [-Xms1g, -Xmx1g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.io.tmpdir=/tmp/elasticsearch.jX5EEUqv, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=data, -XX:ErrorFile=logs/hs_err_pid%p.log, -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m, -Djava.locale.providers=COMPAT, -Des.cgroups.hierarchy.override=/, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/usr/share/elasticsearch/config, -Des.distribution.flavor=default, -Des.distribution.type=tar]
[2018-06-20T21:38:04,206][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [aggs-matrix-stats]
[2018-06-20T21:38:04,206][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [analysis-common]
[2018-06-20T21:38:04,207][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [ingest-common]
[2018-06-20T21:38:04,207][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [lang-expression]
[2018-06-20T21:38:04,207][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [lang-mustache]
[2018-06-20T21:38:04,207][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [lang-painless]
[2018-06-20T21:38:04,207][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [mapper-extras]
[2018-06-20T21:38:04,207][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [parent-join]
[2018-06-20T21:38:04,207][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [percolator]
[2018-06-20T21:38:04,207][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [rank-eval]
[2018-06-20T21:38:04,207][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [reindex]
[2018-06-20T21:38:04,207][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [repository-url]
[2018-06-20T21:38:04,207][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [transport-netty4]
[2018-06-20T21:38:04,207][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [tribe]
[2018-06-20T21:38:04,207][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [x-pack-core]
[2018-06-20T21:38:04,207][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [x-pack-deprecation]
[2018-06-20T21:38:04,207][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [x-pack-graph]
[2018-06-20T21:38:04,207][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [x-pack-logstash]
[2018-06-20T21:38:04,208][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [x-pack-ml]
[2018-06-20T21:38:04,208][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [x-pack-monitoring]
[2018-06-20T21:38:04,208][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [x-pack-rollup]
[2018-06-20T21:38:04,208][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [x-pack-security]
[2018-06-20T21:38:04,208][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [x-pack-sql]
[2018-06-20T21:38:04,208][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [x-pack-upgrade]
[2018-06-20T21:38:04,208][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded module [x-pack-watcher]
[2018-06-20T21:38:04,208][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded plugin [ingest-geoip]
[2018-06-20T21:38:04,208][INFO ][o.e.p.PluginsService     ] [uLAJsY1] loaded plugin [ingest-user-agent]
[2018-06-20T21:38:06,118][INFO ][o.e.x.s.a.s.FileRolesStore] [uLAJsY1] parsed [0] roles from file [/usr/share/elasticsearch/config/roles.yml]
[2018-06-20T21:38:06,428][INFO ][o.e.x.m.j.p.l.CppLogMessageHandler] [controller/172] [Main.cc@109] controller (64 bit): Version 6.3.0 (Build 0f0a34c67965d7) Copyright (c) 2018 Elasticsearch BV
[2018-06-20T21:38:06,632][WARN ][o.e.d.c.m.IndexTemplateMetaData] Deprecated field [template] used, replaced by [index_patterns]
[2018-06-20T21:38:06,634][WARN ][o.e.d.c.m.IndexTemplateMetaData] Deprecated field [template] used, replaced by [index_patterns]
[2018-06-20T21:38:06,640][WARN ][o.e.d.c.m.IndexTemplateMetaData] Deprecated field [template] used, replaced by [index_patterns]
[2018-06-20T21:38:06,641][WARN ][o.e.d.c.m.IndexTemplateMetaData] Deprecated field [template] used, replaced by [index_patterns]
[2018-06-20T21:38:06,643][WARN ][o.e.d.c.m.IndexTemplateMetaData] Deprecated field [template] used, replaced by [index_patterns]
[2018-06-20T21:38:06,644][WARN ][o.e.d.c.m.IndexTemplateMetaData] Deprecated field [template] used, replaced by [index_patterns]
[2018-06-20T21:38:06,644][WARN ][o.e.d.c.m.IndexTemplateMetaData] Deprecated field [template] used, replaced by [index_patterns]
[2018-06-20T21:38:06,644][WARN ][o.e.d.c.m.IndexTemplateMetaData] Deprecated field [template] used, replaced by [index_patterns]
[2018-06-20T21:38:06,645][WARN ][o.e.d.c.m.IndexTemplateMetaData] Deprecated field [template] used, replaced by [index_patterns]
[2018-06-20T21:38:06,646][WARN ][o.e.d.c.m.IndexTemplateMetaData] Deprecated field [template] used, replaced by [index_patterns]
[2018-06-20T21:38:06,647][WARN ][o.e.d.c.m.IndexTemplateMetaData] Deprecated field [template] used, replaced by [index_patterns]
[2018-06-20T21:38:06,648][WARN ][o.e.d.c.m.IndexTemplateMetaData] Deprecated field [template] used, replaced by [index_patterns]
[2018-06-20T21:38:06,650][WARN ][o.e.d.c.m.IndexTemplateMetaData] Deprecated field [template] used, replaced by [index_patterns]
[2018-06-20T21:38:06,865][INFO ][o.e.d.DiscoveryModule    ] [uLAJsY1] using discovery type [zen]
[2018-06-20T21:38:07,373][INFO ][o.e.n.Node               ] [uLAJsY1] initialized
[2018-06-20T21:38:07,373][INFO ][o.e.n.Node               ] [uLAJsY1] starting ...
[2018-06-20T21:38:07,481][INFO ][o.e.t.TransportService   ] [uLAJsY1] publish_address {172.19.0.3:9300}, bound_addresses {0.0.0.0:9300}
[2018-06-20T21:38:07,497][INFO ][o.e.b.BootstrapChecks    ] [uLAJsY1] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2018-06-20T21:38:10,646][INFO ][o.e.c.s.ClusterApplierService] [uLAJsY1] detected_master {4E_A_7z}{4E_A_7zATUu6ebxzJFhMrg}{JxDu4xcyTWKdshEZqUgKQw}{172.19.0.2}{172.19.0.2:9300}{ml.max_open_jobs=10, ml.enabled=true}, added {{4E_A_7z}{4E_A_7zATUu6ebxzJFhMrg}{JxDu4xcyTWKdshEZqUgKQw}{172.19.0.2}{172.19.0.2:9300}{ml.max_open_jobs=10, ml.enabled=true},}, reason: apply cluster state (from master [master {4E_A_7z}{4E_A_7zATUu6ebxzJFhMrg}{JxDu4xcyTWKdshEZqUgKQw}{172.19.0.2}{172.19.0.2:9300}{ml.max_open_jobs=10, ml.enabled=true} committed version [36]])
[2018-06-20T21:38:10,651][INFO ][o.e.c.s.ClusterSettings  ] [uLAJsY1] updating [cluster.routing.allocation.enable] from [all] to [none]
[2018-06-20T21:38:10,827][WARN ][o.e.x.s.a.s.m.NativeRoleMappingStore] [uLAJsY1] Failed to clear cache for realms [[]]
[2018-06-20T21:38:10,837][INFO ][o.e.l.LicenseService     ] [uLAJsY1] license [3d2953c0-7b27-4738-861b-091c92a4fd31] mode [trial] - valid
[2018-06-20T21:38:10,865][INFO ][o.e.x.s.t.n.SecurityNetty4HttpServerTransport] [uLAJsY1] publish_address {172.19.0.3:9200}, bound_addresses {0.0.0.0:9200}
[2018-06-20T21:38:10,865][INFO ][o.e.n.Node               ] [uLAJsY1] started
[2018-06-20T21:38:10,894][INFO ][o.e.x.m.e.l.LocalExporter] waiting for elected master node [{4E_A_7z}{4E_A_7zATUu6ebxzJFhMrg}{JxDu4xcyTWKdshEZqUgKQw}{172.19.0.2}{172.19.0.2:9300}{ml.max_open_jobs=10, ml.enabled=true}] to setup local exporter [default_local] (does it have x-pack installed?)
[2018-06-20T21:38:10,925][INFO ][o.e.x.m.e.l.LocalExporter] waiting for elected master node [{4E_A_7z}{4E_A_7zATUu6ebxzJFhMrg}{JxDu4xcyTWKdshEZqUgKQw}{172.19.0.2}{172.19.0.2:9300}{ml.max_open_jobs=10, ml.enabled=true}] to setup local exporter [default_local] (does it have x-pack installed?)
[2018-06-20T21:38:10,954][INFO ][o.e.x.m.e.l.LocalExporter] waiting for elected master node [{4E_A_7z}{4E_A_7zATUu6ebxzJFhMrg}{JxDu4xcyTWKdshEZqUgKQw}{172.19.0.2}{172.19.0.2:9300}{ml.max_open_jobs=10, ml.enabled=true}] to setup local exporter [default_local] (does it have x-pack installed?)
[2018-06-20T21:38:11,381][INFO ][o.e.c.s.ClusterSettings  ] [uLAJsY1] updating [cluster.routing.allocation.enable] from [none] to [all]
[2018-06-20T21:38:11,392][INFO ][o.e.x.m.e.l.LocalExporter] waiting for elected master node [{4E_A_7z}{4E_A_7zATUu6ebxzJFhMrg}{JxDu4xcyTWKdshEZqUgKQw}{172.19.0.2}{172.19.0.2:9300}{ml.max_open_jobs=10, ml.enabled=true}] to setup local exporter [default_local] (does it have x-pack installed?)
[2018-06-20T21:38:11,529][INFO ][o.e.x.m.e.l.LocalExporter] waiting for elected master node [{4E_A_7z}{4E_A_7zATUu6ebxzJFhMrg}{JxDu4xcyTWKdshEZqUgKQw}{172.19.0.2}{172.19.0.2:9300}{ml.max_open_jobs=10, ml.enabled=true}] to setup local exporter [default_local] (does it have x-pack installed?)
[2018-06-20T21:38:11,592][WARN ][o.e.i.c.IndicesClusterStateService] [uLAJsY1] [[shakespeare][0]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [shakespeare][0]: Recovery failed from {4E_A_7z}{4E_A_7zATUu6ebxzJFhMrg}{JxDu4xcyTWKdshEZqUgKQw}{172.19.0.2}{172.19.0.2:9300}{ml.max_open_jobs=10, ml.enabled=true} into {uLAJsY1}{uLAJsY1xT5yhCUzAvNa8ag}{J4vNZ9OETdeO8pxepzmRHw}{172.19.0.3}{172.19.0.3:9300}{ml.machine_memory=33728278528, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:282) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$900(PeerRecoveryTargetService.java:80) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:623) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:724) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:844) [?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [4E_A_7z][172.19.0.2:9300][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] phase1 failed
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:140) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1556) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:674) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:?]
	at java.lang.Thread.run(Thread.java:748) ~[?:?]
Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: Failed to transfer [0] files with total size of [0b]
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:337) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:138) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1556) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:674) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:?]
	at java.lang.Thread.run(Thread.java:748) ~[?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [uLAJsY1][172.19.0.3:9300][internal:index/shard/recovery/prepare_translog]
Caused by: java.lang.IllegalStateException: commit doesn't contain history uuid
	at org.elasticsearch.index.engine.InternalEngine.loadHistoryUUID(InternalEngine.java:493) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:193) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:157) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:2152) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:2134) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1341) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.shard.IndexShard.openEngineAndSkipTranslogRecovery(IndexShard.java:1305) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.RecoveryTarget.prepareForTranslogOperations(RecoveryTarget.java:366) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$PrepareForTranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:403) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$PrepareForTranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:397) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:246) ~[?:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:304) ~[?:?]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1592) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:724) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:844) ~[?:?]
[2018-06-20T21:38:11,602][WARN ][o.e.i.c.IndicesClusterStateService] [uLAJsY1] [[.monitoring-es-6-2018.06.20][0]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [.monitoring-es-6-2018.06.20][0]: Recovery failed from {4E_A_7z}{4E_A_7zATUu6ebxzJFhMrg}{JxDu4xcyTWKdshEZqUgKQw}{172.19.0.2}{172.19.0.2:9300}{ml.max_open_jobs=10, ml.enabled=true} into {uLAJsY1}{uLAJsY1xT5yhCUzAvNa8ag}{J4vNZ9OETdeO8pxepzmRHw}{172.19.0.3}{172.19.0.3:9300}{ml.machine_memory=33728278528, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:282) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$900(PeerRecoveryTargetService.java:80) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:623) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:724) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:844) [?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [4E_A_7z][172.19.0.2:9300][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] phase1 failed
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:140) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1556) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:674) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:?]
	at java.lang.Thread.run(Thread.java:748) ~[?:?]
Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: Failed to transfer [0] files with total size of [0b]
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:337) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:138) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1556) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:674) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:?]
	at java.lang.Thread.run(Thread.java:748) ~[?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [uLAJsY1][172.19.0.3:9300][internal:index/shard/recovery/prepare_translog]
Caused by: java.lang.IllegalStateException: commit doesn't contain history uuid
	at org.elasticsearch.index.engine.InternalEngine.loadHistoryUUID(InternalEngine.java:493) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:193) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:157) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:2152) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:2134) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1341) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.shard.IndexShard.openEngineAndSkipTranslogRecovery(IndexShard.java:1305) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.RecoveryTarget.prepareForTranslogOperations(RecoveryTarget.java:366) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$PrepareForTranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:403) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$PrepareForTranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:397) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:246) ~[?:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:304) ~[?:?]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1592) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:724) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:844) ~[?:?]
[2018-06-20T21:38:11,634][INFO ][o.e.x.m.e.l.LocalExporter] waiting for elected master node [{4E_A_7z}{4E_A_7zATUu6ebxzJFhMrg}{JxDu4xcyTWKdshEZqUgKQw}{172.19.0.2}{172.19.0.2:9300}{ml.max_open_jobs=10, ml.enabled=true}] to setup local exporter [default_local] (does it have x-pack installed?)
[2018-06-20T21:38:11,657][WARN ][o.e.i.c.IndicesClusterStateService] [uLAJsY1] [[shakespeare][3]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [shakespeare][3]: Recovery failed from {4E_A_7z}{4E_A_7zATUu6ebxzJFhMrg}{JxDu4xcyTWKdshEZqUgKQw}{172.19.0.2}{172.19.0.2:9300}{ml.max_open_jobs=10, ml.enabled=true} into {uLAJsY1}{uLAJsY1xT5yhCUzAvNa8ag}{J4vNZ9OETdeO8pxepzmRHw}{172.19.0.3}{172.19.0.3:9300}{ml.machine_memory=33728278528, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:282) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$900(PeerRecoveryTargetService.java:80) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:623) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:724) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:844) [?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [4E_A_7z][172.19.0.2:9300][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] phase1 failed
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:140) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1556) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:674) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:?]
	at java.lang.Thread.run(Thread.java:748) ~[?:?]
Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: Failed to transfer [0] files with total size of [0b]
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:337) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:138) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1556) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:674) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:?]
	at java.lang.Thread.run(Thread.java:748) ~[?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [uLAJsY1][172.19.0.3:9300][internal:index/shard/recovery/prepare_translog]
Caused by: java.lang.IllegalStateException: commit doesn't contain history uuid
	at org.elasticsearch.index.engine.InternalEngine.loadHistoryUUID(InternalEngine.java:493) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:193) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:157) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:2152) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:2134) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1341) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.shard.IndexShard.openEngineAndSkipTranslogRecovery(IndexShard.java:1305) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.RecoveryTarget.prepareForTranslogOperations(RecoveryTarget.java:366) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$PrepareForTranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:403) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$PrepareForTranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:397) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:246) ~[?:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:304) ~[?:?]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1592) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:724) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:844) ~[?:?]
[2018-06-20T21:38:11,669][WARN ][o.e.i.c.IndicesClusterStateService] [uLAJsY1] [[.watches][0]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [.watches][0]: Recovery failed from {4E_A_7z}{4E_A_7zATUu6ebxzJFhMrg}{JxDu4xcyTWKdshEZqUgKQw}{172.19.0.2}{172.19.0.2:9300}{ml.max_open_jobs=10, ml.enabled=true} into {uLAJsY1}{uLAJsY1xT5yhCUzAvNa8ag}{J4vNZ9OETdeO8pxepzmRHw}{172.19.0.3}{172.19.0.3:9300}{ml.machine_memory=33728278528, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:282) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$900(PeerRecoveryTargetService.java:80) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:623) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:724) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:844) [?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [4E_A_7z][172.19.0.2:9300][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] phase1 failed
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:140) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1556) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:674) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:?]
	at java.lang.Thread.run(Thread.java:748) ~[?:?]
Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: Failed to transfer [0] files with total size of [0b]
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:337) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:138) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1556) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:674) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:?]
	at java.lang.Thread.run(Thread.java:748) ~[?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [uLAJsY1][172.19.0.3:9300][internal:index/shard/recovery/prepare_translog]
Caused by: java.lang.IllegalStateException: commit doesn't contain history uuid
	at org.elasticsearch.index.engine.InternalEngine.loadHistoryUUID(InternalEngine.java:493) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:193) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:157) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:2152) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:2134) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1341) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.index.shard.IndexShard.openEngineAndSkipTranslogRecovery(IndexShard.java:1305) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.RecoveryTarget.prepareForTranslogOperations(RecoveryTarget.java:366) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$PrepareForTranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:403) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$PrepareForTranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:397) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:246) ~[?:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:304) ~[?:?]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1592) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:724) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:844) ~[?:?]
[2018-06-20T21:38:11,681][INFO ][o.e.x.m.e.l.LocalExporter] waiting for elected master node [{4E_A_7z}{4E_A_7zATUu6ebxzJFhMrg}{JxDu4xcyTWKdshEZqUgKQw}{172.19.0.2}{172.19.0.2:9300}{ml.max_open_jobs=10, ml.enabled=true}] to setup local exporter [default_local] (does it have x-pack installed?)

--- cut, Elasticsearch never seems to recover from this ---

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-06-21T00:45:46Z

Pinging @elastic/es-distributed

dnhatn · 2018-06-21T01:45:41Z

This bug can happen in the following scenario.

Have a primary and replica in 5.6.10 with some docs
Issue a synced-flush
Shutdown the replica, then upgrade that node to 6.3.0
Start the replica node
The replica executes a file-based recovery, but it won't receive any file because the commit is sealed. The commit on a replica was created in v5; thus it does not have a historyUUID. Unfortunately, we assume that we should receive a new commit (with a historyUUID) in a file-based recovery.

dnhatn · 2018-06-21T01:46:21Z

@praseodym Thanks for reporting this bug. We are working on the fix.

This is due to #31482

…6.3 (#31506) Today we make sure that a 5.x index commit should have all required commit tags in RecoveryTarget#cleanFiles method. The reason we do this in RecoveryTarget#cleanFiles method because this is only needed in a file-based recovery and we assume that #cleanFiles should be called in a file-based recovery. However, this assumption is not valid if the index is sealed (.i.e synced-flush). This incorrect assumption would prevent users from rolling upgrade from 5.x to 6.3 if their index were sealed. Closes #31482

dnhatn · 2018-06-22T04:12:47Z

This is fixed by #31506. This fix be will included in 6.3.1.

praseodym · 2018-06-22T06:55:24Z

Thank you! Considering that this is a blocker for upgrades, when will 6.3.1 be released?

bleskes · 2018-06-22T07:58:08Z

Thank you! Considering that this is a blocker for upgrades, when will 6.3.1 be released?

That's still unknown at this point. Obviously this is a serious issue. Working on it.

Although the master branch does not affect by #31482, it's helpful to have BWC tests that verify the peer recovery with a synced-flush index. This commit adds the bwc tests from #31506 to the master branch. Relates #31482 Relates #31506

JalehD · 2018-06-25T12:41:41Z

@bleskes @dnhatn is there a workaround to recover from this state? what's the recommended approach once the upgrade has been affected by this bug.

gmoskovicz · 2018-06-25T12:41:43Z

Is removing the replica shard an option after upgrading? Or would that not work and upgrade to 6.3.1 is the only option?

bleskes · 2018-06-25T12:45:32Z

@gmoskovicz a direct rolling upgrade to 6.3 from 5.x just won't work. You can do a rolling upgrade to a 6.x version before 6.3 and then to 6.3. You can also push to 6.3 using a full cluster restart and then reduce the number of replicas and bring it back up (forcing the data to be cleaned). PLEASE try this first - I think it should work but it should be clear by now this is tricky - many moving parts.

dnhatn · 2018-06-25T15:25:57Z

A cleaner workaround would be force flush the offending index, then retry the cluster allocation.

Force flush the offending index POST /offending-index/_flush?force=true
Retry the cluster allocation POST /_cluster/reroute?retry_failed

dnhatn added the :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. label Jun 21, 2018

dnhatn self-assigned this Jun 21, 2018

dnhatn added the >bug label Jun 21, 2018

dnhatn added >upgrade :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. and removed :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. labels Jun 21, 2018

dnhatn assigned bleskes Jun 21, 2018

bleskes added blocker v6.3.0 labels Jun 21, 2018

bleskes mentioned this issue Jun 21, 2018

Add a known issue for upgrading from 5.x to 6.3.0 #31501

Merged

dnhatn mentioned this issue Jun 21, 2018

Fix missing historyUUID in peer recovery when rolling upgrade 5.x to 6.3 #31506

Merged

bleskes added a commit that referenced this issue Jun 21, 2018

Add a known issue for upgrading from 5.x to 6.3.0 (#31501)

a8318fe

This is due to #31482

bleskes added a commit that referenced this issue Jun 21, 2018

Add a known issue for upgrading from 5.x to 6.3.0 (#31501)

7af6bd1

This is due to #31482

dnhatn closed this as completed Jun 22, 2018

JalehD mentioned this issue Jun 25, 2018

Please add known issue #31482 to 6.3.0 release notes #31558

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

5.6.10 to 6.3.0 rolling upgrade broken with 'commit doesn't contain history uuid' when a synced flush is performed #31482

5.6.10 to 6.3.0 rolling upgrade broken with 'commit doesn't contain history uuid' when a synced flush is performed #31482

praseodym commented Jun 20, 2018

elasticmachine commented Jun 21, 2018

dnhatn commented Jun 21, 2018

dnhatn commented Jun 21, 2018

dnhatn commented Jun 22, 2018

praseodym commented Jun 22, 2018

bleskes commented Jun 22, 2018

JalehD commented Jun 25, 2018

gmoskovicz commented Jun 25, 2018

bleskes commented Jun 25, 2018 •

edited

Loading

dnhatn commented Jun 25, 2018

5.6.10 to 6.3.0 rolling upgrade broken with 'commit doesn't contain history uuid' when a synced flush is performed #31482

5.6.10 to 6.3.0 rolling upgrade broken with 'commit doesn't contain history uuid' when a synced flush is performed #31482

Comments

praseodym commented Jun 20, 2018

elasticmachine commented Jun 21, 2018

dnhatn commented Jun 21, 2018

dnhatn commented Jun 21, 2018

dnhatn commented Jun 22, 2018

praseodym commented Jun 22, 2018

bleskes commented Jun 22, 2018

JalehD commented Jun 25, 2018

gmoskovicz commented Jun 25, 2018

bleskes commented Jun 25, 2018 • edited Loading

dnhatn commented Jun 25, 2018

bleskes commented Jun 25, 2018 •

edited

Loading