Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate startup times #28650

Closed
jpountz opened this issue Feb 13, 2018 · 18 comments
Closed

Investigate startup times #28650

jpountz opened this issue Feb 13, 2018 · 18 comments
Labels
>bug >test Issues or PRs that are addressing/adding tests

Comments

@jpountz
Copy link
Contributor

jpountz commented Feb 13, 2018

Integrations tests frequently take more than 20 seconds to start up an Elasticsearch node on an empty or small cluster state, which is a lot of time for a computer. Take the console output from any build and search for #wait (Thread[Task worker for ':',5,main]) completed, the time after Took on the same line is the time that the build had to wait for the node to be available. This is an immediate problem for testing but might also be a problem for users if this boils down to an issue that could make things even worse in some adversarial scenarii.

Relates #28640

@jpountz jpountz added >bug >test Issues or PRs that are addressing/adding tests :Internal labels Feb 13, 2018
@DaveCTurner
Copy link
Contributor

Here is an excerpt from the output, including timestamps, of a test run that seems to have failed because it took more than 30 seconds to bring a node up:

13:18:33 :x-pack-elasticsearch:qa:rolling-upgrade:with-system-key:v6.3.0-SNAPSHOT#upgradedClusterTestCluster#wait
13:18:33 Task ':x-pack-elasticsearch:qa:rolling-upgrade:with-system-key:v6.3.0-SNAPSHOT#upgradedClusterTestCluster#wait' is not up-to-date because:
13:18:33   Task has not declared any outputs.
13:19:03 Node 0 output:
13:19:03 |-----------------------------------------
13:19:03 |  failure marker exists: false
13:19:03 |  pid file exists: true
13:19:03 |  http ports file exists: true
13:19:03 |  transport ports file exists: true
13:19:03 |
13:19:03 |  [ant output]
13:19:03 |
13:19:03 |  [log]
13:19:03 |    warning: ignoring JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF8
13:19:03 |    [2018-02-12T13:18:49,206][INFO ][o.e.n.Node               ] [node-0] initializing ...
13:19:03 |    [2018-02-12T13:18:49,267][INFO ][o.e.e.NodeEnvironment    ] [node-0] using [1] data paths, mounts [[/ (/dev/xvda1)]], net usable_space [104.5gb], net total_space [992.6gb], types [ext4]
13:19:03 |    [2018-02-12T13:18:49,267][INFO ][o.e.e.NodeEnvironment    ] [node-0] heap size [494.9mb], compressed ordinary object pointers [true]
13:19:03 |    [2018-02-12T13:18:49,371][INFO ][o.e.n.Node               ] [node-0] node name [node-0], node ID [aD7I2WYuQSOXvdOri3TVcQ]
13:19:03 |    [2018-02-12T13:18:49,372][INFO ][o.e.n.Node               ] [node-0] version[7.0.0-alpha1-SNAPSHOT], pid[2130], build[37e938f/2018-02-12T11:11:56.823Z], OS[Linux/4.4.0-1032-aws/amd64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_162/25.162-b12]
13:19:03 |    [2018-02-12T13:18:49,372][INFO ][o.e.n.Node               ] [node-0] JVM arguments [-Xms1g, -Xmx1g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.io.tmpdir=/tmp/elasticsearch.bgRdl4sQ, -XX:+HeapDumpOnOutOfMemoryError, -XX:+PrintGCDetails, -XX:+PrintGCDateStamps, -XX:+PrintTenuringDistribution, -XX:+PrintGCApplicationStoppedTime, -Xloggc:logs/gc.log, -XX:+UseGCLogFileRotation, -XX:NumberOfGCLogFiles=32, -XX:GCLogFileSize=64m, -Xms512m, -Xmx512m, -ea, -esa, -Des.path.home=/var/lib/jenkins/workspace/elastic+x-pack-elasticsearch+master+multijob-unix-compatibility/elasticsearch-extra/x-pack-elasticsearch/qa/rolling-upgrade/with-system-key/build/cluster/v6.3.0-SNAPSHOT#upgradedClusterTestCluster node0/elasticsearch-7.0.0-alpha1-SNAPSHOT, -Des.path.conf=/var/lib/jenkins/workspace/elastic+x-pack-elasticsearch+master+multijob-unix-compatibility/elasticsearch-extra/x-pack-elasticsearch/qa/rolling-upgrade/with-system-key/build/cluster/v6.3.0-SNAPSHOT#upgradedClusterTestCluster node0/elasticsearch-7.0.0-alpha1-SNAPSHOT/config]
13:19:03 |    [2018-02-12T13:18:49,372][WARN ][o.e.n.Node               ] [node-0] version [7.0.0-alpha1-SNAPSHOT] is a pre-release version of Elasticsearch and is not suitable for production
13:19:03 |    [2018-02-12T13:18:51,770][INFO ][o.e.p.PluginsService     ] [node-0] loaded module [aggs-matrix-stats]
13:19:03 |    [2018-02-12T13:18:51,770][INFO ][o.e.p.PluginsService     ] [node-0] loaded module [analysis-common]
13:19:03 |    [2018-02-12T13:18:51,770][INFO ][o.e.p.PluginsService     ] [node-0] loaded module [ingest-common]
13:19:03 |    [2018-02-12T13:18:51,770][INFO ][o.e.p.PluginsService     ] [node-0] loaded module [lang-expression]
13:19:03 |    [2018-02-12T13:18:51,770][INFO ][o.e.p.PluginsService     ] [node-0] loaded module [lang-mustache]
13:19:03 |    [2018-02-12T13:18:51,770][INFO ][o.e.p.PluginsService     ] [node-0] loaded module [lang-painless]
13:19:03 |    [2018-02-12T13:18:51,770][INFO ][o.e.p.PluginsService     ] [node-0] loaded module [mapper-extras]
13:19:03 |    [2018-02-12T13:18:51,771][INFO ][o.e.p.PluginsService     ] [node-0] loaded module [parent-join]
13:19:03 |    [2018-02-12T13:18:51,771][INFO ][o.e.p.PluginsService     ] [node-0] loaded module [percolator]
13:19:03 |    [2018-02-12T13:18:51,771][INFO ][o.e.p.PluginsService     ] [node-0] loaded module [rank-eval]
13:19:03 |    [2018-02-12T13:18:51,771][INFO ][o.e.p.PluginsService     ] [node-0] loaded module [reindex]
13:19:03 |    [2018-02-12T13:18:51,771][INFO ][o.e.p.PluginsService     ] [node-0] loaded module [repository-url]
13:19:03 |    [2018-02-12T13:18:51,771][INFO ][o.e.p.PluginsService     ] [node-0] loaded module [transport-netty4]
13:19:03 |    [2018-02-12T13:18:51,772][INFO ][o.e.p.PluginsService     ] [node-0] loaded plugin [x-pack-core]
13:19:03 |    [2018-02-12T13:18:51,772][INFO ][o.e.p.PluginsService     ] [node-0] loaded plugin [x-pack-deprecation]
13:19:03 |    [2018-02-12T13:18:51,772][INFO ][o.e.p.PluginsService     ] [node-0] loaded plugin [x-pack-graph]
13:19:03 |    [2018-02-12T13:18:51,772][INFO ][o.e.p.PluginsService     ] [node-0] loaded plugin [x-pack-logstash]
13:19:03 |    [2018-02-12T13:18:51,772][INFO ][o.e.p.PluginsService     ] [node-0] loaded plugin [x-pack-ml]
13:19:03 |    [2018-02-12T13:18:51,772][INFO ][o.e.p.PluginsService     ] [node-0] loaded plugin [x-pack-monitoring]
13:19:03 |    [2018-02-12T13:18:51,772][INFO ][o.e.p.PluginsService     ] [node-0] loaded plugin [x-pack-security]
13:19:03 |    [2018-02-12T13:18:51,773][INFO ][o.e.p.PluginsService     ] [node-0] loaded plugin [x-pack-sql]
13:19:03 |    [2018-02-12T13:18:51,773][INFO ][o.e.p.PluginsService     ] [node-0] loaded plugin [x-pack-upgrade]
13:19:03 |    [2018-02-12T13:18:51,773][INFO ][o.e.p.PluginsService     ] [node-0] loaded plugin [x-pack-watcher]
13:19:03 |    [2018-02-12T13:18:55,244][INFO ][o.e.x.m.j.p.l.CppLogMessageHandler] [controller/2193] [Main.cc@129] controller (64 bit): Version 7.0.0-alpha1-SNAPSHOT (Build 91f906f0147062) Copyright (c) 2018 Elasticsearch BV
13:19:03 |    [2018-02-12T13:18:56,588][DEBUG][o.e.a.ActionModule       ] Using REST wrapper from plugin org.elasticsearch.xpack.security.Security
13:19:03 |    [2018-02-12T13:18:57,137][INFO ][o.e.d.DiscoveryModule    ] [node-0] using discovery type [zen]
13:19:03 |    [2018-02-12T13:18:57,914][INFO ][o.e.n.Node               ] [node-0] initialized
13:19:03 |    [2018-02-12T13:18:57,914][INFO ][o.e.n.Node               ] [node-0] starting ...
13:19:03 |    [2018-02-12T13:18:58,049][INFO ][o.e.t.TransportService   ] [node-0] publish_address {127.0.0.1:38864}, bound_addresses {[::1]:45891}, {127.0.0.1:38864}
13:19:03 |    [2018-02-12T13:18:58,084][WARN ][o.e.b.BootstrapChecks    ] [node-0] HTTPS is required in order to use the token service; please enable HTTPS using the [xpack.security.http.ssl.enabled] setting or disable the token service using the [xpack.security.authc.token.enabled] setting
13:19:03 |    [2018-02-12T13:19:02,757][INFO ][o.e.c.s.ClusterApplierService] [node-0] master node changed {previous [], current [{node-0}{Jhp_A43OR9KOW2UvUo7OnQ}{t9GoosVjStSO0Qu49OY70A}{127.0.0.1}{127.0.0.1:42114}{testattr=test, upgraded=first, ml.machine_memory=16825872384, ml.max_open_jobs=20, ml.enabled=true}]}, added {{node-0}{Jhp_A43OR9KOW2UvUo7OnQ}{t9GoosVjStSO0Qu49OY70A}{127.0.0.1}{127.0.0.1:42114}{testattr=test, upgraded=first, ml.machine_memory=16825872384, ml.max_open_jobs=20, ml.enabled=true},}, reason: apply cluster state (from master [master {node-0}{Jhp_A43OR9KOW2UvUo7OnQ}{t9GoosVjStSO0Qu49OY70A}{127.0.0.1}{127.0.0.1:42114}{testattr=test, upgraded=first, ml.machine_memory=16825872384, ml.max_open_jobs=20, ml.enabled=true} committed version [197]])
13:19:03 |    [2018-02-12T13:19:03,095][INFO ][o.e.x.s.a.TokenService   ] [node-0] refresh keys
13:19:03 |    [2018-02-12T13:19:03,492][INFO ][o.e.x.s.a.TokenService   ] [node-0] refreshed keys
13:19:03 |    [2018-02-12T13:19:03,514][INFO ][o.e.l.LicenseService     ] [node-0] license [d7c217c8-e7a6-4dfd-99f4-84f446ff9a6a] mode [trial] - valid
13:19:03 |    [2018-02-12T13:19:03,542][INFO ][o.e.x.s.t.n.SecurityNetty4HttpServerTransport] [node-0] publish_address {127.0.0.1:42608}, bound_addresses {[::1]:39088}, {127.0.0.1:42608}
13:19:03 |    [2018-02-12T13:19:03,543][INFO ][o.e.n.Node               ] [node-0] started

In particular, more than half of the 30-second timeout elapsed before the timestamp on the first log line saying [node-0] initializing ... log line: the task apparently started at 13:18:33, but the first timestamp in the log is 13:18:49, 16 seconds later.

@DaveCTurner
Copy link
Contributor

To look at the variation in the durations of *ClusterTestCluster#wait tasks, I downloaded the logs from a selection of recent builds and extracted the times using the following.

cat log.txt | sed -e '/ClusterTestCluster#wait.*Took/!d; s/.* Took //; s/ .*//'

The distribution looks like this, broken down by OS:

screen shot 2018-02-13 at 09 53 15

@jpountz
Copy link
Contributor Author

jpountz commented Feb 13, 2018

I just ran some xpack rolling upgrade tests with a profiler to see where time is spent. About 70% out of the ~51 seconds of CPU time that were spent on the Elasticsearch node were spent inside the JVM, mostly runnning compilation (~29 seconds).

@jasontedor
Copy link
Member

Compilation executes concurrently with application execution. While this is an indication that CPU time is being spent on compilation, it is not necessarily indicative of where the real time during startup is going.

@tvernum
Copy link
Contributor

tvernum commented Feb 14, 2018

#28659 has added more logging to the node startup, but that 16 second startup time is still there, and is prior to the first log message:

03:27:18   Task has not declared any outputs.
03:27:48 Node 0 output:
03:27:48 |-----------------------------------------
03:27:48 |  failure marker exists: false
03:27:48 |  pid file exists: true
03:27:48 |  http ports file exists: true
03:27:48 |  transport ports file exists: true
03:27:48 |
03:27:48 |  [ant output]
03:27:48 |
03:27:48 |  [log]
03:27:48 |    warning: ignoring JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF8
03:27:48 |    [2018-02-14T03:27:34,575][DEBUG][o.e.b.SystemCallFilter   ] Linux seccomp filter installation successful, threads: [all]

// snip

03:27:48 |    [2018-02-14T03:27:34,804][INFO ][o.e.n.Node               ] [node-0] initializing …
03:27:48 |    [2018-02-14T03:27:34,871][INFO ][o.e.e.NodeEnvironment    ] [node-0] using [1] data paths, mounts [[/ (/dev/xvda1)]], net usable_space [340.7gb], net total_space [992.6gb], types [ext4]
03:27:48 |    [2018-02-14T03:27:34,871][INFO ][o.e.e.NodeEnvironment    ] [node-0] heap size [494.9mb], compressed ordinary object pointers [true]
03:27:48 |    [2018-02-14T03:27:34,942][INFO ][o.e.n.Node               ] [node-0] node name [node-0], node ID [NcAMsKj8Tc-e55eiJciNNQ]

// snip

03:27:50 * What went wrong:
03:27:50 Execution failed for task ':x-pack-elasticsearch:qa:rolling-upgrade:with-system-key:v6.3.0-SNAPSHOT#upgradedClusterTestCluster#wait'.
03:27:50 > Failed to start elasticsearch: timed out after 30 seconds

@jpountz
Copy link
Contributor Author

jpountz commented Feb 14, 2018

Similar behaviour to what @tvernum reported: 13 seconds startup time before the first log message.

06:29:21 :x-pack-elasticsearch:qa:rolling-upgrade:with-system-key:v5.6.8-SNAPSHOT#upgradedClusterTestCluster#start
[...]
06:29:51 |  [log]
06:29:51 |    warning: ignoring JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF8
06:29:51 |    [2018-02-14T06:29:34,804][DEBUG][o.e.b.SystemCallFilter   ] Linux seccomp filter installation successful, threads: [all]
06:29:51 |    [2018-02-14T06:29:34,811][DEBUG][o.e.b.JarHell            ] java.class.path:[...]

@DaveCTurner
Copy link
Contributor

That last one was 15 sec from task start to first log message:

22:37:55 Task ':x-pack-elasticsearch:qa:rolling-upgrade:without-system-key:v6.3.0-SNAPSHOT#upgradedClusterTestCluster#wait' is not up-to-date because:
22:37:55   Task has not declared any outputs.
22:38:25 Node 0 output:
22:38:25 |-----------------------------------------
22:38:25 |  failure marker exists: false
22:38:25 |  pid file exists: true
22:38:25 |  http ports file exists: true
22:38:25 |  transport ports file exists: true
22:38:25 |
22:38:25 |  [ant output]
22:38:25 |
22:38:25 |  [log]
22:38:25 |    warning: ignoring JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF8
22:38:25 |    Java HotSpot(TM) 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
22:38:25 |    [2018-02-13T22:38:10,750][DEBUG][o.e.b.SystemCallFilter   ] Linux seccomp filter installation successful, threads: [all]

The extra logging in #28659 has helped to clarify that the bulk of the delay is occurring prior to the bootstrap phase. @jasontedor can you suggest what to do next?

@danielmitterdorfer
Copy link
Member

The issue is only reproducible on our CI systems so I think we need to find out what's characteristic about them. So the first step is IMHO to get some system metrics (e.g. kernel activity). Knowing more about the machine activity / state at the point of the failure will hopefully help us to reproduce this reliably.

Additional supporting data may be:

  • kernel version
  • JVM version
  • How long the build ran until the failure has occurred

Furthermore, we need to break down which parts of the startup take how long:

  • JVM initialization (e.g. time from process start until the first log line in the GC log)
  • Elasticsearch initialization (time from first log line ([node-X] initializing ...) until [node-X] started).

@jasontedor
Copy link
Member

The extra logging in #28659 has helped to clarify that the bulk of the delay is occurring prior to the bootstrap phase.

That's an interesting data point. We do expect a delay here from -XX:+AlwaysPreTouch but we run these nodes with such small heaps (512m) that it seems unlikely to be the (sole) explanation.

@jasontedor can you suggest what to do next?

As you can see by the comment preceding mine, @danielmitterdorfer will take ownership of this one.

@danielmitterdorfer
Copy link
Member

That's an interesting data point. We do expect a delay here from -XX:+AlwaysPreTouch but we run these nodes with such small heaps (512m) that it seems unlikely to be the (sole) explanation.

I agree here.

As a next step, we will repeatedly run one of the affected builds on a dedicated worker node in order to expose this issue as often as possible. We will also gather additional system data (like paging activity) and try to correlate system behaviour during the time a build has failed.

We could also add more JVM logging to see what the JVM is actually doing during startup (I'd leverage JDK 9 unified logging for this).

@danielmitterdorfer
Copy link
Member

Additional supporting data may be:

  • kernel version
  • JVM version
  • How long the build ran until the failure has occurred

An analysis of our build statistics has shown that all these variables are irrelevant. Among others, I also analysed the worker's uptime and the build duration. Uptime is irrelevant but most of the builds failed between one and two hours into the build.

The only data point that stands out that all builds have been run on m4.xlarge worker instances (i.e. not bare metal).

@danielmitterdorfer
Copy link
Member

danielmitterdorfer commented Feb 15, 2018

I did a more detailed investigation on one affected CI node and it turns out that -XX:+AlwaysPreTouch seems to be the problem (or rather amplifying the underlying problem).

Test scenario

Start Elasticsearch 5.6.7 on an affected node, once with default settings (i.e. -XX:+AlwaysPreTouch) and once without pretouching (i.e. -XX:-AlwaysPreTouch). We measure the time from process start until the first log line:

AlwaysPreTouch Time to First Log Message
enabled 80 seconds
disabled 3 seconds

Analysis

However, this is not the whole story. I also ran perf while starting Elasticsearch and this shows up highest in perf report:

Overhead  Command        Shared Object       Symbol                                                                                                             
  67.45%  java           [kernel.kallsyms]   [k] isolate_freepages_block

This function is called when the kernel tries to free up a large enough contiguous block of memory (see also its source code) which leads me to the assumption that memory in our CI systems is fragmented due to the volume of builds and the kernel is compacting memory. Also quoting Memory Compaction v8 (note: that source is almost 8 years old now so the information may or may not correct as of today):

Memory compaction can be triggered in one of three ways. It may be triggered
explicitly by writing any value to /proc/sys/vm/compact_memory and compacting
all of memory. It can be triggered on a per-node basis by writing any
value to /sys/devices/system/node/nodeN/compact where N is the node ID to
be compacted. When a process fails to allocate a high-order page, it may
compact memory in an attempt to satisfy the allocation instead of entering
direct reclaim.

Next steps

We will now record further data to actually back up that assumption.

@danielmitterdorfer
Copy link
Member

Initial test with the same test setup as above on a machine where we have seen timeouts again:

AlwaysPreTouch Time to First Log Message [s] Time until started [s]
enabled 29 37
disabled 3 16

After explicitly compacting memory (echo 1 > /proc/sys/vm/compact_memory):

(I waited for 60 seconds with the next test after compacting memory).

AlwaysPreTouch Time to First Log Message [s] Time until started [s]
enabled 29 37
disabled 3 16

The times were identical. Furthermore, there were no noticeable differences in /proc/pagetypeinfo or /proc/vmstat so I would consider the request for memory compaction ineffective.

After dropping the page cache and Slab objects (sync && echo 3 > /proc/sys/vm/drop_caches):

AlwaysPreTouch Time to First Log Message [s] Time until started [s]
enabled 3 10
disabled 1 8

This is a significant improvement so I suggest that we drop the page cache before each build as a first step.

@DaveCTurner
Copy link
Contributor

Hurrah!

Just to check - you're proposing dropping the page cache at the start of the whole build and not at the start of each individual integration test? If so, this means that dirty pages will accumulate throughout the test run. This might well be fine: I'm just checking I understand.

@danielmitterdorfer
Copy link
Member

Yes, your understanding of my proposal is correct. IMHO this is the most straightforward change that we can make in the first step. You are also right that the situation may get worse over the course of a single build. In that case we would probably need to modify kernel parameters to write back dirty pages more aggressively but I'd rather stick to the stock configuration first and only tune if it should turn out be necessary.

@danielmitterdorfer
Copy link
Member

danielmitterdorfer commented Feb 22, 2018

We have implemented changes in CI yesterday to drop the page cache as well as request memory compaction (it turned out that dropping caches provided the most benefit, and requesting memory compaction afterwards improved the situation even more).

Before this change we have seen around 16 build failures caused by ES startup timeouts per day. In the last 24 hours it was only 2 build failures. So while this has improved the situation significantly, we are not quite there yet.

I'd still want to avoid fiddling with kernel parameters (it's not that I have no idea which knobs to turn, it's rather that I think this should really be our very last resort). I currently analyse the memory requirements of our build and will try to reduce our memory requirements (i.e. reducing compiler memory, test memory, etc.). For example, I already found out that we (IMHO unnecessarily) store geoip data on-heap thus increasing our heap requirements in the build as well (see #28782 for details).

@danielmitterdorfer
Copy link
Member

Two weeks ago we had a problem in our CI infrastructure so the script that dropped the page cache and requested memory compaction was not called (it was called initially as I noted in my previous comment but then we added another check which made it completely ineffective). After this was fixed, we did not see a single failure due to these timeouts within two weeks. Hence, closing.

@pauldraper
Copy link

pauldraper commented Mar 19, 2022

Removing unnecessary modules (like, the x-pack ones) can reduce startup time by a few seconds.

(I still can't get it below 8s which roughly one bajillion CPU cycles, which is really disappointing.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug >test Issues or PRs that are addressing/adding tests
Projects
None yet
Development

No branches or pull requests

7 participants