-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate startup times #28650
Comments
Here is an excerpt from the output, including timestamps, of a test run that seems to have failed because it took more than 30 seconds to bring a node up:
In particular, more than half of the 30-second timeout elapsed before the timestamp on the first log line saying |
To look at the variation in the durations of
The distribution looks like this, broken down by OS: |
I just ran some xpack rolling upgrade tests with a profiler to see where time is spent. About 70% out of the ~51 seconds of CPU time that were spent on the Elasticsearch node were spent inside the JVM, mostly runnning compilation (~29 seconds). |
Compilation executes concurrently with application execution. While this is an indication that CPU time is being spent on compilation, it is not necessarily indicative of where the real time during startup is going. |
#28659 has added more logging to the node startup, but that 16 second startup time is still there, and is prior to the first log message:
// snip
// snip
|
Similar behaviour to what @tvernum reported: 13 seconds startup time before the first log message.
|
That last one was 15 sec from task start to first log message:
The extra logging in #28659 has helped to clarify that the bulk of the delay is occurring prior to the bootstrap phase. @jasontedor can you suggest what to do next? |
The issue is only reproducible on our CI systems so I think we need to find out what's characteristic about them. So the first step is IMHO to get some system metrics (e.g. kernel activity). Knowing more about the machine activity / state at the point of the failure will hopefully help us to reproduce this reliably. Additional supporting data may be:
Furthermore, we need to break down which parts of the startup take how long:
|
That's an interesting data point. We do expect a delay here from
As you can see by the comment preceding mine, @danielmitterdorfer will take ownership of this one. |
I agree here. As a next step, we will repeatedly run one of the affected builds on a dedicated worker node in order to expose this issue as often as possible. We will also gather additional system data (like paging activity) and try to correlate system behaviour during the time a build has failed. We could also add more JVM logging to see what the JVM is actually doing during startup (I'd leverage JDK 9 unified logging for this). |
An analysis of our build statistics has shown that all these variables are irrelevant. Among others, I also analysed the worker's uptime and the build duration. Uptime is irrelevant but most of the builds failed between one and two hours into the build. The only data point that stands out that all builds have been run on |
I did a more detailed investigation on one affected CI node and it turns out that Test scenarioStart Elasticsearch 5.6.7 on an affected node, once with default settings (i.e.
AnalysisHowever, this is not the whole story. I also ran
This function is called when the kernel tries to free up a large enough contiguous block of memory (see also its source code) which leads me to the assumption that memory in our CI systems is fragmented due to the volume of builds and the kernel is compacting memory. Also quoting Memory Compaction v8 (note: that source is almost 8 years old now so the information may or may not correct as of today):
Next stepsWe will now record further data to actually back up that assumption. |
Initial test with the same test setup as above on a machine where we have seen timeouts again:
After explicitly compacting memory ( (I waited for 60 seconds with the next test after compacting memory).
The times were identical. Furthermore, there were no noticeable differences in After dropping the page cache and Slab objects (
This is a significant improvement so I suggest that we drop the page cache before each build as a first step. |
Hurrah! Just to check - you're proposing dropping the page cache at the start of the whole build and not at the start of each individual integration test? If so, this means that dirty pages will accumulate throughout the test run. This might well be fine: I'm just checking I understand. |
Yes, your understanding of my proposal is correct. IMHO this is the most straightforward change that we can make in the first step. You are also right that the situation may get worse over the course of a single build. In that case we would probably need to modify kernel parameters to write back dirty pages more aggressively but I'd rather stick to the stock configuration first and only tune if it should turn out be necessary. |
We have implemented changes in CI yesterday to drop the page cache as well as request memory compaction (it turned out that dropping caches provided the most benefit, and requesting memory compaction afterwards improved the situation even more). Before this change we have seen around 16 build failures caused by ES startup timeouts per day. In the last 24 hours it was only 2 build failures. So while this has improved the situation significantly, we are not quite there yet. I'd still want to avoid fiddling with kernel parameters (it's not that I have no idea which knobs to turn, it's rather that I think this should really be our very last resort). I currently analyse the memory requirements of our build and will try to reduce our memory requirements (i.e. reducing compiler memory, test memory, etc.). For example, I already found out that we (IMHO unnecessarily) store geoip data on-heap thus increasing our heap requirements in the build as well (see #28782 for details). |
Two weeks ago we had a problem in our CI infrastructure so the script that dropped the page cache and requested memory compaction was not called (it was called initially as I noted in my previous comment but then we added another check which made it completely ineffective). After this was fixed, we did not see a single failure due to these timeouts within two weeks. Hence, closing. |
Removing unnecessary modules (like, the x-pack ones) can reduce startup time by a few seconds. (I still can't get it below 8s which roughly one bajillion CPU cycles, which is really disappointing.) |
Integrations tests frequently take more than 20 seconds to start up an Elasticsearch node on an empty or small cluster state, which is a lot of time for a computer. Take the console output from any build and search for
#wait (Thread[Task worker for ':',5,main]) completed
, the time afterTook
on the same line is the time that the build had to wait for the node to be available. This is an immediate problem for testing but might also be a problem for users if this boils down to an issue that could make things even worse in some adversarial scenarii.Relates #28640
The text was updated successfully, but these errors were encountered: