-
Notifications
You must be signed in to change notification settings - Fork 24.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ES 6.8.1 (also 7.2.1) fails to start (probably due to cgroups) #45396
Comments
Version 7.2.1 fails in the same way (I cloned the container and upgraded):
|
Pinging @elastic/es-core-infra |
Hacky patch to work around this problem (yet untested because I need to build the package from source first): From 674f9cbcc8d98e1371a664344d1c439d2756fb2d Mon Sep 17 00:00:00 2001
From: Kai Krakow <kk@netactive.de>
Date: Mon, 12 Aug 2019 12:58:37 +0200
Subject: [PATCH] HACK: Catch NullPointerException during cgroup detection
Do not crash if for whatever reason the pattern matching didn't match
the cgroups and resolves to a NULL pointer path component. In this case,
simply pretent we do not have cgroups.
In my case, it crashes because `controlGroup` is passed a NULL pointer
here:
String readSysFsCgroupCpuAcctCpuAcctUsage(final String controlGroup)
throws IOException {
return readSingleLine(PathUtils.get("/sys/fs/cgroup/cpuacct",
controlGroup, "cpuacct.usage"));
}
This probably results from `OsProbe.getControlGroups()` returning a
`controllerMap` with NULL pointers in some cases.
Signed-off-by: Kai Krakow <kk@netactive.de>
---
server/src/main/java/org/elasticsearch/monitor/os/OsProbe.java | 3 +++
1 file changed, 3 insertions(+)
diff --git a/server/src/main/java/org/elasticsearch/monitor/os/OsProbe.java b/server/src/main/java/org/elasticsearch/monitor/os/OsProbe.java
index 18173dd275a..44c639546f1 100644
--- a/server/src/main/java/org/elasticsearch/monitor/os/OsProbe.java
+++ b/server/src/main/java/org/elasticsearch/monitor/os/OsProbe.java
@@ -507,6 +507,9 @@ public class OsProbe {
} catch (final IOException e) {
logger.debug("error reading control group stats", e);
return null;
+ } catch (final NullPointerException e) {
+ logger.debug("error resolving control groups", e);
+ return null;
}
}
--
2.21.0 |
Confirming my own hacky patch works: I've built patched 6.8.1 using gradle, then replaced the JAR file containing OsProbe.class in my container. ES 6.8.1 now starts successfully. If the patch is fine for you, feel free to apply (just remove the "HACK:" prefix), or ask for a PR. I can port it over to different branches. |
`OsProbe` fetches cgroup data from the filesystem, and has asserts that check for missing values. This PR changes most of these asserts into runtime checks, since at least one user has reported an NPE where a piece of cgroup data was missing. Also update the testing guidelines to reflect current Gradle usage. Closes elastic#45396.
* Always check that cgroup data is present `OsProbe` fetches cgroup data from the filesystem, and has asserts that check for missing values. This PR changes most of these asserts into runtime checks, since at least one user has reported an NPE where a piece of cgroup data was missing. Also update the testing guidelines to reflect current Gradle usage. Closes #45396. * Shorten long line, add JavaDoc. A helper method in OsProbeTests breached the maximum line length. Break up the line, and add some JavaDoc to the method. * Revert testing doc changes for separate submission * Use 'cgroup' abbreviation in log messages When checking for particular items of cgroup data and finding them missing, use the common abbreviaton 'cgroup' in the debug log messages.
* Always check that cgroup data is present `OsProbe` fetches cgroup data from the filesystem, and has asserts that check for missing values. This PR changes most of these asserts into runtime checks, since at least one user has reported an NPE where a piece of cgroup data was missing. Also update the testing guidelines to reflect current Gradle usage. Closes elastic#45396. * Shorten long line, add JavaDoc. A helper method in OsProbeTests breached the maximum line length. Break up the line, and add some JavaDoc to the method. * Revert testing doc changes for separate submission * Use 'cgroup' abbreviation in log messages When checking for particular items of cgroup data and finding them missing, use the common abbreviaton 'cgroup' in the debug log messages.
@kakra thanks for raising this issue. We've fixed this in a slightly different way, by explicitly checking for missing cgroup data. |
@pugnascotia You're welcome. Is the 5.x branch still maintained and will this be backported? Because I fear when upgrading kernels for my apps that still depend on the 5.x version, it will crash due to the same issue. |
5.x is no longer maintained, I'm afraid. However, you could look into why |
I researched a little bit, it's probably a combination of different factors: kernel version, systemd using unified hierarchy, the ES machines are running in a nspawn container (as an installation mirror of the production machines so normal system updates to my dev machine won't tamper with the dependencies), and my desktop/development kernels use the MuQSS scheduler (CK patchset). The latter doesn't support hierarchical CPU bandwidth allocation and no CPU accounting at all, it's flat, and I'm pretty sure that's causing the problem. I'll look into the 5.x case myself then. Since those containers are mostly static (they only offer ES as a service visible to the host), I can simply backport the patch, rebuild the class file and replace it in the installation. It's a bit dirty but it should be good enough. BTW: At least, if it's the CK patchset, the server deployments would be safe: They are running the standard Linux scheduler. Still good to know that potential crash is caught now. |
Elasticsearch version (
bin/elasticsearch --version
):6.8.1 (Gentoo ebuild)
running in a systemd nspawn container
Plugins installed: []
JVM version (
java -version
):openjdk version "1.8.0_212"
OpenJDK Runtime Environment (IcedTea 3.12.0) (Gentoo icedtea-3.12.0)
OpenJDK 64-Bit Server VM (build 25.212-b04, mixed mode)
OS version (
uname -a
if on a Unix-like system):Linux es6 5.2.7-gentoo #1 SMP Fri Aug 9 11:40:38 CEST 2019 x86_64 Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz GenuineIntel GNU/Linux
Description of the problem including expected versus actual behavior:
Since upgrading ES in my development container, it no longer starts. It crashes with a stack trace pointing to problems determining cgroup controllers. Researching other reports showed that the problem should be fixed, but it still shows for me. Logs below.
Last known good version:
Sat Sep 15 02:21:22 2018 >>> app-misc/elasticsearch-6.4.0
Steps to reproduce:
Provide logs (if relevant):
The text was updated successfully, but these errors were encountered: