Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ES 6.8.1 (also 7.2.1) fails to start (probably due to cgroups) #45396

Closed
kakra opened this issue Aug 9, 2019 · 8 comments · Fixed by #45606
Closed

ES 6.8.1 (also 7.2.1) fails to start (probably due to cgroups) #45396

kakra opened this issue Aug 9, 2019 · 8 comments · Fixed by #45606
Assignees
Labels
:Core/Infra/Core Core issues without another label

Comments

@kakra
Copy link

kakra commented Aug 9, 2019

Elasticsearch version (bin/elasticsearch --version):
6.8.1 (Gentoo ebuild)
running in a systemd nspawn container

Plugins installed: []

JVM version (java -version):
openjdk version "1.8.0_212"
OpenJDK Runtime Environment (IcedTea 3.12.0) (Gentoo icedtea-3.12.0)
OpenJDK 64-Bit Server VM (build 25.212-b04, mixed mode)

OS version (uname -a if on a Unix-like system):
Linux es6 5.2.7-gentoo #1 SMP Fri Aug 9 11:40:38 CEST 2019 x86_64 Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz GenuineIntel GNU/Linux

Description of the problem including expected versus actual behavior:
Since upgrading ES in my development container, it no longer starts. It crashes with a stack trace pointing to problems determining cgroup controllers. Researching other reports showed that the problem should be fixed, but it still shows for me. Logs below.

Last known good version:
Sat Sep 15 02:21:22 2018 >>> app-misc/elasticsearch-6.4.0

Steps to reproduce:

es6 ~ # ls -l /sys/fs/cgroup
insgesamt 0
drwxr-xr-x 2 nobody nobody 0  9. Aug 17:42 blkio
drwxr-xr-x 2 nobody nobody 0  9. Aug 17:42 cpu
lrwxrwxrwx 1 root   root   3  9. Aug 17:42 cpuacct -> cpu
drwxr-xr-x 2 nobody nobody 0  9. Aug 17:42 devices
drwxr-xr-x 2 nobody nobody 0  9. Aug 17:42 memory
dr-xr-xr-x 2 nobody nobody 0  9. Aug 11:57 net_cls
lrwxrwxrwx 1 root   root   7  9. Aug 17:42 net_prio -> net_cls
drwxr-xr-x 2 nobody nobody 0  9. Aug 17:42 pids
drwxr-xr-x 5 root   root   0  9. Aug 17:42 systemd
drwxr-xr-x 5 root   root   0  9. Aug 17:42 unified

es6 ~ # cat /proc/cgroups
#subsys_name    hierarchy       num_cgroups     enabled
cpu     2       88      1
blkio   4       108     1
memory  5       128     1
devices 3       86      1
net_cls 6       1       1
pids    7       92      1

es6 ~ # cat /proc/self/cgroup
7:pids:/
6:net_cls:/
5:memory:/
4:blkio:/
3:devices:/
2:cpu:/
1:name=systemd:/user.slice/user-0.slice/session-c1.scope
0::/user.slice/user-0.slice/session-c1.scope

es6 ~ # mount -t cgroup
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/net_cls type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/cpu type cgroup (rw,nosuid,nodev,noexec,relatime,cpu)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)

es6 ~ # grep cgroup /proc/self/mountinfo
1067 1141 0:106 / /sys/fs/cgroup ro,nosuid,nodev,noexec - tmpfs tmpfs ro,mode=755,uid=65536,gid=65536
1068 1067 0:29 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,memory
1069 1067 0:30 / /sys/fs/cgroup/net_cls rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,net_cls
1070 1067 0:27 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,devices
1071 1067 0:28 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,blkio
1073 1067 0:26 / /sys/fs/cgroup/cpu rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,cpu
1074 1067 0:31 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,pids
1075 1067 0:21 / /sys/fs/cgroup/unified rw,nosuid,nodev,noexec,relatime - cgroup2 cgroup rw,nsdelegate
1076 1067 0:22 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,xattr,name=systemd

# FULL LOG AT THE END
es6 ~ # systemctl start elasticsearch.service
es6 ~ # systemctl status elasticsearch.service
● elasticsearch.service - Elasticsearch
   Loaded: loaded (/lib/systemd/system/elasticsearch.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/elasticsearch.service.d
           └─00gentoo.conf, override.conf
   Active: failed (Result: exit-code) since Fri 2019-08-09 18:18:28 CEST; 640ms ago
     Docs: https://www.elastic.co
  Process: 437 ExecStartPre=/usr/share/elasticsearch/bin/elasticsearch-systemd-pre-exec (code=exited, status=0/SUCCESS)
  Process: 438 ExecStart=/usr/share/elasticsearch/bin/elasticsearch -p ${PID_DIR}/elasticsearch.pid -Epath.logs=${LOG_DIR} -Epath.data=${DATA_DIR} (code=exited, status=1/FAILURE)
 Main PID: 438 (code=exited, status=1/FAILURE)

Aug 09 18:18:28 es6 elasticsearch[438]:         at org.elasticsearch.monitor.MonitorService.<init>(MonitorService.java:46) ~[elasticsearch-6.8.1.jar:6.8.1]
Aug 09 18:18:28 es6 elasticsearch[438]:         at org.elasticsearch.node.Node.<init>(Node.java:399) ~[elasticsearch-6.8.1.jar:6.8.1]
Aug 09 18:18:28 es6 elasticsearch[438]:         at org.elasticsearch.node.Node.<init>(Node.java:266) ~[elasticsearch-6.8.1.jar:6.8.1]
Aug 09 18:18:28 es6 elasticsearch[438]:         at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:212) ~[elasticsearch-6.8.1.jar:6.8.1]
Aug 09 18:18:28 es6 elasticsearch[438]:         at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:212) ~[elasticsearch-6.8.1.jar:6.8.1]
Aug 09 18:18:28 es6 elasticsearch[438]:         at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:333) ~[elasticsearch-6.8.1.jar:6.8.1]
Aug 09 18:18:28 es6 elasticsearch[438]:         at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159) ~[elasticsearch-6.8.1.jar:6.8.1]
Aug 09 18:18:28 es6 elasticsearch[438]:         ... 6 more
Aug 09 18:18:28 es6 systemd[1]: elasticsearch.service: Main process exited, code=exited, status=1/FAILURE
Aug 09 18:18:28 es6 systemd[1]: elasticsearch.service: Failed with result 'exit-code'.

Provide logs (if relevant):

java.lang.NullPointerException: null
        at sun.nio.fs.UnixFileSystem.getPath(UnixFileSystem.java:273) ~[?:?]
        at org.elasticsearch.common.io.PathUtils.get(PathUtils.java:60) ~[elasticsearch-core-6.8.1.jar:6.8.1]
        at org.elasticsearch.monitor.os.OsProbe.readSysFsCgroupCpuAcctCpuAcctUsage(OsProbe.java:277) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.monitor.os.OsProbe.getCgroupCpuAcctUsageNanos(OsProbe.java:264) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.monitor.os.OsProbe.getCgroup(OsProbe.java:483) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.monitor.os.OsProbe.osStats(OsProbe.java:603) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.monitor.os.OsService.<init>(OsService.java:49) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.monitor.MonitorService.<init>(MonitorService.java:46) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.node.Node.<init>(Node.java:399) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.node.Node.<init>(Node.java:266) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:212) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:212) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:333) [elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159) [elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:150) [elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86) [elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124) [elasticsearch-cli-6.8.1.jar:6.8.1]
        at org.elasticsearch.cli.Command.main(Command.java:90) [elasticsearch-cli-6.8.1.jar:6.8.1]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:116) [elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:93) [elasticsearch-6.8.1.jar:6.8.1]
[2019-08-09T17:46:33,105][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [QoY4z88] uncaught exception in thread [main]
org.elasticsearch.bootstrap.StartupException: java.lang.NullPointerException
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:163) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:150) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124) ~[elasticsearch-cli-6.8.1.jar:6.8.1]
        at org.elasticsearch.cli.Command.main(Command.java:90) ~[elasticsearch-cli-6.8.1.jar:6.8.1]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:116) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:93) ~[elasticsearch-6.8.1.jar:6.8.1]
Caused by: java.lang.NullPointerException
        at sun.nio.fs.UnixFileSystem.getPath(UnixFileSystem.java:273) ~[?:?]
        at org.elasticsearch.common.io.PathUtils.get(PathUtils.java:60) ~[elasticsearch-core-6.8.1.jar:6.8.1]
        at org.elasticsearch.monitor.os.OsProbe.readSysFsCgroupCpuAcctCpuAcctUsage(OsProbe.java:277) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.monitor.os.OsProbe.getCgroupCpuAcctUsageNanos(OsProbe.java:264) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.monitor.os.OsProbe.getCgroup(OsProbe.java:483) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.monitor.os.OsProbe.osStats(OsProbe.java:603) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.monitor.os.OsService.<init>(OsService.java:49) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.monitor.MonitorService.<init>(MonitorService.java:46) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.node.Node.<init>(Node.java:399) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.node.Node.<init>(Node.java:266) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:212) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:212) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:333) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159) ~[elasticsearch-6.8.1.jar:6.8.1]
        ... 6 more
@kakra
Copy link
Author

kakra commented Aug 9, 2019

Version 7.2.1 fails in the same way (I cloned the container and upgraded):

Aug 09 18:30:25 es7 elasticsearch[705]: [2019-08-09T18:30:25,287][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [es7] uncaught exception in thread [main]
Aug 09 18:30:25 es7 elasticsearch[705]: org.elasticsearch.bootstrap.StartupException: java.lang.NullPointerException
Aug 09 18:30:25 es7 elasticsearch[705]:         at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:163) ~[elasticsearch-7.2.1.jar:7.2.1]
Aug 09 18:30:25 es7 elasticsearch[705]:         at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:150) ~[elasticsearch-7.2.1.jar:7.2.1]
Aug 09 18:30:25 es7 elasticsearch[705]:         at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86) ~[elasticsearch-7.2.1.jar:7.2.1]
Aug 09 18:30:25 es7 elasticsearch[705]:         at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124) ~[elasticsearch-cli-7.2.1.jar:7.2.1]
Aug 09 18:30:25 es7 elasticsearch[705]:         at org.elasticsearch.cli.Command.main(Command.java:90) ~[elasticsearch-cli-7.2.1.jar:7.2.1]
Aug 09 18:30:25 es7 elasticsearch[705]:         at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:115) ~[elasticsearch-7.2.1.jar:7.2.1]
Aug 09 18:30:25 es7 elasticsearch[705]:         at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92) ~[elasticsearch-7.2.1.jar:7.2.1]
Aug 09 18:30:25 es7 elasticsearch[705]: Caused by: java.lang.NullPointerException
Aug 09 18:30:25 es7 elasticsearch[705]:         at sun.nio.fs.UnixFileSystem.getPath(UnixFileSystem.java:273) ~[?:?]
Aug 09 18:30:25 es7 elasticsearch[705]:         at org.elasticsearch.common.io.PathUtils.get(PathUtils.java:60) ~[elasticsearch-core-7.2.1.jar:7.2.1]
Aug 09 18:30:25 es7 elasticsearch[705]:         at org.elasticsearch.monitor.os.OsProbe.readSysFsCgroupCpuAcctCpuAcctUsage(OsProbe.java:309) ~[elasticsearch-7.2.1.jar:7.2.1]
Aug 09 18:30:25 es7 elasticsearch[705]:         at org.elasticsearch.monitor.os.OsProbe.getCgroupCpuAcctUsageNanos(OsProbe.java:296) ~[elasticsearch-7.2.1.jar:7.2.1]
Aug 09 18:30:25 es7 elasticsearch[705]:         at org.elasticsearch.monitor.os.OsProbe.getCgroup(OsProbe.java:515) ~[elasticsearch-7.2.1.jar:7.2.1]
Aug 09 18:30:25 es7 elasticsearch[705]:         at org.elasticsearch.monitor.os.OsProbe.osStats(OsProbe.java:635) ~[elasticsearch-7.2.1.jar:7.2.1]
Aug 09 18:30:25 es7 elasticsearch[705]:         at org.elasticsearch.monitor.os.OsService.<init>(OsService.java:49) ~[elasticsearch-7.2.1.jar:7.2.1]
Aug 09 18:30:25 es7 elasticsearch[705]:         at org.elasticsearch.monitor.MonitorService.<init>(MonitorService.java:46) ~[elasticsearch-7.2.1.jar:7.2.1]
Aug 09 18:30:25 es7 elasticsearch[705]:         at org.elasticsearch.node.Node.<init>(Node.java:366) ~[elasticsearch-7.2.1.jar:7.2.1]
Aug 09 18:30:25 es7 elasticsearch[705]:         at org.elasticsearch.node.Node.<init>(Node.java:251) ~[elasticsearch-7.2.1.jar:7.2.1]
Aug 09 18:30:25 es7 elasticsearch[705]:         at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:221) ~[elasticsearch-7.2.1.jar:7.2.1]
Aug 09 18:30:25 es7 elasticsearch[705]:         at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:221) ~[elasticsearch-7.2.1.jar:7.2.1]
Aug 09 18:30:25 es7 elasticsearch[705]:         at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:349) ~[elasticsearch-7.2.1.jar:7.2.1]
Aug 09 18:30:25 es7 elasticsearch[705]:         at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159) ~[elasticsearch-7.2.1.jar:7.2.1]
Aug 09 18:30:25 es7 elasticsearch[705]:         ... 6 more

@kakra kakra changed the title ES 6.8.1 fails to start (probably due to cgroups) ES 6.8.1 (also 7.2.1) fails to start (probably due to cgroups) Aug 9, 2019
@jtibshirani jtibshirani added the :Core/Infra/Core Core issues without another label label Aug 9, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@kakra
Copy link
Author

kakra commented Aug 12, 2019

Hacky patch to work around this problem (yet untested because I need to build the package from source first):

From 674f9cbcc8d98e1371a664344d1c439d2756fb2d Mon Sep 17 00:00:00 2001
From: Kai Krakow <kk@netactive.de>
Date: Mon, 12 Aug 2019 12:58:37 +0200
Subject: [PATCH] HACK: Catch NullPointerException during cgroup detection

Do not crash if for whatever reason the pattern matching didn't match
the cgroups and resolves to a NULL pointer path component. In this case,
simply pretent we do not have cgroups.

In my case, it crashes because `controlGroup` is passed a NULL pointer
here:

String readSysFsCgroupCpuAcctCpuAcctUsage(final String controlGroup)
throws IOException {
return readSingleLine(PathUtils.get("/sys/fs/cgroup/cpuacct",
controlGroup, "cpuacct.usage"));
    }


This probably results from `OsProbe.getControlGroups()` returning a
`controllerMap` with NULL pointers in some cases.

Signed-off-by: Kai Krakow <kk@netactive.de>
---
 server/src/main/java/org/elasticsearch/monitor/os/OsProbe.java | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/server/src/main/java/org/elasticsearch/monitor/os/OsProbe.java b/server/src/main/java/org/elasticsearch/monitor/os/OsProbe.java
index 18173dd275a..44c639546f1 100644
--- a/server/src/main/java/org/elasticsearch/monitor/os/OsProbe.java
+++ b/server/src/main/java/org/elasticsearch/monitor/os/OsProbe.java
@@ -507,6 +507,9 @@ public class OsProbe {
         } catch (final IOException e) {
             logger.debug("error reading control group stats", e);
             return null;
+        } catch (final NullPointerException e) {
+            logger.debug("error resolving control groups", e);
+            return null;
         }
     }

--
2.21.0

@kakra
Copy link
Author

kakra commented Aug 12, 2019

Confirming my own hacky patch works: I've built patched 6.8.1 using gradle, then replaced the JAR file containing OsProbe.class in my container. ES 6.8.1 now starts successfully.

If the patch is fine for you, feel free to apply (just remove the "HACK:" prefix), or ask for a PR. I can port it over to different branches.

@pugnascotia pugnascotia self-assigned this Aug 14, 2019
pugnascotia added a commit to pugnascotia/elasticsearch that referenced this issue Aug 15, 2019
`OsProbe` fetches cgroup data from the filesystem, and has asserts that
check for missing values. This PR changes most of these asserts into
runtime checks, since at least one user has reported an NPE where
a piece of cgroup data was missing.

Also update the testing guidelines to reflect current Gradle usage.

Closes elastic#45396.
pugnascotia added a commit that referenced this issue Aug 16, 2019
* Always check that cgroup data is present

`OsProbe` fetches cgroup data from the filesystem, and has asserts that
check for missing values. This PR changes most of these asserts into
runtime checks, since at least one user has reported an NPE where
a piece of cgroup data was missing.

Also update the testing guidelines to reflect current Gradle usage.

Closes #45396.

* Shorten long line, add JavaDoc.

A helper method in OsProbeTests breached the maximum line length. Break
up the line, and add some JavaDoc to the method.

* Revert testing doc changes for separate submission

* Use 'cgroup' abbreviation in log messages

When checking for particular items of cgroup data and finding them
missing, use the common abbreviaton 'cgroup' in the debug log messages.
pugnascotia added a commit to pugnascotia/elasticsearch that referenced this issue Aug 16, 2019
* Always check that cgroup data is present

`OsProbe` fetches cgroup data from the filesystem, and has asserts that
check for missing values. This PR changes most of these asserts into
runtime checks, since at least one user has reported an NPE where
a piece of cgroup data was missing.

Also update the testing guidelines to reflect current Gradle usage.

Closes elastic#45396.

* Shorten long line, add JavaDoc.

A helper method in OsProbeTests breached the maximum line length. Break
up the line, and add some JavaDoc to the method.

* Revert testing doc changes for separate submission

* Use 'cgroup' abbreviation in log messages

When checking for particular items of cgroup data and finding them
missing, use the common abbreviaton 'cgroup' in the debug log messages.
@pugnascotia
Copy link
Contributor

@kakra thanks for raising this issue. We've fixed this in a slightly different way, by explicitly checking for missing cgroup data.

@kakra
Copy link
Author

kakra commented Aug 16, 2019

@pugnascotia You're welcome.

Is the 5.x branch still maintained and will this be backported? Because I fear when upgrading kernels for my apps that still depend on the 5.x version, it will crash due to the same issue.

@pugnascotia
Copy link
Contributor

5.x is no longer maintained, I'm afraid. However, you could look into why cpuacct doesn't appear in /proc/self/cgroup on your system. I booted an Ubuntu VM with a 5.0 kernel (so not as new as yours, but at least the same major version), and I still saw a line with cpuacct in /proc/self/cgroup.

@kakra
Copy link
Author

kakra commented Aug 16, 2019

I researched a little bit, it's probably a combination of different factors: kernel version, systemd using unified hierarchy, the ES machines are running in a nspawn container (as an installation mirror of the production machines so normal system updates to my dev machine won't tamper with the dependencies), and my desktop/development kernels use the MuQSS scheduler (CK patchset). The latter doesn't support hierarchical CPU bandwidth allocation and no CPU accounting at all, it's flat, and I'm pretty sure that's causing the problem.

I'll look into the 5.x case myself then. Since those containers are mostly static (they only offer ES as a service visible to the host), I can simply backport the patch, rebuild the class file and replace it in the installation. It's a bit dirty but it should be good enough.

BTW: At least, if it's the CK patchset, the server deployments would be safe: They are running the standard Linux scheduler. Still good to know that potential crash is caught now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants