Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ES doesn't start when there are empty cgroup controller names in '/proc/self/cgroup' #23486

Closed
phile314-fh opened this issue Mar 4, 2017 · 15 comments

Comments

@phile314-fh
Copy link

phile314-fh commented Mar 4, 2017

Elasticsearch version: 5.2.2

Plugins installed: []

JVM version: 8u122

OS version: Linux

Description of the problem including expected versus actual behavior:
The OS Probe Regex fails if there is a cgroup entry with no controller and crashes. Example /proc/self/cgroup (see last line):

8:net_cls:/
7:devices:/user.slice
6:pids:/user.slice/user-1000.slice/session-3.scope
5:blkio:/
4:freezer:/
3:memory:/
2:cpu,cpuacct:/
1:cpuset:/
0::/user.slice/user-1000.slice/session-3.scope

Related issue: #23218

Steps to reproduce:
1.
2.
3.

Provide logs (if relevant):

Mar 03 12:43:00 elasticsearch systemd[1]: elasticsearch.service: Main process exited, code=exited, status=1/FAILURE
Mar 03 12:43:00 elasticsearch elasticsearch[641]:         ... 6 more
Mar 03 12:43:00 elasticsearch elasticsearch[641]:         at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:121) ~[elasticsearch-5.2.2.jar:5.2.2]
Mar 03 12:43:00 elasticsearch elasticsearch[641]:         at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:333) ~[elasticsearch-5.2.2.jar:5.2.2]
Mar 03 12:43:00 elasticsearch elasticsearch[641]:         at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:241) ~[elasticsearch-5.2.2.jar:5.2.2]
Mar 03 12:43:00 elasticsearch elasticsearch[641]:         at org.elasticsearch.bootstrap.Bootstrap$6.(Bootstrap.java:241) ~[elasticsearch-5.2.2.jar:5.2.2]
Mar 03 12:43:00 elasticsearch elasticsearch[641]:         at org.elasticsearch.node.Node.(Node.java:232) ~[elasticsearch-5.2.2.jar:5.2.2]
Mar 03 12:43:00 elasticsearch elasticsearch[641]:         at org.elasticsearch.node.Node.(Node.java:345) ~[elasticsearch-5.2.2.jar:5.2.2]
Mar 03 12:43:00 elasticsearch elasticsearch[641]:         at org.elasticsearch.monitor.MonitorService.(MonitorService.java:45) ~[elasticsearch-5.2.2.jar:5.2.2]
Mar 03 12:43:00 elasticsearch elasticsearch[641]:         at org.elasticsearch.monitor.os.OsService.(OsService.java:45) ~[elasticsearch-5.2.2.jar:5.2.2]
Mar 03 12:43:00 elasticsearch elasticsearch[641]:         at org.elasticsearch.monitor.os.OsProbe.osStats(OsProbe.java:466) ~[elasticsearch-5.2.2.jar:5.2.2]
Mar 03 12:43:00 elasticsearch elasticsearch[641]:         at org.elasticsearch.monitor.os.OsProbe.getCgroup(OsProbe.java:414) ~[elasticsearch-5.2.2.jar:5.2.2]
Mar 03 12:43:00 elasticsearch elasticsearch[641]:         at org.elasticsearch.monitor.os.OsProbe.getControlGroups(OsProbe.java:216) ~[elasticsearch-5.2.2.jar:5.2.2]
Mar 03 12:43:00 elasticsearch elasticsearch[641]:         at java.util.regex.Matcher.group(Matcher.java:536) ~[?:1.8.0_121]
Mar 03 12:43:00 elasticsearch elasticsearch[641]: Caused by: java.lang.IllegalStateException: No match found
Mar 03 12:43:00 elasticsearch elasticsearch[641]:         at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:82) ~[elasticsearch-5.2.2.jar:5.2.2]
Mar 03 12:43:00 elasticsearch elasticsearch[641]:         at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:89) ~[elasticsearch-5.2.2.jar:5.2.2]
Mar 03 12:43:00 elasticsearch elasticsearch[641]:         at org.elasticsearch.cli.Command.main(Command.java:88) ~[elasticsearch-5.2.2.jar:5.2.2]
Mar 03 12:43:00 elasticsearch elasticsearch[641]:         at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:122) ~[elasticsearch-5.2.2.jar:5.2.2]
Mar 03 12:43:00 elasticsearch elasticsearch[641]:         at org.elasticsearch.cli.SettingCommand.execute(SettingCommand.java:54) ~[elasticsearch-5.2.2.jar:5.2.2]
Mar 03 12:43:00 elasticsearch elasticsearch[641]:         at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:112) ~[elasticsearch-5.2.2.jar:5.2.2]
Mar 03 12:43:00 elasticsearch elasticsearch[641]:         at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:125) ~[elasticsearch-5.2.2.jar:5.2.2]
Mar 03 12:43:00 elasticsearch elasticsearch[641]: org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: No match found
Mar 03 12:43:00 elasticsearch elasticsearch[641]: [2017-03-03T12:43:00,027][WARN ][org.elasticsearch.bootstrap.ElasticsearchUncaughtExceptionHandler] uncaught exception in thread [main]
Mar 03 12:42:59 elasticsearch elasticsearch[641]: [2017-03-03T12:42:59,000][WARN ][org.elasticsearch.deprecation.script.groovy.GroovyScriptEngineService] [groovy] scripts are deprecated, use [painless] scripts instead
Mar 03 12:42:58 elasticsearch elasticsearch[641]: [2017-03-03T12:42:58,412][INFO ][org.elasticsearch.plugins.PluginsService] no plugins loaded

Describe the feature:
Make the regex more robust.Changing the + to a * in the failing regex for the part matching the cgroup controller name should do the trick. I would make a PR, but I am not willing to sign the CLA.

@jasontedor
Copy link
Member

jasontedor commented Mar 4, 2017

Thanks for the report @phile314-fh and sorry for the issue. I'll put together a fix soon. What Linux distribution are you using (including version, and kernel version)? Would you share the output of cat /proc/cgroups and mount | grep cgroup?

@phile314-fh
Copy link
Author

Kernel: Linux nixos 4.9.9 #1-NixOS SMP Thu Feb 9 07:08:40 UTC 2017 x86_64 GNU/Linux
systemd: 232
Distribution: NixOS 17.03

/proc/cgroups:

#subsys_name	hierarchy	num_cgroups	enabled
cpuset	1	1	1
cpu	2	1	1
cpuacct	2	1	1
blkio	5	1	1
memory	3	1	1
devices	7	38	1
freezer	4	1	1
net_cls	8	1	1
pids	6	43	1

mount | grep cgroup:

tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup2 (rw,nosuid,nodev,noexec,relatime)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/net_cls type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls)

Although you don't officially support NixOS, this ought to be fixed as the problem could also happen on other distributions. Furthermore, as the error (when it occurs) is quite severe, a more liberal parsing of the cgroups seems appropriate to me.

@jasontedor
Copy link
Member

Okay, I wanted to ensure it was only because the cgroup version 2 hierarchy was mistakenly accounted for and that's exactly what is happening here. I opened #23493.

@jay-dihenkar
Copy link

jay-dihenkar commented Jul 15, 2017

Faced this on FC26 as well.... For ES v5.2.2

$ cat /etc/redhat-release 
Fedora release 26 (Twenty Six)
$ uname -a
Linux jaypc 4.11.9-300.fc26.x86_64 #1 SMP Wed Jul 5 16:21:56 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

@jasontedor
Copy link
Member

@jay-dihenkar You should upgrade, this is fixed in 5.3.1.

@aymone
Copy link

aymone commented Nov 17, 2017

@jasontedor 5.3.1 are incompatible with my services already in production on AWS, theres some workaround or fix for it?
I using fedora 26 too... @jay-dihenkar u found some fix for it?

@jasontedor
Copy link
Member

@aymone You can disable the cgroup version 2 hierarchy on your system, otherwise you have to upgrade.

@aymone
Copy link

aymone commented Nov 30, 2017

@jasontedor do you know how to do it?

kowalcj0 added a commit to uktrade/directory-api that referenced this issue Jan 29, 2018
this is to fix the issue with some linux distros see: @ED-3197
elastic/elasticsearch#23486
@KrzysztofMadejski
Copy link

KrzysztofMadejski commented Feb 20, 2018

I can't update my ES, same as @aymone , because services depend on this version.

Any workarounds? How one can disable cgroup hierarchy and what does it mean / what side-effects can it have?

I have on Ubuntu 17.10:

12:rdma:/
11:pids:/user.slice/user-0.slice/session-3.scope
10:blkio:/user.slice/user-0.slice/session-3.scope
9:hugetlb:/
8:net_cls,net_prio:/
7:perf_event:/
6:devices:/user.slice/user-0.slice/session-3.scope
5:memory:/user.slice/user-0.slice/session-3.scope
4:cpu,cpuacct:/user.slice/user-0.slice/session-3.scope
3:cpuset:/
2:freezer:/
1:name=systemd:/user.slice/user-0.slice/session-3.scope
0::/user.slice/user-0.slice/session-3.scope

@jasontedor
Copy link
Member

jasontedor commented Feb 26, 2018

@aymone @KrzysztofMadejski Please poke around in documentation and the web for that, that is a general Linux issue, not an Elasticsearch issue.

@KrzysztofMadejski
Copy link

KrzysztofMadejski commented Feb 28, 2018

@aymone I've cherrypicked ae6331f into 5.2.2 tag, resolved conflicts, compiled from source and it seems to work.

Run the gradle as gradle assemble -Dbuild.snapshot=false

@KrzysztofMadejski
Copy link

@jasontedor it would be good to add "wontfix" label here.

@jasontedor
Copy link
Member

What do you mean? It is fixed in #23493 released in 5.3.1.

@KrzysztofMadejski
Copy link

The bug report is against version 5.2.2 so I see it as "won't fix" for branch 5.2.x. Such notion makes sense to me because minor versions may introduce backwards incompatible changes (5.3 does) so an upgrade is not straightforward operation if you have ES in production.

The other issue which is troubling me more is why you introduce backwards incompatible changes in minor versions, which is contrary to the declaration at https://www.elastic.co/support/eol. But for clarity let's put it into another issue.

@jasontedor
Copy link
Member

I understand where you’re coming from, but our maintenance policy is very clear (when 5.3.0 is released, 5.2 sees no more releases) and all the information needed to determine what version this is fixed in is already available.

But for clarity let's put it into another issue.

Please do.

rmetzger added a commit to rmetzger/flink that referenced this issue Feb 4, 2020
The ElasticSearch connector tests are failing on some machines, due to an issue with a regex to parse cgroups: elastic/elasticsearch#23486.
rmetzger added a commit to rmetzger/flink that referenced this issue Feb 4, 2020
The ElasticSearch connector tests are failing on some machines, due to an issue with a regex to parse cgroups: elastic/elasticsearch#23486.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants