Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes liveness probes with docker exec fail randomly #2605

Open
hingstarne opened this issue Aug 3, 2019 · 3 comments

Comments

@hingstarne
Copy link

commented Aug 3, 2019

Issue Report

Bug

Kubernetes liveness probes fail randomly on this version of coreos. There is a bug regarding the used runc version ...

runc --version
runc version 1.0.0-rc5+dev.docker-18.06
commit: a592beb5bc4c4092b1b1bac971afed27687340c5
spec: 1.0.0

See here

user 5m 5m 1 user-sqsworker-55f4f9494f-glnm7.15b76be66de646eb Pod spec.containers{rails} Warning Unhealthy kubelet, ip-172-31-101-183.eu-west-1.compute.internal Readiness probe failed: OCI runtime exec failed: exec failed: container_linux.go:348: starting container process caused "process_linux.go:90: adding pid 18580 to cgroups caused \"failed to write 18580 to cgroup.procs: write /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/podb72c8c56-b538-11e9-9f9c-06d6b3e699b6/1075b19a94bb045fdf72bfb0133bbbc721f3e04c43a99ded2d0f5eba6f34e7ca/cgroup.procs: invalid argument\"": unknown

This error happens randomly and we cannot provoke it but as it happens with our cni pods as well, thats why it is a big issue for us.

Container Linux Version

cat /etc/os-release 
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=2135.6.0
VERSION_ID=2135.6.0
BUILD_ID=2019-07-30-0722
PRETTY_NAME="Container Linux by CoreOS 2135.6.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

Environment

aws ec2 instance m5.xlarge

Expected Behavior

Liveness probe fails only on error within the pod

Actual Behavior

Liveness probes fails randomly on setting cgroups.procs.

Reproduction Steps

  1. Use kubernetes 1.11.10 deployed by kops 1.11.1 on aws
  2. Wait and watch events in all namespaces for this error to occur

Other Information

@bgilbert

This comment has been minimized.

Copy link
Member

commented Aug 3, 2019

Thanks for the report. Did this work properly in a previous version of Container Linux?

@hingstarne

This comment has been minimized.

Copy link
Author

commented Aug 5, 2019

We use an immutable approach and disable the update-engine.

It started when we migrated to CoreOS-stable-2079.3.0-hvm and is still with CoreOS-stable-2135.5.0-hvm that we are using now.
Is there any best practice on how to replace runc on the system properly for testing?

@bgilbert

This comment has been minimized.

Copy link
Member

commented Aug 5, 2019

It started when we migrated to CoreOS-stable-2079.3.0-hvm

Which version were you using before that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.