Skip to content
This repository has been archived by the owner. It is now read-only.

CFS scheduler bug throttles highly threaded I/O blocked applications in Kubernetes #2623

Closed
dharmab opened this issue Oct 30, 2019 · 8 comments
Closed

Comments

@dharmab
Copy link

@dharmab dharmab commented Oct 30, 2019

Issue Report

Bug

Container Linux Version

NAME="Container Linux by CoreOS"
ID=coreos
VERSION=2191.5.0
VERSION_ID=2191.5.0
BUILD_ID=2019-09-04-0357
PRETTY_NAME="Container Linux by CoreOS 2191.5.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

Environment

Azure, AWS, and VMware

Expected Behavior

Highly threaded, I/O blocked containers running on Kubernetes on CoreOS should be able to use their full configured CPU request.

Actual Behavior

Highly threaded, I/O blocked containers running on Kubernetes on CoreOS are heavily throttled well before they approach their CPU request. We are seeing CPU performance impact of 50% in production for some Java web applications- i.e. if we request 6 cores, we are throttled at 3.

See https://lkml.org/lkml/2019/5/17/581 for the kernel patch to fix this, which is landing in 5.4.

Because this bug heavily impacts the main intended use case of CoreOS Container Linux, would it be possible to prioritize this patch for backport?

@jcrowthe
Copy link

@jcrowthe jcrowthe commented Oct 30, 2019

Small clarification: the issue lies in cfs, the CPU bandwidth control mechanism. Due to this bug in the kernel, cfs may throttle a pod well before its requests are reached. Hence the issue is in how the kernel enforces the kubernetes pod limits rather than pod requests.

@jhohertz
Copy link

@jhohertz jhohertz commented Nov 4, 2019

Just wanted to reference some work I've done on this, overlay commit w/ the patch here:

viafoura/flatcar-coreos-overlay@b32f07f

Edit: This ⬆️ is obsoleted by the PR below ⬇️ .

@chiluk
Copy link

@chiluk chiluk commented Nov 4, 2019

cpu ".request" = cpu.shares which is a different mechanism from ".limits" which uses cfs bandwidth control. If you really are using .request you are probably being hit by over-committed cpus on your nodes.

That being said yes CoreOS should backport the following linux commits onto their kernel if they haven't already.
512ac999
de53fd7ae
763a9ec06

@jhohertz
Copy link

@jhohertz jhohertz commented Nov 5, 2019

Update: I am reworking my PR w/ input from the developers now.

@dharmab
Copy link
Author

@dharmab dharmab commented Nov 6, 2019

@chiluk good callout. We were setting both requests and limits to 6 in our test, and reproduced the throttling on a node we tainted to run just the one application Pod that had many more cores available.

@jhohertz thanks for submitting the patches! We've built a CoreOS image with those patches for some testing and are also awaiting an official release.

@dharmab
Copy link
Author

@dharmab dharmab commented Nov 7, 2019

We confirmed this patch fixed our throttling issue! Thank you!

@dharmab dharmab closed this Nov 7, 2019
@bgilbert
Copy link
Member

@bgilbert bgilbert commented Nov 7, 2019

This should be fixed in alpha 2317.0.1, due shortly. We'll probably roll the fix into beta and stable more quickly than the normal promotion schedule would suggest. Reopening until that happens.

@bgilbert bgilbert reopened this Nov 7, 2019
@bgilbert
Copy link
Member

@bgilbert bgilbert commented Nov 19, 2019

This will also be fixed in beta 2303.2.0 and stable 2247.7.0, due shortly. Thanks for reporting.

@bgilbert bgilbert closed this Nov 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants