Research Linux OOM killer behavior for cgroups #20

cirocosta · 2019-03-11T01:36:22Z

Hey,

We've been seeing some of our workers going away after a certain memory profile gets reached due to the kernel's OOM killer getting started and destroying our container.

It seems like k8s does not act as an intermediary for the OOM killer when it comes to cgroups, making the whole process very ungraceful.

It'd be interesting to understand:

how does k8s deal with limits when you reach memory limits (does it live to the kernel to terminate the parent process in the cgroup?), and
what is the best metric to look for when it comes to understanding if an OOM eviction is going to take place: should we look 1 - available? Should we consider cache?

This is very impactful for workloads like strabo that might generate a huge in-memory cache for the files it access.

The text was updated successfully, but these errors were encountered:

cirocosta mentioned this issue Jan 2, 2020

Investigate worker recovery for k8s ungracefully restarting pods out-of-band due to high memory consumption #94

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research Linux OOM killer behavior for cgroups #20

Research Linux OOM killer behavior for cgroups #20

cirocosta commented Mar 11, 2019 •

edited

Research Linux OOM killer behavior for cgroups #20

Research Linux OOM killer behavior for cgroups #20

Comments

cirocosta commented Mar 11, 2019 • edited

cirocosta commented Mar 11, 2019 •

edited