The agent should more clearly indicate when it or its sub-processes have been OOM killed on Kubernetes #3641

cmacknz · 2023-10-19T20:30:16Z

We need to make it easier to detect inadequate memory limits on Kubernetes, which are extremely common.

The agent should detect when its last status was OOM killed and report its status as degraded. Detecting that an agent has been OOMKilled from diagnostics along is not easy, it must be inferred from process restarts appearing the agent diagnostics with no other plausible explanations.

Today the primary way for us to detect this is to instruct users to run kubectl describe pod and look for the following:

       Last State:   Terminated
       Reason:       OOMKilled
       Exit Code:    137

We should automate this process and have the agent read the last state and reason for itself and report it in the agent status report.

We have also seen cases where the agent sub-processes are killed and restarted without the agent process itself being OOMKilled (because the sub-processes use more memory). We should double check that the OOMKilled reason appears on the pod when this happens.

The OOM kill event also appears in the node kernel logs if we end up needing to look there:

Mar 13 20:37:14 aks-default-32489819 kernel: [2442796.469054] Memory cgroup out of memory: Killed process 2532535 (filebeat) total-vm:2766604kB, anon-rss:1298484kB, file-rss:71456kB, shmem-rss:0kB, UID:0 pgtables:2992kB oom_score_adj:-997
Mar 13 20:37:14 aks-default-32489819 systemd[1]: cri-containerd-8a7c9177c7f2c619df882ecfebb3895c.scope: A process of this unit has been killed by the OOM killer.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2023-10-19T20:30:19Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

cmacknz · 2024-03-20T15:19:05Z

I think we will need to experiment with a few different scenarios to test this properly:

The agent container going over its configured memory limit because one of the sub-processes (e.g. Filebeat is using too much memory).
The agent container staying under its limit, but the node it is running on running out of memory. This can be triggered by having the individual containers stay under a large memory limit while the sum of their actual memory consumption is greater than the memory available on the node.

leehinman · 2024-03-20T16:08:59Z

Just so we don't forget. If the ExitCode is -1, that signals that "process hasn't exited or was terminated by a signal". We currently just log the ExitCode if a subprocess exits. We could add to the error message if the error is -1 that this is potentially OOM or at least that the process is getting killed via an external mechanism.

cmacknz · 2024-05-08T17:53:38Z

The reporting we get from k8s when a pod is OOMKilled differs based on the Kubernetes version.

Starting from Kubernetes 1.28 the memory.oom.group feature of cgroups v2 is turned on by default, so the pod will be OOM killed if any process in the container cgroup hits a memory limit.

Prior versions have memory.oom.group turned off, so the pods won't be annotated with the OOMKilled last exit reason. Most of our memory consumption happens in sub-processes so we hit this situation frequently.

Kubernetes change log for reference: https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.28.md

If using cgroups v2, then the cgroup aware OOM killer will be enabled for container cgroups via memory.oom.group . This causes processes within the cgroup to be treated as a unit and killed simultaneously in the event of an OOM kill on any process in the cgroup. (#117793, @tzneal) [SIG Apps, Node and Testing]

cmacknz added the Team:Elastic-Agent Label for the Agent team label Oct 19, 2023

cmacknz added the Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team label Oct 19, 2023

cmacknz mentioned this issue Oct 19, 2023

[META] Make the Elastic Agent easier to debug #3640

Open

pierrehilbert assigned faec Dec 1, 2023

kilfoyle mentioned this issue Jan 22, 2024

Add troubleshooting for Agent Out of Memory on K8s elastic/ingest-docs#839

Closed

cmacknz changed the title ~~Report the agent status as degraded when previously OOMKilled on Kubernetes~~ The agent should more clearly indicate when it or its sub-processes have been OOM killed on Kubernetes Mar 20, 2024

jlind23 removed the Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team label Mar 20, 2024

jlind23 unassigned faec Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The agent should more clearly indicate when it or its sub-processes have been OOM killed on Kubernetes #3641

The agent should more clearly indicate when it or its sub-processes have been OOM killed on Kubernetes #3641

cmacknz commented Oct 19, 2023 •

edited

elasticmachine commented Oct 19, 2023

cmacknz commented Mar 20, 2024

leehinman commented Mar 20, 2024

cmacknz commented May 8, 2024

The agent should more clearly indicate when it or its sub-processes have been OOM killed on Kubernetes #3641

The agent should more clearly indicate when it or its sub-processes have been OOM killed on Kubernetes #3641

Comments

cmacknz commented Oct 19, 2023 • edited

elasticmachine commented Oct 19, 2023

cmacknz commented Mar 20, 2024

leehinman commented Mar 20, 2024

cmacknz commented May 8, 2024

cmacknz commented Oct 19, 2023 •

edited