`os.cgroup.cpuacct.usage_nanos` is actually microseconds when Elasticsearch is ran inside cgroup v2 #96089

b-deam · 2023-05-15T07:13:31Z

Elasticsearch Version

master

Installed Plugins

No response

Java Version

bundled

OS Version

5.15.0-1036-azure

Problem Description

When Elasticsearch is ran inside a cgroup v2, the node stats output for "https://elasticsearch:9200/_nodes/stats?filter_path=nodes.*.os.cgroup.cpuacct.usage_nanos" is actually in microseconds, cgroup v1 correctly reports these in the nanosecond unit:

elasticsearch/server/src/main/java/org/elasticsearch/monitor/os/OsProbe.java

Line 695 in f01f07f

cgroupCpuAcctUsageNanos = cpuStatsMap.get("usage_usec");

We collect these stats in Rally's node-stats telemetry device and it became clear that formula we use to derive CPU usage out of the available time is off by a factor of 1000 (i.e. the difference between nanoseconds and microseconds) for any container running inside a cgroup v2.

The below screenshots show the difference between cgroup v1 running on Google Kuberentes Engine (GKE), and cgroup v2 running on Azure Kubernetes Service (AKS):

GKE output (cgroup v1)

$ uname -a
Linux es-es-search-7b66d98c5b-fs28n 5.15.89+ #1 SMP Sat Mar 18 09:27:02 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

$ mount -l | grep cgroup
tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (ro,nosuid,nodev,noexec,relatime,xattr,name=systemd)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (ro,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/pids type cgroup (ro,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/blkio type cgroup (ro,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/freezer type cgroup (ro,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/cpuset type cgroup (ro,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (ro,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/perf_event type cgroup (ro,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/rdma type cgroup (ro,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (ro,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/devices type cgroup (ro,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/memory type cgroup (ro,nosuid,nodev,noexec,relatime,memory)

# nanoseconds
$ cat /sys/fs/cgroup/cpu,cpuacct/cpuacct.usage
63277158346752

AKS output (cgroup v2):

$ uname -a
Linux es-es-index-6f49648d8-jhm9s 5.15.0-1036-azure #43-Ubuntu SMP Wed Mar 29 16:11:05 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

$ mount -l | grep cgroup
cgroup on /sys/fs/cgroup type cgroup2 (ro,nosuid,nodev,noexec,relatime)

# microseconds
$ cat /sys/fs/cgroup/cpu.stat
usage_usec 104036485036
user_usec 98419994704
system_usec 5616490332
nr_periods 164357
nr_throttled 143842
throttled_usec 9516539086

Steps to Reproduce

I encountered the bug when running a cluster inside an Azure Kubernetes Service (AKS) cluster, but that's not exactly practical for reproductions.

We can repro this using ECK and Minikube with the Docker driver on macOs.

Note that for Linux users Minikube automatically detects whether or not cgroup v1 or v2 is in use by your workstation (i.e. where you invoke minikube start from), whereas for Docker Desktop users on macOS (which actually creates a Linux VM in the background) we need to adjust the cgroup version via modifying the engine's settings (more on this below).

Testing with cgroup v2:

# minikube > 1.23 defaults to cgroup v2
$ minikube version
minikube version: v1.26.1
commit: 62e108c3dfdec8029a890ad6d8ef96b6461426dc

# start minikube
$ minikube start

# create eck operator
$ minikube kubectl -- create -f https://download.elastic.co/downloads/eck/2.7.0/crds.yaml
$ minikube kubectl -- apply -f https://download.elastic.co/downloads/eck/2.7.0/operator.yaml
$ minikube kubectl -- -n elastic-system logs -f statefulset.apps/elastic-operator

# deploy elasticsearch
$ cat <<EOF | minikube kubectl -- apply -f -
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 8.7.1
  nodeSets:
  - name: default
    count: 1
    config:
      node.store.allow_mmap: false
EOF

# check that es is using cgroup v2
$ minukube kubectl -- -n elastic-system exec -it local-es-default-0 -- /bin/sh

# inside es pod/container
sh-5.0$ mount -l | grep cgroup
cgroup on /sys/fs/cgroup type cgroup2 (ro,nosuid,nodev,noexec,relatime)

# get pass
$ PASSWORD=$(minikube kubectl -- get secret quickstart-es-elastic-user -o go-template='{{.data.elastic | base64decode}}')
# make service avail 
$ minikube kubectl -- port-forward service/quickstart-es-http 9200
# check output
$ curl -s -u "elastic:$PASSWORD" -k "https://localhost:9200/_nodes/stats?filter_path=nodes.*.os.cgroup.cpuacct.usage_nanos" | jq .
{
  "nodes": {
    "WA7ANuASRiGF7xgO3dYp_w": {
      "os": {
        "cgroup": {
          "cpuacct": {
            "usage_nanos": 107115513
          }
        }
      }
    }
  }
}

I'm on macOS Monterey 12.6 using the docker driver for minikube , which is actually a Linux VM running behind the scenes. In order to force it to use cgroup v2 I had to configure "deprecatedCgroupv1": true, in $HOME/Library/Group\ Containers/group.com.docker/settings.json and then restart Docker Desktop before following these steps:

$ minukube kubectl -- -n elastic-system exec -it local-es-default-0 -- /bin/sh

# inside es pod/container
sh-5.0$ mount -l | grep cgroup
tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=755)
cgroup on /sys/fs/cgroup/cpuset type cgroup (ro,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu type cgroup (ro,nosuid,nodev,noexec,relatime,cpu)
cgroup on /sys/fs/cgroup/cpuacct type cgroup (ro,nosuid,nodev,noexec,relatime,cpuacct)
cgroup on /sys/fs/cgroup/blkio type cgroup (ro,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/memory type cgroup (ro,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/devices type cgroup (ro,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup (ro,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls type cgroup (ro,nosuid,nodev,noexec,relatime,net_cls)
cgroup on /sys/fs/cgroup/perf_event type cgroup (ro,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/net_prio type cgroup (ro,nosuid,nodev,noexec,relatime,net_prio)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (ro,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/pids type cgroup (ro,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/rdma type cgroup (ro,nosuid,nodev,noexec,relatime,rdma)
systemd on /sys/fs/cgroup/systemd type cgroup (ro,nosuid,nodev,noexec,relatime,name=systemd)

# get pass
$ PASSWORD=$(minikube kubectl -- get secret quickstart-es-elastic-user -o go-template='{{.data.elastic | base64decode}}')
# make service avail 
$ minikube kubectl -- port-forward service/quickstart-es-http 9200
# check output
$ curl -s -u "elastic:$PASSWORD" -k "https://localhost:9200/_nodes/stats?filter_path=nodes.*.os.cgroup.cpuacct.usage_nanos" | jq .
{
  "nodes": {
    "WA7ANuASRiGF7xgO3dYp_w": {
      "os": {
        "cgroup": {
          "cpuacct": {
            "usage_nanos": 35975828297
          }
        }
      }
    }
  }
}

Logs (if relevant)

No response

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2023-05-15T14:12:58Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

thecoop · 2023-06-19T13:12:11Z

Looks like a basic error where it reads usecs and thinks its nanos. Fixed by #96924

b-deam added >bug Team:Core/Infra Meta label for core/infra team needs:triage Requires assignment of a team area label labels May 15, 2023

elasticsearchmachine removed the Team:Core/Infra Meta label for core/infra team label May 15, 2023

rjernst added :Core/Infra/Core Core issues without another label and removed needs:triage Requires assignment of a team area label labels May 15, 2023

elasticsearchmachine added the Team:Core/Infra Meta label for core/infra team label May 15, 2023

miltonhultgren mentioned this issue Jun 12, 2023

[monitoring] Rewrite CPU usage rule to improve accuracy elastic/kibana#159351

Merged

thecoop self-assigned this Jun 19, 2023

thecoop mentioned this issue Jun 19, 2023

Interpret microseconds cpu stats properly as nanos from cgroups2 #96924

Merged

thecoop closed this as completed in #96924 Jun 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`os.cgroup.cpuacct.usage_nanos` is actually microseconds when Elasticsearch is ran inside cgroup v2 #96089

`os.cgroup.cpuacct.usage_nanos` is actually microseconds when Elasticsearch is ran inside cgroup v2 #96089

b-deam commented May 15, 2023 •

edited by thecoop

elasticsearchmachine commented May 15, 2023

thecoop commented Jun 19, 2023

os.cgroup.cpuacct.usage_nanos is actually microseconds when Elasticsearch is ran inside cgroup v2 #96089

os.cgroup.cpuacct.usage_nanos is actually microseconds when Elasticsearch is ran inside cgroup v2 #96089

Comments

b-deam commented May 15, 2023 • edited by thecoop

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Logs (if relevant)

elasticsearchmachine commented May 15, 2023

thecoop commented Jun 19, 2023

`os.cgroup.cpuacct.usage_nanos` is actually microseconds when Elasticsearch is ran inside cgroup v2 #96089

`os.cgroup.cpuacct.usage_nanos` is actually microseconds when Elasticsearch is ran inside cgroup v2 #96089

b-deam commented May 15, 2023 •

edited by thecoop