Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os.cgroup.cpuacct.usage_nanos is actually microseconds when Elasticsearch is ran inside cgroup v2 #96089

Closed
b-deam opened this issue May 15, 2023 · 2 comments · Fixed by #96924
Assignees
Labels
>bug :Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team

Comments

@b-deam
Copy link
Member

b-deam commented May 15, 2023

Elasticsearch Version

master

Installed Plugins

No response

Java Version

bundled

OS Version

5.15.0-1036-azure

Problem Description

When Elasticsearch is ran inside a cgroup v2, the node stats output for "https://elasticsearch:9200/_nodes/stats?filter_path=nodes.*.os.cgroup.cpuacct.usage_nanos" is actually in microseconds, cgroup v1 correctly reports these in the nanosecond unit:

cgroupCpuAcctUsageNanos = cpuStatsMap.get("usage_usec");

We collect these stats in Rally's node-stats telemetry device and it became clear that formula we use to derive CPU usage out of the available time is off by a factor of 1000 (i.e. the difference between nanoseconds and microseconds) for any container running inside a cgroup v2.

The below screenshots show the difference between cgroup v1 running on Google Kuberentes Engine (GKE), and cgroup v2 running on Azure Kubernetes Service (AKS):
image
image

GKE output (cgroup v1)

$ uname -a
Linux es-es-search-7b66d98c5b-fs28n 5.15.89+ #1 SMP Sat Mar 18 09:27:02 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

$ mount -l | grep cgroup
tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (ro,nosuid,nodev,noexec,relatime,xattr,name=systemd)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (ro,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/pids type cgroup (ro,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/blkio type cgroup (ro,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/freezer type cgroup (ro,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/cpuset type cgroup (ro,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (ro,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/perf_event type cgroup (ro,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/rdma type cgroup (ro,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (ro,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/devices type cgroup (ro,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/memory type cgroup (ro,nosuid,nodev,noexec,relatime,memory)

# nanoseconds
$ cat /sys/fs/cgroup/cpu,cpuacct/cpuacct.usage
63277158346752 

AKS output (cgroup v2):

$ uname -a
Linux es-es-index-6f49648d8-jhm9s 5.15.0-1036-azure #43-Ubuntu SMP Wed Mar 29 16:11:05 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

$ mount -l | grep cgroup
cgroup on /sys/fs/cgroup type cgroup2 (ro,nosuid,nodev,noexec,relatime)

# microseconds
$ cat /sys/fs/cgroup/cpu.stat
usage_usec 104036485036
user_usec 98419994704
system_usec 5616490332
nr_periods 164357
nr_throttled 143842
throttled_usec 9516539086

Steps to Reproduce

I encountered the bug when running a cluster inside an Azure Kubernetes Service (AKS) cluster, but that's not exactly practical for reproductions.

We can repro this using ECK and Minikube with the Docker driver on macOs.

Note that for Linux users Minikube automatically detects whether or not cgroup v1 or v2 is in use by your workstation (i.e. where you invoke minikube start from), whereas for Docker Desktop users on macOS (which actually creates a Linux VM in the background) we need to adjust the cgroup version via modifying the engine's settings (more on this below).

Testing with cgroup v2:

# minikube > 1.23 defaults to cgroup v2
$ minikube version
minikube version: v1.26.1
commit: 62e108c3dfdec8029a890ad6d8ef96b6461426dc

# start minikube
$ minikube start

# create eck operator
$ minikube kubectl -- create -f https://download.elastic.co/downloads/eck/2.7.0/crds.yaml
$ minikube kubectl -- apply -f https://download.elastic.co/downloads/eck/2.7.0/operator.yaml
$ minikube kubectl -- -n elastic-system logs -f statefulset.apps/elastic-operator

# deploy elasticsearch
$ cat <<EOF | minikube kubectl -- apply -f -
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 8.7.1
  nodeSets:
  - name: default
    count: 1
    config:
      node.store.allow_mmap: false
EOF

# check that es is using cgroup v2
$ minukube kubectl -- -n elastic-system exec -it local-es-default-0 -- /bin/sh

# inside es pod/container
sh-5.0$ mount -l | grep cgroup
cgroup on /sys/fs/cgroup type cgroup2 (ro,nosuid,nodev,noexec,relatime)

# get pass
$ PASSWORD=$(minikube kubectl -- get secret quickstart-es-elastic-user -o go-template='{{.data.elastic | base64decode}}')
# make service avail 
$ minikube kubectl -- port-forward service/quickstart-es-http 9200
# check output
$ curl -s -u "elastic:$PASSWORD" -k "https://localhost:9200/_nodes/stats?filter_path=nodes.*.os.cgroup.cpuacct.usage_nanos" | jq .
{
  "nodes": {
    "WA7ANuASRiGF7xgO3dYp_w": {
      "os": {
        "cgroup": {
          "cpuacct": {
            "usage_nanos": 107115513
          }
        }
      }
    }
  }
}

I'm on macOS Monterey 12.6 using the docker driver for minikube , which is actually a Linux VM running behind the scenes. In order to force it to use cgroup v2 I had to configure "deprecatedCgroupv1": true, in $HOME/Library/Group\ Containers/group.com.docker/settings.json and then restart Docker Desktop before following these steps:

$ minukube kubectl -- -n elastic-system exec -it local-es-default-0 -- /bin/sh

# inside es pod/container
sh-5.0$ mount -l | grep cgroup
tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=755)
cgroup on /sys/fs/cgroup/cpuset type cgroup (ro,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu type cgroup (ro,nosuid,nodev,noexec,relatime,cpu)
cgroup on /sys/fs/cgroup/cpuacct type cgroup (ro,nosuid,nodev,noexec,relatime,cpuacct)
cgroup on /sys/fs/cgroup/blkio type cgroup (ro,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/memory type cgroup (ro,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/devices type cgroup (ro,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup (ro,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls type cgroup (ro,nosuid,nodev,noexec,relatime,net_cls)
cgroup on /sys/fs/cgroup/perf_event type cgroup (ro,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/net_prio type cgroup (ro,nosuid,nodev,noexec,relatime,net_prio)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (ro,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/pids type cgroup (ro,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/rdma type cgroup (ro,nosuid,nodev,noexec,relatime,rdma)
systemd on /sys/fs/cgroup/systemd type cgroup (ro,nosuid,nodev,noexec,relatime,name=systemd)

# get pass
$ PASSWORD=$(minikube kubectl -- get secret quickstart-es-elastic-user -o go-template='{{.data.elastic | base64decode}}')
# make service avail 
$ minikube kubectl -- port-forward service/quickstart-es-http 9200
# check output
$ curl -s -u "elastic:$PASSWORD" -k "https://localhost:9200/_nodes/stats?filter_path=nodes.*.os.cgroup.cpuacct.usage_nanos" | jq .
{
  "nodes": {
    "WA7ANuASRiGF7xgO3dYp_w": {
      "os": {
        "cgroup": {
          "cpuacct": {
            "usage_nanos": 35975828297
          }
        }
      }
    }
  }
}

Logs (if relevant)

No response

@b-deam b-deam added >bug Team:Core/Infra Meta label for core/infra team needs:triage Requires assignment of a team area label labels May 15, 2023
@elasticsearchmachine elasticsearchmachine removed the Team:Core/Infra Meta label for core/infra team label May 15, 2023
@rjernst rjernst added :Core/Infra/Core Core issues without another label and removed needs:triage Requires assignment of a team area label labels May 15, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Core/Infra Meta label for core/infra team label May 15, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@thecoop
Copy link
Member

thecoop commented Jun 19, 2023

Looks like a basic error where it reads usecs and thinks its nanos. Fixed by #96924

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants