Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistent memory escalation within Tetragon leads to Out-of-Memory errors #1485

Closed
Jay-JHe opened this issue Sep 18, 2023 · 14 comments
Closed
Assignees
Labels
kind/bug Something isn't working release-blocker This PR or issue is blocking the next release.

Comments

@Jay-JHe
Copy link

Jay-JHe commented Sep 18, 2023

What happened?

Hey everyone, I'm currently using Tetragon v0.11.0. For resource considerations, I've limited the resource allocation of Tetragon, setting the CPU to 250m and memory to 400MB.

However, when deploying Tetragon on a production cluster, I noticed that memory usage starts from 250MB and continuously increases until the container experiences OOM (Out Of Memory) , which usually only takes half an hour.

I checked the event generation rate of Tetragon and it's roughly 200 events per second. Given this magnitude, such event traffic shouldn't result in such high memory usage. Upon analyzing the Tetragon events, I found that over 90% of them were either process execve or exit.

On a different test cluster, I've tested Tetragon's performance by hook tcp_connect with the netperf TCP short connection tool, and it ran stably under a 400MB memory allocation, with an event rate of 1000-2000 events per second. Hence, this has puzzled me.

I've tried versions v0.9, v0.10, and v0.11 of Tetragon; they all seem to have this issue. My cluster host kernel version is 5.4.119.
Can anyone advise on how to tackle this issue?

Below is the display for pprof-heap and memory usage
pprof-heap.zip
image (2)
image (3)
profile002.pdf

Tetragon Version

v0.9 v0.10 v0.11

Kernel Version

kernel version :5.4.119

Kubernetes Version

Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"clean", BuildDate:"2023-06-14T09:53:42Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.6-tke.21.tkex.1.43+f47562b7e0c24c-dirty", GitCommit:"f47562b7e0c24c2b70a8ee9b17b4b012bb61706f", GitTreeState:"dirty", BuildDate:"2022-09-30T07:39:44Z", GoVersion:"go1.15.10", Compiler:"gc", Platform:"linux/amd64"}

Bugtool

tetragon-bugtool.tar.gz

Relevant log output

time="2023-09-18T12:03:03Z" level=info msg="config settings" config="map[bpf-lib:/var/lib/tetragon/ btf: cilium-bpf: config-dir:/etc/runtime/runtime.conf.d/ config-file: cpuprofile: data-cache-size:1024 debug:false disable-kprobe-multi:false enable-cilium-api:false enable-export-aggregation:false enable-k8s-api:true enable-msg-handling-latency:false enable-pid-set-filter:false enable-policy-filter:false enable-policy-filter-debug:false enable-process-ancestors:true enable-process-cred:false enable-process-ns:false enable-report:true event-queue-size:10000 export-aggregation-buffer-size:10000 export-aggregation-window-size:15s export-allowlist:{\"event_set\":[\"PROCESS_EXEC\", \"PROCESS_EXIT\", \"PROCESS_KPROBE\", \"PROCESS_UPROBE\"]} export-denylist:{\"health_check\":true}\n{\"namespace\":[\"\", \"tcss\", \"kube-system\"]} export-file-compress:false export-file-max-backups:5 export-file-max-size-mb:10 export-file-rotation-interval:0s export-filename:/var/run/cilium/tetragon/tetragon.log export-rate-limit:-1 field-filters:{} force-large-progs:false force-small-progs:false gops-address:localhost:8118 k8s-kubeconfig-path: kernel: kmods:[] log-format:text log-level:info memprofile:/var/run/cilium/profile.heap metrics-server::2112 netns-dir:/var/run/docker/netns/ pprof-addr:localhost:6060 process-cache-size:65536 procfs:/proc rb-size:0 rb-size-total:0 release-pinned-bpf:true server-address:localhost:54321 tracing-policy: tracing-policy-dir:/etc/tetragon/tetragon.tp.d verbose:0]"
time="2023-09-18T12:03:03Z" level=warning msg="BPF filesystem is going to be mounted automatically in /run/cilium/bpffs. However, it probably means that Cilium is running inside container and BPFFS is not mounted on the host. for more information, see: https://cilium.link/err-bpf-mount"
time="2023-09-18T12:03:03Z" level=info msg="Starting mem profiling" file=/var/run/cilium/profile.heap
time="2023-09-18T12:03:03Z" level=info msg="BPF: successfully released pinned BPF programs and maps" bpf-dir=/run/cilium/bpffs/tetragon
time="2023-09-18T12:03:03Z" level=info msg="BTF discovery: default kernel btf file found" btf-file=/sys/kernel/btf/vmlinux
time="2023-09-18T12:03:03Z" level=info msg="sensor controller waiting on channel"
time="2023-09-18T12:03:03Z" level=info msg="Registering pod delete handler for metrics"
time="2023-09-18T12:03:03Z" level=info msg="Starting metrics server" addr=":2112"
time="2023-09-18T12:03:03Z" level=info msg="Cgroup mode detection succeeded" cgroup.fs=/sys/fs/cgroup cgroup.mode="Legacy mode (Cgroupv1)"
time="2023-09-18T12:03:03Z" level=info msg="Supported cgroup controller 'memory' is active on the system" cgroup.controller.hierarchyID=10 cgroup.controller.index=4 cgroup.controller.name=memory cgroup.fs=/sys/fs/cgroup
time="2023-09-18T12:03:03Z" level=info msg="Supported cgroup controller 'pids' is active on the system" cgroup.controller.hierarchyID=12 cgroup.controller.index=11 cgroup.controller.name=pids cgroup.fs=/sys/fs/cgroup
time="2023-09-18T12:03:03Z" level=info msg="Supported cgroup controller 'cpuset' is active on the system" cgroup.controller.hierarchyID=4 cgroup.controller.index=0 cgroup.controller.name=cpuset cgroup.fs=/sys/fs/cgroup
time="2023-09-18T12:03:03Z" level=info msg="Cgroupv1 controller 'memory' will be used" cgroup.controller.hierarchyID=10 cgroup.controller.index=4 cgroup.controller.name=memory cgroup.fs=/sys/fs/cgroup
time="2023-09-18T12:03:03Z" level=info msg="Cgroupv1 hierarchy validated successfully" cgroup.fs=/sys/fs/cgroup cgroup.path=/sys/fs/cgroup/memory/kubepods/burstable/pod4937d3d8-8012-49c9-b579-faf1ee9c1666/5841fd9b21b5d977ab55700c510cd478a7e63aee00db82044b60b67c5275a0ea
time="2023-09-18T12:03:03Z" level=info msg="Deployment mode detection succeeded" cgroup.fs=/sys/fs/cgroup deployment.mode=Kubernetes
time="2023-09-18T12:03:03Z" level=info msg="Updated TetragonConf map successfully" NSPID=3089669 cgroup.controller.hierarchyID=10 cgroup.controller.index=4 cgroup.controller.name=memory cgroup.fs.magic=Cgroupv1 confmap-update=tg_conf_map deployment.mode=Kubernetes log.level=info
time="2023-09-18T12:03:03Z" level=info msg="Enabling Kubernetes API"
time="2023-09-18T12:03:03Z" level=info msg="registering policyfilter pod handlers"
time="2023-09-18T12:03:04Z" level=info msg="Initialized pod cache" num_pods=39
time="2023-09-18T12:03:04Z" level=info msg="Initialized service cache" num_services=546
time="2023-09-18T12:03:04Z" level=info msg="Disabling Cilium API"
time="2023-09-18T12:03:04Z" level=info msg="Starting process manager" enableCilium=false enableK8s=true enableProcessCred=false enableProcessNs=false
time="2023-09-18T12:03:04Z" level=info msg="Starting JSON exporter" logger="&{/var/run/cilium/tetragon/tetragon.log 10 0 5 false false ---------- 0 <nil> {0 0} <nil> {0 {0 0}}}" request="allow_list:{event_set:PROCESS_EXEC  event_set:PROCESS_EXIT  event_set:PROCESS_KPROBE  event_set:PROCESS_UPROBE}  deny_list:{health_check:{value:true}}  deny_list:{namespace:\"\"  namespace:\"tcss\"  namespace:\"kube-system\"}  field_filters:{}"
time="2023-09-18T12:03:04Z" level=info msg="Exporter configuration" enabled=true fileName=/var/run/cilium/tetragon/tetragon.log
time="2023-09-18T12:03:04Z" level=info msg="Successfully detected bpftool path" bpftool=/usr/bin/bpftool
time="2023-09-18T12:03:04Z" level=info msg="BPF: found active BPF resources" bpf-dir=/run/cilium/bpffs/tetragon pinned-bpf="[tg_conf_map]"
time="2023-09-18T12:03:04Z" level=info msg="BTF file: using metadata file" metadata=/sys/kernel/btf/vmlinux
time="2023-09-18T12:03:04Z" level=info msg="Loading sensor" name=__base__
time="2023-09-18T12:03:04Z" level=info msg="Loading kernel version 5.4.119"
time="2023-09-18T12:03:04Z" level=info msg="Starting gRPC server" address="localhost:54321" protocol=tcp
time="2023-09-18T12:03:04Z" level=info msg="tetragon, map loaded." map=execve_map path=/run/cilium/bpffs/tetragon/execve_map sensor=__base__
time="2023-09-18T12:03:04Z" level=info msg="tetragon, map loaded." map=execve_map_stats path=/run/cilium/bpffs/tetragon/execve_map_stats sensor=__base__
time="2023-09-18T12:03:04Z" level=info msg="tetragon, map loaded." map=execve_calls path=/run/cilium/bpffs/tetragon/execve_calls sensor=__base__
time="2023-09-18T12:03:04Z" level=info msg="tetragon, map loaded." map=names_map path=/run/cilium/bpffs/tetragon/names_map sensor=__base__
time="2023-09-18T12:03:04Z" level=info msg="tetragon, map loaded." map=tcpmon_map path=/run/cilium/bpffs/tetragon/tcpmon_map sensor=__base__
time="2023-09-18T12:03:04Z" level=info msg="tetragon, map loaded." map=tg_conf_map path=/run/cilium/bpffs/tetragon/tg_conf_map sensor=__base__
time="2023-09-18T12:03:04Z" level=info msg="Loading BPF program" Program=/var/lib/tetragon/bpf_exit.o Type=kprobe
time="2023-09-18T12:03:06Z" level=info msg="Loading BPF program" Program=/var/lib/tetragon/bpf_fork.o Type=kprobe
time="2023-09-18T12:03:06Z" level=info msg="Loading registered BPF probe" Program=/var/lib/tetragon/bpf_execve_event_v53.o Type=execve
time="2023-09-18T12:03:08Z" level=info msg="Read ProcFS /proc appended 1157/1810 entries"
time="2023-09-18T12:03:08Z" level=warning msg="Procfs execve event pods/ identifier error" error="open /proc/3087789/cgroup: no such file or directory"
time="2023-09-18T12:03:08Z" level=warning msg="Procfs execve event pods/ identifier error" error="open /proc/3087665/cgroup: no such file or directory"
time="2023-09-18T12:03:08Z" level=warning msg="Procfs execve event pods/ identifier error" error="open /proc/3083006/cgroup: no such file or directory"
time="2023-09-18T12:03:08Z" level=warning msg="Procfs execve event pods/ identifier error" error="open /proc/3083189/cgroup: no such file or directory"
time="2023-09-18T12:03:09Z" level=warning msg="Procfs execve event pods/ identifier error" error="open /proc/3088312/cgroup: no such file or directory"
time="2023-09-18T12:03:09Z" level=warning msg="Procfs execve event pods/ identifier error" error="open /proc/0/cgroup: no such file or directory"
time="2023-09-18T12:03:09Z" level=info msg="Loaded BPF maps and events for sensor successfully" sensor=__base__
time="2023-09-18T12:03:09Z" level=info msg="sensor controller starts"
time="2023-09-18T12:03:09Z" level=info msg="Available sensors" sensors=
time="2023-09-18T12:03:09Z" level=info msg="Registered sensors (policy-handlers)" policy-handlers="loader sensor, tracing, uprobe sensor"
time="2023-09-18T12:03:09Z" level=info msg="Registered probe types" types="generic_tracepoint, generic_uprobe, loader, execve, generic_kprobe"
time="2023-09-18T12:03:09Z" level=info msg="Start to load internal policy" TracingPolicy="Default Policies"
time="2023-09-18T12:03:10Z" level=info msg="Added generic kprobe sensor: /var/lib/tetragon/bpf_generic_kprobe_v53.o -> do_init_module" flags=none override=false
time="2023-09-18T12:03:10Z" level=info msg="BTF file: using metadata file" metadata=/sys/kernel/btf/vmlinux
time="2023-09-18T12:03:10Z" level=info msg="Loading sensor" name=gkp-sensor-1
time="2023-09-18T12:03:10Z" level=info msg="Loading kernel version 5.4.119"
time="2023-09-18T12:03:11Z" level=info msg="tetragon, map loaded." map=fdinstall_map path=/run/cilium/bpffs/tetragon/gkp-sensor-1-fdinstall_map sensor=gkp-sensor-1
time="2023-09-18T12:03:12Z" level=info msg="tetragon, map loaded." map=config_map path=/run/cilium/bpffs/tetragon/gkp-sensor-1-gkp-0-config_map sensor=gkp-sensor-1
time="2023-09-18T12:03:13Z" level=info msg="tetragon, map loaded." map=kprobe_calls path=/run/cilium/bpffs/tetragon/gkp-sensor-1-gkp-0-kp_calls sensor=gkp-sensor-1
time="2023-09-18T12:03:13Z" level=info msg="tetragon, map loaded." map=filter_map path=/run/cilium/bpffs/tetragon/gkp-sensor-1-gkp-0-filter_map sensor=gkp-sensor-1
time="2023-09-18T12:03:15Z" level=info msg="tetragon, map loaded." map=argfilter_maps path=/run/cilium/bpffs/tetragon/gkp-sensor-1-gkp-0-argfilter_maps sensor=gkp-sensor-1
time="2023-09-18T12:03:15Z" level=info msg="tetragon, map loaded." map=addr4lpm_maps path=/run/cilium/bpffs/tetragon/gkp-sensor-1-gkp-0-addr4lpm_maps sensor=gkp-sensor-1
time="2023-09-18T12:03:16Z" level=info msg="tetragon, map loaded." map=addr6lpm_maps path=/run/cilium/bpffs/tetragon/gkp-sensor-1-gkp-0-addr6lpm_maps sensor=gkp-sensor-1
time="2023-09-18T12:03:17Z" level=info msg="tetragon, map loaded." map=retprobe_map path=/run/cilium/bpffs/tetragon/gkp-sensor-1-gkp-0-retprobe_map sensor=gkp-sensor-1
time="2023-09-18T12:03:18Z" level=info msg="tetragon, map loaded." map=process_call_heap path=/run/cilium/bpffs/tetragon/gkp-sensor-1-gkp-0-process_call_heap sensor=gkp-sensor-1
time="2023-09-18T12:03:19Z" level=info msg="tetragon, map loaded." map=sel_names_map path=/run/cilium/bpffs/tetragon/gkp-sensor-1-gkp-0-sel_names_map sensor=gkp-sensor-1
time="2023-09-18T12:03:20Z" level=info msg="tetragon, map loaded." map=socktrack_map path=/run/cilium/bpffs/tetragon/gkp-sensor-1-socktrack_map sensor=gkp-sensor-1
time="2023-09-18T12:03:20Z" level=info msg="Loading registered BPF probe" Program=/var/lib/tetragon/bpf_generic_kprobe_v53.o Type=generic_kprobe
time="2023-09-18T12:03:42Z" level=info msg="Loaded generic kprobe program: /var/lib/tetragon/bpf_generic_kprobe_v53.o -> do_init_module"
time="2023-09-18T12:03:42Z" level=info msg="Loaded BPF maps and events for sensor successfully" sensor=gkp-sensor-1
time="2023-09-18T12:03:42Z" level=info msg="Added TracingPolicy with success" metadata.name=kernel-module-load
time="2023-09-18T12:03:43Z" level=warning msg="Kprobe spec pre-validation failed, but will continue with loading" error="type (sock) of argument 0 does not match spec type (sock)\n" sensor=gkp-sensor-2
time="2023-09-18T12:03:43Z" level=info msg="Added generic kprobe sensor: /var/lib/tetragon/bpf_generic_kprobe_v53.o -> tcp_connect" flags=none override=false
time="2023-09-18T12:03:43Z" level=info msg="BTF file: using metadata file" metadata=/sys/kernel/btf/vmlinux
time="2023-09-18T12:03:43Z" level=info msg="Loading sensor" name=gkp-sensor-2
time="2023-09-18T12:03:43Z" level=info msg="Loading kernel version 5.4.119"
time="2023-09-18T12:03:43Z" level=info msg="tetragon, map loaded." map=fdinstall_map path=/run/cilium/bpffs/tetragon/gkp-sensor-2-fdinstall_map sensor=gkp-sensor-2
time="2023-09-18T12:03:44Z" level=info msg="tetragon, map loaded." map=config_map path=/run/cilium/bpffs/tetragon/gkp-sensor-2-gkp-1-config_map sensor=gkp-sensor-2
time="2023-09-18T12:03:45Z" level=info msg="tetragon, map loaded." map=kprobe_calls path=/run/cilium/bpffs/tetragon/gkp-sensor-2-gkp-1-kp_calls sensor=gkp-sensor-2
time="2023-09-18T12:03:46Z" level=info msg="tetragon, map loaded." map=filter_map path=/run/cilium/bpffs/tetragon/gkp-sensor-2-gkp-1-filter_map sensor=gkp-sensor-2
time="2023-09-18T12:03:47Z" level=info msg="tetragon, map loaded." map=argfilter_maps path=/run/cilium/bpffs/tetragon/gkp-sensor-2-gkp-1-argfilter_maps sensor=gkp-sensor-2
time="2023-09-18T12:03:48Z" level=info msg="tetragon, map loaded." map=addr4lpm_maps path=/run/cilium/bpffs/tetragon/gkp-sensor-2-gkp-1-addr4lpm_maps sensor=gkp-sensor-2
time="2023-09-18T12:03:48Z" level=info msg="tetragon, map loaded." map=addr6lpm_maps path=/run/cilium/bpffs/tetragon/gkp-sensor-2-gkp-1-addr6lpm_maps sensor=gkp-sensor-2
time="2023-09-18T12:03:49Z" level=info msg="tetragon, map loaded." map=retprobe_map path=/run/cilium/bpffs/tetragon/gkp-sensor-2-gkp-1-retprobe_map sensor=gkp-sensor-2
time="2023-09-18T12:03:50Z" level=info msg="tetragon, map loaded." map=process_call_heap path=/run/cilium/bpffs/tetragon/gkp-sensor-2-gkp-1-process_call_heap sensor=gkp-sensor-2
time="2023-09-18T12:03:50Z" level=info msg="tetragon, map loaded." map=sel_names_map path=/run/cilium/bpffs/tetragon/gkp-sensor-2-gkp-1-sel_names_map sensor=gkp-sensor-2
time="2023-09-18T12:03:51Z" level=info msg="tetragon, map loaded." map=socktrack_map path=/run/cilium/bpffs/tetragon/gkp-sensor-2-socktrack_map sensor=gkp-sensor-2
time="2023-09-18T12:03:51Z" level=info msg="Loading registered BPF probe" Program=/var/lib/tetragon/bpf_generic_kprobe_v53.o Type=generic_kprobe
time="2023-09-18T12:04:15Z" level=info msg="Loaded generic kprobe program: /var/lib/tetragon/bpf_generic_kprobe_v53.o -> tcp_connect"
time="2023-09-18T12:04:15Z" level=info msg="Loaded BPF maps and events for sensor successfully" sensor=gkp-sensor-2
time="2023-09-18T12:04:15Z" level=info msg="Added TracingPolicy with success" metadata.name=tcp-connect
time="2023-09-18T12:04:16Z" level=info msg="Added generic kprobe sensor: /var/lib/tetragon/bpf_generic_kprobe_v53.o -> __x64_sys_setuid" flags=none override=false
time="2023-09-18T12:04:16Z" level=info msg="BTF file: using metadata file" metadata=/sys/kernel/btf/vmlinux
time="2023-09-18T12:04:16Z" level=info msg="Loading sensor" name=gkp-sensor-3
time="2023-09-18T12:04:16Z" level=info msg="Loading kernel version 5.4.119"
time="2023-09-18T12:04:16Z" level=info msg="tetragon, map loaded." map=fdinstall_map path=/run/cilium/bpffs/tetragon/gkp-sensor-3-fdinstall_map sensor=gkp-sensor-3
time="2023-09-18T12:04:17Z" level=info msg="tetragon, map loaded." map=config_map path=/run/cilium/bpffs/tetragon/gkp-sensor-3-gkp-2-config_map sensor=gkp-sensor-3
time="2023-09-18T12:04:18Z" level=info msg="tetragon, map loaded." map=kprobe_calls path=/run/cilium/bpffs/tetragon/gkp-sensor-3-gkp-2-kp_calls sensor=gkp-sensor-3
time="2023-09-18T12:04:18Z" level=info msg="tetragon, map loaded." map=filter_map path=/run/cilium/bpffs/tetragon/gkp-sensor-3-gkp-2-filter_map sensor=gkp-sensor-3
time="2023-09-18T12:04:19Z" level=info msg="tetragon, map loaded." map=argfilter_maps path=/run/cilium/bpffs/tetragon/gkp-sensor-3-gkp-2-argfilter_maps sensor=gkp-sensor-3
time="2023-09-18T12:04:20Z" level=info msg="tetragon, map loaded." map=addr4lpm_maps path=/run/cilium/bpffs/tetragon/gkp-sensor-3-gkp-2-addr4lpm_maps sensor=gkp-sensor-3
time="2023-09-18T12:04:21Z" level=info msg="tetragon, map loaded." map=addr6lpm_maps path=/run/cilium/bpffs/tetragon/gkp-sensor-3-gkp-2-addr6lpm_maps sensor=gkp-sensor-3
time="2023-09-18T12:04:22Z" level=info msg="tetragon, map loaded." map=retprobe_map path=/run/cilium/bpffs/tetragon/gkp-sensor-3-gkp-2-retprobe_map sensor=gkp-sensor-3
time="2023-09-18T12:04:23Z" level=info msg="tetragon, map loaded." map=process_call_heap path=/run/cilium/bpffs/tetragon/gkp-sensor-3-gkp-2-process_call_heap sensor=gkp-sensor-3
time="2023-09-18T12:04:23Z" level=info msg="tetragon, map loaded." map=sel_names_map path=/run/cilium/bpffs/tetragon/gkp-sensor-3-gkp-2-sel_names_map sensor=gkp-sensor-3
time="2023-09-18T12:04:24Z" level=info msg="tetragon, map loaded." map=socktrack_map path=/run/cilium/bpffs/tetragon/gkp-sensor-3-socktrack_map sensor=gkp-sensor-3
time="2023-09-18T12:04:24Z" level=info msg="Loading registered BPF probe" Program=/var/lib/tetragon/bpf_generic_kprobe_v53.o Type=generic_kprobe
time="2023-09-18T12:04:45Z" level=info msg="Loaded generic kprobe program: /var/lib/tetragon/bpf_generic_kprobe_v53.o -> __x64_sys_setuid"
time="2023-09-18T12:04:45Z" level=info msg="Loaded BPF maps and events for sensor successfully" sensor=gkp-sensor-3
time="2023-09-18T12:04:45Z" level=info msg="Added TracingPolicy with success" metadata.name=sys-setuid
time="2023-09-18T12:04:45Z" level=info msg="Perf ring buffer size (bytes)" percpu=69632 total=3342336
time="2023-09-18T12:04:45Z" level=info msg="Listening for events..."
I0918 12:07:04.317893 3089669 trace.go:219] Trace[598696473]: "DeltaFIFO Pop Process" ID:kube-system/tkex-security-sync-95c6c5654-x4jkr,Depth:30,Reason:slow event handlers blocking the queue (18-Sep-2023 12:07:04.117) (total time: 100ms):
Trace[598696473]: [100.050478ms] [100.050478ms] END
I0918 12:12:05.017975 3089669 trace.go:219] Trace[468173039]: "DeltaFIFO Pop Process" ID:kube-system/csi-cbs-node-8chdf,Depth:35,Reason:slow event handlers blocking the queue (18-Sep-2023 12:12:04.118) (total time: 899ms):
Trace[468173039]: [899.780442ms] [899.780442ms] END
I0918 12:18:04.418931 3089669 trace.go:219] Trace[576527658]: "DeltaFIFO Pop Process" ID:ns-prjm987r-1639428-production/live-pullpush-interface-mumbai-0-0,Depth:19,Reason:slow event handlers blocking the queue (18-Sep-2023 12:18:04.318) (total time: 100ms):
Trace[576527658]: [100.496195ms] [100.496195ms] END

Anything else?

No response

@Jay-JHe Jay-JHe added the kind/bug Something isn't working label Sep 18, 2023
@kkourt
Copy link
Contributor

kkourt commented Sep 18, 2023

Hi,

Thanks for the report, and the detailed information!

I'm seeing that you are deploying a number of tracing policies. Does the issue appear if you don't use any policies at all? If not, is it possible to pinpoint a single policy that leads to this issue?

Also, as far as I understand the issue does not appear on a different cluster with the same tetragon configuration? Could you please provide the same information from the cluster where the issue does not exist?

@Jay-JHe
Copy link
Author

Jay-JHe commented Sep 19, 2023

Hi,

Thanks for the report, and the detailed information!

I'm seeing that you are deploying a number of tracing policies. Does the issue appear if you don't use any policies at all? If not, is it possible to pinpoint a single policy that leads to this issue?

Also, as far as I understand the issue does not appear on a different cluster with the same tetragon configuration? Could you please provide the same information from the cluster where the issue does not exist?

@kkourt Hi,
I've removed all policies from the production cluster and found out that the problem of increasing memory still persists. I even adjusted the CPU limit to 500m, but it didn't have any impact on the escalating memory issue, which continues to occur. Below is the bugtool file after removing policies from the production cluster.
tetragon-bugtool-production.tar.gz

I've noticed that the memory of tetragon in the test cluster is also constantly increasing, but not as rapidly as in the production environment. And in this environment, I haven't added any resource limits. Here's the bugtool file from the test cluster:
tetragon-bugtoo-testl.tar.gz

May I ask if there's any data report concerning the resource usage of Tetragon during its running? I've noticed that Tetragon's memory usage seems to increase consistently with its prolonged operation time.

Here is the trend chart showing the increase in memory usage.

image

@kkourt
Copy link
Contributor

kkourt commented Sep 19, 2023

Thank you for the info!

May I ask if there's any data report concerning the resource usage of Tetragon during its running? I've noticed that Tetragon's memory usage seems to increase consistently with its prolonged operation time.

One known cause of memory footprint increase was due to having pod labels as a dimension in prometheus metrics.

This issue was recently addressed in: #1279, which is part of 0.11 (see: 16a9408). I'm not aware of any other increasing memory issues (especially if there are no policies loaded).

Could you also check with metrics disabled and see if the issue persists?

@Jay-JHe
Copy link
Author

Jay-JHe commented Sep 19, 2023

Thank you for the info!

May I ask if there's any data report concerning the resource usage of Tetragon during its running? I've noticed that Tetragon's memory usage seems to increase consistently with its prolonged operation time.

One known cause of memory footprint increase was due to having pod labels as a dimension in prometheus metrics.

This issue was recently addressed in: #1279, which is part of 0.11 (see: 16a9408). I'm not aware of any other increasing memory issues (especially if there are no policies loaded).

Could you also check with metrics disabled and see if the issue persists?

@kkourt Thank you for your reply. I have removed the constraints on CPU resources and disabled metric. No resource limit, no TracyingPolicy and no metric.

But unfortunately, during testing I found that the memory usage is still experiencing a gradual increase. I believe this indicates an issue within Tetragon, possibly causing a memory leak.

Below is the memory resource variation chart within 19 hours after metric disabled. As you can see, the memory resources are still slowly increasing.
image

The tetragon bugtool is:
tetragon-disable-metric-bugtool.tar.gz

@Jay-JHe
Copy link
Author

Jay-JHe commented Sep 25, 2023

@kkourt I discovered that this issue was due to some processes that had not been GC, leading to a continuous increase in the number of process caches.

I found that the original process PIDs of all non-GC processes were 1. And I noticed that the reference count values of non-GC processes typically range from 4294967200 to 4294967295, which I believe is due to the process's reference counter undergoing multiple Dec() operations, causing the reference count to be non-zero. This results in the process being skipped during garbage collection, eventually leading to this problem.
image

After attempting to modify the RefDec function, I noticed that the memory no longer increased, and the processes were successfully GC, the code is as follow.
image

However, due to some pieces of code that I'm struggling to understand. I have yet to identify the cause of the superfluous Dec() in the reference count. I would appreciate your assistance in investigating this matter further.
process_cache.log

@tpapagian
Copy link
Member

Hello @Jay-JHe, thanks for checking into that. What is the workload that the cluster runs?

To be more specific, I would like to know if new pods/container continuously created/destroyed, or the same pods/containers run for a long periods of time? Thanks!

@Jay-JHe
Copy link
Author

Jay-JHe commented Sep 25, 2023

@tpapagian Yes, I noticed that there are some pods in the cluster that are constantly being restarted and created. In addition, some pods of Tetragon failed to start because the host kernel version is less than 5.4, and they are also constantly being restarted and created.
image
image

@tpapagian
Copy link
Member

Ok, thanks, it makes sense! I will try to reproduce that issue locally and let you know for any updates.

@tixxdz
Copy link
Member

tixxdz commented Sep 25, 2023

Thanks @Jay-JHe for those, so already identified one cause #1507 (comment) we will try to fix and see if this is the only one.

@Jay-JHe
Copy link
Author

Jay-JHe commented Sep 27, 2023

@tpapagian @tixxdz I think this issue is due to improper concurrent handling. When hooking to 'do_task_dead', it's necessary to ascertain whether it's currently in the main process. Here is my bug fix.
image

@tpapagian
Copy link
Member

@tpapagian @tixxdz I think this issue is due to improper concurrent handling. When hooking to 'do_task_dead', it's necessary to ascertain whether it's currently in the main process. Here is my bug fix. image

CC: @olsajiri

@tixxdz
Copy link
Member

tixxdz commented Sep 27, 2023

Thanks for the proposed fix but this could block from receiving the event at all, if I'm reading it correctly this will trigger only if the leader is doing an exit and all other threads are also doing same (racing on the counter), but there are cases where these other threads (non leader) may keep running while the leader did actually finish, for this we won't send an event

@olsajiri
Copy link
Contributor

Thanks for the proposed fix but this could block from receiving the event at all, if I'm reading it correctly this will trigger only if the leader is doing an exit and all other threads are also doing same (racing on the counter), but there are cases where these other threads (non leader) may keep running while the leader did actually finish, for this we won't send an event

yes, I think that's the case.. also it's still racy wrt live.counter

@Jay-JHe fyi we have pending fix in here #1509

@jrfastab jrfastab added the release-blocker This PR or issue is blocking the next release. label Oct 20, 2023
@kkourt
Copy link
Contributor

kkourt commented Oct 23, 2023

fixed by #1509.

@kkourt kkourt closed this as completed Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working release-blocker This PR or issue is blocking the next release.
Projects
None yet
Development

No branches or pull requests

6 participants