Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lots of "syscall event drop" since 0.32.1 #2129

Closed
wdoekes opened this issue Jul 13, 2022 · 16 comments
Closed

Lots of "syscall event drop" since 0.32.1 #2129

wdoekes opened this issue Jul 13, 2022 · 16 comments

Comments

@wdoekes
Copy link

wdoekes commented Jul 13, 2022

Describe the bug

Since Falco 0.32.1 I'm getting many of these:

  • Falco internal: syscall event drop. 1538124 system calls dropped in last second.
  • Syscall event drop but token bucket depleted, skipping actions

This was not the case with Falco 0.32.0 as far as I can see.

So one of two things happened:

  • either I changed my rules drastically,
  • or something in Falco 032.1 changed to cause this.

Falco 0.32.0 behaviour

$ sudo journalctl -t 64516b56d7be -o cat | uniq -c | sed -e 's/I: .*/I: event fired/'
      1 * Setting up /usr/src links from host
      1 * Running falco-driver-loader for: falco version=0.32.0, driver version=39ae7d40496793cf3d3e7890c9bbdc202263836b
      1 * Running falco-driver-loader with: driver=bpf, compile=yes, download=yes
      1 * Mounting debugfs
      1 * Trying to download a prebuilt eBPF probe from https://...
      1 * Skipping compilation, eBPF probe is already present in /root/.falco/falco_ubuntu-generic_5.4.0-100-generic_113.o
      1 * eBPF probe located in /root/.falco/falco_ubuntu-generic_5.4.0-100-generic_113.o
      1 * Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
      1 Mon Jul 11 16:41:49 2022: Falco version 0.32.0 (driver version 39ae7d40496793cf3d3e7890c9bbdc202263836b)
      1 Mon Jul 11 16:41:49 2022: Falco initialized with configuration file /etc/falco/falco.yaml
      1 Mon Jul 11 16:41:49 2022: Configured rules filenames:
      1 Mon Jul 11 16:41:49 2022:    /etc/falco/falco_rules.yaml
      1 Mon Jul 11 16:41:49 2022:    /etc/falco/falco_rules.local.yaml
      1 Mon Jul 11 16:41:49 2022:    /etc/falco/rules.d
      1 Mon Jul 11 16:41:49 2022: Loading rules from file /etc/falco/falco_rules.yaml:
      1 Mon Jul 11 16:41:49 2022: Loading rules from file /etc/falco/falco_rules.local.yaml:
      1 Mon Jul 11 16:41:49 2022: Loading rules from file /etc/falco/rules.d/xxx.yaml:
      1 Mon Jul 11 16:41:49 2022: Loading rules from file /etc/falco/rules.d/xxx.yaml:
      1 Mon Jul 11 16:41:50 2022: Watching /etc/falco/falco.yaml
      1 Mon Jul 11 16:41:50 2022: Watching /etc/falco/falco_rules.yaml.
      1 Mon Jul 11 16:41:50 2022: Watching /etc/falco/falco_rules.local.yaml.
      1 Mon Jul 11 16:41:50 2022: Watching /etc/falco/rules.d/xxx.yaml.
      1 Mon Jul 11 16:41:50 2022: Watching /etc/falco/rules.d/xxx.yaml.
      1 Mon Jul 11 16:41:50 2022: Watching /etc/falco/rules.d folder.
      1 Mon Jul 11 16:41:50 2022: Setting metadata download max size to 100 MB
      1 Mon Jul 11 16:41:50 2022: Setting metadata download chunk wait time to 1000 μs
      1 Mon Jul 11 16:41:50 2022: Setting metadata download watch frequency to 1 seconds
      1 Mon Jul 11 16:41:50 2022: Starting internal webserver, listening on port 8765
      1 Tue Jul 12 09:54:10 2022: SIGINT received, exiting...
      1 Syscall event drop monitoring:
      1    - event drop detected: 0 occurrences
      1    - num times actions taken: 0
      1 Events detected: 0
      1 Rule counts by severity:
      1 Triggered rules by rule name:

Falco 0.32.1 behaviour

$ sudo journalctl -t f828f2631248 -o cat | uniq -c | sed -e 's/I: .*/I: event fired/'
      1 * Setting up /usr/src links from host
      1 * Running falco-driver-loader for: falco version=0.32.1, driver version=2.0.0+driver
      1 * Running falco-driver-loader with: driver=bpf, compile=yes, download=yes
      1 * Mounting debugfs
      1 * Trying to download a prebuilt eBPF probe from https://...
      1 * Skipping compilation, eBPF probe is already present in /root/.falco/falco_ubuntu-generic_5.4.0-100-generic_113.o
      1 * eBPF probe located in /root/.falco/falco_ubuntu-generic_5.4.0-100-generic_113.o
      1 * Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
      1 Tue Jul 12 16:37:32 2022: Falco version 0.32.1
      1 Tue Jul 12 16:37:32 2022: Falco initialized with configuration file /etc/falco/falco.yaml
      1 Tue Jul 12 16:37:32 2022: Configured rules filenames:
      1 Tue Jul 12 16:37:32 2022:    /etc/falco/falco_rules.yaml
      1 Tue Jul 12 16:37:32 2022:    /etc/falco/falco_rules.local.yaml
      1 Tue Jul 12 16:37:32 2022:    /etc/falco/rules.d
      1 Tue Jul 12 16:37:32 2022: Loading rules from file /etc/falco/falco_rules.yaml:
      1 Tue Jul 12 16:37:32 2022: Loading rules from file /etc/falco/falco_rules.local.yaml:
      1 Tue Jul 12 16:37:32 2022: Loading rules from file /etc/falco/rules.d/xxx.yaml:
      1 Tue Jul 12 16:37:32 2022: Loading rules from file /etc/falco/rules.d/xxx.yaml:
      1 Tue Jul 12 16:37:32 2022: Loading rules from file /etc/falco/rules.d/xxx.yaml:
      1 Tue Jul 12 16:37:32 2022: Watching /etc/falco/falco.yaml
      1 Tue Jul 12 16:37:32 2022: Watching /etc/falco/falco_rules.yaml.
      1 Tue Jul 12 16:37:32 2022: Watching /etc/falco/falco_rules.local.yaml.
      1 Tue Jul 12 16:37:32 2022: Watching /etc/falco/rules.d/xxx.yaml.
      1 Tue Jul 12 16:37:32 2022: Watching /etc/falco/rules.d/xxx.yaml.
      1 Tue Jul 12 16:37:32 2022: Watching /etc/falco/rules.d/xxx.yaml.
      1 Tue Jul 12 16:37:32 2022: Watching /etc/falco/rules.d folder.
      1 Tue Jul 12 16:37:32 2022: Setting metadata download max size to 100 MB
      1 Tue Jul 12 16:37:32 2022: Setting metadata download chunk wait time to 1000 μs
      1 Tue Jul 12 16:37:32 2022: Setting metadata download watch frequency to 1 seconds
      1 Tue Jul 12 16:37:32 2022: Starting internal webserver, listening on port 8765
      1 Tue Jul 12 16:38:21 2022: Falco internal: syscall event drop. 380827 system calls dropped in last second.
      5 Tue Jul 12 16:38:22 2022: Syscall event drop but token bucket depleted, skipping actions
      1 Tue Jul 12 17:31:26 2022: Falco internal: syscall event drop. 187032 system calls dropped in last second.
      2 Tue Jul 12 17:31:27 2022: Syscall event drop but token bucket depleted, skipping actions
     16 Tue Jul 12 17:31:28 2022: Syscall event drop but token bucket depleted, skipping actions
      1 Tue Jul 12 17:31:28 2022: Falco internal: syscall event drop. 14 system calls dropped in last second.
      1 Tue Jul 12 18:30:08 2022: Falco internal: syscall event drop. 115867 system calls dropped in last second.
     23 Tue Jul 12 18:30:09 2022: Syscall event drop but token bucket depleted, skipping actions
      1 Tue Jul 12 18:30:09 2022: Falco internal: syscall event drop. 66 system calls dropped in last second.
      1 Tue Jul 12 19:05:16 2022: Falco internal: syscall event drop. 145866 system calls dropped in last second.
     24 Tue Jul 12 19:05:17 2022: Syscall event drop but token bucket depleted, skipping actions
      1 Tue Jul 12 19:05:17 2022: Falco internal: syscall event drop. 13 system calls dropped in last second.
      1 Tue Jul 12 19:59:38 2022: Falco internal: syscall event drop. 252247 system calls dropped in last second.
     25 Tue Jul 12 19:59:43 2022: Syscall event drop but token bucket depleted, skipping actions
...

How to reproduce it

Not sure yet.

Next steps

I've reverted the first of my test environments to 0.32.0. If the errors stay gone, I must conclude that 0.32.1 is to blame and not my changed rules.

I'll be back to confirm.

Environment

  • Linux, Ubuntu, Focal
  • Falco 0.32.1, deployed through Helm chart 1.19.4, but updated ds to point to image 0.32.1.
  • (Note to self: logs from io cluster)

Additional context

These errors popped up on both master and worker k8s nodes AND on different clusters at once. So I would rule out that this due to some external performance spike. The only common thing between them are the rules and the Falco version.

Cheers,
Walter Doekes
OSSO B.V.

@Andreagit97
Copy link
Member

That's interesting @wdoekes thank you for pointing this out! Just 2 questions:

  • From the logs, it seems you are using the BPF probe, are you able to test the same thing with the kernel module? it would help us a lot in understanding where the problem is 👀
  • Which are the kernel versions involved?

@wdoekes
Copy link
Author

wdoekes commented Jul 14, 2022

There is definitely something different with 0.32.1.

Cluster # 1 (io):

$ sudo kubectl -n kube-falco logs -l app=falco --max-log-requests=100 --timestamps --tail=20000 --prefix=true | grep -F 'yscall event drop'
[pod/falco-wkqcj/falco] 2022-07-14T00:22:04.337702000Z Thu Jul 14 02:22:04 2022: Falco internal: syscall event drop. 180 system calls dropped in last second.

^- 0.32.0: just one notice

Cluster # 2 (mpr):

$ sudo kubectl -n kube-falco logs -l app=falco --max-log-requests=100 --timestamps --tail=20000 --prefix=true | grep -F 'yscall event drop'

^- 0.32.0: silence

Cluster # 3 (mac):

$ sudo kubectl -n kube-falco logs -l app=falco --max-log-requests=100 --timestamps --tail=20000 --prefix=true | grep -F 'yscall event drop' | wc -l
2018

^- 032.1: we have 2018 lines of "[Ss]yscall event drop"

All of these were started between at around 15:00 yesterday and I'm looking at them 18 hours later.

Yesterday, all of them were showing many many of those log lines when they were all running 0.32.1. Today, only the one running 0.32.1 exhibits excessive amounts of those log lines.

(I'll followup with answers to your questions.)

@leogr
Copy link
Member

leogr commented Jul 14, 2022

cc @LucaGuerra @FedeDP

@FedeDP
Copy link
Contributor

FedeDP commented Jul 14, 2022

Agree with @Andreagit97 question here; are you able to try with kmod?
We must try to understand who is the culprit :/
Thanks for reporting btw!

@wdoekes
Copy link
Author

wdoekes commented Jul 14, 2022

@Andreagit97:


Which are the kernel versions involved?

$ sudo kubectl -n kube-falco logs -l app=falco --max-log-requests=100 --timestamps --tail=20000 --prefix=true | grep 'probe located' | sed -e 's/.*located in //' | sort -u
/root/.falco/falco_ubuntu-generic_5.4.0-100-generic_113.o

$ sudo kubectl -n kube-falco logs -l app=falco --max-log-requests=100 --timestamps --tail=20000 --prefix=true | grep 'probe located' | sed -e 's/.*located in //' | sort -u
/root/.falco/falco_ubuntu-generic_5.4.0-109-generic_123.o

$ sudo kubectl -n kube-falco logs -l app=falco --max-log-requests=100 --timestamps --tail=20000 --prefix=true | grep 'probe located' | sed -e 's/.*located in //' | sort -u
/root/.falco/falco_ubuntu-generic_5.4.0-107-generic_121.o

(Also seen in the logs above.)

All of them Ubuntu/Focal generic kernels.

E.g.:

$ uname -a
Linux xxx 5.4.0-100-generic #113-Ubuntu SMP Thu Feb 3 18:43:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Are you able to test the same thing with the kernel module?

Running them now.

Cluster # 1 (io, running Falco 0.32.1, dkms):

$ sudo kubectl -n kube-falco logs -l app=falco --max-log-requests=100 --tail=20000  | grep -E 'Falco version|yscall event drop|loaded in dkms' | sed -e 's/^.*2022: //' | uniq
* Success: falco module found and loaded in dkms
Falco version 0.32.1
...
$ sudo kubectl -n kube-falco logs -l app=falco --max-log-requests=100 --tail=20000  | grep -E 'Falco version|yscall event drop|loaded in dkms' | sed -e 's/^.*2022: //' | grep event | sort | uniq -c
      1 Falco internal: syscall event drop. 1318 system calls dropped in last second.
      1 Falco internal: syscall event drop. 1532767 system calls dropped in last second.
      1 Falco internal: syscall event drop. 2069197 system calls dropped in last second.
      1 Falco internal: syscall event drop. 2447005 system calls dropped in last second.
      1 Falco internal: syscall event drop. 341298 system calls dropped in last second.
      1 Falco internal: syscall event drop. 353175 system calls dropped in last second.
      1 Falco internal: syscall event drop. 354492 system calls dropped in last second.
     48 Syscall event drop but token bucket depleted, skipping actions

There on 0.32.1 with DKMS, we already have plenty after about 15 minutes.

Cluster # 3 (mac, running Falco 0.32.0, dkms):

$ sudo kubectl -n kube-falco logs -l app=falco --max-log-requests=100 --tail=20000  | grep -E 'Falco version|yscall event drop|loaded in dkms' | sed -e 's/^.*2022: //'
* Success: falco module found and loaded in dkms
Falco version 0.32.0 (driver version 39ae7d40496793cf3d3e7890c9bbdc202263836b)
...
$ sudo kubectl -n kube-falco logs -l app=falco --max-log-requests=100 --tail=20000  | grep -E 'Falco version|yscall event drop|loaded in dkms' | sed -e 's/^.*2022: //' | grep event | sort | uniq -c
      1 Falco internal: syscall event drop. 784899 system calls dropped in last second.
      1 Syscall event drop but token bucket depleted, skipping actions

Just 2 log lines with Falco 0.32.0 with DKMS, after about 20 minutes.

I think we can rule out the choice of module type.

@Andreagit97
Copy link
Member

Thank you @wdoekes for reporting this! These data are super useful! I suspect that for some reasons Falco 0.32.1 userspace is slower than Falco 0.32.0, not sure about that we have to check but your data lead me in this direction 🤔

Just one last thing 🙏 it would be great if we could understand which type of drops are these (e.g. if they are drops due to the full buffers or due to wrong instrumentation). To get this information you can simply search in your logs something like n_drops and you should find something like this:

Critical Falco internal: syscall event drop. 10 system calls dropped in last second. (ebpf_enabled=1 n_drops=10 n_drops_buffer=10 n_drops_bug=0 n_drops_pf=0 n_evts=328)

In this way we could understand what is going on 🤔

@Andreagit97
Copy link
Member

Could you try also the difference between the 2 versions (0.32.0, 0.32.1) disabling the k8s client, so without using -k option? This could be a possible cause... 🤔

@wdoekes
Copy link
Author

wdoekes commented Jul 14, 2022

To get this information you can simply search in your logs something like n_drops and you should find something like this:

I'm sorry, I don't have any lines that contain "Critical" or "n_drops".

Helm chart log levels are set to debug:

  # Loglevels: emergency, alert, critical, error, warning, notice,
  # informational, debug
  #logLevel: warning
  logLevel: debug
  priority: warning

[...] so without using -k option?

I've dropped everything after -k from the daemonset args:

      containers:
      - args:
        - /usr/bin/falco
        - --cri
        - /run/containerd/containerd.sock
        - --cri
        - /run/crio/crio.sock
        - -K
        - /var/run/secrets/kubernetes.io/serviceaccount/token

(i.e. the stuff below)

        - k
        - https://$(KUBERNETES_SERVICE_HOST)
        - --k8s-node
        - $(FALCO_K8S_NODE_NAME)
        - -pk

Now k8s.pod.name is null, for obvious reasons. But I have not seen a event drop in the 4 minutes since I started.

Let me run this for a while (with 0.32.1 and eBPF), and I'll get back to you...

@wdoekes
Copy link
Author

wdoekes commented Jul 14, 2022

Ok. Ran without:

        - -k 
        - https://$(KUBERNETES_SERVICE_HOST) 
        - --k8s-node 
        - $(FALCO_K8S_NODE_NAME) 
        - -pk 

for 20 minutes, and had 0 "event drop" in the logs.

Restarted with above arguments, and instantly 45 logs entries:

[pod/falco-w6vql/falco] 2022-07-14T09:29:06.602610000Z Thu Jul 14 11:29:06 2022: Starting internal webserver, listening on port 8765
[pod/falco-w6vql/falco] 2022-07-14T09:30:38.587254000Z Thu Jul 14 11:30:38 2022: Falco internal: syscall event drop. 2838222 system calls dropped in last second.
[pod/falco-4dtsq/falco] 2022-07-14T09:29:49.161864000Z Thu Jul 14 11:29:49 2022: Falco internal: syscall event drop. 421997 system calls dropped in last second.
[pod/falco-4dtsq/falco] 2022-07-14T09:29:54.400530000Z Thu Jul 14 11:29:54 2022: Syscall event drop but token bucket depleted, skipping actions

(etc..)

Looks like the the -k mechanics are the culprit indeed.

@Andreagit97
Copy link
Member

Yeah, it seems we have found our culprit! The funny thing is that this is due to a bug fix that you can see here. In a nutshell, in Falco 0.32.0 we didn't collect data from daemonsets, deployments, and replicasets, in Falco 0.32.1 we fixed it... The problem is that now the userspace asks much more data to the API server so the overall processing logic becomes slower, from here the problem of drops. So this is not a bug, but the correct behavior :/ BTW we have to find some solution to mitigate this effect... Thank you very much for having pointed this out!!

@wdoekes
Copy link
Author

wdoekes commented Jul 14, 2022

Ah. Good to know. Thanks for the feedback 👍

@jasondellaluce
Copy link
Contributor

Since we seem to have solved the mystery, I'd suggest linking this to #1403 (I already asked to add this issue in that list) for better traceability, and close this one. What does everyone think?

@FedeDP
Copy link
Contributor

FedeDP commented Jul 14, 2022

Agree 👍

@poiana
Copy link

poiana commented Oct 12, 2022

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

@poiana
Copy link

poiana commented Nov 12, 2022

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

@leogr
Copy link
Member

leogr commented Nov 23, 2022

Since we seem to have solved the mystery, I'd suggest linking this to #1403 (I already asked to add this issue in that list) for better traceability, and close this one. What does everyone think?

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants