Falco on GKE - dropped syscall events #669

caquino · 2019-06-13T12:53:47Z

What happened: Falco is constantly dropping events on a really small GKE cluster (3 nodes) and 9 containers (excluding Google services containers)

What you expected to happen: I would expect that for a cluster of such size that falco would have no issues handling the events for it.

How to reproduce it (as minimally and precisely as possible): Based on my tests, just by deploying a simple cluster on GKE with monitoring and metrics extensions enabled, it is enough to cause drops.

Anything else we need to know?:

Environment:

Falco version (use falco --version): 0.15.3
System info

{
  "machine": "x86_64",
  "nodename": "gke-ai-cloud-us-central1-default-pool-b3942f02-7mjv",
  "release": "4.14.119+",
  "sysname": "Linux",
  "version": "#1 SMP Wed May 15 17:44:01 PDT 2019"
}

Cloud provider or hardware configuration: Google Cloud
OS (e.g: cat /etc/os-release): COS
Kernel (e.g. uname -a): 4.14.119+
Install tools (e.g. in kubernetes, rpm, deb, from source): kubernetes
Others:
The logs are flooded with syscall event drop, the only other event showing up is the following:

12:50:28.988592573: Warning Log files were tampered (user=root command=pos_writer.rb:* -Eascii-8bit:ascii-8bit /usr/sbin/google-fluentd --under-supervisor file=/var/log/gcp-journald-kubelet.pos) k8s.ns=kube-system k8s.pod=fluentd-gcp-v3.1.1-6xbdr container=e84a9479a965 k8s.ns=kube-system k8s.pod=fluentd-gcp-v3.1.1-6xbdr container=e84a9479a965

Which I assume is generating enough events to cause the syscall drops, is there anything I can do? Filter out this events/process?

I've checked the other similar issue, and I'm using the KUBERNETES_SERVICE_HOST like described on it.

Log snippet:

13:03:18.688024362: Warning Log files were tampered (user=root command=pos_writer.rb:* -Eascii-8bit:ascii-8bit /usr/sbin/google-fluentd --under-supervisor file=/var/log/gcp-journald-kubelet.pos) k8s.ns=kube-system k8s.pod=fluentd-gcp-v3.1.1-6fmpm container=9b27b6fc8460 k8s.ns=kube-system k8s.pod=fluentd-gcp-v3.1.1-6fmpm container=9b27b6fc8460
13:03:34.485317730: Critical Falco internal: syscall event drop. 11 system calls dropped in last second.(ebpf_enabled=1 n_drops=11 n_drops_buffer=11 n_drops_bug=0 n_drops_pf=0 n_evts=92933)
13:04:20.686678200: Critical Falco internal: syscall event drop. 4 system calls dropped in last second.(ebpf_enabled=1 n_drops=4 n_drops_buffer=4 n_drops_bug=0 n_drops_pf=0 n_evts=8090)
13:04:24.654208479: Critical Falco internal: syscall event drop. 4 system calls dropped in last second.(ebpf_enabled=1 n_drops=4 n_drops_buffer=4 n_drops_bug=0 n_drops_pf=0 n_evts=11527)

The text was updated successfully, but these errors were encountered:

fntlnz · 2019-06-13T18:12:25Z

Thanks for opening @caquino - need some time to look at this because I need to get to a gcp account to do tests.

fntlnz · 2019-06-13T18:12:42Z

/assign @fntlnz
/assign @leodido

michiels · 2019-06-14T09:36:31Z

FWIW, getting the same on a 3-server (1cpu2Gbram) cluster at DigitalOcean Kubernetes. There are a bunch of additional things already deployed:

Traefik ingress controller
A rails app with a web and background process
Logspout logging pipeline.

Do these errors usually mean the cluster's capacity is reached?

caquino · 2019-06-14T11:50:07Z

On my case the cluster has 3 n1-standard-2 (2 vCPUs, 7.5 GB memory) nodes, other than what comes pre-configured on the cluster from Google is running only a Deployment with nginx and php.

This is not a production cluster, so it has no traffic on it yet, so I would expect to not be a capacity issue.

nuala33 · 2019-06-21T21:49:44Z

I am also having the same issue ("syscall event drop" logs several times an hour) in my almost completely unused GKE cluster with 3 nodes.

Re: the fluentd log tampering alerts, I created #684 to address it, and I describe the fix that worked to stop those from flowing in.

kbrown · 2019-07-10T04:34:32Z

Continuously seeing this after installing k8s-with-rbac/falco-daemonset-configmap.yaml in a 3 node gke environment:

Falco internal: syscall event drop. 7 system calls dropped in last second.
Falco internal: syscall event drop. 15 system calls dropped in last second.

Occurs in:
falcosecurity/falco:dev
falcosecurity/falco:latest

Not yet checked:
falcosecurity/falco:15.0.x (container won't start with config from dev branch)

leodido · 2019-07-11T13:55:17Z

Just to be sure: what syscall_event_drops.rate and syscall_event_drops.max_burst config are you folks using?

The default one?

metalsong · 2019-07-11T14:24:33Z

@leodido I've just install falco to the GKE using:

helm install --name falco stable/falco --set ebpf.enabled=true

Having exactly the same errors.

kbrown · 2019-07-11T14:35:13Z

@leodido

I have used the defaults,
then tried the increased limits:

syscall_event_drops:
actions:
- log
- alert
rate: 1
max_burst: 1000

Same result. Maybe I should have decreased?

Using falco-daemonset-configmap.yaml with the following changes:

      env:
      - name: SYSDIG_BPF_PROBE
        value: ""
      - name: KBUILD_EXTRA_CPPFLAGS
        value: -DCOS_73_WORKAROUND

caquino · 2019-07-11T15:34:23Z

Same here, I'm using the default configuration.

DannyPat44 · 2019-07-24T20:42:12Z

Has anyone found a workaround for this issue?

qbast · 2019-08-05T16:37:55Z

I see the same behaviour with Falco 0.17 on five node GKE cluster. CPU load on the nodes is between 18% and 35%. Falco is using 1-5% .

Aaron-ML · 2019-08-12T04:15:29Z

Seeing this in Azure with falco using abysmal amounts of CPU and using the set defaults

bgeesaman · 2019-08-13T16:59:17Z

I'm also seeing 1 n_drops per node, every 60 to 61 mins on my COS 4.14.127+ v1.12.8-gke.10 GKE cluster (8CPUs/30+GB RAM nodes). Running falcosecurity/falco:latest.

arthurk · 2019-09-09T06:40:43Z

I have the same issue. As a workaround i have just set the action to "ignore" until this feature works with GKE

cvernooy23 · 2019-09-26T17:56:53Z

I'm seeing this when installed as a binary on the host OS in AWS EC2 running CentOS 7 w/ kernel 5.3 mainline.

stale · 2019-11-25T18:22:27Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

shane-lawrence · 2019-11-25T18:57:30Z

Could we keep this open until it has been resolved?

fntlnz · 2019-11-25T22:26:52Z

@shane-lawrence i agree we want to keep this open, I was coming here to un-stale this but you already did :)

kbrown · 2019-11-26T01:43:50Z

I abandoned falco due to this issue.

…

On Nov 25, 2019, at 5:26 PM, Lorenzo Fontana ***@***.***> wrote: @shane-lawrence i agree we want to keep this open, I was coming here to I’m imi un-stale this but you already did :) — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

popaaaandrei · 2020-01-02T15:36:14Z

Guys, any progress/updates? Thanks!

fntlnz · 2020-01-04T22:20:56Z

For everyone looking at this:
The reason why Falco drops is that right now it does not have a way to "offload" events coming from kernel space while receiving them in userspace.

In other words, once inputs are consumed by the Falco driver they are sent immediately by the engine to processes them.

For that reason, we had to implement an artifact called "Token Bucket" that it's basically a rate limiter for the engine and is responsible for the drops.

Now, this kind of system is limited because on machines that have lot of activity in kernel space (and fill the ring buffer fast) Falco drops.

After discussions in many many office hours and repo planning calls, we decided to redesign the Inputs to achieve two goals:

Have an input streaming interface (that offloads messages to a queue)
Implement inputs as a gRPC client - means that the inputs are not part of the Falco engine itself but a separate service

You can find a Diagram with an explanation here.

Long story short: It's likely that this issue will persist until we implement the new input API and then release 1.0 - We don't have an official release date yet but IIRC many community members wanted to set that for March 2020.

We are grateful for everyone's help and we are trying to do our best to make Falco more and more reliable for all kinds of workloads. To stay updated and have a voice in the process please join our weekly calls.

stale · 2020-03-04T23:15:21Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

poiana · 2021-01-13T21:37:43Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

poiana · 2021-02-13T02:59:30Z

Rotten issues close after 30d of inactivity.

Reopen the issue with /reopen.

Mark the issue as fresh with /remove-lifecycle rotten.

Provide feedback via https://github.com/falcosecurity/community.
/close

poiana · 2021-02-13T02:59:31Z

@poiana: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue with /reopen.

Mark the issue as fresh with /remove-lifecycle rotten.

Provide feedback via https://github.com/falcosecurity/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kbrown · 2021-02-14T16:24:33Z

I've abandoned Falco (because never worked for me on gcloud). No idea if this is still an issue.

bsord · 2021-08-01T21:56:28Z

I've just deployed Falco today using latest helm chart and still running into this. Is the only option to set the syscall drops to ignore until version 1.0.0 is released? With so many dropped calls, can I trust Falco is even accurately detecting events i am using it to monitor?

abroglesc · 2021-08-03T17:52:08Z

+1 to @bsord 's question. Commenting to follow along on discussion.

bsda · 2021-09-07T12:25:20Z

+1, still hitting this issue after 2 years.

eelkonio · 2021-09-17T14:37:37Z

I'm hitting this too. A lot.

I see numbers like 10k, 150K and even 458K of "syscall events dropped in the last second". Anyone of these events may be a potential security risk so to me this sounds like a serious issue with Falco.

Will there be a fix for this? I see great potential for Falco in our security environment, but I cannot explain this to my customers. Especially not since I can't even give them a percentage of what has been investigated and what has not been investigate by Falco.

leogr · 2021-09-17T15:45:54Z

Hey folks,

As you can understand, it's hard to give answers without being able to reproduce the problem. Moreover, the syscall events dropping may be caused by different factors. In my personal experience, usually, it happens when Falco hasn't enough resources (CPU) to process the event stream. Sometimes this can happen due to a misconfiguration. I created the #1403 which include a handy checklist for debugging purpose.

It seemed to me that the most recent Falco versions fixed the majority of the issues. However, by seeing your comments, likely something is still not working on GKE. I'm happy to help you with that, but I also need your help investigating and understanding what's going on.

For instance, for debugging these kinds of issues, we usually need:

confirmation that after trying each item in the [UMBRELLA] Dropped events #1403's checklist, the problem persists
all version numbers (including GKE ones)
machine info
Falco's configuration are you using
manifests you used to deploy Falco
a few logs reporting "syscall events dropped" notice
any performance metrics you can provide

Could someone create a full report and share it (privately would be fine too)?

leogr · 2021-09-17T15:50:38Z

/reopen

poiana · 2021-09-17T15:50:40Z

@leogr: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

eelkonio · 2021-09-23T09:54:18Z

UPDATE:
I found the problem via the links you provided, leogr.

Our Falco was doing synchronous calls to an outside alerting system. These calls took about 100+ms which stopped the main event-processing thread of Falco. That was the main cause of the thousands of missed events. I did not assume that the Falco alerting requests were put in the same thread as the event processing. So I'll be adding the falcosidekick and an external mq now, which should fix most of these dropped events.

Thanks for your hints!

leogr · 2021-09-23T12:41:59Z

Hey @eelkonio

you're welcome! I am happy to have been helpful :)
Btw, your solution is perfectly fine, I can confirm (similar situations happened to me in the past, and I solved them in a similar way).

I want also to give you more context. Since Falco 0.27.0, the alert processing was offloaded to another thread (see #1451).
AFAIK, it mitigates most of those kinds of issues, though it might not be the definitive solution for all circumstances (especially, it cannot solve situations where the calls take a very long time or stay pending indefinitely).
So just curious to know which Falco version are you using?
Anyway, adding falcosidekick will help for sure. Likely, it's the best solution for your case.

eelkonio · 2021-09-23T13:55:15Z

Hi leogr,

I used Falco 0.29.1 so that other thread should have been present. However, there may be limits to how much messages per second this thread can handle before it starts stalling?

As I said I saw 458K and 600+K of dropped events per second a few times. That may also have been caused by the cpu limits on the container that runs both/all these threads. I will remove that too to see if it makes any difference.

Thanks again - love the product and hope to get it working fine soon!

leogr · 2021-09-23T14:10:28Z

I used Falco 0.29.1 so that other thread should have been present. However, there may be limits to how much messages per second this thread can handle before it starts stalling?

There's no hard limit. It's just a matter of how many resources Falco can use. Furthermore, the output mechanism it's still sequential. For example, if just one call blocks indefinitely, all subsequent calls will be stalled (Falco will try to emit a warning if an output consumers blocks for more than 2 seconds, see here). In such a situation, Falco cannot operate and starts to drop events.

For this reason, using a responsive consumer (like falcosidekick) is still beneficial.

bsda · 2021-09-23T14:25:50Z

I am currently running 0.29.1 installed via helm in GKE(1.20.9-gke.1001) and only logging to stdout, I'm seeing 1000's of n_drops_buffer and n_drops_bug drops. I've tried many (if not all) of the suggestions from #1403 and can't seem to get rid of these. I'll try to get a full report when I get some time, below is an idea of the numbers I am seeing from a single node.

leogr · 2021-09-23T15:00:45Z

Just trying to guess a possible issue. Is the BPF JIT compiler enabled?
(ie. the kernel has CONFIG_BPF_JIT enabled and net.core.bpf_jit_enable is set to 1)

bsda · 2021-09-23T15:24:52Z

I have CONFIG_BPF_JIT=y and

net.core.bpf_jit_enable = 1
net.core.bpf_jit_harden = 0
net.core.bpf_jit_kallsyms = 1
net.core.bpf_jit_limit = 264241152

leogr · 2021-10-06T09:01:20Z

If you are using Falco with the K8s support enabled, could you please try the latest release (Falco 0.30.0)?

It comes with several fixes that reduce resource consumption when fetching metadata from the K8s API server, which may indirectly alleviate the event dropping problem. I'm not sure that is the case, but any testing and feedback are very useful for us.
Thank you in advance :)

leogr · 2021-10-14T14:20:45Z

I am currently running 0.29.1 installed via helm in GKE(1.20.9-gke.1001) and only logging to stdout, I'm seeing 1000's of n_drops_buffer and n_drops_bug drops. I've tried many (if not all) of the suggestions from #1403 and can't seem to get rid of these. I'll try to get a full report when I get some time, below is an idea of the numbers I am seeing from a single node.

hey @bsda
@FedeDP and I have recently tried Falco 0.30.0 on a testing GKE cluster, but we weren't able to reproduce any n_drops_buffer but I noticed some n_drops_bug during the start-up phase. We tried Falco deployed via helm and a couple of pods with stress-ng on a 2 nodes cluster (COS).
Let me know if you have any chance to share a report. Thanks.

poiana · 2021-12-10T15:18:16Z

Rotten issues close after 30d of inactivity.

Reopen the issue with /reopen.

Mark the issue as fresh with /remove-lifecycle rotten.

Provide feedback via https://github.com/falcosecurity/community.
/close

poiana · 2021-12-10T15:18:19Z

@poiana: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue with /reopen.

Mark the issue as fresh with /remove-lifecycle rotten.

Provide feedback via https://github.com/falcosecurity/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

caquino added the kind/bug label Jun 13, 2019

poiana assigned fntlnz and leodido Jun 13, 2019

fntlnz mentioned this issue Jun 17, 2019

Frequent syscall event drops when falco runs as a binary #615

Closed

stale bot added the wontfix label Nov 25, 2019

stale bot removed the wontfix label Nov 25, 2019

markjacksonfishing mentioned this issue Dec 7, 2019

Frequent and noisy syscall event drops when running falco 0.17.1 helm chart #961

Closed

poiana added the lifecycle/stale label Dec 14, 2020

poiana added lifecycle/rotten and removed lifecycle/stale labels Jan 13, 2021

poiana closed this as completed Feb 13, 2021

poiana reopened this Sep 17, 2021

fntlnz removed their assignment Nov 10, 2021

poiana closed this as completed Dec 10, 2021

Andreagit97 mentioned this issue Jun 20, 2022

new(driver-bpf,driver-kmod,libscap,libscap-engine-bpf,libscap-engine-kmod): extend buffer event drop metrics falcosecurity/libs#414

Merged

Falco on GKE - dropped syscall events #669

Falco on GKE - dropped syscall events #669

Comments

caquino commented Jun 13, 2019 • edited Loading

fntlnz commented Jun 13, 2019

fntlnz commented Jun 13, 2019

michiels commented Jun 14, 2019

caquino commented Jun 14, 2019

nuala33 commented Jun 21, 2019

kbrown commented Jul 10, 2019 • edited Loading

leodido commented Jul 11, 2019 • edited Loading

metalsong commented Jul 11, 2019

kbrown commented Jul 11, 2019 • edited Loading

caquino commented Jul 11, 2019

DannyPat44 commented Jul 24, 2019

qbast commented Aug 5, 2019

Aaron-ML commented Aug 12, 2019

bgeesaman commented Aug 13, 2019 • edited Loading

arthurk commented Sep 9, 2019

cvernooy23 commented Sep 26, 2019

stale bot commented Nov 25, 2019

shane-lawrence commented Nov 25, 2019

fntlnz commented Nov 25, 2019 • edited Loading

kbrown commented Nov 26, 2019 via email

popaaaandrei commented Jan 2, 2020

fntlnz commented Jan 4, 2020

stale bot commented Mar 4, 2020

poiana commented Jan 13, 2021

poiana commented Feb 13, 2021

poiana commented Feb 13, 2021

kbrown commented Feb 14, 2021 • edited Loading

bsord commented Aug 1, 2021

abroglesc commented Aug 3, 2021

bsda commented Sep 7, 2021

eelkonio commented Sep 17, 2021

leogr commented Sep 17, 2021

leogr commented Sep 17, 2021

poiana commented Sep 17, 2021

eelkonio commented Sep 23, 2021

leogr commented Sep 23, 2021

eelkonio commented Sep 23, 2021

leogr commented Sep 23, 2021 • edited Loading

bsda commented Sep 23, 2021

leogr commented Sep 23, 2021

bsda commented Sep 23, 2021

leogr commented Oct 6, 2021

leogr commented Oct 14, 2021 • edited Loading

poiana commented Dec 10, 2021

poiana commented Dec 10, 2021

caquino commented Jun 13, 2019 •

edited

Loading

kbrown commented Jul 10, 2019 •

edited

Loading

leodido commented Jul 11, 2019 •

edited

Loading

kbrown commented Jul 11, 2019 •

edited

Loading

bgeesaman commented Aug 13, 2019 •

edited

Loading

fntlnz commented Nov 25, 2019 •

edited

Loading

kbrown commented Feb 14, 2021 •

edited

Loading

leogr commented Sep 23, 2021 •

edited

Loading

leogr commented Oct 14, 2021 •

edited

Loading