High rate of dropped events #558

prsimoes · 2019-03-08T22:48:27Z

I'm running Falco (latest from 3 weeks ago) as a daemonset in one of our staging environments in GKE (1.11.6-gke.6). It's a pool of around 20 nodes using Ubuntu. I'm using BPF probe.

I'm experiencing a very high drop rate of events, at around 72%, on nodes with around 8 of CPU load. On nodes where the CPU load is less (below 1), the drop rate is almost inexistent or even zero.

Also tested in a less busy GKE+COS environment, with BPF enabled, and still got around 12% drop rate for a node with 1.2 of CPU load.

I'm attaching two PDFs with the detailed data that I measured, showing CPU load, memory, disk, pod restarts, drop rate, etc.

I also tested using the kernel probe on the Ubuntu node pool, and the drop rate was actually higher (around 74% on the busier nodes), so not sure if it's totally related to BPF.

Falco Dropped syscalls - GKE + Ubuntu + BPF.pdf
Falco Dropped syscalls - GKE + COS + BPF.pdf

mfdii · 2019-03-08T23:04:32Z

Do you have any data on how many containers per node you run on average and also info on how quickly containers churn? We have a fix going in for how we pull container metadata which we believe is causing the dropped events.

prsimoes · 2019-03-08T23:28:47Z

It's around 16 to 20 containers (including kube-system ones) on each node and most of them seem to be long lived (most of them have the same age as the node). There's one pod that have some restarts. I'm attaching a file with more detailed data from kubectl:

data.txt

prsimoes · 2019-03-14T23:45:36Z

Just an update on this,

After fixing the k8s metadata issue (#562), I was able to reduce the drop rate to almost zero on the heavy CPU load nodes (0.0001% drop rate).

Was the inability of Falco to reach https://kubernetes.default, the root cause of all these drops?

mfdii · 2019-04-09T02:18:46Z

Have you seen any improvement further with the newer falco builds? We've further fixed pulling container info as well as alerting on drops.

And yes, I am sure the DNS was blocking somewhere which would have caused this.

prsimoes · 2019-04-17T23:30:59Z

I haven't noticed much difference. The DNS fix was what really pulled down the number of drops.

prsimoes closed this as completed Apr 17, 2019

leogr mentioned this issue Sep 16, 2020

[UMBRELLA] Dropped events #1403

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High rate of dropped events #558

High rate of dropped events #558

prsimoes commented Mar 8, 2019

mfdii commented Mar 8, 2019

prsimoes commented Mar 8, 2019

prsimoes commented Mar 14, 2019

mfdii commented Apr 9, 2019

prsimoes commented Apr 17, 2019

High rate of dropped events #558

High rate of dropped events #558

Comments

prsimoes commented Mar 8, 2019

mfdii commented Mar 8, 2019

prsimoes commented Mar 8, 2019

prsimoes commented Mar 14, 2019

mfdii commented Apr 9, 2019

prsimoes commented Apr 17, 2019