Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High rate of dropped events #558

Closed
prsimoes opened this issue Mar 8, 2019 · 5 comments
Closed

High rate of dropped events #558

prsimoes opened this issue Mar 8, 2019 · 5 comments

Comments

@prsimoes
Copy link
Contributor

prsimoes commented Mar 8, 2019

I'm running Falco (latest from 3 weeks ago) as a daemonset in one of our staging environments in GKE (1.11.6-gke.6). It's a pool of around 20 nodes using Ubuntu. I'm using BPF probe.

I'm experiencing a very high drop rate of events, at around 72%, on nodes with around 8 of CPU load. On nodes where the CPU load is less (below 1), the drop rate is almost inexistent or even zero.

Also tested in a less busy GKE+COS environment, with BPF enabled, and still got around 12% drop rate for a node with 1.2 of CPU load.

I'm attaching two PDFs with the detailed data that I measured, showing CPU load, memory, disk, pod restarts, drop rate, etc.

I also tested using the kernel probe on the Ubuntu node pool, and the drop rate was actually higher (around 74% on the busier nodes), so not sure if it's totally related to BPF.

Falco Dropped syscalls - GKE + Ubuntu + BPF.pdf
Falco Dropped syscalls - GKE + COS + BPF.pdf

@mfdii
Copy link
Member

mfdii commented Mar 8, 2019

Do you have any data on how many containers per node you run on average and also info on how quickly containers churn? We have a fix going in for how we pull container metadata which we believe is causing the dropped events.

@prsimoes
Copy link
Contributor Author

prsimoes commented Mar 8, 2019

It's around 16 to 20 containers (including kube-system ones) on each node and most of them seem to be long lived (most of them have the same age as the node). There's one pod that have some restarts. I'm attaching a file with more detailed data from kubectl:

data.txt

@prsimoes
Copy link
Contributor Author

Just an update on this,

After fixing the k8s metadata issue (#562), I was able to reduce the drop rate to almost zero on the heavy CPU load nodes (0.0001% drop rate).

Was the inability of Falco to reach https://kubernetes.default, the root cause of all these drops?

@mfdii
Copy link
Member

mfdii commented Apr 9, 2019

Have you seen any improvement further with the newer falco builds? We've further fixed pulling container info as well as alerting on drops.

And yes, I am sure the DNS was blocking somewhere which would have caused this.

@prsimoes
Copy link
Contributor Author

I haven't noticed much difference. The DNS fix was what really pulled down the number of drops.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants