-
Notifications
You must be signed in to change notification settings - Fork 895
-
Notifications
You must be signed in to change notification settings - Fork 895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIGABRT on Falco restarts when using gRPC server #2342
Comments
Thanks for the bug report and in-depth research! /cc @leogr @jasondellaluce wdyt? |
I agree with Federico, we should research this for the upcoming release. |
Hi! I could not trigger the issue anymore with the patch. |
@FedeDP Thanks for the quick fix. Give me a while(next weekish) to be certain and ensure that the patch resolves this issue for sure if that's ok? |
Yep no problem! Let us know! 🤞 |
Hey @FedeDP, I just tried out PR #2350; unfortunately, we are still seeing this error.
falco --version output (was built internally with #2350 pulled in)
Is there additional info that we can provide on our side to help debug further? |
Weird :/ but somewhat expected, i mean if this is caused by a race condition it could be just mitigated in my own env; i will test again and try to find another solution. |
Btw love that you are using new falco systemd services :) |
Thanks for inputs @FedeDP and for following up @happy-dude |
Just a quick update:
I need to figure out a clean way to get the CircleCI build of the PR into a production node, but that's a mess because that is governed by config management, which is set to launch/configure the package according to the latest deployment I did (i.e. with settings changes, etc.) However, I see that the gRPC fix was merged into master recently; I'll generate a clean build (with my org's internal patches applied) and test that. Our build process generates a "clean" build for every release since the builder images are essentially newly created for every release. |
Hey team, We completed a fresh deployment branching from a7e991b, with some custom patches applied ontop and were able to reproduce the issue. Please see attached screenshot of the tmux session I was working off of. I was doing a All the commands from the right ( NOTE: This was on a aarch64 host. Will report back if a x86-64 host experiences the problem tomorrow. |
Hi! Can you share the Moreover, was there a lot of spamming on the grpc unix socket (like, lots of outputs)? Let us know if you can repro it on x86 :) |
Hey Fede, The
I do not believe so; this was an evaluation/testing host we were using and there were 0 events detected and no events dropped. If anything, this was a random instance of me tailing the service logs and after seeing nothing was happening for a long time, decided to stop the service on a whim -- which caused the SIGABRT 🤷♂️
Hasn't happened yet (although it has in the past); will update ticket if it does.
Are there suggestions or a path for us to debug this further? There really isn't anything special in the buffered_outputs: false
file_output:
enabled: true
filename: /var/log/falco/falco_events.log
keep_alive: true
grpc:
bind_address: unix:///var/run/falco.sock
enabled: true
threadiness: 0
grpc_output:
enabled: true
http_output:
enabled: false
url: http://some.url
user_agent: falcosecurity/falco
json_include_output_property: false
json_include_tags_property: true
json_output: true
libs_logger:
enabled: false
severity: debug
log_level: info
log_stderr: false
log_syslog: true
metadata_download:
chunk_wait_us: 1000
max_mb: 100
watch_freq_sec: 1
output_timeout: 2000
outputs:
max_burst: 1000
rate: 0
priority: debug
program_output:
enabled: false
keep_alive: false
program: 'jq ''{text: .output}'' | curl -d @- -X POST https://hooks.slack.com/services/XXX'
rules_file:
- /etc/falco/falco_rules.local.yaml
- /etc/falco/rules.d
stdout_output:
enabled: false
syscall_buf_size_preset: 5
syscall_event_drops:
actions:
- log
- alert
max_burst: 1000
rate: 1
threshold: 0.1
syscall_event_timeouts:
max_consecutives: 1000
syslog_output:
enabled: false
time_format_iso_8601: false
watch_config_files: false
webserver:
enabled: false
k8s_healthz_endpoint: /healthz
listen_port: 8765
ssl_certificate: /etc/falco/falco.pem
ssl_enabled: false
|
Hey team, just wanted to provide an update saying that besides the SIGABRT that we observed on the ARM test node, we haven't encountered the issue again on the other nodes with Falco deployed. Happy to have this be closed out; I understand that it's hard to reproduce and will file another ticket if it occurs again! |
Great to hear! |
Thanks for the PR @FedeDP . Appreciate ur quick fix and can confirm haven't seen any issues where the unclean exit of the server was responsible for the aborts. Good with closing the issue. |
Describe the bug
When restarting an already running Falco, sometimes a SIGABRT is received due to an unclean gRPC server shutdown.
Currently, I have enabled the gRPC output and server for Falco. This is done for exporting metrics via the Falco Exporter.
How to reproduce it
This may not always work which makes me believe there is a race condition of some sort when enabling the gRPC server to run in Falco:
On doing so, the follwoing debug level log messages will be observed:
I took a deeper look and the assertion failure seems to be due to a completion queue component within the grpc core repository that Falco makes use of as a library. The code in this case has a GPR_ASSERT that aborts the program, if the condition is not met.
Definition of the *GPR_ASSERT * - https://github.com/grpc/grpc/blob/dba82c9be6af2bbe22e460338eac257726102f5e/include/grpc/support/log.h#L92-L98
Source code where assert fails - https://github.com/grpc/grpc/blob/master/src/cpp/server/server_cc.cc#L255
After taking an initial glance at the library source code and some of the Falco side code, my guess is there is a race condition wherein the completion queue is not drained completely post server shutdown and there is an attempt to enqueue a request post shutting down the completion queue.
Expected behaviour
Falco should not SIGABRT.
Environment
Falco version:
Falco version: 0.33.1-1-48a2c44b
Libs version: 0.9.0
Plugin API: 2.0.0
Driver:
API version: 2.0.0
Schema version: 2.0.0
Default driver: 3.0.1+driver
Machine: x86_64
System: Linux Debian
OS: Debian GNU/Linux 11 (bullseye)
Kernel: 5.15.76
Installation method: From Source
The text was updated successfully, but these errors were encountered: