Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGABRT on Falco restarts when using gRPC server #2342

Closed
adduali1310 opened this issue Jan 9, 2023 · 16 comments
Closed

SIGABRT on Falco restarts when using gRPC server #2342

adduali1310 opened this issue Jan 9, 2023 · 16 comments
Labels
Milestone

Comments

@adduali1310
Copy link

Describe the bug

When restarting an already running Falco, sometimes a SIGABRT is received due to an unclean gRPC server shutdown.
Currently, I have enabled the gRPC output and server for Falco. This is done for exporting metrics via the Falco Exporter.

How to reproduce it

This may not always work which makes me believe there is a race condition of some sort when enabling the gRPC server to run in Falco:

  1. Enable the gRPC output and server.
  2. Bind both the server(Falco) and the client(Falco Exporter) to unix:///var/run/falco.sock (The latest version defaults to /run/falco.sock, but I do not see any reason why it cannot bind to var/run/falco.sock as in this case, if set explicitly on both ends)
  3. Ensure, the connection is successful and metrics are being exported/are visible.
  4. Restart Falco.

On doing so, the follwoing debug level log messages will be observed:

  • Shutting down gRPC server. Waiting until external connections are closed by clients
  • grpc: unix:///var/run/falco.sock: Sending goaway err={....}
  • grpc: assertion failed: grpc_server_request_registered_call( server_->server(), registered_method, &call_, &context_->deadline_, context_->client_metadata_.arr(), payload, call_cq_->cq(), notification_cq->cq(), this) == GRPC_CALL_OK

I took a deeper look and the assertion failure seems to be due to a completion queue component within the grpc core repository that Falco makes use of as a library. The code in this case has a GPR_ASSERT that aborts the program, if the condition is not met.

Definition of the *GPR_ASSERT * - https://github.com/grpc/grpc/blob/dba82c9be6af2bbe22e460338eac257726102f5e/include/grpc/support/log.h#L92-L98

Source code where assert fails - https://github.com/grpc/grpc/blob/master/src/cpp/server/server_cc.cc#L255

After taking an initial glance at the library source code and some of the Falco side code, my guess is there is a race condition wherein the completion queue is not drained completely post server shutdown and there is an attempt to enqueue a request post shutting down the completion queue.

Expected behaviour

Falco should not SIGABRT.

Environment

Falco version:
Falco version: 0.33.1-1-48a2c44b
Libs version: 0.9.0
Plugin API: 2.0.0
Driver:
API version: 2.0.0
Schema version: 2.0.0
Default driver: 3.0.1+driver
Machine: x86_64
System: Linux Debian
OS: Debian GNU/Linux 11 (bullseye)
Kernel: 5.15.76
Installation method: From Source

@FedeDP
Copy link
Contributor

FedeDP commented Jan 9, 2023

Thanks for the bug report and in-depth research!
Given that Falco auto-restart on ruleset/config update is enabled by default, i think this is rather important to be fixed on time for 0.34.0.
Adding the milestone:
/milestone 0.34.0

/cc @leogr @jasondellaluce wdyt?

@poiana poiana added this to the 0.34.0 milestone Jan 9, 2023
@jasondellaluce
Copy link
Contributor

I agree with Federico, we should research this for the upcoming release.

@FedeDP
Copy link
Contributor

FedeDP commented Jan 12, 2023

Hi!
I have a fix in #2350.
@adduali1310 are you able to test it by chance?

I could not trigger the issue anymore with the patch.

@adduali1310
Copy link
Author

@FedeDP Thanks for the quick fix.
Let me test it out and get back to you.

Give me a while(next weekish) to be certain and ensure that the patch resolves this issue for sure if that's ok?

@FedeDP
Copy link
Contributor

FedeDP commented Jan 12, 2023

Yep no problem! Let us know! 🤞

@happy-dude
Copy link
Contributor

happy-dude commented Jan 18, 2023

Hey @FedeDP, I just tried out PR #2350; unfortunately, we are still seeing this error.

Jan 18 18:16:56 falco[4133530]: Falco version: 0.33.1-4-52bab289 (aarch64)
<lines omitted>
Jan 18 18:16:56 falco[4133530]: Starting gRPC server at unix:///var/run/falco.sock
Jan 18 18:17:59 falco[4133530]: SIGINT received, exiting...
Jan 18 18:17:59 systemd[1]: Stopping Falco Container Native Runtime Security with ebpf...
Jan 18 18:17:59 bash[4133530]: Syscall event drop monitoring:
Jan 18 18:17:59 bash[4133530]:    - event drop detected: 0 occurrences
Jan 18 18:17:59 bash[4133530]:    - num times actions taken: 0
Jan 18 18:17:59 falco[4133530]: Shutting down gRPC server. Waiting until external connections are closed by clients
Jan 18 18:17:59 falco[4133530]: grpc: assertion failed: grpc_server_request_registered_call( server_->server(), registered_method, &call_, &context_->deadline_, context_->client_metadata_.arr(), payload, call_cq_->cq(), notification_cq->cq(), this) == GRPC_CALL_OK
Jan 18 18:18:01 systemd[1]: falco-bpf.service: Main process exited, code=killed, status=6/ABRT
Jan 18 18:18:01 systemd[1]: falco-bpf.service: Failed with result 'signal'.
Jan 18 18:18:01 systemd[1]: Stopped Falco Container Native Runtime Security with ebpf.
Jan 18 18:18:01 systemd[1]: falco-bpf.service: Consumed 4.199s CPU time.

falco --version output (was built internally with #2350 pulled in)

falco --version
Falco version: 0.33.1-4-52bab289
Libs version:  0.10.1
Plugin API:    2.0.0
Engine:        15
Driver:
  API version:    3.0.0
  Schema version: 2.0.0
  Default driver: 4.0.0+driver

Is there additional info that we can provide on our side to help debug further?

@FedeDP
Copy link
Contributor

FedeDP commented Jan 19, 2023

Weird :/ but somewhat expected, i mean if this is caused by a race condition it could be just mitigated in my own env; i will test again and try to find another solution.
Thank you for your valuable feedback!

@FedeDP
Copy link
Contributor

FedeDP commented Jan 19, 2023

Btw love that you are using new falco systemd services :)

@adduali1310
Copy link
Author

Thanks for inputs @FedeDP and for following up @happy-dude

@happy-dude
Copy link
Contributor

happy-dude commented Jan 25, 2023

Just a quick update:

  • did some testing on my ARM64 VM -- was unable to reproduce
  • looking at when we did the most recent custom build deployment (clean build with grpc fix pulled in), it did occur on a ARM64 host and a AMD64 host
  • I have a (older) coredump from a AMD64 host

I need to figure out a clean way to get the CircleCI build of the PR into a production node, but that's a mess because that is governed by config management, which is set to launch/configure the package according to the latest deployment I did (i.e. with settings changes, etc.)

However, I see that the gRPC fix was merged into master recently; I'll generate a clean build (with my org's internal patches applied) and test that. Our build process generates a "clean" build for every release since the builder images are essentially newly created for every release.

@happy-dude
Copy link
Contributor

happy-dude commented Jan 26, 2023

Hey team,

We completed a fresh deployment branching from a7e991b, with some custom patches applied ontop and were able to reproduce the issue.

Please see attached screenshot of the tmux session I was working off of.

I was doing a journalctl -fu falco-bpf and on another session I decided to stop the service.

All the commands from the right (sudo systemctl stop falco-bpf) all happened within Thu 26 Jan 2023 10:22:15 PM UTC, which corresponds with the SIGABRT time on the left.

Screenshot 2023-01-26 at 4 24 51 PM

NOTE: This was on a aarch64 host. Will report back if a x86-64 host experiences the problem tomorrow.

@FedeDP
Copy link
Contributor

FedeDP commented Jan 27, 2023

Hi! Can you share the grpc server threadiness? https://github.com/falcosecurity/falco/blob/master/falco.yaml#L431

Moreover, was there a lot of spamming on the grpc unix socket (like, lots of outputs)?
I found: grpc/grpc#14001 and grpc/grpc#31579; even if they are not specifically talking about the shutdown process. Perhaps grpc has some issues on arm?

Let us know if you can repro it on x86 :)
Thank you for your testing!

@happy-dude
Copy link
Contributor

happy-dude commented Jan 27, 2023

Hey Fede,

The grpc server threadiness is 0.

Moreover, was there a lot of spamming on the grpc unix socket (like, lots of outputs)?

I do not believe so; this was an evaluation/testing host we were using and there were 0 events detected and no events dropped. If anything, this was a random instance of me tailing the service logs and after seeing nothing was happening for a long time, decided to stop the service on a whim -- which caused the SIGABRT 🤷‍♂️

Let us know if you can repro it on x86 :)

Hasn't happened yet (although it has in the past); will update ticket if it does.

Perhaps grpc has some issues on arm?

Are there suggestions or a path for us to debug this further?

There really isn't anything special in the falco.yaml, so here are the contents of the file:

buffered_outputs: false
file_output:
  enabled: true
  filename: /var/log/falco/falco_events.log
  keep_alive: true
grpc:
  bind_address: unix:///var/run/falco.sock
  enabled: true
  threadiness: 0
grpc_output:
  enabled: true
http_output:
  enabled: false
  url: http://some.url
  user_agent: falcosecurity/falco
json_include_output_property: false
json_include_tags_property: true
json_output: true
libs_logger:
  enabled: false
  severity: debug
log_level: info
log_stderr: false
log_syslog: true
metadata_download:
  chunk_wait_us: 1000
  max_mb: 100
  watch_freq_sec: 1
output_timeout: 2000
outputs:
  max_burst: 1000
  rate: 0
priority: debug
program_output:
  enabled: false
  keep_alive: false
  program: 'jq ''{text: .output}'' | curl -d @- -X POST https://hooks.slack.com/services/XXX'
rules_file:
- /etc/falco/falco_rules.local.yaml
- /etc/falco/rules.d
stdout_output:
  enabled: false
syscall_buf_size_preset: 5
syscall_event_drops:
  actions:
  - log
  - alert
  max_burst: 1000
  rate: 1
  threshold: 0.1
syscall_event_timeouts:
  max_consecutives: 1000
syslog_output:
  enabled: false
time_format_iso_8601: false
watch_config_files: false
webserver:
  enabled: false
  k8s_healthz_endpoint: /healthz
  listen_port: 8765
  ssl_certificate: /etc/falco/falco.pem
  ssl_enabled: false

@happy-dude
Copy link
Contributor

Hey team, just wanted to provide an update saying that besides the SIGABRT that we observed on the ARM test node, we haven't encountered the issue again on the other nodes with Falco deployed.

Happy to have this be closed out; I understand that it's hard to reproduce and will file another ticket if it occurs again!

@FedeDP
Copy link
Contributor

FedeDP commented Feb 2, 2023

Great to hear!
Let's wait for @adduali1310 feedback before eventually closing this 🤞

@adduali1310
Copy link
Author

Thanks for the PR @FedeDP . Appreciate ur quick fix and can confirm haven't seen any issues where the unclean exit of the server was responsible for the aborts.

Good with closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants