Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to solve the UNIX signal handling mess #818

Closed
hillu opened this issue Jan 27, 2018 · 4 comments
Closed

How to solve the UNIX signal handling mess #818

hillu opened this issue Jan 27, 2018 · 4 comments
Labels

Comments

@hillu
Copy link
Contributor

hillu commented Jan 27, 2018

The yr_rules_scan_* functions install signal handlers for SIGSEGV and SIGBUS. I added those so that an application using libyara would not be terminated when scanning files on corrupted file systems. Another case I found were files that were concurrently being modified (truncated).

It was a bad idea to make this behavior largely transparent to the user; adding the SCAN_FLAGS_NO_TRYCATCH for not installing any signal was not a very good solution, part of that is lack of documentation. Moreover, I am pretty sure that YR_TRYCATCH as it is implemented for UNIX so far is not thread-safe.

I'd like to collect some ideas here on what a "proper" implementation might look like

  • Signals are set process-wide, so we need to distinguish between SIGBUS/SIGSEGV being delivered due to
    • invalid memory access that should have been valid (FS-related, etc., see above) from within libyara
    • invalid memory access from within libyara, but due to a bug (e.g. out-of-bounds read from a module)
      • the signal's default action should be taken (how do we do that?)
    • invalid memory access from code running concurrently with libyara (Remove abort when catching a signal. Combined with signal handler cha… #813) or a signal delivered by another process
      • If a signal handler was installed, it should be called
      • otherwise the default action should be taken.
  • Signal handler installation
    • Is YR_TRYCATCH enough?
    • Should there be yr_rules_install_exception_handler / yr_rules_uninstall_exception_handler functions?
    • Do we want to support multiple YR_RULES instances that are concurrently being used?
@ilaidlaw
Copy link

ilaidlaw commented Apr 3, 2018

Just came across this issue, I am hitting the assert in the exception handler, eventually traced it down to the case where multiple YR_RULES are concurrently in use. The exc_jmp_buf array is indexed via tidx, yet tidx is specific to an instance of YR_RULES.

@russianfool
Copy link

russianfool commented Oct 15, 2018

The assert has been removed in 9c62bc7, which probably makes this section way more dangerous in a mutlithreaded environment (and maybe this is just user error here).

Specifically:

  1. Yara supports up to 32 threads running on a rules structure at the moment.
  2. Those threads grab a tid when starting a scan, cache the "old" signal handlers, and install the Yara one.
  3. The threads reinstall the "old" signal handler back after the scan is done.

When multiple threads are running, you very quickly end up in a situation where the "old" cache is a cache of another thread installing the Yara signal handlers. The whole process ends up with the Yara signal handler installed outside Yara scan context.

With the missing abort()/assert(), this is dangerous since SIGSEGVs outside Yara scan get ignored. I've reintroduced the assert() as an abort(); my local behavior is then:

  1. SIGSEGV from another thread might still get ignored. T1 is running Yara scan, T2 is not, T2 SIGSEGV. There's no guarantee which thread will run signal handler (EDIT: Actually, maybe this isn't true (https://stackoverflow.com/a/6533431); if Yara is only interested in SIGBUS/SIGSEGV and the "thread directed signals" hold true, then just the signal handler installation doesn't work).
  2. T1 runs signal handler. If Yara scan is ongoing, tidx is retrieved; we assume Yara scan failed and cancel it by jumping. T2 is resumed, and SIGSEGVs again, hopefully not hitting T1.
  3. T2 runs signal handler. No Yara scan is ongoing, tidx is -1. Before assert/abort, we would swallow the SIGSEGV. T2 operation resumes, SIGSEGV again, infinite "loop" (SIGSEGV -> ignore signal -> SIGSEGV). With patch, at least we crash.

@plusvic
Copy link
Member

plusvic commented Oct 28, 2021

The exception handling logic has changed since this issue was opened. Re-open if this is still an issue.

@plusvic plusvic closed this as completed Oct 28, 2021
@mark-belnap
Copy link

I just upgrade our software from v3.5 to v4.2.3 of libyara. We had an issue in production that showed this is still an issue even after the changes listed above. I dug out the issue myself and then found this discussion, which reinforces my original thoughts.

Some comments:

hillu was right on with the following:

It was a bad idea to make this behavior largely transparent to the user; adding the SCAN_FLAGS_NO_TRYCATCH for not installing any signal was not a very good solution

I had no idea we were using this in 3.5 (with only sigbus) and it was only with the introduction of sigsegv that the bad side-effects started hitting hard.

part of that is lack of documentation

This +=100 ! Nowhere in the documentation does it mention what is happening.

Moreover, I am pretty sure that YR_TRYCATCH as it is implemented for UNIX so far is not thread-safe

Again, yes! I can verify that it is NOT thread-safe.

russianfool's description was the exact same conclusion that I came to after a few days dealing with mis-behaving code. Some of his comments relate to the old tidx methodology, but the gist is still there even after the 9c62bc7 patch. His comment regarding the ease at which a multi-threaded process calling yr_rules_scan_mem() will quickly and irrevocably (assuming the non-yara code doesn't mess with signal handlers) replace the signal handler with yara's exception_handler() is verifiably correct based on my experience and testing.

Further, russianfool's comment #3 is the (mis-) behavior that we were seeing. Whereas before when our code would segfault, it would dump core (which we could examine and debug) and kubernetes would restart it. With the infinite loop, we just bumped up to 100% cpu and waited until other threads started reacting and the process failed in various other ways.

In discussing with colleagues we haven't been able to determine a way that a global signal handler can be used in a local fashion. Especially in a 3rd party library with no warning of the non-thread-safe-ness of the behavior.

Our solution was to enable the SCAN_FLAGS_NO_TRYCATCH flag. (This solution was determined by digging through source code, not via any documentation.) However, my contention is that this feature should be off by default and only enabled by the SCAN_FLAGS_THREAD_UNSAFE_TRYCATCH flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants