-
Notifications
You must be signed in to change notification settings - Fork 32
-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AppScope segfaults node app when configured to send tls to logstream cloud #650
Comments
It's difficult to prove that building openssl with no-engine didn't just change the timing but I do have evidence that this is not the case. With the original, unmodified build of master (cb36e96 at this time), I started gdb as I did above:
This worked! So if I take the unmodified build, and run it, skipping engine_cleanup_int() everything works. This is how I've convinced myself that we've really fixed the problem, that it's not just a coincidence. |
To be honest, I don't know why the openssl library is continuing to do some atExit() behaviors after we've tried to tell it not to via the OPENSSL_init_ssl(OPENSSL_INIT_NO_ATEXIT, NULL) call. This said, my impression is that building with no-engine is acceptable for us. From this description: https://github.com/jmhodges/libssl/blob/master/src/README.ENGINE I've concluded that by building with no-engine, we won't be taking advantage of external crypto devices. While not ideal, this seems straightforward to me, and I prefer this to making our own modification to the SSL library. |
QA INSTRUCTIONSScenario 1Goal: Verify that we are able to make outgoing TLS using a variety of short-lived processes without crashing. Requirements: Scenario must be done with TLS, which is probably easiest via cribl.cloud. I'd recommend registering for a LogStream cloud instance if you don't already have one. It's not strictly required, but I like looking at the incoming data from the scoped process by navigating to Sources->AppScope->in_appscope_tls, and clicking the "Live" button to confirm that I'm getting the data from AppScope to LogStream Cloud successfully. Procedure:
As long as we don't see a crash which prints |
I can observe the segfault using these repro steps and AWS instance: #650 (comment) and patch in #656 fix the segfault. |
I've created #669 to help make these kinds of issues more repeatable. |
Verified on 0.8.1-rc1 |
While looking at #644, I noticed a segfault that seems to happen very repeatably.
Repro instructions...
What is happening is at least some of the information is making it through to be received in logstream (confirmed with the live capture feature on the appscope source), but the console where node runs is outputting this message, apparently as the process exits:
Segmentation fault (core dumped)
What I expect is no segmentation fault.
Here's a backtrace by running
gdb node
, and executing these commands in gdb:If you run the same thing without tls, it works fine (no segfault)
It crashes the same way on the master branch (I tested this by moving the metriccapture integration test to another sandbox that was on master and therefore did not have any of the code changes from #644 in it)
Since it seemed like this was crashing on exit, and hasn't been seen before, I tried adding the sleep package for node:
And then added this to the end of hotshot.ts:
What I've observed on my ec2 machine (where I originally saw this problem) is that when X is 15 (meaning we delay the exit of node for 15ms), it stops crashing. When X is 10 or less (delaying the exit for 10ms or less) it crashes. This suggests to me that this crash will only appear during small timing windows, maybe? Like the window of time where the tls connection is being established...
A few questions that I'd like to answer next:
Q1) Has it been like this since the last time we touched TLS stuff in our code?
A1) Yeah, from everything I can tell. I haven't found any recent change in behavior... I've built and run the version before these two latest changes, and the same node test crashes in the same way. My conclusion is that it's always been here since we added tls.
unix socket stuff was done under 5d45e5d on 17-Sept-2021
a mutex was removed under e257968 on 31-Aug-2021
Q2) Even when it crashes with no delay, we have an active connection that is successfully conveying information to logstream cloud... So what new connection is being attempted in the backtrace? (Payload maybe?). Is that connection necessary?
A2) I rebuilt with the payload connection commented out. The crash didn't appear. So my best understanding at the moment is that the timing window exists while the payload tls connection is being established. I assume that we probably have another window of vulnerability while the main tls connection is being established, but I also assume that the timing of this particular test doesn't allow us to see it.
Q3) Can we learn what's causing the crash by inspecting the backtrace?
A3) So, by looking at the source code of frames of the backtrace, I found this:
So this suggests that the problem is that global_engine_lock is being assigned a value (great!) but then at a later point during the execution, the value is being set to NULL. I confirmed this in gdb:
Ah HA! OPENSSL_cleanup() is getting called!
I did not expect this with our OPENSSL_init_ssl(OPENSSL_INIT_NO_ATEXIT, NULL) call.
So, does the OPENSSL_cleanup() source code offer any possible solutions? Um, yeah, maybe?
So maybe it'd be worth building with OPENSSL_NO_ENGINE?
Yep, when I add "no-engine" to the arguments to openssl/Configure in contrib/Makefile, voilà, no crash.
The text was updated successfully, but these errors were encountered: