AppScope segfaults node app when configured to send tls to logstream cloud #650

jrcheli · 2021-11-17T21:49:55Z

While looking at #644, I noticed a segfault that seems to happen very repeatably.

Repro instructions...

git checkout feature/644-metric-capture branch         (799aa351459e9e3a7aad96c2325114ff0958c1fc at this time)
cd test/testContainers
make metriccapture-shell, then in the shell
LD_PRELOAD=/opt/appscope/lib/linux/x86_64/libscope.so SCOPE_CRIBL_CLOUD=tcp://in.logstream.<yourimage>.cribl.cloud:10090 node /usr/local/scope/hotshot.ts

What is happening is at least some of the information is making it through to be received in logstream (confirmed with the live capture feature on the appscope source), but the console where node runs is outputting this message, apparently as the process exits: Segmentation fault (core dumped)

What I expect is no segmentation fault.

Here's a backtrace by running gdb node, and executing these commands in gdb:

(gdb) set args /usr/local/scope/hotshot.ts
(gdb) set env LD_PRELOAD=/opt/appscope/lib/linux/x86_64/libscope.so
(gdb) set env SCOPE_CRIBL_CLOUD=tcp://in.logstream.<yourimage>.cribl.cloud:10090
(gdb) run
(gdb) bt

Thread 4 "node" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff58cc700 (LWP 47)]
__GI___pthread_rwlock_wrlock (rwlock=0x0) at pthread_rwlock_wrlock.c:27
27	pthread_rwlock_wrlock.c: No such file or directory.
(gdb) bt
#0  __GI___pthread_rwlock_wrlock (rwlock=0x0) at pthread_rwlock_wrlock.c:27
#1  0x00007ffff7856d39 in CRYPTO_THREAD_write_lock () from /opt/appscope/lib/linux/x86_64/libscope.so
#2  0x00007ffff781f2ad in ENGINE_pkey_asn1_find_str () from /opt/appscope/lib/linux/x86_64/libscope.so
#3  0x00007ffff77d89aa in EVP_PKEY_asn1_find_str () from /opt/appscope/lib/linux/x86_64/libscope.so
#4  0x00007ffff7838a10 in find_ameth () from /opt/appscope/lib/linux/x86_64/libscope.so
#5  0x00007ffff7849ffe in ossl_namemap_doall_names () from /opt/appscope/lib/linux/x86_64/libscope.so
#6  0x00007ffff783a2ef in EVP_PKEY_set_type_by_keymgmt () from /opt/appscope/lib/linux/x86_64/libscope.so
#7  0x00007ffff7833c16 in evp_keymgmt_util_assign_pkey () from /opt/appscope/lib/linux/x86_64/libscope.so
#8  0x00007ffff7834161 in evp_keymgmt_util_gen () from /opt/appscope/lib/linux/x86_64/libscope.so
#9  0x00007ffff783ed07 in EVP_PKEY_gen () from /opt/appscope/lib/linux/x86_64/libscope.so
#10 0x00007ffff7783935 in ssl_generate_pkey_group () from /opt/appscope/lib/linux/x86_64/libscope.so
#11 0x00007ffff77ad5e8 in tls_construct_ctos_key_share () from /opt/appscope/lib/linux/x86_64/libscope.so
#12 0x00007ffff77ab443 in tls_construct_extensions () from /opt/appscope/lib/linux/x86_64/libscope.so
#13 0x00007ffff77b6b8c in tls_construct_client_hello () from /opt/appscope/lib/linux/x86_64/libscope.so
#14 0x00007ffff77b58bf in state_machine () from /opt/appscope/lib/linux/x86_64/libscope.so
#15 0x00007ffff778ed74 in SSL_do_handshake () from /opt/appscope/lib/linux/x86_64/libscope.so
#16 0x00007ffff76e29f8 in establishTlsSession (trans=0x494ee60) at src/transport.c:398
#17 0x00007ffff76e335b in checkPendingSocketStatus (trans=0x494ee60) at src/transport.c:688
#18 0x00007ffff76e3bea in transportConnect (trans=0x494ee60) at src/transport.c:923
#19 0x00007ffff76e9790 in ctlConnect (ctl=0x495a360, who=CFG_LS) at src/ctl.c:1127
#20 0x00007ffff76caa63 in doPayload () at src/report.c:3307
#21 0x00007ffff769f822 in periodic (arg=0x0) at src/wrap.c:1021
#22 0x00007ffff72c7ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#23 0x00007ffff71f5def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb)

If you run the same thing without tls, it works fine (no segfault)

LD_PRELOAD=/opt/appscope/lib/linux/x86_64/libscope.so SCOPE_CRIBL=tcp://in.logstream.<yourimage>.cloud:10091 node /usr/local/scope/hotshot.ts

It crashes the same way on the master branch (I tested this by moving the metriccapture integration test to another sandbox that was on master and therefore did not have any of the code changes from #644 in it)

Since it seemed like this was crashing on exit, and hasn't been seen before, I tried adding the sleep package for node:

cd /usr/local/scope
npm add sleep

And then added this to the end of hotshot.ts:

var sleep = require('sleep');
sleep.msleep(X);

What I've observed on my ec2 machine (where I originally saw this problem) is that when X is 15 (meaning we delay the exit of node for 15ms), it stops crashing. When X is 10 or less (delaying the exit for 10ms or less) it crashes. This suggests to me that this crash will only appear during small timing windows, maybe? Like the window of time where the tls connection is being established...

A few questions that I'd like to answer next:
Q1) Has it been like this since the last time we touched TLS stuff in our code?
A1) Yeah, from everything I can tell. I haven't found any recent change in behavior... I've built and run the version before these two latest changes, and the same node test crashes in the same way. My conclusion is that it's always been here since we added tls.
unix socket stuff was done under 5d45e5d on 17-Sept-2021
a mutex was removed under e257968 on 31-Aug-2021

Q2) Even when it crashes with no delay, we have an active connection that is successfully conveying information to logstream cloud... So what new connection is being attempted in the backtrace? (Payload maybe?). Is that connection necessary?
A2) I rebuilt with the payload connection commented out. The crash didn't appear. So my best understanding at the moment is that the timing window exists while the payload tls connection is being established. I assume that we probably have another window of vulnerability while the main tls connection is being established, but I also assume that the timing of this particular test doesn't allow us to see it.

Q3) Can we learn what's causing the crash by inspecting the backtrace?
A3) So, by looking at the source code of frames of the backtrace, I found this:

const EVP_PKEY_ASN1_METHOD *ENGINE_pkey_asn1_find_str(ENGINE **pe,
                                                      const char *str,
                                                      int len)
{
    ENGINE_FIND_STR fstr;
    fstr.e = NULL;
    fstr.ameth = NULL;
    fstr.str = str;
    fstr.len = len;

    if (!RUN_ONCE(&engine_lock_init, do_engine_lock_init)) {
        ERR_raise(ERR_LIB_ENGINE, ERR_R_MALLOC_FAILURE);
        return NULL;
    }

    if (!CRYPTO_THREAD_write_lock(global_engine_lock))
        return NULL;
    engine_table_doall(pkey_asn1_meth_table, look_str_cb, &fstr);
    /* If found obtain a structural reference to engine */
    if (fstr.e) {
        fstr.e->struct_ref++;
        engine_ref_debug(fstr.e, 0, 1);
    }
    *pe = fstr.e;
    CRYPTO_THREAD_unlock(global_engine_lock);
    return fstr.ameth;
}

So this suggests that the problem is that global_engine_lock is being assigned a value (great!) but then at a later point during the execution, the value is being set to NULL. I confirmed this in gdb:

(gdb) p &global_engine_lock
$2 = (<data variable, no debug info> *) 0x4928208 <global_engine_lock>
(gdb) watch *0x4928208        
Hardware watchpoint 1: *0x4928208
(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/local/bin/node /usr/local/scope/hotshot.ts
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff6993700 (LWP 57)]
[New Thread 0x7ffff5991700 (LWP 59)]
[New Thread 0x7ffff6192700 (LWP 58)]
[New Thread 0x7ffff5190700 (LWP 60)]
[New Thread 0x7ffff498f700 (LWP 61)]
[New Thread 0x7fffeffff700 (LWP 62)]
[Switching to Thread 0x7ffff6192700 (LWP 58)]

Thread 4 "node" hit Hardware watchpoint 1: *0x4928208

Old value = 0
New value = -402529088
0x00007ffff781e183 in do_engine_lock_init_ossl_ () from /opt/appscope/lib/linux/x86_64/libscope.so
(gdb) bt
#0  0x00007ffff781e183 in do_engine_lock_init_ossl_ () from /opt/appscope/lib/linux/x86_64/libscope.so
#1  0x00007ffff72d134f in __pthread_once_slow (once_control=0x7ffff7fc555c <engine_lock_init>, init_routine=0x7ffff781e170 <do_engine_lock_init_ossl_>) at pthread_once.c:116
#2  0x00007ffff7856be9 in CRYPTO_THREAD_run_once () from /opt/appscope/lib/linux/x86_64/libscope.so
#3  0x00007ffff781f0cd in ENGINE_pkey_asn1_find_str () from /opt/appscope/lib/linux/x86_64/libscope.so
#4  0x00007ffff77d87da in EVP_PKEY_asn1_find_str () from /opt/appscope/lib/linux/x86_64/libscope.so
#5  0x00007ffff7787b94 in ssl_load_ciphers () from /opt/appscope/lib/linux/x86_64/libscope.so
#6  0x00007ffff7793c7c in SSL_CTX_new_ex () from /opt/appscope/lib/linux/x86_64/libscope.so
#7  0x00007ffff76e2573 in establishTlsSession (trans=0x49524d0) at src/transport.c:359
#8  0x00007ffff76e318e in checkPendingSocketStatus (trans=0x49524d0) at src/transport.c:688
#9  0x00007ffff76e3a1d in transportConnect (trans=0x49524d0) at src/transport.c:923
#10 0x00007ffff76e95d4 in ctlConnect (ctl=0x4959f70, who=CFG_CTL) at src/ctl.c:1127
#11 0x00007ffff76ca586 in doConnection () at src/report.c:3120
#12 0x00007ffff76ca621 in doEvent () at src/report.c:3146
#13 0x00007ffff76a081d in periodic (arg=0x0) at src/wrap.c:1020
#14 0x00007ffff72c8ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#15 0x00007ffff71f6def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) c
Continuing.
[New Thread 0x7ffff761c700 (LWP 63)]
Starting to generate metrics from hotshot.ts
Completed metric generation from hotshot.ts
[Thread 0x7fffeffff700 (LWP 62) exited]
[Thread 0x7ffff498f700 (LWP 61) exited]
[Thread 0x7ffff5190700 (LWP 60) exited]
[Thread 0x7ffff5991700 (LWP 59) exited]
[Thread 0x7ffff6993700 (LWP 57) exited]
[Switching to Thread 0x7ffff70ed180 (LWP 56)]

Thread 1 "node" hit Hardware watchpoint 1: *0x4928208

Old value = -402529088
New value = 0
0x00000000017fdf4d in engine_cleanup_int ()
(gdb) bt
#0  0x00000000017fdf4d in engine_cleanup_int ()
#1  0x000000000185a00f in OPENSSL_cleanup ()
#2  0x00007ffff71374d7 in __run_exit_handlers (status=0, listp=0x7ffff72b7718 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true)
    at exit.c:108
#3  0x00007ffff713767a in __GI_exit (status=<optimized out>) at exit.c:139
#4  0x00007ffff711fd11 in __libc_start_main (main=0xa564a0 <main>, argc=2, argv=0x7fffffffec08, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 
    stack_end=0x7fffffffebf8) at ../csu/libc-start.c:342
#5  0x0000000000a58dfc in _start ()
(gdb)

Ah HA! OPENSSL_cleanup() is getting called!
I did not expect this with our OPENSSL_init_ssl(OPENSSL_INIT_NO_ATEXIT, NULL) call.
So, does the OPENSSL_cleanup() source code offer any possible solutions? Um, yeah, maybe?

void OPENSSL_cleanup(void)
{
    OPENSSL_INIT_STOP *currhandler, *lasthandler;
...
#ifndef OPENSSL_NO_ENGINE
    OSSL_TRACE(INIT, "OPENSSL_cleanup: engine_cleanup_int()\n");
    engine_cleanup_int();
#endif
...
}

So maybe it'd be worth building with OPENSSL_NO_ENGINE?

Yep, when I add "no-engine" to the arguments to openssl/Configure in contrib/Makefile, voilà, no crash.

The text was updated successfully, but these errors were encountered:

jrcheli · 2021-11-19T23:33:22Z

It's difficult to prove that building openssl with no-engine didn't just change the timing but I do have evidence that this is not the case. With the original, unmodified build of master (cb36e96 at this time), I started gdb as I did above:

gdb node
(gdb) set args /usr/local/scope/hotshot.ts
(gdb) set env LD_PRELOAD=/opt/appscope/lib/linux/x86_64/libscope.so
(gdb) set env SCOPE_CRIBL_CLOUD=tcp://in.logstream.<yourimage>.cribl.cloud:10090
(gdb) disass OPENSSL_cleanup

      0x000000000185a000 <+160>: call   0x1898080 <ossl_rand_cleanup_int>
      0x000000000185a005 <+165>: call   0x17ab9c0 <ossl_config_modules_free>
=>    0x000000000185a00a <+170>: call   0x17fdf10 <engine_cleanup_int>
      0x000000000185a00f <+175>: call   0x18bd360 <ossl_store_cleanup_int>
      0x000000000185a014 <+180>: call   0x18558d0 <ossl_lib_ctx_default_deinit>
      0x000000000185a019 <+185>: call   0x185a9d0 <ossl_cleanup_thread>
      0x000000000185a01e <+190>: call   0x1764c10 <bio_cleanup>

(gdb) b *0x000000000185a00a
Breakpoint 1 at 0x185a00a
(gdb) command 1
>jump *0x000000000185a00f
>c
>end
(gdb) run

This worked! So if I take the unmodified build, and run it, skipping engine_cleanup_int() everything works. This is how I've convinced myself that we've really fixed the problem, that it's not just a coincidence.

jrcheli · 2021-11-22T15:53:18Z

To be honest, I don't know why the openssl library is continuing to do some atExit() behaviors after we've tried to tell it not to via the OPENSSL_init_ssl(OPENSSL_INIT_NO_ATEXIT, NULL) call. This said, my impression is that building with no-engine is acceptable for us. From this description: https://github.com/jmhodges/libssl/blob/master/src/README.ENGINE I've concluded that by building with no-engine, we won't be taking advantage of external crypto devices. While not ideal, this seems straightforward to me, and I prefer this to making our own modification to the SSL library.

jrcheli · 2021-11-22T20:30:53Z

QA INSTRUCTIONS

Scenario 1

Goal: Verify that we are able to make outgoing TLS using a variety of short-lived processes without crashing.

Requirements: Scenario must be done with TLS, which is probably easiest via cribl.cloud. I'd recommend registering for a LogStream cloud instance if you don't already have one. It's not strictly required, but I like looking at the incoming data from the scoped process by navigating to Sources->AppScope->in_appscope_tls, and clicking the "Live" button to confirm that I'm getting the data from AppScope to LogStream Cloud successfully.

Procedure:

docker run -it cribl/scope:latest
scope version                                 (just to verify that this version contains the fix described here)
LD_PRELOAD=/usr/local/lib/libscope.so SCOPE_CRIBL_CLOUD=tcp://in.logstream.<yourcloudinstance>.cribl.cloud:10090 ps -ef
LD_PRELOAD=/usr/local/lib/libscope.so SCOPE_CRIBL_CLOUD=tcp://in.logstream.<yourcloudinstance>.cribl.cloud:10090 ls -al
LD_PRELOAD=/usr/local/lib/libscope.so SCOPE_CRIBL_CLOUD=tcp://in.logstream.<yourcloudinstance>.cribl.cloud:10090 sleep 0.01
LD_PRELOAD=/usr/local/lib/libscope.so SCOPE_CRIBL_CLOUD=tcp://in.logstream.<yourcloudinstance>.cribl.cloud:10090 sleep 0.001

As long as we don't see a crash which prints Segmentation fault (core dumped) or something like it, the test passes. Ideally, we'd be able to confirm that TLS connections are able to be successfully established, which can be done by looking at data in logstream coming in on the appscope source with the "Live" feature of LogStream Cloud

michalbiesek · 2021-12-01T16:58:30Z

I can observe the segfault using these repro steps and AWS instance: #650 (comment) and patch in #656 fix the segfault.
Unfortunately, I cannot see the segfault using these repro steps #650 (comment).
Besides comments above LGTM.

jrcheli · 2021-12-03T22:14:41Z

I've created #669 to help make these kinds of issues more repeatable.

lukaszwilk · 2021-12-09T19:35:16Z

Verified on 0.8.1-rc1
Tested on commands above and multiple others, where process is too short to connect to LogStream and send event. No crashes happened. Looks good.

jrcheli mentioned this issue Nov 17, 2021

Application Specific Metrics #644

Closed

jrcheli added this to the Next Maintenance (0.8.1) milestone Nov 18, 2021

jrcheli self-assigned this Nov 19, 2021

jrcheli added a commit that referenced this issue Nov 19, 2021

(#650) Resolve crash when process exits during tls connection.

d50e064

jrcheli mentioned this issue Nov 19, 2021

Resolve crash when process exits during tls connection #656

Merged

jrcheli mentioned this issue Dec 3, 2021

Create a test that "scans" to look for connection timing issues #669

Closed

jrcheli closed this as completed in #656 Dec 3, 2021

michalbiesek pushed a commit to michalbiesek/appscope that referenced this issue Dec 6, 2021

(criblio#650) Resolve crash when process exits during tls connection.

aacf576

jrcheli added the release-note Issues & PRs that should be listed in release notes label Dec 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AppScope segfaults node app when configured to send tls to logstream cloud #650

AppScope segfaults node app when configured to send tls to logstream cloud #650

jrcheli commented Nov 17, 2021 •

edited

Loading

jrcheli commented Nov 19, 2021

jrcheli commented Nov 22, 2021 •

edited

Loading

jrcheli commented Nov 22, 2021 •

edited

Loading

michalbiesek commented Dec 1, 2021

jrcheli commented Dec 3, 2021

lukaszwilk commented Dec 9, 2021

AppScope segfaults node app when configured to send tls to logstream cloud #650

AppScope segfaults node app when configured to send tls to logstream cloud #650

Comments

jrcheli commented Nov 17, 2021 • edited Loading

jrcheli commented Nov 19, 2021

jrcheli commented Nov 22, 2021 • edited Loading

jrcheli commented Nov 22, 2021 • edited Loading

QA INSTRUCTIONS

Scenario 1

michalbiesek commented Dec 1, 2021

jrcheli commented Dec 3, 2021

lukaszwilk commented Dec 9, 2021

jrcheli commented Nov 17, 2021 •

edited

Loading

jrcheli commented Nov 22, 2021 •

edited

Loading

jrcheli commented Nov 22, 2021 •

edited

Loading