Skip to content
This repository has been archived by the owner on Jun 24, 2021. It is now read-only.

GNUTLS segmentation fault with connection timeouts #215

Closed
anarcat opened this issue Sep 6, 2016 · 41 comments
Closed

GNUTLS segmentation fault with connection timeouts #215

anarcat opened this issue Sep 6, 2016 · 41 comments
Assignees

Comments

@anarcat
Copy link
Contributor

anarcat commented Sep 6, 2016

I have this strange crash here which I can reproduce only when using a qwebirc webchat frontend. From the user's point of view, the connexion fails with:

[10:31] == Disconnected from server: Connection to IRC server failed.

From the server side, it's much more fun though:

Program received signal SIGSEGV, Segmentation fault.
0x00007f742bbcf720 in rb_ssl_timeout (F=0x7f742c00e128, notused=<optimized out>) at gnutls.c:85
85      gnutls.c: Aucun fichier ou dossier de ce type.
(gdb) bt
#0  0x00007f742bbcf720 in rb_ssl_timeout (F=0x7f742c00e128, notused=<optimized out>) at gnutls.c:85
#1  0x00007f742bbcce2a in rb_checktimeouts (notused=0x7f742c00e128) at commio.c:328
#2  0x00007f742bbd0a80 in rb_run_event (ev=ev@entry=0x17e6620) at event.c:191
#3  0x00007f742bbd459b in rb_read_timerfd (F=0x7f742c00e3d0, data=0x17e6620) at epoll.c:458
#4  0x00007f742bbd491a in rb_select_epoll (delay=<optimized out>) at epoll.c:199
#5  0x00007f742bbce9bc in rb_select (timeout=timeout@entry=18446744073709551615) at commio.c:2104
#6  0x00007f742bbd139c in rb_lib_loop (delay=delay@entry=0) at ratbox_lib.c:228
#7  0x0000000000401baf in main (argc=<optimized out>, argv=<optimized out>) at ssld.c:1219
(gdb) bt full
#0  0x00007f742bbcf720 in rb_ssl_timeout (F=0x7f742c00e128, notused=<optimized out>) at gnutls.c:85
        __PRETTY_FUNCTION__ = "rb_ssl_timeout"
#1  0x00007f742bbcce2a in rb_checktimeouts (notused=0x7f742c00e128) at commio.c:328
        td = 0x17ea200
        F = 0x7f742c00e128
        hdl = 0x7f742bbcf710 <rb_ssl_timeout>
        data = 0x0
#2  0x00007f742bbd0a80 in rb_run_event (ev=ev@entry=0x17e6620) at event.c:191
No locals.
#3  0x00007f742bbd459b in rb_read_timerfd (F=0x7f742c00e3d0, data=0x17e6620) at epoll.c:458
        data = 0x17e6620
        F = 0x7f742c00e3d0
        event = 0x17e6620
        retlen = <optimized out>
        count = 1
#4  0x00007f742bbd491a in rb_select_epoll (delay=<optimized out>) at epoll.c:199
        hdl = <optimized out>
        F = 0x7f742c00e3d0
        num = <optimized out>
        flags = <optimized out>
        old_flags = 1
        op = <optimized out>
        ep_event = {events = 24777344, data = {ptr = 0x0, fd = 0, u32 = 0, u64 = 0}}
        o_errno = <optimized out>
        data = <optimized out>
#5  0x00007f742bbce9bc in rb_select (timeout=timeout@entry=18446744073709551615) at commio.c:2104
        ret = <optimized out>
#6  0x00007f742bbd139c in rb_lib_loop (delay=delay@entry=0) at ratbox_lib.c:228
        next = <optimized out>
#7  0x0000000000401baf in main (argc=<optimized out>, argv=<optimized out>) at ssld.c:1219
        s_ctlfd = <optimized out>
        s_pipe = <optimized out>
        s_pid = <optimized out>
        x = <optimized out>

It looks like a NULL pointer is passed as F->accept to the rb_ssl_timeout callback. it is unclear why or when that is called - it looks like it's being called during the client setup. i have indeed found the handler is called with a NULL pointer here:

Breakpoint 1, rb_settimeout (F=F@entry=0x7ffff7fe5e90, timeout=timeout@entry=0, callback=callback@entry=0x0, cbdata=cbdata@entry=0x0) at commio.c:265
265     in commio.c
(gdb) bt
#0  rb_settimeout (F=F@entry=0x7ffff7fe5e90, timeout=timeout@entry=0, callback=callback@entry=0x0, cbdata=cbdata@entry=0x0) at commio.c:265
#1  0x00007ffff7bab17b in rb_close (F=0x7ffff7fe5e90) at commio.c:873
#2  0x0000000000416331 in close_connection (client_p=0x7ffff7fe5e90) at client.c:1959
#3  0x0000000000416edd in exit_local_server (comment=<optimized out>, from=<optimized out>, source_p=<optimized out>, client_p=<optimized out>)
    at client.c:1442
#4  exit_client (client_p=0x7ffff7fe5e90, source_p=0x0, from=0x0, comment=0x0) at client.c:1588
#5  0x000000000041743d in exit_aborted_clients (unused=0x7ffff7fe5e90) at client.c:1188
#6  0x00007ffff7bad6d0 in rb_run_event (ev=ev@entry=0xd470a0) at event.c:201
#7  0x00007ffff7bb157b in rb_read_timerfd (F=0x7ffff7fe7788, data=0xd470a0) at epoll.c:458
#8  0x00007ffff7bb18fa in rb_select_epoll (delay=<optimized out>) at epoll.c:199
#9  0x00007ffff7bab38c in rb_select (timeout=timeout@entry=18446744073709551615) at commio.c:2105
#10 0x00007ffff7badf9c in rb_lib_loop (delay=0) at ratbox_lib.c:229
#11 0x000000000040be84 in main (argc=0, argv=0x7fffffffe6e8) at ircd.c:732
(gdb) p F->accept
$8 = (struct acceptdata *) 0x0

but it could be that it's supposed to be configured after the handler is setup? that seems fishy. in fact, using a rb_settimeout breakpoint, all the handler setups would have a NULL pointer there.

i wonder if a naive fix like this would be enough:

diff --git a/libratbox/src/gnutls.c b/libratbox/src/gnutls.c
index 05fe812..9198ff5 100644
--- a/libratbox/src/gnutls.c
+++ b/libratbox/src/gnutls.c
@@ -87,8 +87,10 @@ rb_ssl_clear_handshake_count(rb_fde_t *F)
 static void
 rb_ssl_timeout(rb_fde_t *F, void *notused)
 {
-   lrb_assert(F->accept != NULL);
-   F->accept->callback(F, RB_ERR_TIMEOUT, NULL, 0, F->accept->data);
+   if (F->accept != NULL)
+   {
+       F->accept->callback(F, RB_ERR_TIMEOUT, NULL, 0, F->accept->data);
+   }
 }


any feedback would be appreciated.

This is Charybdis 3.5.3 or 3.4.2 on Debian jessie. This server was working fine when it was running 3.3.0.

@anarcat
Copy link
Contributor Author

anarcat commented Sep 6, 2016

so the suggested "naive" patch works, but it probably hides the problem under the rug - i can't imagine it's a good thing to just skip those callbacks ...

@aaronmdjones maybe you have a better idea here?

@kaniini
Copy link
Contributor

kaniini commented Sep 6, 2016

I suspect it is related to his TLS cleanups. It looks like rb_ssl_timeout() is being called without F->accept ever being set.

@aaronmdjones
Copy link
Contributor

aaronmdjones commented Sep 6, 2016

I did not change any of the logic in the GNUTLS backend. As far as I remember I only ever changed the way it reports its version, and prevented a crash bug in rb_ssl_get_cipher(). I don't even grok the callback logic which is why I copied the OpenSSL backend code in my MbedTLS cleanups.

@anarcat
Copy link
Contributor Author

anarcat commented Sep 6, 2016

is there anyone that groks that code? otherwise maybe my patch should just be added to the other backends, which are probably susceptible to similar crashes

@aaronmdjones
Copy link
Contributor

I could have a look at the GNUTLS backend tomorrow and see what it does differently from the OpenSSL backend with respect to its callbacks.

@anarcat
Copy link
Contributor Author

anarcat commented Sep 6, 2016

i did some of that earlier - couldn't find anything.

at this point, i wouldn't assume OpenSSL works at all - i have been running those servers with GnuTLS and 3.4/3.5 codebases without problems until we deployed qwebirc, which triggers the bug.

@aaronmdjones
Copy link
Contributor

The OpenSSL backend should work, as it's using the same non-OpenSSL-specific backend code as the MbedTLS backend, which I threw a battery of tests at.

You can see with git blame who was last responsible for editing a particular line. It appears to be you:
https://github.com/charybdis-ircd/charybdis/blame/release/3.5/libratbox/src/gnutls.c

@anarcat
Copy link
Contributor Author

anarcat commented Sep 6, 2016

well, that's just because i pushed to have GnuTLS support restored in Charybdis, so yeah I "own" the whole file... i didn't actually commit anything in there: https://github.com/charybdis-ircd/charybdis/commits/release/3.5/libratbox/src/gnutls.c

regarding those tests: can those be targeted at the gnutls backend to see if we can reproduce this reliably then?

@aaronmdjones
Copy link
Contributor

aaronmdjones commented Sep 6, 2016

Unfortunately they were all manual so I'll give it a go tomorrow.

Off the top of my head I tried:

  1. Establishing connections without sending any data
  2. Establishing a connection and sending invalid data
  3. Establishing a connection, sending a ClientHello, and then sending no more data
  4. Completing a TLS negotiation and then not sending any more data
  5. Completing a TLS negotiation and then closing the socket
  6. Completing a TLS negotiation, sending a TLS close_notify alert, and then closing the socket

And maybe a few others. Then the usual course of tests with normal connections (completing user registration) on the 3 different libraries to ensure interop.

These were all done under Valgrind and ASan.

@anarcat
Copy link
Contributor Author

anarcat commented Sep 6, 2016

  1. Establishing connections without sending any data: no crash

    nc -t 10 -w 10  -v che.indymedia.org  6697 < /dev/null 
    
  2. Establishing a connection and sending invalid data: no crash

    echo "jkfdljfdkls fjsdkfsdl " | nc -t 10 -w 10  -v che.indymedia.org  6697 
    
  3. Establishing a connection, sending a ClientHello, and then sending no more data: not sure how to replicate that

  4. Completing a TLS negotiation and then not sending any more data: no crash:

    sleep 60 | gnutls-cli --no-ca-verification --port=6697 foo.example.com
    
  5. Completing a TLS negotiation and then closing the socket: no crash

    gnutls-cli --no-ca-verification --port=6697 che.indymedia.org < /dev/null
    
  6. Completing a TLS negotiation, sending a TLS close_notify alert, and then closing the socket: not sure how to reproduce this easily either

it would be great to have that stress test suite available somewhere...

@aaronmdjones aaronmdjones self-assigned this Sep 6, 2016
@aaronmdjones
Copy link
Contributor

I have done some experimentation against the GNUTLS backend and cannot find any problems.

See Here
Testsuite Here

If you can find what exactly in qwebirc is tripping up the IRCd I can write a reproducible test for it.
Since I cannot reproduce it from the outside, I am going to take a closer look at the logic in that backend later today.

@aaronmdjones
Copy link
Contributor

I have made some progress in the investigation.

I suspect it is caused by negotiation failure -- the current GNUTLS backend code does not distinguish between negotiation success and failure before indicating to the IRCd that everything is okay.

I am currently gutting the GNUTLS backend to bring it in line with the MbedTLS one. I will update the issue when I have some code for you to test.

@aaronmdjones
Copy link
Contributor

Please test commit c0f6591 in aaronmdjones/charybdis

@anarcat
Copy link
Contributor Author

anarcat commented Sep 10, 2016

wow... that diff is scary... did you test that code at all?

@aaronmdjones
Copy link
Contributor

Nope! That's why I asked you to test it... and why I didn't push it upstream.

However, I did just test commit fbe8a69 and I can establish a session with that.

@aaronmdjones aaronmdjones changed the title segfault on timeout of SSL connexions GNUTLS segmentation fault with connection timeouts Sep 10, 2016
@aaronmdjones
Copy link
Contributor

I'm happy with commit 569d5f6 if you are. I'll be throwing more tests at it and then opening a pull request for other chary developers to review. I've also done the same to the OpenSSL backend. The diffs between all 3 of them are now fairly minimal.

@anarcat
Copy link
Contributor Author

anarcat commented Sep 11, 2016

if understand this correctly, i actually need 3 patches: c0f6591, fbe8a69 and 569d5f6?

it's too bad that the diff is so far-reaching: it makes it basically impossible to review, and hard to backport to 3.4, short of just copying the file in place completely.

that's basically what i'll do to test this here, because i am not sure that those three patches are sufficient.

thanks for working on this anyways!

@aaronmdjones
Copy link
Contributor

Oh yeah you'd definitely be better off just replacing the file wholesale; I did mean "the current state of the file at commit ...". I don't need you to actually work on integrating it though, I just need you to confirm it fixes the issue, and then we'll go about integrating it into upstream and you can just grab the repository at that commit instead.

Since I'm doing work on all 3 backends and at least this new backend will be a bug fix (I also discovered several memory leaks in the current upstream GNUTLS backend) we may even end up doing a new release for it, which should make your work even easier.

@anarcat
Copy link
Contributor Author

anarcat commented Sep 11, 2016

that's great news! the problem is i am still maintaining a 3.4 branch for debian stable here. i factored in a few things there, including pulling in all of gnutls.c from scratch, as it is simply not present in 3.4 :( so that would be a little harder to merge in as well... :)

@aaronmdjones
Copy link
Contributor

Well if you already pulled in one whole file ... ;)

@anarcat
Copy link
Contributor Author

anarcat commented Sep 11, 2016

problem is auditability: the whole file changed, so there's no way for us to figure out wth happened. gnutls support generally works very well in 3.4, it's just this one bug that was left. :p

@aaronmdjones
Copy link
Contributor

The main problem (the one I suspect causing this reported issue) is that the current upstream GNUTLS backend does not properly check the return value of do_ssl_handshake. Find the 2 places where it calls that in an if() and adjust it so that it explicitly checks for == 1. At the moment, -1 will also satisfy the ifs, even though -1 is a handshake error.

It's still going to leak memory all over the place though. Adjusting the test for handshakes that succeed is only going to make the leaks worse -- every failed handshake will also leak memory too (right now they just crash the backend if they fail in the right way), in addition to the current rehash leaks.

@aaronmdjones
Copy link
Contributor

Also, I sure hope you backported commit 818a3fd (if applicable).

@anarcat
Copy link
Contributor Author

anarcat commented Sep 11, 2016

818a3fd has been relased in Debian last week, iirc, i made sure of that:

https://security-tracker.debian.org/tracker/CVE-2016-7143

heck, i'm the one that requested that CVE. :p

@aaronmdjones
Copy link
Contributor

@kaniini Do we even still support version 3.4?

I've not been prodded (by you or anyone else) to push any security improvements or fixes to it.

@anarcat
Copy link
Contributor Author

anarcat commented Sep 11, 2016

regarding == 1, then that seems like a better approach for sure! it looks something like this here:

--- libratbox/src/gnutls.c.orig 2016-09-11 16:30:04.159263409 -0400
+++ libratbox/src/gnutls.c  2016-09-11 16:30:21.051829961 -0400
@@ -157,7 +157,7 @@
    gnutls_dh_set_prime_bits(*ssl, 1024);
    gnutls_transport_set_ptr(*ssl, (gnutls_transport_ptr_t) (long int)new_F->fd);
    gnutls_certificate_server_set_request(*ssl, GNUTLS_CERT_REQUEST);
-   if(do_ssl_handshake(new_F, rb_ssl_tryaccept, NULL))
+   if(do_ssl_handshake(new_F, rb_ssl_tryaccept, NULL) == 1)
    {
        struct acceptdata *ad = new_F->accept;
        new_F->accept = NULL;
@@ -189,7 +189,7 @@
    gnutls_dh_set_prime_bits(SSL_P(new_F), 1024);
    gnutls_transport_set_ptr(SSL_P(new_F), (gnutls_transport_ptr_t) (long int)rb_get_fd(new_F));
    gnutls_certificate_server_set_request(SSL_P(new_F), GNUTLS_CERT_REQUEST);
-   if(do_ssl_handshake(F, rb_ssl_tryaccept, NULL))
+   if(do_ssl_handshake(F, rb_ssl_tryaccept, NULL) == 1)
    {
        struct acceptdata *ad = F->accept;
        F->accept = NULL;

there are two other places where do_ssl_handshake() returns are not checked at all, but i guess that doesn't matter because the callbacks aren't called.

i would sure prefer such a patch for now, because right now i'm battling with my crappy silencing patch and it looks pretty ugly. i think memory leaks are not so critical as to warrant a full update of the gnutls backend in stable, but i could ask the release team for approval on that explicitely.

regarding 3.4 support, unfortunately i have to support it here, at least for security, because we have that in stable.

if you make a clear statement that using 3.4 is harmful, then maybe i could convince the release team to upgrade to 3.5 altogether, but that (running the same version across all suites) happens rarely.

@aaronmdjones
Copy link
Contributor

At the very least I'd say it should be strongly discouraged.

I doubt you have the configurable fingerprint digest algorithm functionality added in 3.5, for example -- which would mean your GNUTLS backend port is using SHA-1 for digesting client and server certificates. SHA-1 is no longer considered acceptable for this purpose; hell, NIST was recommending against it 6 years ago.

@anarcat
Copy link
Contributor Author

anarcat commented Sep 11, 2016

yeep, sha-1. :/ i guess that'll be a good argument for upgrading, but for now i'm just trying to fix this one crash. ;)

@kaniini
Copy link
Contributor

kaniini commented Sep 12, 2016

@aaronmdjones i'm open to doing a 3.4.3 security release if you want, but would rather emphasize adoption of 3.5 and 4.0 branches as charybdis 5 is not due to be stable for at least another 18 months most likely.

@aaronmdjones
Copy link
Contributor

I'd rather encourage adoption of 3.5 too.

@aaronmdjones
Copy link
Contributor

@anarcat Please see the series of commits in PR #220.

I broke them down into small easily auditable chunks (except for the last one -- see its commit message).

Please let me know if that works for you.

@anarcat
Copy link
Contributor Author

anarcat commented Feb 28, 2017

so i'd be glad to jump ship and join the 3.5 bandwagon, although i doubt i could convince the stable release managers to agree in debian. but that hardly matters: i need to fix this bug. i stepped away from this for a while and now i'm back to square one, because there's no release that bundles those fixes in a meaningful way. 3.5.3 was released before those branches were merged...

what's the way forward here? chary 4?

@aaronmdjones
Copy link
Contributor

aaronmdjones commented Feb 28, 2017 via email

@anarcat
Copy link
Contributor Author

anarcat commented Feb 28, 2017

@aaronmdjones thanks - i'll test that and report back here.

@anarcat
Copy link
Contributor Author

anarcat commented Feb 28, 2017

@aaronmdjones i'm geetting this in 3.5.4, investigating:

2017/2/28 16.53 libratbox reports: rb_setup_ssl_server: gnutls_x509_crt_list_import: The given memory buffer is too short to hold parameters.
2017/2/28 16.53 WARNING: Unable to setup SSL.
2017/2/28 16.53 libratbox reports: rb_setup_ssl_server: gnutls_priority_init: The request is invalid., error begins at 'VERS-TLS-ALL:!VERS-TLS1.0:!VERS-SSL3.0:%SAFE_RENEGOTIATION'? -- using defaults instead
2017/2/28 16.53 libratbox reports: rb_setup_ssl_server: TLS configuration successful

SSL connections fail:

16:55:00 -!- Irssi: warning SSL handshake failed: server closed connection unexpectedly

@anarcat
Copy link
Contributor Author

anarcat commented Feb 28, 2017

btw, the .crt file on that server has a chain of 5 X509 certificates... but it could be that it's trying to parse /etc/ssl/certs/ca-certificates.crt which, in debian has over 350 certificates. i found a bug in apache2 that was limiting that call to 128 certificates but had to just dynamically allocate the certs to load them all correctly, see https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=511573

@anarcat
Copy link
Contributor Author

anarcat commented Feb 28, 2017

well, what a clusterfuck... turns out the gnutls code doesn't really work at all the minute you actually have a certificate to load (or maybe a certificate chain? who knows...), see #238 for details.

the second error was fixed by using a custom ssl_cipher_list like this:

        /* from https://wiki.mozilla.org/Security/Server_Side_TLS#GnuTLS_ciphersuite */
        ssl_cipher_list = "NONE:+VERS-TLS1.2:+VERS-TLS1.1:+VERS-TLS1.0:+ECDHE-RSA:+DHE-RSA:+RSA:+AES-128-GCM:+AES-128-CBC:+AES-256-CBC:+SIGN-RSA-SHA256:+SIGN-RSA-SHA384:+SIGN-RSA-SHA512:+SIGN-RSA-SHA224:+SIGN-RSA-SHA1:+SIGN-DSA-SHA256:+SIGN-DSA-SHA224:+SIGN-DSA-SHA1:+CURVE-ALL:+AEAD:+SHA256:+SHA384:+SHA1:+COMP-NULL";

i would have thought that a working default would have been shipped in the gnutls module... oh well.

@anarcat
Copy link
Contributor Author

anarcat commented Feb 28, 2017

oh, and with #238, i can confirm this issue is resolved in 3.5.4. i am not sure we can say it's really fixed until that's merged, because i can't really test it, but i assume it is...

@anarcat anarcat closed this as completed Feb 28, 2017
@aaronmdjones
Copy link
Contributor

Apologies. It seems the default I shipped uses a token that was introduced with a recent GNUTLS but Debian is still stuck using an old one. I will use older tokens.

@aaronmdjones
Copy link
Contributor

With your PR merged and commit 5d8a480 does it work?

@anarcat
Copy link
Contributor Author

anarcat commented Feb 28, 2017

@aaronmdjones yes, thanks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants