-
-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition in the schannel SSL sessionid cache #815
Comments
just want to make sure @mback2k sees this... |
Thanks Daniel, I will try to take a look at it this weekend. |
@w23 In the meantime, please take a look at the following thread to understand the reasoning for the cached-flag and the need to free sessions in two places: http://thread.gmane.org/gmane.comp.web.curl.library/44652 |
It seems like OpenSSL does not have the same issues, because the OpenSSL library itself does the reference counting of session IDs. There the refcount is increased by each call to SSL_get1_session and decremented by calls to SSL_SESSION_free until it reaches zero and the session is finally freed. The OpenSSL backend in libcurl only calls SSL_get1_session and SSL_SESSION_free (in order to keep the refcount of an already cached session at 1) during the initial SSL handshake. Since it never decrements the refcount on connection shutdown, to me it seems like one last SSL session per host is never freed until the backend itself is freed. The OpenSSL backend basically just makes sure that there is just one cached session with refcount = 1 per host. The SChannel backend actually does the refcounting itself, because Windows does not provide such a mechanism. And the SChannel backend also tries to decrement the refcount and free stale sessions during connection shutdown. Also the session (actually the Windows credential handle) must be kept alive during the whole SSL connection. This means that instead of freeing "stale" sessions while creating a new connection, we are only allowed to replace the session in the cache (remove it from the cache and if refcount is zero instead of freeing it set cached to FALSE). Then in order to free SSL sessions that are no longer cached, the SSL connection shutdown code needs to take care of it. |
Thanks @bagder, @mback2k for a quick feedback! There are several things I'd like to elaborate. Race condition in OpenSSLI have looked at OpenSSL codepath more closely, and I still believe that the race does exist. Let's examine three lines at https://github.com/curl/curl/blob/curl-7_49_0/lib/vtls/openssl.c#L2075:
The problem here is that the The question is, can this gap be actually exploited by some other session-handling code. Let me put the relevant excerpt as a quick reference here (https://github.com/curl/curl/blob/curl-7_49_0/lib/vtls/openssl.c#L2811)
The timeline:
This is not the only scenario. Sessionid can also be indirectly destroyed by being pushed it out if limited cache space due to old age. Imagine some thread getting a valid sessionid at line 2075, then getting interrupted before line 2077, and then >=8 connections to >=8 different hosts slipping through in the mean time before the first thread gets to continue with long-dead sessionid. However unlikely, this unfortunate scheduling pattern could happen in practice, given there's enough aggregate time. SChannel and
|
Really? The easy handles are not shareable between threads so I don't understand that scenario. Each handle is used single-threaded so there shouldn't be any risk for another thread there, not using the same handle/data. |
Easy handles themselves are not shareable, indeed. But the session cache part is shared, if you do |
I've been able to trigger this race in OpenSSL engine. To do so in a reasonable amount of time, I had to make a couple of adjustments to libcurl sources (note that these don't change any logic):
These changes and the tool used to reproduce the issue can be found here: w23@2f33977. However, to detect this corruption with a confidence, an instrumentation is needed. I used
Which exactly confirms the scenario I described a few comments above. To get these you'll need to compile both openssl and curl with ASAN.
Two URLs are enough. I usually get corruption just after 10-20 seconds. |
And the fix that I'm proposing looks like this: w23@c89969b. I've also looked at a couple of other engines (axtls, cyassl, darwinssl), although briefly, and I couldn't find any reference counting at all. I am confused as to how they manage sessions lifetimes. |
The Secure Transport (darwinssl) engine simply creates a session ID string and shares it with the Security framework. The framework handles the rest internally. |
@w23 Thanks for your in-depth investigation. To me your proposed fix looks good and very straightforward, but since this is a change that affects all TLS backends, Daniel should probably have the final word. Please take a look at the comments I made on w23@c89969b. |
Sessionid cache management is inseparable from managing individual session lifetimes. E.g. for reference-counted sessions (like those in SChannel and OpenSSL engines) every session addition and removal should be accompanied with refcount increment and decrement respectively. Failing to do so synchronously leads to a race condition that causes symptoms like use-after-free and memory corruption. This commit: - makes existing session cache locking explicit, thus allowing individual engines to manage lock's scope. - fixes OpenSSL and SChannel engines by putting refcount management inside this lock's scope in relevant places. - adds these explicit locking calls to other engines that use sessionid cache to accommodate for this change. Note, however, that it is unknown whether any of these engines could also have this race. Bug: curl#815
Core issue
Windows SSPI engine (https://github.com/curl/curl/blob/master/lib/vtls/schannel.c) has trouble managing
curl_schannel_cred
lifetime properly:Curl_ssl_{get,add}sessionid()
in https://github.com/curl/curl/blob/master/lib/vtls/vtls.c don't lead to refcount being immediately incremented, as it should be semantically.This leads to sporadic memory-related crashes (accessing freed memory, double free, etc) when downloading
https://
URLs concurrently in separate threads.I also believe that the same set of issues is present for OpenSSL engine in https://github.com/curl/curl/blob/master/lib/vtls/openssl.c, but the race gap is way more narrow there and I couldn't crash it in a reasonable amount of time.
And there are also a couple of somewhat related notes on
schannel.c
:BOOL cached
field seems to be superfluous -- maintainingrefcount
should be enough.curl_schannel_cred
objects: https://github.com/curl/curl/blob/master/lib/vtls/schannel.c#L1430 and https://github.com/curl/curl/blob/master/lib/vtls/schannel.c#L1458 -- oneCurl_schannel_session_free()
called from appropriate places should be enough.How to reproduce
To have a chance at reproducing this one needs to start several threads and use them to fetch several
https://
URLs concurrently.Unfortunately, I don't have a clean standalone utility for this purpose, and there is ideally a server part that is completely out of scope.
However, I at least can give you more details about my setup:
max_ssl_sessions
*) HTTPS servers/virtual hosts that can stand several random requests per second for hours without anyone complaining or throttling. I use trivial python http server for generating several kilobytes of random payload, and nginx as an ssl-wrapping reverse proxy to it. I use valid CA-signed SSL certificates, but I think that it could also be reproduced using self-signed certs (allowing libcurl to accept them, of course)CURLSH
handle is created and set with proper locking functions. This handle is used for all requests.https://
URLs to random servers.curl_easy
API is used to perform each request.After a few minutes and a few thousands requests a crash is usually observed.
(*interestingly, this variable seems to be intended to be set by user, but there is no API for it, only a hardcoded value of 8)
Ways to fix
I could come up with two feasibly-looking options. I ask maintainers for their opinion, as I'm not familiar with curl codebase.
General ideas of these options are:
Curl_ssl_kill_session()
make aCurl_ssl_retain_session()
that is called byCurl_ssl_{get,add}sessionid()
on appropriate occasions. Amendschannel.c
accordingly. Note that if this issue is indeed present for OpenSSL engine, then this gets a bit trickier. OpenSSL has no separate increment-refcount function forSSL_SESSION
, and I see no clean way to fake it.CURL_LOCK_DATA_SSL_SESSION
scope, e.g. by making it taken explicitly by a user ofCurl_ssl_*sessionid()
API, rather than implicitly by these functions themselves. Although this is less clean in terms of who manages what lifetime, and looks somewhat ugly for multiple-returns, it is a very straightforward change. I have already implemented it for SSPI and OpenSSL engines locally to discover that I'm not getting crashes anymore.Note that I haven't looked at other SSL engines, and have no idea what's going on there. And, unfortunately, it is likely I won't have resources to do that. I'd appreciate any feedback from informed people on this.
Thanks!
The text was updated successfully, but these errors were encountered: