Race condition in the schannel SSL sessionid cache #815
Windows SSPI engine (https://github.com/curl/curl/blob/master/lib/vtls/schannel.c) has trouble managing
This leads to sporadic memory-related crashes (accessing freed memory, double free, etc) when downloading
I also believe that the same set of issues is present for OpenSSL engine in https://github.com/curl/curl/blob/master/lib/vtls/openssl.c, but the race gap is way more narrow there and I couldn't crash it in a reasonable amount of time.
And there are also a couple of somewhat related notes on
How to reproduce
To have a chance at reproducing this one needs to start several threads and use them to fetch several
However, I at least can give you more details about my setup:
After a few minutes and a few thousands requests a crash is usually observed.
(*interestingly, this variable seems to be intended to be set by user, but there is no API for it, only a hardcoded value of 8)
Ways to fix
I could come up with two feasibly-looking options. I ask maintainers for their opinion, as I'm not familiar with curl codebase.
General ideas of these options are:
Note that I haven't looked at other SSL engines, and have no idea what's going on there. And, unfortunately, it is likely I won't have resources to do that. I'd appreciate any feedback from informed people on this.
The text was updated successfully, but these errors were encountered:
@w23 In the meantime, please take a look at the following thread to understand the reasoning for the cached-flag and the need to free sessions in two places: http://thread.gmane.org/gmane.comp.web.curl.library/44652
It seems like OpenSSL does not have the same issues, because the OpenSSL library itself does the reference counting of session IDs. There the refcount is increased by each call to SSL_get1_session and decremented by calls to SSL_SESSION_free until it reaches zero and the session is finally freed.
The OpenSSL backend in libcurl only calls SSL_get1_session and SSL_SESSION_free (in order to keep the refcount of an already cached session at 1) during the initial SSL handshake. Since it never decrements the refcount on connection shutdown, to me it seems like one last SSL session per host is never freed until the backend itself is freed. The OpenSSL backend basically just makes sure that there is just one cached session with refcount = 1 per host.
The SChannel backend actually does the refcounting itself, because Windows does not provide such a mechanism. And the SChannel backend also tries to decrement the refcount and free stale sessions during connection shutdown. Also the session (actually the Windows credential handle) must be kept alive during the whole SSL connection. This means that instead of freeing "stale" sessions while creating a new connection, we are only allowed to replace the session in the cache (remove it from the cache and if refcount is zero instead of freeing it set cached to FALSE). Then in order to free SSL sessions that are no longer cached, the SSL connection shutdown code needs to take care of it.
There are several things I'd like to elaborate.
Race condition in OpenSSL
I have looked at OpenSSL codepath more closely, and I still believe that the race does exist.
Let's examine three lines at https://github.com/curl/curl/blob/curl-7_49_0/lib/vtls/openssl.c#L2075:
The problem here is that the
The question is, can this gap be actually exploited by some other session-handling code.
Let me put the relevant excerpt as a quick reference here (https://github.com/curl/curl/blob/curl-7_49_0/lib/vtls/openssl.c#L2811)
This is not the only scenario. Sessionid can also be indirectly destroyed by being pushed it out if limited cache space due to old age. Imagine some thread getting a valid sessionid at line 2075, then getting interrupted before line 2077, and then >=8 connections to >=8 different hosts slipping through in the mean time before the first thread gets to continue with long-dead sessionid. However unlikely, this unfortunate scheduling pattern could happen in practice, given there's enough aggregate time.
Really? The easy handles are not shareable between threads so I don't understand that scenario. Each handle is used single-threaded so there shouldn't be any risk for another thread there, not using the same handle/data.
Easy handles themselves are not shareable, indeed. But the session cache part is shared, if you do
I've been able to trigger this race in OpenSSL engine. To do so in a reasonable amount of time, I had to make a couple of adjustments to libcurl sources (note that these don't change any logic):
These changes and the tool used to reproduce the issue can be found here: w23@2f33977.
However, to detect this corruption with a confidence, an instrumentation is needed. I used
Which exactly confirms the scenario I described a few comments above.
To get these you'll need to compile both openssl and curl with ASAN.
Two URLs are enough. I usually get corruption just after 10-20 seconds.
And the fix that I'm proposing looks like this: w23@c89969b.
I've also looked at a couple of other engines (axtls, cyassl, darwinssl), although briefly, and I couldn't find any reference counting at all. I am confused as to how they manage sessions lifetimes.
Sessionid cache management is inseparable from managing individual session lifetimes. E.g. for reference-counted sessions (like those in SChannel and OpenSSL engines) every session addition and removal should be accompanied with refcount increment and decrement respectively. Failing to do so synchronously leads to a race condition that causes symptoms like use-after-free and memory corruption. This commit: - makes existing session cache locking explicit, thus allowing individual engines to manage lock's scope. - fixes OpenSSL and SChannel engines by putting refcount management inside this lock's scope in relevant places. - adds these explicit locking calls to other engines that use sessionid cache to accommodate for this change. Note, however, that it is unknown whether any of these engines could also have this race. Bug: curl#815