-
-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[libcurl] Returning from a write callback with error does not immediately return from curl_easy_perform #4487
Comments
Before you manage to reproduce with an example, you can just put a break-point here: Line 614 in a626fa1
And then A) issue a back trace and show that to us and B) single-step forward a bit to a pointer where you see it "forgetting" about the error code the callback returned. |
Thanks for the tip, I'll try that... which means I'll have first to get the source (easy with apt) and find how to replace the library with the locally compiled and modified stuff! Another test I'll do, is try the same thing without tls, in case the delay might be caused by curl trying to clean things with GnuTLS, and that cleaning taking some random time. |
#define STREAM_ERROR ((CURLcode)99) /* CURL errors stop at 90 */ at 94 now and will increase as more are added. |
Thanks, I'll change that... is there a
... or the like, so that I set my own internal errors to a value not already used by curl? |
|
I wouldn't mess with that it's a similar predicament. We could add a CURLE_USER_MASK 0x80000000 but personally I think that's just masking the smell (pun). Better to use your own error type. |
Back to the bug at hand!.. More tests give me that:
So I am assuming this is either:
I used to link with GnuTLS library since I'm still running Ubuntu 16.04 on my desktop and laptop, and this version does not have the minimum OpenSSL 1.1 to be thread-safe. So OpenSSL was not an option with my heavily threaded(*) use case under Ubuntu 16.04 On the other hand, Raspbian Buster is more up to date and sports a 1.1.1d version of OpenSSL as shown:
So as far as this "alleged bug"(**) is concerned, since I have a perfectly fine workaround (using OpenSSL!), you can mark it as "low priority". Unless it concerns others that really need GnuTLS on arm, this will have lesser priority in my /TODO list of things I still have to work on to properly finish configuring my Raspberry Pi 4! ==> Please can you mark as "Low Priority" (*) "heavily threaded" (program at rest)
(**) "alleged" but it can be very easily reproduced with my program... if you have a 1fichier.com subscription! Many thanks for your reactivity and quick responses! |
By the way, "putting a breakpoint" is somehow impractical in an heavily threaded program with several curl_easy_perform happening in the background. So what I would have to do in this case is to add some traces (printf like). A little more effort than a simple "break-point + trace" unfortunately. For instance, wonderful tools such as nemiver are of no use at all, unless debugging the initialisation phase. Even if you make it work with a daemon (not sure it is possible!) that would so much disrupt the inter-thread timings that you could very well not observe some bugs. |
If the problem is in curl, then a break-point would still be valuable no matter how many other threads run because they're not the ones we care about. If the issue is indeed within GnuTLS, then a break-point won't help much. Then the question I have what is curl/GnuTLS doing all those minutes from the error is returned from the callback until you get the libcurl functions error code returned? Is it stuck in a function call? If so, which and why? If not, then I don't think it is GnuTLS' fault and where back on the first paragraph again... |
@Badger, in fact the program has (by default) 4 readers. When the bug occurs, all four readers have been used at least once (round robin), and we try to use an "idle" reader: more than 45 seconds with no requests, but the curl-callback is still active. The break-point would then be common for all readers and break each time a curl finishes on any reader. That is because of course the code is not duplicated for each thread. ;-) It might help, but it will most probably disrupt the whole scheduling of threads and possibly make the bug vanish, because each time it breaks you would have to step-by-step in case it is "the bug"... which gives time for the faulty GnuTLS to finish his background stuff. Another bigger issue (for me) is that I am not so comfortable with gdb... especially on a multi-thread daemon. Most probably also, to be able to break at breakpoints, gdb has to set its own locking that could disrupt many things! And it's easier counterpart (Nemiver) does probably not work in this case. So the most realistic to debug is indeed adding some kind of "printf". That what I did actually on the current program: a "printf" just before returning from the callback, and another one just after the exit of curl_easy_perform. On my trace, I see that those "printf" are several second (up to several minutes) apart when the bug occurs. A working case (when there is no bug) looks like that
Which means:
For debugging purpose I added a "trace" just before the return from the callback. This is not shown here because this is the "packaged" driver, but the trace shown that we indeed returned from the callback and had to wait several second to see the corresponding 'curl_easy_perform exited with 23'. |
I confirm gdb completely breaks the program, probably due to the locking it adds. Nevertheless I could try to provide a trace, just attaching it at the moment when the bug is visible... but for now fail to see any curl symbol. Here is what I have so far (working case)
So we correctly my own symbols: write_data (the curl-callback), but the trace says it is called from "??" in libcurl! Is there a method to display more useful debugging symbols from the libcurl stack? [EDIT] after some searching on the web, it seems that unlike Debian, Raspbian provides only "striped" packages with no debug symbol. So the alternative method seems to be: rebuild the package with "nostrip" option, so that we get symbols. I'll try that later! |
I managed to get a backtrace using the documentation at this page: https://wiki.debian.org/HowToGetABacktrace Unfortunately, even after rebuilding curl (which tool forever x 2!) with nostrip option, I still get a trace with NO libcurl symbols, same as the previous post. So if someone has a valid pointer on how to get those libcurl symbols, I'll try again, otherwise I guess the bug will stay, sorry! |
If you don't get the symbols then it sounds like you're not using your newly built version? |
Yes I did to the limit of my skills that go as far as, after rebuilding:
To remove the version installed from the repo. Then did: Compiled successfully, reproduced the bug (it is not systematic but easy to reproduce), and failed to get a relevant trace! I have another try to do, the debian wiki page I linked recommends to build with: I saw on another page the you can add the undocumented debug option to the packaging. Also I am wondering, and you might have the answer to that as "curl specialists". When I issued the build command above, it did not only build libcurl4-gnutls-dev, but built all flavors of lib4curl4-xxx-dev, and also built curl itself. My question would be, indeed the "dev" versions of the libraries are needed when one does a program requiring libcurl. Those versions bring for example the C headers (curl.h and others), the curl-config utility, etc... But once the program has been dynamically linked, I assume it is not using the "-dev" libraries when it runs, othewise anyone wanting to run such program would have to install the "-dev" package (sure we can put a "Depends" on the .deb). So maybe the error is there. Building the "-dev" didn't rebuild the runtime libraries that are used once the program have been compiled. Any advise on that? So in short, put up a little document about "how to backtrace curl/libcurl on Linux" would be nice. Obviously a documentation for Debian can be adapted to Ubuntu (or conversely), and the ones using Fedora would easily convert apt to dnf! |
The "dev versions" you speak of I think are Debian/Ubuntu packages and they're designed, made and maintained by maintainers in those distros. Not by anyone in the curl project. I only have very rudimentary knowledge of how they work.
I wouldn't and I never involve that at all. I build curl and libcurl from source, install that and then make my app use that custom install. Done. To "backtrace" on linux, you run a program with gdb, set a break-point with |
Many thanks for your help. So we might need the lights of Debian packagers! I'll try to guess what is the library I must reinstall after build, obviously the "dev" libraries don't help. Also I didn't manage to static link with them... possibly because they are not the right ones! As for backtrace, this page I linked is very nice and enough. In my case, since gdb completely blocks the mutli-thread process, it is better to do a "batch trace". That is a single command that give a trace and exits. So I launch the program, look at its PID and do
Then I have the backtrace, and the program resumes because gdb exits after the "batch trace". |
I don't think we/you do! Forget about the debian package for a second. Build libcurl from source, install it with 'make install' in a custom place (using configure's |
Indeed that is an elegant solution... if I can manage to link statically, which is not the default. So far it failed, but I keep hope! Another solution could also be to temporary change the dynamic library path... I'll have to look into that too. |
So here is the best I can do. When a "asynchronous reader" is waiting for work to do (on a semaphore), here is the stack
Quite straightforward, pthread_create launched async_reader() that called get_work() which waits on a semaphore. When work arrives, here is the normal situation:
So we see that async_reader() has now called stream_fs() who called curl_easy_perform() which after some stacking calls my callback cleverly named write_data(). We have 4 identical readers doing the same stuff. With GnuTLS, when the bug happens here is what we have on some of the threads:
Now our "thread 4" is back to the "idle" situation waiting for some work to do. The other 3 readers are blocked on some obscure libgnutls or libnettle function. A symbol appear though: Curl_gtls_pull, that seems to be called somewhere by GnuTLS. Since I don't have the symbols of this libraries either, the stack is not correctly shown, although after some point it comes back to the "normal" situation above with curl_easy_perform() showing in the middle of my own calls. So my guess is that when we return an error from the callback, libcurl does some of freeing or cleaning things like "close", free sockets, etc... with gnutls, and that does not work correctly at least is not immediate. Looking at the synbols GnuTLS seems to have specific code for Curl, and the "bug" could possibly there also. Does this help a bit, or is it even more confusing? |
So here is a cleaner trace of what happens when we are "blocked" Time is 21:58:59
Time is 21:59:07
Time is 21:59:14
That means that we have been waiting more than 15 seconds on the following stack: gnutls_bye seems to be doing various things, but it takes looooong time. I guess that's the best I can do for now. I leave the hand to you specialists of curl/gnutls. The only thing I can do if that helps is a trace with "full" (including parameters) but that is extremely verbose! |
[Linux Documentation: HOW-TO get a backtrace] (Note, that is NOT part of the bug report. But since documentation was hard to find for me to get that "backtrace", I wrote that in case it helps some other person. Moderator, feel free to move that comment elsewhere to a place where people searching for that would find it... or even feel free to remove it, if you feel it is irrelevant) If your distribution doesn't provide ready to use packages with "debug symbols" (as Ubuntu does for example), you will have to dot that yourself. The cleaner way I have made it work is:
As usual with Linux, libraries in /usr/local/lib/arch will have precedence over those in /usr/lib/arch where standard libraries are stored. To know what libraries you need, you can launch your program with strace on files, since the first thing linux will do is fetch the dynamic libraries.
As you see in the example, most libraries come from /usr/lib/arm-linux-gnueabihf
You are reverted back to using the standard libraries you installed with normal packages. |
gnutls_bye is probably sending a close_notify and waiting for a response on a dead connection. "GNUTLS_SHUT_RDWR actually sends an alert containing a close request and waits for the peer to reply with the same message." diff --git a/lib/vtls/gtls.c b/lib/vtls/gtls.c
index 0a83f35..3737d7c 100644
--- a/lib/vtls/gtls.c
+++ b/lib/vtls/gtls.c
@@ -1608,7 +1608,7 @@ static ssize_t gtls_send(struct connectdata *conn,
static void close_one(struct ssl_connect_data *connssl)
{
if(BACKEND->session) {
- gnutls_bye(BACKEND->session, GNUTLS_SHUT_RDWR);
+ gnutls_bye(BACKEND->session, GNUTLS_SHUT_WR);
gnutls_deinit(BACKEND->session);
BACKEND->session = NULL;
} That is just for testing and may have security implications, I can't tell from the documentation. But I found this https://savannah.gnu.org/support/?107495 I think the real issue here is that you're waiting long periods in the callback which is killing the connections. I don't think we should be responsible for that. Also I suggest if you haven't already review libcurl thread safety. |
Indeed Jay, the callback can wait up to 45 seconds (a #define in the program). It is precisely the point of a "background reader" to wait for the read requests, hoping they will come in sequence so that it can read on the same "stream" (same curl-range request). The peer (1fichier.com) is doing "keep-alive" for 5 minutes, but I noticed on a previous version of curl/gnutls that things does not do restart smoothly if the wait is over 1 minute, in spite of the server's keep-alive. So 45 second was sort of a tradeoff. I had already thoroughly reviewed the "libcurl thread safety" page, and it is why I was previously using GnuTLS instead of the default OpenSSL which was not thread-safe as of Ubuntu 16.04's release date. |
Dear Jay, I did some preliminary testing with your patch, and couldn't reproduce the bug/behaviour. So this seems to fix the bug. What I will do now is used this gnutls version of my program, as I use it daily, and monitor whether I see the "bug" again. If it doesn't show in a few days, we might consider it "fixed". Then there will be two options:
"Security implication": my understanding it that there are none here, at least considering the "truncation attack" that is mentioned in the bug you linked. Are there other known attacks? My understanding of the "truncation attack", as you can see an example here: https://bugs.chromium.org/p/chromium/issues/detail?id=244260 In our case, we are not at all in this situation. The only benefit of SHUT_RDWR would be to know that the server indeed received the alert... but what for would you do that. But anyway, whether the server acknowledges the alert or not, we have nothing else to do that clean and stop everything... and there is no code doing otherwise in case we think the server did not receive the alert! By the way, networks typically exhibit the "Byzantine General conundrum", and there might be situation where you cannot know whether the peer received the alert or not, or whether the alert was triggered by an attacker. Summary: considering the current code, I see no benefit to wait for a server's reply to the alert since anyway there is no code taking action A or action B depending on the fact that the server acknowledges the alert or didn't acknowledge it! Using SHUT_WR, considering the current code, is the right move because it will warn the server (in situation when the transport is still able to send that alert), and anyway at this point in the code we are not intend to read or write on that connection. Possibly you have to analyse if further down the way in libcurl you have some "optimisation" that would try to reuse the same socket with the server. That would obviously be wrong here because we didn't wait for the server confirmation that all is stopped and could receive data from a previous session. So if such "optimisations" occur, they would have to be changed. |
More arguments in favor of the proposed patch! 1) The current code, using GNUTLS_SHUT_RDWR, assumes the server respects the protocol and will respond to the peer alert with an alert. If the server did not respect that protocol but still note that the peer wants to close the connection, all will still be fine but the closing peer will in 100% of the case suffer from a long delay waiting for an alert response that will never come, due to the other peer not respecting the protocol! (Note that in my own case, 1fichier.com's servers probably respect the TLS1.1 alert protocol because the "bug" is not systematic. If it didn't, we would suffer from the delay for all curl terminations). 2) The "truncation attack", when it occurs, happens when the peers are in active communication (not when one of the peer wants to stop communicating as already pointed out). In the case of libcurl and the use here, it would mean that when downloading data from the server, the attacker would send an alert to the client, making libcurl believe that the server wants to stop the TLS session. In such case, what does libcurl do? If it uses the same "close_one()" function, it shouldn't use GNUTLS_SHUT_RDWR, because we just want to respond to the server with the same alert to abide by the protocol, but there is no point waiting for a server alert in return since we were alerted first. 3) as already pointed out, the correct code would be at least: |
@jay, I think we should consider doing this WR-only change to better avoid the risk of this kind of hang. Will you make a PR out of this? |
Not against it but AFAIK this is the only report we've had of this which makes me wonder if it's something in his configuration, or GnutTLS/3.6.7 has a bug? Is gnutls_bye allowed to block for a long amount of time? /cc @nmav |
It could possibly be a bug of this version of GnuTLS.
As for libcurl, not sure whether it was at the time calling gnutls_bye with WR-only and someone changed if to RDWR between libcurl 7.47 and 7.64... if you can't check that with git, I can download the source from the Ubuntu 16.04 repo and check. This is possibly the first report of this behaviour because you need a very "strange" combination to make it happen:
Not a combination that you would find in most uses of libcurl! It could also be a plain bug of GnuTLS on ARM, as I'm now on ARM (32bits at the moment). As you know, the memory model is quite different between i386/amd64 and ARM, especially on atomics that you might use for synchro and provide the thread safety. A bug in this domain is hard to track! EDIT: 3 days of usage with the patch, the "bug" couldn't be reproduced. So far, so good! |
|
Thanks @nmav! I think this counts as another vote for WR-only for this case. |
... as it can make it wait there for a long time for no good purpose. Patched-by: Jay Satiro Reported-by: Bylon2 on github Adviced-by: Nikos Mavrogiannopoulos Fixes #4487
I have run for two weeks with this patch of libcurl + GnuTLS: all went fine, and I noticed no adverse effects nor did I encounter long delays I had prior to the patch. All good! |
Excellent, thanks for the feedback! |
The bug does NOT occur on x86/amd64 version, but is visible on armV7 (Raspbian Buster 32 bits on Raspberry Pi 4)
I have a fuse driver using libcurl underneath : https://gitlab.com/BylonAkila/astreamfs/blob/master/1fichierfs.c
What it does is read chunks of file as instructed in background process (read-ahead stream optimised). When no read request arrives, the background processes are just sleeping inside the curl callback.
In the case a request arrives out of the currently opened streams, we pick a sleeping process, terminate the current curl operation (sending CURLE_WRITE_ERROR from the callback) and start a new curl_easy_perform with the new range.
What is expected:
So what can happen in the background:
-- callback read bytes and return them, when no more bytes are needed we sit in the callback
-- wait for some time
-- A new request arrives and we need to reuse that thread, so we send the new range to it (with semaphores), the callback then returns CURLE_WRITE_ERROR to terminate the current stream-read
What happens instead
And that is specific to armhf 32 bits apparently, is that in spite of the callback returning CURLE_WRITE_ERROR, curl_easy_perform does not return immediately to its caller. The return delay to the caller, after curl_easy_perform receives the error code from the callback can be anything from zero to 3 minutes!
Versions
So the libcurl version is 7.64.0 with GnutTLS/3.6.7
(I need reentrance since it is mutlithreaded use)
TODO
I'll try to build a smaller program to demonstrate the bug, since it is impractical with this driver because you need a 1fichier.com subscription to use it.
In the mean time... if you have an idea where that comes from...
The text was updated successfully, but these errors were encountered: