Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libcurl] Returning from a write callback with error does not immediately return from curl_easy_perform #4487

Closed
Bylon2 opened this issue Oct 14, 2019 · 30 comments
Labels

Comments

@Bylon2
Copy link

@Bylon2 Bylon2 commented Oct 14, 2019

The bug does NOT occur on x86/amd64 version, but is visible on armV7 (Raspbian Buster 32 bits on Raspberry Pi 4)

I have a fuse driver using libcurl underneath : https://gitlab.com/BylonAkila/astreamfs/blob/master/1fichierfs.c

What it does is read chunks of file as instructed in background process (read-ahead stream optimised). When no read request arrives, the background processes are just sleeping inside the curl callback.
In the case a request arrives out of the currently opened streams, we pick a sleeping process, terminate the current curl operation (sending CURLE_WRITE_ERROR from the callback) and start a new curl_easy_perform with the new range.

What is expected:

So what can happen in the background:

  • curl_easy_perform (range offset-end_of_file)
    -- callback read bytes and return them, when no more bytes are needed we sit in the callback
    -- wait for some time
    -- A new request arrives and we need to reuse that thread, so we send the new range to it (with semaphores), the callback then returns CURLE_WRITE_ERROR to terminate the current stream-read
  • on x86, curl_easy_perform IMMEDIATELY terminates (with error 23 as expected) and returns to the caller. The caller will then start a new curl_easy_perform with the new range.

What happens instead

And that is specific to armhf 32 bits apparently, is that in spite of the callback returning CURLE_WRITE_ERROR, curl_easy_perform does not return immediately to its caller. The return delay to the caller, after curl_easy_perform receives the error code from the callback can be anything from zero to 3 minutes!

Versions

$ uname -a
Linux raspberrypi 4.19.75-v7l+ #1270 SMP Tue Sep 24 18:51:41 BST 2019 armv7l GNU/Linux

$ lsb_release -a
No LSB modules are available.
Distributor ID:	Raspbian
Description:	Raspbian GNU/Linux 10 (buster)
Release:	10
Codename:	buster

$ 1fichierfs -V
1fichierfs: version 1.5.0
libcurl/7.64.0 (arm-unknown-linux-gnueabihf) GnuTLS/3.6.7
FUSE library version: 2.9.9
fusermount version: 2.9.9
using FUSE kernel interface version 7.19

So the libcurl version is 7.64.0 with GnutTLS/3.6.7
(I need reentrance since it is mutlithreaded use)

TODO

I'll try to build a smaller program to demonstrate the bug, since it is impractical with this driver because you need a 1fichier.com subscription to use it.

In the mean time... if you have an idea where that comes from...

@bagder

This comment has been minimized.

Copy link
Member

@bagder bagder commented Oct 14, 2019

Before you manage to reproduce with an example, you can just put a break-point here:

failf(data, "Failed writing body (%zu != %zu)", wrote, chunklen);

And then A) issue a back trace and show that to us and B) single-step forward a bit to a pointer where you see it "forgetting" about the error code the callback returned.

@bagder bagder added the libcurl API label Oct 14, 2019
@Bylon2

This comment has been minimized.

Copy link
Author

@Bylon2 Bylon2 commented Oct 15, 2019

Thanks for the tip, I'll try that... which means I'll have first to get the source (easy with apt) and find how to replace the library with the locally compiled and modified stuff!

Another test I'll do, is try the same thing without tls, in case the delay might be caused by curl trying to clean things with GnuTLS, and that cleaning taking some random time.

@jay

This comment has been minimized.

Copy link
Member

@jay jay commented Oct 15, 2019

#define STREAM_ERROR ((CURLcode)99) /* CURL errors stop at 90 */

at 94 now and will increase as more are added.

@Bylon2

This comment has been minimized.

Copy link
Author

@Bylon2 Bylon2 commented Oct 15, 2019

Thanks, I'll change that... is there a

#define CURLE_MAX_ERROR_CODE 94

... or the like, so that I set my own internal errors to a value not already used by curl?

@bagder

This comment has been minimized.

Copy link
Member

@bagder bagder commented Oct 15, 2019

CURL_LAST (weirdly enough it isn't CURLE_LAST which would've been logical) is the number above the highest used number in the header file you build with. It doesn't really say what the highest error code might be from your installed library though!

@jay

This comment has been minimized.

Copy link
Member

@jay jay commented Oct 15, 2019

I wouldn't mess with that it's a similar predicament. We could add a CURLE_USER_MASK 0x80000000 but personally I think that's just masking the smell (pun). Better to use your own error type.

@Bylon2

This comment has been minimized.

Copy link
Author

@Bylon2 Bylon2 commented Oct 15, 2019

Back to the bug at hand!..

More tests give me that:

  • The "alleged bug" does NOT happen in plain http (no TLS)
  • The "alleged bug" does NOT happen when linking my program with OpenSSL instead of GnuTLS

So I am assuming this is either:

  • A bug with the current version of GnuTLS (3.6.7)
  • libcurl not using correctly this version of GnuTLS

I used to link with GnuTLS library since I'm still running Ubuntu 16.04 on my desktop and laptop, and this version does not have the minimum OpenSSL 1.1 to be thread-safe. So OpenSSL was not an option with my heavily threaded(*) use case under Ubuntu 16.04

On the other hand, Raspbian Buster is more up to date and sports a 1.1.1d version of OpenSSL as shown:

$ ./1fichierfs -V
1fichierfs: version 1.5.0
libcurl/7.64.0 (arm-unknown-linux-gnueabihf) OpenSSL/1.1.1d
FUSE library version: 2.9.9
fusermount version: 2.9.9
using FUSE kernel interface version 7.19

So as far as this "alleged bug"(**) is concerned, since I have a perfectly fine workaround (using OpenSSL!), you can mark it as "low priority".

Unless it concerns others that really need GnuTLS on arm, this will have lesser priority in my /TODO list of things I still have to work on to properly finish configuring my Raspberry Pi 4!
For those using GnuTLS for the same reason I did, I hope this can help to "workaround" the issue!

==> Please can you mark as "Low Priority"

(*) "heavily threaded" (program at rest)

$ ps huH p 4225
pi        4225  0.0  0.1  97776  5964 ?        Ssl  18:43   0:00 dev/astreamfs/1fichierfs 
pi        4225  0.0  0.1  97776  5964 ?        Ssl  18:43   0:00 dev/astreamfs/1fichierfs 
pi        4225  0.0  0.1  97776  5964 ?        Ssl  18:43   0:00 dev/astreamfs/1fichierfs 
pi        4225  0.0  0.1  97776  5964 ?        Ssl  18:43   0:00 dev/astreamfs/1fichierfs 
pi        4225  0.2  0.1  97776  5964 ?        Ssl  18:43   0:04 dev/astreamfs/1fichierfs 
pi        4225  0.3  0.1  97776  5964 ?        Ssl  18:43   0:05 dev/astreamfs/1fichierfs 
pi        4225  0.0  0.1  97776  5964 ?        Ssl  18:43   0:00 dev/astreamfs/1fichierfs 
pi        4225  0.0  0.1  97776  5964 ?        Ssl  18:43   0:00 dev/astreamfs/1fichierfs 
pi        4225  0.0  0.1  97776  5964 ?        Ssl  18:43   0:00 dev/astreamfs/1fichierfs 

(**) "alleged" but it can be very easily reproduced with my program... if you have a 1fichier.com subscription!

Many thanks for your reactivity and quick responses!

@Bylon2

This comment has been minimized.

Copy link
Author

@Bylon2 Bylon2 commented Oct 15, 2019

Before you manage to reproduce with an example, you can just put a break-point here:

failf(data, "Failed writing body (%zu != %zu)", wrote, chunklen);

And then A) issue a back trace and show that to us and B) single-step forward a bit to a pointer where you see it "forgetting" about the error code the callback returned.

By the way, "putting a breakpoint" is somehow impractical in an heavily threaded program with several curl_easy_perform happening in the background. So what I would have to do in this case is to add some traces (printf like). A little more effort than a simple "break-point + trace" unfortunately.

For instance, wonderful tools such as nemiver are of no use at all, unless debugging the initialisation phase. Even if you make it work with a daemon (not sure it is possible!) that would so much disrupt the inter-thread timings that you could very well not observe some bugs.

@bagder

This comment has been minimized.

Copy link
Member

@bagder bagder commented Oct 15, 2019

If the problem is in curl, then a break-point would still be valuable no matter how many other threads run because they're not the ones we care about.

If the issue is indeed within GnuTLS, then a break-point won't help much. Then the question I have what is curl/GnuTLS doing all those minutes from the error is returned from the callback until you get the libcurl functions error code returned? Is it stuck in a function call? If so, which and why? If not, then I don't think it is GnuTLS' fault and where back on the first paragraph again...

@Bylon2

This comment has been minimized.

Copy link
Author

@Bylon2 Bylon2 commented Oct 15, 2019

@badger, in fact the program has (by default) 4 readers. When the bug occurs, all four readers have been used at least once (round robin), and we try to use an "idle" reader: more than 45 seconds with no requests, but the curl-callback is still active. The break-point would then be common for all readers and break each time a curl finishes on any reader. That is because of course the code is not duplicated for each thread. ;-)

It might help, but it will most probably disrupt the whole scheduling of threads and possibly make the bug vanish, because each time it breaks you would have to step-by-step in case it is "the bug"... which gives time for the faulty GnuTLS to finish his background stuff.

Another bigger issue (for me) is that I am not so comfortable with gdb... especially on a multi-thread daemon. Most probably also, to be able to break at breakpoints, gdb has to set its own locking that could disrupt many things! And it's easier counterpart (Nemiver) does probably not work in this case.

So the most realistic to debug is indeed adding some kind of "printf". That what I did actually on the current program: a "printf" just before returning from the callback, and another one just after the exit of curl_easy_perform. On my trace, I see that those "printf" are several second (up to several minutes) apart when the bug occurs.

A working case (when there is no bug) looks like that

[1fichierfs  6836.325] DEBUG: >> unfichier_read: 0x7f3e3c01ed30 (953487360,102400) rq[521]--------
[1fichierfs  6836.325] DEBUG: dump: readers[521*0] fs=0x7f3e3c01ed30 7110656:7127040 (idle=0) (n_rq=0)
[1fichierfs  6836.325] DEBUG: dump: readers[521|1] fs=0x7f3e3c01ed30 758022144:758038528 (idle=0) (n_rq=0)
[1fichierfs  6836.325] DEBUG: dump: readers[521|2] fs=0x7f3e3c01ed30 1750368256:1750384640 (idle=0) (n_rq=0)
[1fichierfs  6836.325] DEBUG: dump: readers[521|3] fs=0x7f3e3c01ed30 2663206912:2663223296 (idle=0) (n_rq=0)
[1fichierfs  6836.325] DEBUG: write_data[0] Dump: 0x7f3e417fa6b8->(nil)  953487360:0-102400
[1fichierfs  6836.337] DEBUG: curl_easy_perform exited with 23
[1fichierfs  6836.337] DEBUG: async_reader[0] range is 953356288-
[1fichierfs  6836.337] INFO: curling https://a-2.1fichier.com/p445931726 (URL=https://1fichier.com/?7b5qwm8bw5mhsj6eqr3t)

Which means:

  • read at position 953,487,360 received from fuse.
  • all threads have been used, the read process selects reader 0 that is not idle but has no pending request (best choice in round robin algo).
  • write_data (the callback ) receives this order, but since reader 0 is at position 7,110,656, it is way off range, the callback then returns with an error to end the current curl_easy_perform.
  • curl_easy_perform ends with error 23 (as expected)... which here took 12 milliseconds at the precision of the coarse clock that is 4ms (I call that "immediate"!)
  • then the async reader (0 since we are reader 0) now issues a new curl_easy_perform in range with what is requested.

For debugging purpose I added a "trace" just before the return from the callback. This is not shown here because this is the "packaged" driver, but the trace shown that we indeed returned from the callback and had to wait several second to see the corresponding 'curl_easy_perform exited with 23'.
One could argue that the thread might have been suspended between curl_easy_perform and the "printf" that is just the next instruction, or between the "printf" in the callback and the return... but I frankly don't believe that in normal conditions a thread could be suspended seconds, even less minutes for such a kind of straightforward part of the program with no locks/semaphores!

@Bylon2

This comment has been minimized.

Copy link
Author

@Bylon2 Bylon2 commented Oct 16, 2019

I confirm gdb completely breaks the program, probably due to the locking it adds. Nevertheless I could try to provide a trace, just attaching it at the moment when the bug is visible... but for now fail to see any curl symbol.

Here is what I have so far (working case)

(gdb) bt
#0  futex_abstimed_wait_cancelable (private=0, abstime=0xb4e997f8, expected=1, futex_word=0x30788 <readers+48>) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  do_futex_wait (sem=sem@entry=0x30788 <readers+48>, abstime=abstime@entry=0xb4e997f8) at sem_waitcommon.c:115
#2  0xb6e1e5c4 in __new_sem_wait_slow (sem=0x30788 <readers+48>, abstime=0xb4e997f8) at sem_waitcommon.c:282
#3  0x00011ba8 in get_work ()
#4  0x00011cd4 in write_data ()
#5  0xb6e4d604 in ?? () from /usr/lib/arm-linux-gnueabihf/libcurl.so.4
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

So we correctly my own symbols: write_data (the curl-callback), but the trace says it is called from "??" in libcurl!

Is there a method to display more useful debugging symbols from the libcurl stack?

[EDIT] after some searching on the web, it seems that unlike Debian, Raspbian provides only "striped" packages with no debug symbol. So the alternative method seems to be: rebuild the package with "nostrip" option, so that we get symbols. I'll try that later!

@Bylon2

This comment has been minimized.

Copy link
Author

@Bylon2 Bylon2 commented Oct 20, 2019

I managed to get a backtrace using the documentation at this page: https://wiki.debian.org/HowToGetABacktrace

Unfortunately, even after rebuilding curl (which tool forever x 2!) with nostrip option, I still get a trace with NO libcurl symbols, same as the previous post.

So if someone has a valid pointer on how to get those libcurl symbols, I'll try again, otherwise I guess the bug will stay, sorry!

@bagder

This comment has been minimized.

Copy link
Member

@bagder bagder commented Oct 21, 2019

If you don't get the symbols then it sounds like you're not using your newly built version?

@Bylon2

This comment has been minimized.

Copy link
Author

@Bylon2 Bylon2 commented Oct 21, 2019

Yes I did to the limit of my skills that go as far as, after rebuilding:

sudo apt purge libcurl4-openssl-dev

To remove the version installed from the repo.
Did a "make" to see that it didn't link (since no other libcurl4-xxx-dev is installed)

Then did:
sudo dpkg -i /path/to/build-area/libcurl4-gnutls-dev

Compiled successfully, reproduced the bug (it is not systematic but easy to reproduce), and failed to get a relevant trace!

I have another try to do, the debian wiki page I linked recommends to build with:
DEB_BUILD_OPTIONS="nostrip noopt" apt -b source libcurl4-gnutls-dev

I saw on another page the you can add the undocumented debug option to the packaging.

Also I am wondering, and you might have the answer to that as "curl specialists".

When I issued the build command above, it did not only build libcurl4-gnutls-dev, but built all flavors of lib4curl4-xxx-dev, and also built curl itself.

My question would be, indeed the "dev" versions of the libraries are needed when one does a program requiring libcurl. Those versions bring for example the C headers (curl.h and others), the curl-config utility, etc...

But once the program has been dynamically linked, I assume it is not using the "-dev" libraries when it runs, othewise anyone wanting to run such program would have to install the "-dev" package (sure we can put a "Depends" on the .deb). So maybe the error is there. Building the "-dev" didn't rebuild the runtime libraries that are used once the program have been compiled.

Any advise on that?
What other package need I rebuild on top of libcurl4-gnutls-dev (which build much more in fact)!
Or maybe the build is enough (since it builds all!) but I also need to "dpkg -i curl" (didn't check if it has some dynamic libraries in its -contents)

So in short, put up a little document about "how to backtrace curl/libcurl on Linux" would be nice. Obviously a documentation for Debian can be adapted to Ubuntu (or conversely), and the ones using Fedora would easily convert apt to dnf!

@bagder

This comment has been minimized.

Copy link
Member

@bagder bagder commented Oct 21, 2019

The "dev versions" you speak of I think are Debian/Ubuntu packages and they're designed, made and maintained by maintainers in those distros. Not by anyone in the curl project. I only have very rudimentary knowledge of how they work.

What other package need I rebuild on top of libcurl4-gnutls-dev

I wouldn't and I never involve that at all. I build curl and libcurl from source, install that and then make my app use that custom install. Done.

To "backtrace" on linux, you run a program with gdb, set a break-point with b [function name] and then when it stops at the break-point you invoke the bt command. Or did you mean something else?

@Bylon2

This comment has been minimized.

Copy link
Author

@Bylon2 Bylon2 commented Oct 21, 2019

Many thanks for your help.

So we might need the lights of Debian packagers!

I'll try to guess what is the library I must reinstall after build, obviously the "dev" libraries don't help. Also I didn't manage to static link with them... possibly because they are not the right ones!

As for backtrace, this page I linked is very nice and enough. In my case, since gdb completely blocks the mutli-thread process, it is better to do a "batch trace". That is a single command that give a trace and exits.

So I launch the program, look at its PID and do

gdb -batch -ex 'thread apply all bt full' -p PID >/tmp/backtrace.txt

Then I have the backtrace, and the program resumes because gdb exits after the "batch trace".

@bagder

This comment has been minimized.

Copy link
Member

@bagder bagder commented Oct 21, 2019

So we might need the lights of Debian packagers!

I don't think we/you do! Forget about the debian package for a second. Build libcurl from source, install it with 'make install' in a custom place (using configure's --prefix option), make your application use that custom build. Then you can keep your "regular" curl package in its normal place and the experimenting won't affect anything else...

@Bylon2

This comment has been minimized.

Copy link
Author

@Bylon2 Bylon2 commented Oct 22, 2019

Indeed that is an elegant solution... if I can manage to link statically, which is not the default. So far it failed, but I keep hope! Another solution could also be to temporary change the dynamic library path... I'll have to look into that too.

@Bylon2

This comment has been minimized.

Copy link
Author

@Bylon2 Bylon2 commented Oct 22, 2019

So here is the best I can do.

When a "asynchronous reader" is waiting for work to do (on a semaphore), here is the stack

Thread 4 (Thread 0xb50fe210 (LWP 4465)):
#0  0xb6ddf088 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=1, futex_word=0x378b0 <readers+48>) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  do_futex_wait (sem=sem@entry=0x378b0 <readers+48>, abstime=0x0) at sem_waitcommon.c:115
#2  0xb6ddf1f4 in __new_sem_wait_slow (sem=0x378b0 <readers+48>, abstime=0x0) at sem_waitcommon.c:282
#3  0x0001834c in get_work (r=0x37880 <readers>) at 1fichierfs.c:2706
#4  0x00019f80 in async_reader (arg=0x37880 <readers>) at 1fichierfs.c:3062
#5  0xb6dd5494 in start_thread (arg=0xb50fe210) at pthread_create.c:486
#6  0xb6d58578 in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:73 from /lib/arm-linux-gnueabihf/libc.so.6

Quite straightforward, pthread_create launched async_reader() that called get_work() which waits on a semaphore.

When work arrives, here is the normal situation:

Thread 4 (Thread 0xb50fe210 (LWP 4465)):
#0  futex_abstimed_wait_cancelable (private=0, abstime=0xb50dd78c, expected=1, futex_word=0x378b0 <readers+48>) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  do_futex_wait (sem=sem@entry=0x378b0 <readers+48>, abstime=abstime@entry=0xb50dd78c) at sem_waitcommon.c:115
#2  0xb6ddf5c4 in __new_sem_wait_slow (sem=0x378b0 <readers+48>, abstime=0xb50dd78c) at sem_waitcommon.c:282
#3  0x000183d4 in get_work (r=0x37880 <readers>) at 1fichierfs.c:2713
#4  0x000188a0 in write_data (ptr=0xb2d01608 "\002\261\253R\324\360\203\026\254\252\262\202\005\263\273\224\263\203hZ\327\267\215\367\201\004\361\205\002\261\370\023 \360\201\005\273\232\263\203h`5\267\223\367\201\003\361\205\002\262\017C\345\360\203\n\"\372\262\202\016V\273\232\263\203hv\207\267\223\367\201\003\361\205\002\262\234f\205\360\203\005\027E\262\202\023\017\273\224\263\203h\201\361\267\215\367\201\004\361\205\002\262\346\302\225\360\201\005\273\232\263\203h\216z\267\223\367\201\003\361\205\002\263-B\335\360\203\005\017\255\262\202\005\212\273\232\263\203h\232^\267\223\367\201\003\361\205\002\263t\213P\360\203\005/\355\262\202\bͻ\224\263\203h\251\v\267\215\367\201\004\361\205\002\263҇\207\360\201\005\273\224\263\203h\262ѷ"..., size=1, nmemb=11051, r=0x37880 <readers>) at astream_engine.c:63
#5  0xb6e1723c in chop_write (conn=0xb2de47a8, type=1, optr=0xb2d01608 "\002\261\253R\324\360\203\026\254\252\262\202\005\263\273\224\263\203hZ\327\267\215\367\201\004\361\205\002\261\370\023 \360\201\005\273\232\263\203h`5\267\223\367\201\003\361\205\002\262\017C\345\360\203\n\"\372\262\202\016V\273\232\263\203hv\207\267\223\367\201\003\361\205\002\262\234f\205\360\203\005\027E\262\202\023\017\273\224\263\203h\201\361\267\215\367\201\004\361\205\002\262\346\302\225\360\201\005\273\232\263\203h\216z\267\223\367\201\003\361\205\002\263-B\335\360\203\005\017\255\262\202\005\212\273\232\263\203h\232^\267\223\367\201\003\361\205\002\263t\213P\360\203\005/\355\262\202\bͻ\224\263\203h\251\v\267\215\367\201\004\361\205\002\263҇\207\360\201\005\273\224\263\203h\262ѷ"..., olen=11051) at sendf.c:596
#6  0xb6e17478 in Curl_client_write (conn=0xb2de47a8, type=1, ptr=0xb2d01608 "\002\261\253R\324\360\203\026\254\252\262\202\005\263\273\224\263\203hZ\327\267\215\367\201\004\361\205\002\261\370\023 \360\201\005\273\232\263\203h`5\267\223\367\201\003\361\205\002\262\017C\345\360\203\n\"\372\262\202\016V\273\232\263\203hv\207\267\223\367\201\003\361\205\002\262\234f\205\360\203\005\027E\262\202\023\017\273\224\263\203h\201\361\267\215\367\201\004\361\205\002\262\346\302\225\360\201\005\273\232\263\203h\216z\267\223\367\201\003\361\205\002\263-B\335\360\203\005\017\255\262\202\005\212\273\232\263\203h\232^\267\223\367\201\003\361\205\002\263t\213P\360\203\005/\355\262\202\bͻ\224\263\203h\251\v\267\215\367\201\004\361\205\002\263҇\207\360\201\005\273\224\263\203h\262ѷ"..., len=11051) at sendf.c:679
#7  0xb6e37e2c in readwrite_data (data=0xb2d005b8, conn=0xb2de47a8, k=0xb2d00650, didwhat=0xb50dd9a8, done=0xb50dda1f, comeback=0xb50dda20) at transfer.c:934
#8  0xb6e38978 in Curl_readwrite (conn=0xb2de47a8, data=0xb2d005b8, done=0xb50dda1f, comeback=0xb50dda20) at transfer.c:1294
#9  0xb6e49fa8 in multi_runsingle (multi=0xb225a1e0, now=..., data=0xb2d005b8) at multi.c:1932
#10 0xb6e4a820 in curl_multi_perform (multi=0xb225a1e0, running_handles=0xb50ddae0) at multi.c:2217
#11 0xb6e3ac1c in easy_transfer (multi=0xb225a1e0) at easy.c:686
#12 0xb6e3ae48 in easy_perform (data=0xb2d005b8, events=false) at easy.c:780
#13 0xb6e3ae94 in curl_easy_perform (data=0xb2d005b8) at easy.c:799
#14 0x00019e28 in stream_fs (curl=0xb2d005b8, s=0xb2f007e0, callback=0x18724 <write_data>, userdata=0x37880 <readers>) at 1fichierfs.c:3014
#15 0x0001a37c in async_reader (arg=0x37880 <readers>) at 1fichierfs.c:3113
#16 0xb6dd5494 in start_thread (arg=0xb50fe210) at pthread_create.c:486
#17 0xb6d58578 in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:73 from /lib/arm-linux-gnueabihf/libc.so.6

So we see that async_reader() has now called stream_fs() who called curl_easy_perform() which after some stacking calls my callback cleverly named write_data().
Itself is now calling get_work(), because the callback has done all the work it was requested to do.

We have 4 identical readers doing the same stuff.

With GnuTLS, when the bug happens here is what we have on some of the threads:

Thread 7 (Thread 0xb38fb210 (LWP 4468)):
#0  0xb6cf2e68 in _int_malloc (av=av@entry=0xb0400010, bytes=bytes@entry=16481) at malloc.c:3931
#1  0xb6cf4318 in __GI___libc_malloc (bytes=16481) at malloc.c:3057
#2  0xb69d962c in ?? () from /usr/lib/arm-linux-gnueabihf/libgnutls.so.30
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 6 (Thread 0xb40fc210 (LWP 4467)):
#0  __libc_recv (flags=0, len=5, buf=0xb01f4c83, fd=<optimized out>) at ../sysdeps/unix/sysv/linux/recv.c:26
#1  __libc_recv (fd=<optimized out>, buf=0xb01f4c83, len=5, flags=0) at ../sysdeps/unix/sysv/linux/recv.c:23
#2  0xb6eab644 in Curl_gtls_pull (s=0xb07df678, buf=0xb01f4c83, len=5) at vtls/gtls.c:171
#3  0xb69db164 in ?? () from /usr/lib/arm-linux-gnueabihf/libgnutls.so.30
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 5 (Thread 0xb48fd210 (LWP 4466)):
#0  0xb6b5720c in _nettle_aes_encrypt () from /usr/lib/arm-linux-gnueabihf/libnettle.so.6
Backtrace stopped: Cannot access memory at address 0x237cae80

Thread 4 (Thread 0xb50fe210 (LWP 4465)):
#0  0xb6ddf088 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=1, futex_word=0x378b0 <readers+48>) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  do_futex_wait (sem=sem@entry=0x378b0 <readers+48>, abstime=0x0) at sem_waitcommon.c:115
#2  0xb6ddf1f4 in __new_sem_wait_slow (sem=0x378b0 <readers+48>, abstime=0x0) at sem_waitcommon.c:282
#3  0x0001834c in get_work (r=0x37880 <readers>) at 1fichierfs.c:2706
#4  0x00019f80 in async_reader (arg=0x37880 <readers>) at 1fichierfs.c:3062
#5  0xb6dd5494 in start_thread (arg=0xb50fe210) at pthread_create.c:486
#6  0xb6d58578 in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:73 from /lib/arm-linux-gnueabihf/libc.so.6

Now our "thread 4" is back to the "idle" situation waiting for some work to do. The other 3 readers are blocked on some obscure libgnutls or libnettle function. A symbol appear though: Curl_gtls_pull, that seems to be called somewhere by GnuTLS.

Since I don't have the symbols of this libraries either, the stack is not correctly shown, although after some point it comes back to the "normal" situation above with curl_easy_perform() showing in the middle of my own calls.

So my guess is that when we return an error from the callback, libcurl does some of freeing or cleaning things like "close", free sockets, etc... with gnutls, and that does not work correctly at least is not immediate. Looking at the synbols GnuTLS seems to have specific code for Curl, and the "bug" could possibly there also.

Does this help a bit, or is it even more confusing?

@Bylon2

This comment has been minimized.

Copy link
Author

@Bylon2 Bylon2 commented Oct 23, 2019

So here is a cleaner trace of what happens when we are "blocked"

Time is 21:58:59

$ grep -A30 "Thread 6" /tmp/trace21\:58\:59.txt 
Thread 6 (Thread 0xb3ffc210 (LWP 1539)):
#0  0xb6b1b060 in gcm_gf_add (r=0xb3fdb1fc, x=0xb3fdb1fc, y=0xb0d23148) at gcm.c:61
#1  0xb6b1b3d0 in gcm_gf_mul (x=0xb0d23da8, table=0xb0d22d88) at gcm.c:315
#2  0xb6b1b5c4 in gcm_hash (key=0xb0d22d88, x=0xb0d23da8, length=1216, data=0xb0a075d0 "\224\232x\311\343\333N\330`\355o\216\311\311\370\b\253\022fCn\v\364\202{\236\334\032\062)@7\205\n\032\241:4\323+1\327\361{\301\204\062\315\b\234\016\362\v\333\321\037LL\317y7s\177\264}\374\356\343l\227\352\063\033}@R{)\300\177],\006i\274\361\214\374\031\300Wyc8\237\035پo\352A\242'q\"-\370\220.\fu\317\061_ӭs\326$7\001\005\222", <incomplete sequence \330>) at gcm.c:372
#3  0xb6b1bf5c in nettle_gcm_decrypt (ctx=0xb0d23d88, key=0xb0d22d88, cipher=0xb0d23dc8, f=0xb6b01544 <nettle_aes128_encrypt>, length=16384, dst=0x9caccce0 "", src=0xb0a03a90 "") at gcm.c:494
#4  0xb6a8e8d4 in _gcm_decrypt (ctx=0xb0d22d38, length=16384, dst=0x9caccce0 "", src=0xb0a03a90 "") at ../../../lib/nettle/cipher.c:250
#5  0xb6a8f630 in wrap_nettle_cipher_aead_decrypt (_ctx=0xb0d22d38, nonce=0xb3fdb5ac, nonce_size=12, auth=0xb3fdb5bc, auth_size=13, tag_size=16, encr=0xb0a03a90, encr_size=16384, plain=0x9caccce0, plain_size=16703) at ../../../lib/nettle/cipher.c:875
#6  0xb68e95cc in _gnutls_aead_cipher_decrypt (handle=0xb0d8db68, nonce=0xb3fdb5ac, nonce_len=12, auth=0xb3fdb5bc, auth_len=13, tag=16, ciphertext=0xb0a03a90, ciphertextlen=16400, text=0x9caccce0, textlen=16703) at ../../lib/cipher_int.h:140
#7  0xb68ebec0 in decrypt_packet (session=0xb0f59c88, ciphertext=0xb3fdb794, plain=0xb3fdb79c, type=GNUTLS_APPLICATION_DATA, params=0xb0d8dac0, sequence=0xb0d8dbbc) at ../../lib/cipher.c:651
#8  0xb68e9e58 in _gnutls_decrypt (session=0xb0f59c88, ciphertext=0xb3fdb794, output=0xb3fdb79c, type=0xb3fdb7b4, params=0xb0d8dac0, sequence=0xb0d8dbbc) at ../../lib/cipher.c:156
#9  0xb68e637c in _gnutls_recv_in_buffers (session=0xb0f59c88, type=GNUTLS_ALERT, htype=4294967295, ms=0) at ../../lib/record.c:1374
#10 0xb68e7a28 in _gnutls_recv_int (session=0xb0f59c88, type=GNUTLS_ALERT, data=0x0, data_size=0, seq=0x0, ms=0) at ../../lib/record.c:1773
#11 0xb68e2458 in gnutls_bye (session=0xb0f59c88, how=GNUTLS_SHUT_RDWR) at ../../lib/record.c:313
#12 0xb6e7e994 in close_one (connssl=0xb0d3f00c) at vtls/gtls.c:1610
#13 0xb6e7ea68 in Curl_gtls_close (conn=0xb0d3ee50, sockindex=0) at vtls/gtls.c:1628
#14 0xb6e80bbc in Curl_ssl_close (conn=0xb0d3ee50, sockindex=0) at vtls/vtls.c:555
#15 0xb6df1390 in Curl_disconnect (data=0xb15007a0, conn=0xb0d3ee50, dead_connection=true) at url.c:802
#16 0xb6e16490 in multi_done (data=0xb15007a0, status=CURLE_WRITE_ERROR, premature=true) at multi.c:620
#17 0xb6e191a0 in multi_runsingle (multi=0xb15057e8, now=..., data=0xb15007a0) at multi.c:1996
#18 0xb6e19820 in curl_multi_perform (multi=0xb15057e8, running_handles=0xb3fdbae0) at multi.c:2217
#19 0xb6e09c1c in easy_transfer (multi=0xb15057e8) at easy.c:686
#20 0xb6e09e48 in easy_perform (data=0xb15007a0, events=false) at easy.c:780
#21 0xb6e09e94 in curl_easy_perform (data=0xb15007a0) at easy.c:799
#22 0x00019e28 in stream_fs (curl=0xb15007a0, s=0xb5822e18, callback=0x18724 <write_data>, userdata=0x37920 <readers+160>) at 1fichierfs.c:3014
#23 0x0001a37c in async_reader (arg=0x37920 <readers+160>) at 1fichierfs.c:3113
#24 0xb6da4494 in start_thread (arg=0xb3ffc210) at pthread_create.c:486
Backtrace stopped: Cannot access memory at address 0x4

Time is 21:59:07

$ grep -A30 "Thread 6" /tmp/trace21\:59\:07.txt 
Thread 6 (Thread 0xb3ffc210 (LWP 1539)):
#0  decrypt_packet (session=0xb0f59c88, ciphertext=0xb3fdb794, plain=0xb3fdb79c, type=GNUTLS_APPLICATION_DATA, params=0xb0d8dac0, sequence=0xb0d8dbbc) at ../../lib/cipher.c:660
#1  0xb68e9e58 in _gnutls_decrypt (session=0xb0f59c88, ciphertext=0xb3fdb794, output=0xb3fdb79c, type=0xb3fdb7b4, params=0xb0d8dac0, sequence=0xb0d8dbbc) at ../../lib/cipher.c:156
#2  0xb68e637c in _gnutls_recv_in_buffers (session=0xb0f59c88, type=GNUTLS_ALERT, htype=4294967295, ms=0) at ../../lib/record.c:1374
#3  0xb68e7a28 in _gnutls_recv_int (session=0xb0f59c88, type=GNUTLS_ALERT, data=0x0, data_size=0, seq=0x0, ms=0) at ../../lib/record.c:1773
#4  0xb68e2458 in gnutls_bye (session=0xb0f59c88, how=GNUTLS_SHUT_RDWR) at ../../lib/record.c:313
#5  0xb6e7e994 in close_one (connssl=0xb0d3f00c) at vtls/gtls.c:1610
#6  0xb6e7ea68 in Curl_gtls_close (conn=0xb0d3ee50, sockindex=0) at vtls/gtls.c:1628
#7  0xb6e80bbc in Curl_ssl_close (conn=0xb0d3ee50, sockindex=0) at vtls/vtls.c:555
#8  0xb6df1390 in Curl_disconnect (data=0xb15007a0, conn=0xb0d3ee50, dead_connection=true) at url.c:802
#9  0xb6e16490 in multi_done (data=0xb15007a0, status=CURLE_WRITE_ERROR, premature=true) at multi.c:620
#10 0xb6e191a0 in multi_runsingle (multi=0xb15057e8, now=..., data=0xb15007a0) at multi.c:1996
#11 0xb6e19820 in curl_multi_perform (multi=0xb15057e8, running_handles=0xb3fdbae0) at multi.c:2217
#12 0xb6e09c1c in easy_transfer (multi=0xb15057e8) at easy.c:686
#13 0xb6e09e48 in easy_perform (data=0xb15007a0, events=false) at easy.c:780
#14 0xb6e09e94 in curl_easy_perform (data=0xb15007a0) at easy.c:799
#15 0x00019e28 in stream_fs (curl=0xb15007a0, s=0xb5822e18, callback=0x18724 <write_data>, userdata=0x37920 <readers+160>) at 1fichierfs.c:3014
#16 0x0001a37c in async_reader (arg=0x37920 <readers+160>) at 1fichierfs.c:3113
#17 0xb6da4494 in start_thread (arg=0xb3ffc210) at pthread_create.c:486
#18 0xb6d27578 in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:73 from /lib/arm-linux-gnueabihf/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Time is 21:59:14

$ grep -A30 "Thread 6" /tmp/trace21\:59\:14.txt 
Thread 6 (Thread 0xb3ffc210 (LWP 1539)):
#0  gcm_gf_add (r=0xb3fdb1fc, x=0xb3fdb1fc, y=0xb0d23678) at gcm.c:61
#1  0xb6b1b3d0 in gcm_gf_mul (x=0xb0d23da8, table=0xb0d22d88) at gcm.c:315
#2  0xb6b1b5c4 in gcm_hash (key=0xb0d22d88, x=0xb0d23da8, length=432, data=0xb0a078e0 "\263S\202\340L\342h\207{\177\061\242\225ފ\220\063\340\004tXw&\365\062\026\205\252o\002\276\235z\340uF\323\306y\235\001\306\003\027\030\335c4\005\026\325L=\326Q&\340\241\322n\004\256-r\262\345\253s\vK(óNq \361\034C\250\"\273\031.\252\366\327\322l\370J\350\036%\250\225^\215\372MI\027\341\217F\030\317`\253\211\304w\343\320\004\177M\006!-\025!\366\344\256i\202+\367p\367\340\225\003b\205\235\277\a\305jyr\300}B\355n\246\363øW:P\252v\231+\037X\200\215\004w\372Q\b\331L\243\271\360d\\\355\347\277\016\346&f\251\tY\267IF\323\364g\253t?劓Ri/"...) at gcm.c:372
#3  0xb6b1bf5c in nettle_gcm_decrypt (ctx=0xb0d23d88, key=0xb0d22d88, cipher=0xb0d23dc8, f=0xb6b01544 <nettle_aes128_encrypt>, length=16384, dst=0x92151ef0 "", src=0xb0a03a90 "\200\233\221\002/$\320&\305U\334Z!\363\325\376\032\314\311\036\257\206\217\062\312\330O\"\256e\203\363\024\070\365%\305(\315^\352q") at gcm.c:494
#4  0xb6a8e8d4 in _gcm_decrypt (ctx=0xb0d22d38, length=16384, dst=0x92151ef0 "", src=0xb0a03a90 "\200\233\221\002/$\320&\305U\334Z!\363\325\376\032\314\311\036\257\206\217\062\312\330O\"\256e\203\363\024\070\365%\305(\315^\352q") at ../../../lib/nettle/cipher.c:250
#5  0xb6a8f630 in wrap_nettle_cipher_aead_decrypt (_ctx=0xb0d22d38, nonce=0xb3fdb5ac, nonce_size=12, auth=0xb3fdb5bc, auth_size=13, tag_size=16, encr=0xb0a03a90, encr_size=16384, plain=0x92151ef0, plain_size=16703) at ../../../lib/nettle/cipher.c:875
#6  0xb68e95cc in _gnutls_aead_cipher_decrypt (handle=0xb0d8db68, nonce=0xb3fdb5ac, nonce_len=12, auth=0xb3fdb5bc, auth_len=13, tag=16, ciphertext=0xb0a03a90, ciphertextlen=16400, text=0x92151ef0, textlen=16703) at ../../lib/cipher_int.h:140
#7  0xb68ebec0 in decrypt_packet (session=0xb0f59c88, ciphertext=0xb3fdb794, plain=0xb3fdb79c, type=GNUTLS_APPLICATION_DATA, params=0xb0d8dac0, sequence=0xb0d8dbbc) at ../../lib/cipher.c:651
#8  0xb68e9e58 in _gnutls_decrypt (session=0xb0f59c88, ciphertext=0xb3fdb794, output=0xb3fdb79c, type=0xb3fdb7b4, params=0xb0d8dac0, sequence=0xb0d8dbbc) at ../../lib/cipher.c:156
#9  0xb68e637c in _gnutls_recv_in_buffers (session=0xb0f59c88, type=GNUTLS_ALERT, htype=4294967295, ms=0) at ../../lib/record.c:1374
#10 0xb68e7a28 in _gnutls_recv_int (session=0xb0f59c88, type=GNUTLS_ALERT, data=0x0, data_size=0, seq=0x0, ms=0) at ../../lib/record.c:1773
#11 0xb68e2458 in gnutls_bye (session=0xb0f59c88, how=GNUTLS_SHUT_RDWR) at ../../lib/record.c:313
#12 0xb6e7e994 in close_one (connssl=0xb0d3f00c) at vtls/gtls.c:1610
#13 0xb6e7ea68 in Curl_gtls_close (conn=0xb0d3ee50, sockindex=0) at vtls/gtls.c:1628
#14 0xb6e80bbc in Curl_ssl_close (conn=0xb0d3ee50, sockindex=0) at vtls/vtls.c:555
#15 0xb6df1390 in Curl_disconnect (data=0xb15007a0, conn=0xb0d3ee50, dead_connection=true) at url.c:802
#16 0xb6e16490 in multi_done (data=0xb15007a0, status=CURLE_WRITE_ERROR, premature=true) at multi.c:620
#17 0xb6e191a0 in multi_runsingle (multi=0xb15057e8, now=..., data=0xb15007a0) at multi.c:1996
#18 0xb6e19820 in curl_multi_perform (multi=0xb15057e8, running_handles=0xb3fdbae0) at multi.c:2217
#19 0xb6e09c1c in easy_transfer (multi=0xb15057e8) at easy.c:686
#20 0xb6e09e48 in easy_perform (data=0xb15007a0, events=false) at easy.c:780
#21 0xb6e09e94 in curl_easy_perform (data=0xb15007a0) at easy.c:799
#22 0x00019e28 in stream_fs (curl=0xb15007a0, s=0xb5822e18, callback=0x18724 <write_data>, userdata=0x37920 <readers+160>) at 1fichierfs.c:3014
#23 0x0001a37c in async_reader (arg=0x37920 <readers+160>) at 1fichierfs.c:3113
#24 0xb6da4494 in start_thread (arg=0xb3ffc210) at pthread_create.c:486
#25 0xb6d27578 in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:73 from /lib/arm-linux-gnueabihf/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

That means that we have been waiting more than 15 seconds on the following stack:
multi_done (with status CURLE_WRITE_ERROR as expected)
Curl_disconnect
Curl_ssl_close
Curl_gtls_close
close_one
gnutls_bye

gnutls_bye seems to be doing various things, but it takes looooong time.
15 seconds is actually not the worst I have had, I saw more than 2 minutes once!

I guess that's the best I can do for now.

I leave the hand to you specialists of curl/gnutls.
It does NOT happen with when my program accesses a "non-sssl" resource (http instead of https) in this case gnutls is not needed and mutli_done/Curl_disconnect work fine without delay.
It does NOT happen either with curl/openssl flavor, without changing anything in my program!

The only thing I can do if that helps is a trace with "full" (including parameters) but that is extremely verbose!

@Bylon2

This comment has been minimized.

Copy link
Author

@Bylon2 Bylon2 commented Oct 23, 2019

[Linux Documentation: HOW-TO get a backtrace]

(Note, that is NOT part of the bug report. But since documentation was hard to find for me to get that "backtrace", I wrote that in case it helps some other person. Moderator, feel free to move that comment elsewhere to a place where people searching for that would find it... or even feel free to remove it, if you feel it is irrelevant)

If your distribution doesn't provide ready to use packages with "debug symbols" (as Ubuntu does for example), you will have to dot that yourself. The cleaner way I have made it work is:

  • rebuild the concerned packages with the link I posted above: https://wiki.debian.org/HowToGetABacktrace (1 command per package... but it can take some time!)
  • you don't need to install any of those "deb" you just build, but grab the dynamic libraries with "debug symbols" you just built, and put them in /usr/local/lib/arm-linux-gnueabihf (or whatever according to you architecture)
  • then issue:
    sudo ldconfig -v

As usual with Linux, libraries in /usr/local/lib/arch will have precedence over those in /usr/lib/arch where standard libraries are stored.
This will do the trick and the dynamic linker will get your "debug" libraries.

To know what libraries you need, you can launch your program with strace on files, since the first thing linux will do is fetch the dynamic libraries.
Example:

$ strace -e trace=%file ~/dev/astreamfs/1fichierfs-gnutls -o ipv4,api-key=@/home/pi/.1fichier.key,refresh-file=\!000_refresh.txt,refresh-time=20,log-level=7,log-file=/tmp/debug.txt,stat-file=.stats,no-ssl=/.crypt,no-ssl=/Divers/Test/,uid=1000,gid=1000 /tmp/1fichier
execve("/home/pi/dev/astreamfs/1fichierfs-gnutls", ["/home/pi/dev/astreamfs/1fichierf"..., "-o", "ipv4,api-key=@/home/pi/.1fichier"..., "/tmp/1fichier"], 0xbee2d584 /* 25 vars */) = 0
access("/etc/ld.so.preload", R_OK)      = 0
openat(AT_FDCWD, "/etc/ld.so.preload", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3
readlink("/proc/self/exe", "/home/pi/dev/astreamfs/1fichierf"..., 4096) = 40
openat(AT_FDCWD, "/usr/lib/arm-linux-gnueabihf/libarmmem-v7l.so", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/arm-linux-gnueabihf/libfuse.so.2", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/lib/arm-linux-gnueabihf/libatomic.so.1", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/local/lib/arm-linux-gnueabihf/libcurl-gnutls.so.4", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/arm-linux-gnueabihf/libpthread.so.0", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/arm-linux-gnueabihf/libc.so.6", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/arm-linux-gnueabihf/libdl.so.2", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/lib/arm-linux-gnueabihf/libnghttp2.so.14", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/lib/arm-linux-gnueabihf/libidn2.so.0", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/lib/arm-linux-gnueabihf/librtmp.so.1", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/lib/arm-linux-gnueabihf/libssh2.so.1", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/lib/arm-linux-gnueabihf/libpsl.so.5", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/local/lib/arm-linux-gnueabihf/libnettle.so.6", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/local/lib/arm-linux-gnueabihf/libgnutls.so.30", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/lib/arm-linux-gnueabihf/libgssapi_krb5.so.2", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3
(...)

As you see in the example, most libraries come from /usr/lib/arm-linux-gnueabihf
We have 3 libraries that come from /usr/local/lib/arm-linux-gnueabihf instead, the 3 needed for the above trace: libgnutls.so.30, libnettle.so.6, libcurl-gnutls.so.4
(note that those are symbolic links to the real libraries that have longer names like libcurl-gnutls.so.4.5.0, you need to copy the "real" file with the longer name, and ldconfig will add the relevant symbolic links)

  • Once you are done with your debugging, simply remove the debug libraries from /usr/local/lib/arm-linux-gnueabihf then issue again
    sudo ldconfig -v

You are reverted back to using the standard libraries you installed with normal packages.

@jay

This comment has been minimized.

Copy link
Member

@jay jay commented Oct 25, 2019

gnutls_bye is probably sending a close_notify and waiting for a response on a dead connection.

"GNUTLS_SHUT_RDWR actually sends an alert containing a close request and waits for the peer to reply with the same message."
Try replacing it with GNUTLS_SHUT_WR, which isn't documented the same

diff --git a/lib/vtls/gtls.c b/lib/vtls/gtls.c
index 0a83f35..3737d7c 100644
--- a/lib/vtls/gtls.c
+++ b/lib/vtls/gtls.c
@@ -1608,7 +1608,7 @@ static ssize_t gtls_send(struct connectdata *conn,
 static void close_one(struct ssl_connect_data *connssl)
 {
   if(BACKEND->session) {
-    gnutls_bye(BACKEND->session, GNUTLS_SHUT_RDWR);
+    gnutls_bye(BACKEND->session, GNUTLS_SHUT_WR);
     gnutls_deinit(BACKEND->session);
     BACKEND->session = NULL;
   }

That is just for testing and may have security implications, I can't tell from the documentation. But I found this https://savannah.gnu.org/support/?107495

I think the real issue here is that you're waiting long periods in the callback which is killing the connections. I don't think we should be responsible for that. Also I suggest if you haven't already review libcurl thread safety.

@Bylon2

This comment has been minimized.

Copy link
Author

@Bylon2 Bylon2 commented Oct 25, 2019

Indeed Jay, the callback can wait up to 45 seconds (a #define in the program). It is precisely the point of a "background reader" to wait for the read requests, hoping they will come in sequence so that it can read on the same "stream" (same curl-range request). The peer (1fichier.com) is doing "keep-alive" for 5 minutes, but I noticed on a previous version of curl/gnutls that things does not do restart smoothly if the wait is over 1 minute, in spite of the server's keep-alive. So 45 second was sort of a tradeoff.
I'll try your suggestion, and also see if it improves the previous behaviour that I get when waiting over 1 minute. The "old" version on Ubuntu 16.04 is libcurl/7.47.0 (x86_64-pc-linux-gnu) GnuTLS/3.4.10

I had already thoroughly reviewed the "libcurl thread safety" page, and it is why I was previously using GnuTLS instead of the default OpenSSL which was not thread-safe as of Ubuntu 16.04's release date.

@Bylon2

This comment has been minimized.

Copy link
Author

@Bylon2 Bylon2 commented Oct 26, 2019

Dear Jay, I did some preliminary testing with your patch, and couldn't reproduce the bug/behaviour.

So this seems to fix the bug.

What I will do now is used this gnutls version of my program, as I use it daily, and monitor whether I see the "bug" again. If it doesn't show in a few days, we might consider it "fixed".

Then there will be two options:

  • either we close this bug
  • or we wait until you have considered with the main mailing list the possible "security implications".

"Security implication": my understanding it that there are none here, at least considering the "truncation attack" that is mentioned in the bug you linked. Are there other known attacks?

My understanding of the "truncation attack", as you can see an example here: https://bugs.chromium.org/p/chromium/issues/detail?id=244260
would be to send a notification alert to one of the peers (either server or client) to trick them into believing there is no more data and use what data they have. The other peer being unaware of the trick thinks all the data have been transmitted. In the example above, the browser get a partial http message and is tricked to believe it is the whole message. It has been "truncated" by the attacker in a clever way so that the cookie that is normally not accessible can be stolen. If the server was made aware of the attack (client responding to the alert) it could take action such as revoking this cookie.

In our case, we are not at all in this situation.
We are in the part of the code where we willingly want to stop receiving data from the server. Thus the only "polite" thing to do is warn the server that we are not interested anymore by any data it might send and SHUT_WR does that: https://linux.die.net/man/3/gnutls_bye

The only benefit of SHUT_RDWR would be to know that the server indeed received the alert... but what for would you do that.
The code is flawed anyway because the return status of gnutls_bye is not even tested. The documentation says that is can be something like GNU_EAGAIN when the call was interrupted. In this case we might believe the alert was not sent to the server because of the interruption, and might want to send it again. So a correct code should test that, and do the gnutls_bye again.

But anyway, whether the server acknowledges the alert or not, we have nothing else to do that clean and stop everything... and there is no code doing otherwise in case we think the server did not receive the alert!

By the way, networks typically exhibit the "Byzantine General conundrum", and there might be situation where you cannot know whether the peer received the alert or not, or whether the alert was triggered by an attacker.
The norm does not solve all. It relies on the transport to carry the alert (and do the retries when there are packet losses), but the transport itself at some point will give up, at least TCP does, and you are in an unknown situation, not able to agree on the fact that both peers know they have to stop.

Summary: considering the current code, I see no benefit to wait for a server's reply to the alert since anyway there is no code taking action A or action B depending on the fact that the server acknowledges the alert or didn't acknowledge it! Using SHUT_WR, considering the current code, is the right move because it will warn the server (in situation when the transport is still able to send that alert), and anyway at this point in the code we are not intend to read or write on that connection.

Possibly you have to analyse if further down the way in libcurl you have some "optimisation" that would try to reuse the same socket with the server. That would obviously be wrong here because we didn't wait for the server confirmation that all is stopped and could receive data from a previous session. So if such "optimisations" occur, they would have to be changed.

@Bylon2

This comment has been minimized.

Copy link
Author

@Bylon2 Bylon2 commented Oct 26, 2019

More arguments in favor of the proposed patch!

1) The current code, using GNUTLS_SHUT_RDWR, assumes the server respects the protocol and will respond to the peer alert with an alert. If the server did not respect that protocol but still note that the peer wants to close the connection, all will still be fine but the closing peer will in 100% of the case suffer from a long delay waiting for an alert response that will never come, due to the other peer not respecting the protocol!

(Note that in my own case, 1fichier.com's servers probably respect the TLS1.1 alert protocol because the "bug" is not systematic. If it didn't, we would suffer from the delay for all curl terminations).

2) The "truncation attack", when it occurs, happens when the peers are in active communication (not when one of the peer wants to stop communicating as already pointed out). In the case of libcurl and the use here, it would mean that when downloading data from the server, the attacker would send an alert to the client, making libcurl believe that the server wants to stop the TLS session. In such case, what does libcurl do? If it uses the same "close_one()" function, it shouldn't use GNUTLS_SHUT_RDWR, because we just want to respond to the server with the same alert to abide by the protocol, but there is no point waiting for a server alert in return since we were alerted first.
Indeed, if the server responded to that alert, we could detect the attack, because if the alert does really comes from the server first, all we need is respond to the alert. If the alert comes from an attacker and the server gets our alert back (assuming the attacker cannot trap it!) the server that didn't in the first place send the alert would respond (if it respects the protocol, see 1 above).
But issuing a GNUTLS_SHUT_RDWR just to detect attacks, knowing that in the "normal" case we have no response and will have to suffer from a timeout delay does not seem very optimised!

3) as already pointed out, the correct code would be at least:
while (GNUTLS_E_SUCCESS != gnutls_bye(BACKEND->session, GNUTLS_SHUT_RDWR));
But that obviously can trigger an infinite loop, in the case 1) above where the server does not respect the protocol and will never respond to the alert, or in case when something is wrong with the underlying transport and alerts do no go through.
So we would have to define something like a "max_retry", and stop once we hit that... again for very little benefit that we are sure (if not time-outed) that the server received the information... but since we do nothing different when are assured the server knows our intentions, an when we do not, what is the point of having that information (and more importantly having to wait long delays for this information), if we do nothing with the information?

@bagder

This comment has been minimized.

Copy link
Member

@bagder bagder commented Oct 28, 2019

@jay, I think we should consider doing this WR-only change to better avoid the risk of this kind of hang. Will you make a PR out of this?

@jay

This comment has been minimized.

Copy link
Member

@jay jay commented Oct 28, 2019

@jay, I think we should consider doing this WR-only change to better avoid the risk of this kind of hang. Will you make a PR out of this?

Not against it but AFAIK this is the only report we've had of this which makes me wonder if it's something in his configuration, or GnutTLS/3.6.7 has a bug? Is gnutls_bye allowed to block for a long amount of time?

/cc @nmav

@Bylon2

This comment has been minimized.

Copy link
Author

@Bylon2 Bylon2 commented Oct 28, 2019

It could possibly be a bug of this version of GnuTLS.
Having used my driver for more than a year, I have never noticed this "bug"/"behaviour" of strange delays with the same code compiled against :

libcurl/7.47.0 (x86_64-pc-linux-gnu) GnuTLS/3.4.10

As for libcurl, not sure whether it was at the time calling gnutls_bye with WR-only and someone changed if to RDWR between libcurl 7.47 and 7.64... if you can't check that with git, I can download the source from the Ubuntu 16.04 repo and check.

This is possibly the first report of this behaviour because you need a very "strange" combination to make it happen:

  • download some data with a callback
  • make the callback wait for some time (let's say 30 seconds)
  • then make the callback decide it does not want anymore data and return zero (different from size * nmemb).
  • and obviouly you need to link with GnuTLS (which is not the default as I understand the library with the name 'libcurl.so' is in fact 'libcurl-openssl.so', so OpenSSL seems the natural default linking library). Actually I wanted to use OpenSSL but couldn't for thread-safety reasons.

Not a combination that you would find in most uses of libcurl!
And yes, strangely enough the same combination with libcurl/7.47.0 - x86_64 - GnuTLS/3.4.10 didn't "bug"!

It could also be a plain bug of GnuTLS on ARM, as I'm now on ARM (32bits at the moment). As you know, the memory model is quite different between i386/amd64 and ARM, especially on atomics that you might use for synchro and provide the thread safety. A bug in this domain is hard to track!

EDIT: 3 days of usage with the patch, the "bug" couldn't be reproduced. So far, so good!

@nmav

This comment has been minimized.

Copy link
Contributor

@nmav nmav commented Oct 29, 2019

gnutls_bye with RDWR can block for an indefinite amount of time unless gnutls_record_set_timeout is set. On certain cases it may be more appropriate to use just WR when closing the session (e.g, if you are not re-using the socket, or expecting data).

@bagder

This comment has been minimized.

Copy link
Member

@bagder bagder commented Oct 29, 2019

Thanks @nmav! I think this counts as another vote for WR-only for this case.

bagder added a commit that referenced this issue Oct 30, 2019
... as it can make it wait there for a long time for no good purpose.

Patched-by: Jay Satiro
Reported-by: Bylon2 on github
Adviced-by: Nikos Mavrogiannopoulos

Fixes #4487
@bagder bagder closed this in c2b01cc Oct 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.