-
-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTTP2 connections stuck when paused and resumed multiple times #10988
Comments
The title of this issue says "connections stuck when paused and resumed multiple times" and then you draw a connection to #10972. But that PR is about when a paused transfer is aborted, it is not about being "paused/unpaused" at all. It makes me curious. Which one is it that is actually necessary to trigger this problem?
Are you 100% certain of this? Because the only thing my PR there did was make sure data was freed that otherwise was not freed at all. That data was already ready off the stream and connection and freeing or not does not change the state of the connection. To me, it seems super-weird and unlikely that the PR #10972 changes the state of the connection in any way. |
The title might be a bit imprecise. The test program does a combination of abort / pause / resume:
All requests are fired to the same URL. The same code works fine with HTTP 1.1. |
Just one? Will the others go on? Are these h2 requests all done over the same connection? |
It's not deterministic, anything from 0 to 5 will be stuck; it probably depends on timing.
They should, no? All requests use the same multi handle. |
I'm looking at the |
The freeing of that data is completely unrelated to the connection's lifetime or when it gets closed or not. |
What would be most helpful! |
744dcf2 is the culprit. |
/cc @icing ^^ |
Could you check if #10998 resolves your issue? |
Tested, #10998 doesn't help. |
Thanks for checking. We definitely need to add a test case with this sort of lib use. |
- curl#10988 reports transfer stalling when pausing/unpausing - if the servers send window is exhausted before the pause, it will not send more on unpausing until the transfer consumes its data. - make sure unpaused transfers get drained and expire, so any data already buffered will be processed. Which triggers a WINDOW_UPDATE to the server and it will start sending again.
@lemourin I added a test with pause/unpausing of transfers and could reproduce a situation where transfers got stuck. That I fixed in #11005. Now, I do not know if this exactly matches what you observe, but I know it has one bug less. If you could find the time to run that in your environment, that'd be helpful. Thanks. |
It's still the same for me, unfortunately. For context, in case it's useful: I have an HTTP server which generates thumbnails for videos stored in cloud storage (e.g. OneDrive). It renders html that looks like this. The video data is fetched using curl and then it is passed through to FFmpeg (through C API). When FFmpeg says: please read n bytes from the stream then I tell curl to fire a request and then as soon as n bytes are fetched I pause the connection, to avoid buffering. When FFmpeg asks for m more bytes I resume the connection. With curl at head the first time I try to load the file list it works fine, very rarely it gets stuck on the last thumbnail. When I do a hard reload (to force thumbnail regeneration) it loads around 3 thumbnails and the rest are stuck; it's 100% reproducible. |
I tried to find your I think we need a standalone program that reproduces your issue. |
My curl usage is in coro-http library. It wraps
The program will start a HTTP server on port 12345. Go to The app can also be built without The app can be compiled with |
- curl#10988 reports transfer stalling when pausing/unpausing - if the servers send window is exhausted before the pause, it will not send more on unpausing until the transfer consumes its data. - make sure unpaused transfers get drained and expire, so any data already buffered will be processed. Which triggers a WINDOW_UPDATE to the server and it will start sending again.
I updated #11005 with a fix to my current theory of what is going wrong in your case. With 81b2b57 we introduced more restrictive HTTP/2 flow control windows for transfers. This has many benefits. But, in the case if pause/unpause transfers, it needs extra care. In an event based app like yours, transfers are only run when socket events happen, or when curl timeouts trigger. The unpausing of a http/2 transfer did not trigger a timeout, but just relied on new DATA arriving and generating socket events. This is not always the case. On a pause, we restrict the flow window to 0 and on unpause, we increase it again. But the server will not always send more DATA. It can happen that, before it got the paused 0, it already send enough DATA to exhaust the window. Increasing the window again, the server will see that it still exhausts the newly enlarged window and not send anything. This means no socket events and, without the curl timeout set, no events at all and the transfer stalls. The fix is to always generate a timeout on unpausing, so the receive on the transfer will be made. I hope this works for you and catches the underlying issue. |
Seems we need a real trace of your app's libcurl use. How to do that:
|
Logs on Google Drive. On shutdown all stalled requests are aborted. curl built at #11005. |
Thanks! I miss 2 things in the log:
which will give us all the details of how the HTTP/2 implementation is called and what streams it opens etc. Could you try and get a log with that information? |
Sorry, I'm using cmake to build curl. I updated the log file. |
Thanks! There is more interesting things to see in that file:
So, there is a little bit more ongoing. The 3 RST requests are the problem and those you see probably as stalled. The cause is that HTTP/2 not only has flow control for individual streams, but also for the overall connection. I set that - unwisely - to 16 times the stream window. That would limit the amount of memory buffered in total and still allow for good performance. If all stream were indeed being read. I took not into account a case like yours where, and that is what is happening in your app, the PAUSED streams would bring the overall connection window to 0. This is what is happening and why the server cannot send you any more response data. This makes transfer seem stalled. But it is the whole connection that is. In #11043 I increase the overall window to 100 * window size of each stream. So one may PAUSE 99/100 and the last one will still find enough space to continue. Please give that PR a try. |
Thanks @icing, looks like it works. So, if I understand correctly, if I try to pause 100/101 transfers then the 101st transfer will stall? If yes, then maybe this behavior should be surfaced in the documentation or maybe the window size should be made configurable? I'm just concerned that if I set the parallelism too high for my app then it would break. |
No, this counts the active transfers on a single connection. And, commonly, servers and clients like curl only allow up to 100 simultaneous transfers on one connection. When you app starts the 101th transfer against the same server, curl will open another connection. |
FYI: just added #11052 that fixes "one last thing" in regard to flow control handling of cancelled transfers. |
- fixes stalled connections - Make the connection window large enough, so that there is some room left should 99/100 streams be PAUSED by the application Reported-by: Paweł Wegner Fixes curl#10988 Closes curl#11043
I'd reopen #10971 if I could. The fix for that bug introduced a new regression, please see #10971 (comment).
I did more testing today with my real app coro-cloudstorage and I noticed that the downloads are now getting stuck. I can provide more context on what the app does if necessary. I changed the test program a bit, it now resumes some transfers; with the memory leak fix one transfer gets stuck and returns "Stream error in the HTTP/2 framing layer" after ~5 minutes.
Without #10972 it leaks, with #10972 the connections get stuck.
The text was updated successfully, but these errors were encountered: