New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTTP/2 connection returning CURLE_HTTP2 error after a while #5389
Comments
We would need to know more about the error to know if we're handling it correctly. I don't recall a similar post. The timeout error sticks out, that strikes me as a server issue however possibly there is some DNS issue, curl is contacting a server that has since gone down.
Set CURLOPT_VERBOSE in your app. If you are using curl w/OpenSSL then you can also set SSLKEYLOGFILE environment variable, monitor in Wireshark and run the app. |
Please note that it first returns CURLE_HTTP2 (after an hour of working all right) before it starts returning timeout errors. That also happens quite after a while (O(hours)) |
My blind guess is that perhaps the server sends a |
Ok then!
But connection 1 also dies in a less worthy way, that is, it is zombie for a while (~4 mins). You can see that before connection 1 is declared dead there's a lot of POST requests that go unanswered, they simply timeout (and report CURLE_HTTP2 as recorded in a separate log). The window of time seems to go from 21:29:35 to 21:33:35. What I find interesting is that the error passed to the user is CURLE_HTTP2 instead of timeout :/ Logfile: https://gist.github.com/davidgfnet/f0d6188c5a035c3ad3bb5b52a78c2f66 |
Oh quick update, this seems to happen roughly every hour, the connection dies from the server, I guess in a silent? way? Cause sometimes CURL will notice and sometimes it will keep pushing data for a few minutes before dying. Just saw it happen another time on the next hour boundary. |
I don't see any error in that log? |
That's the curl verbose log (CURLOPT_VERBOSE + CURLOPT_STDERR). May 14 21:29:42 (... redacted ...)[4005989]: msg->data.result = 16 (that's a separate log that records errors on http completion, the code looks aproximately like:)
|
So how do they relate? The first log seems to not show any problems. The latter one shows a bunch of transfers apparently returning The key we're looking for is why they suddenly start to fail. For this, I presume we will need some level of HTTP/2 details from the connections or streams. Perhaps by enabling HTTP2 debugging in http2.c, perhaps by wiresharking the connection, perhaps some other means. I'm open for suggestions. |
The reason why it suddenly fails I don't know. My guess is that the Google server kills the connection unilateraly after 3600s and stops responding completely, likely not sending any sort of message to tell the client it did. What I do not understand is why CURL then returns that error, shouldnt the requests just timeout? Perhaps the server sends a funny message (like "ive never seen this connection")? Should I capture a pcap or go for http2.c debugging? Which is better? I assume the connection is using TLS and we won't be able to see much? |
I don't think a guess helps us much here. My guess would be different, but as that's a guess as well it's not too relevant.
If the connection closed they shouldn't timeout, they should return errors of some sort. Closing a connection at the wrong time (like when there are live streams on it) is a bad thing and can of course trigger http2 (and other) errors. But we're talking about a possible bug so then of course curl isn't behaving like it should and then its hard to say exactly what's going on.
Both will generate a lot of content that's hard to interpret, when probably only the last few requests before the error triggers are actually relevant and interesting for us to inspect. They would be sort of complementary so I can't tell which one would be better.
You'd have to do it the SSLKEYLOGFILE way to make sure you can inspect the TLS protected connections as well. Can you make the problem happen if you just make one request, wait 59 minutes and then make another one or similar? It would be good to reduce the amount traffic but still see the problem... |
I can certainly give that approach a try. I will try to create a test case for this instead of testing in production systems :) If I do nothing for 59 mins, will curl close the connection? I mean, does the connection cache expire? It can take a bit to reproduce, since sometimes it works (previous connection is closed more or less gracefully) and sometimes it does not. |
No, it won't work that easily. 😢 The server will most likely close the connection within N minutes/seconds unless it's used and curl will only reuse connections that were used within the I presume a better way then is to basically go on like before for the first 59 minutes and then for the last minute or so, increase logging of connection and HTTP/2 details. See |
Have you made any more progress in debugging this problem? I have also observed the same issue when making requests to Google services, but haven't yet set up a test case that reproduces it easily. |
Nope sorry! I sort of fixed the bug short-term via forcing HTTP1.1 in all my clients and forgot about building a test case. Since it is quite time consuming to do (due to the fact that you need to wait to ensure it happens). |
Okay, thanks for your initial analysis and the update. We are now looking into this at Google, and can reliably reproduce this --- there does appear to be a problem in the interaction between curl and the Google frontend proxy server ("GFE") when the maximum connection age, which is normally 1 hour, is reached. |
Here is a packet capture and a debug log that reproduces the error: https://gist.github.com/jbms/18c93e28392a5244457de0e1db460f82 This is using curl head as of today 01afe4e, with #5641 applied to fix the http2 debug messages, using nghttp2 version 1.40.0. It appears that on_stream_close gets called due to the GOAWAY message, but http2_handle_stream_close is never called to handle it. However, it was not clear to me what the correct fix would be. I would much appreciate nay help in fixing this. Let me know if any additional information/captures would be helpful. |
Ah lol you also work at Google :P Thanks for digging further into this, I'm happy to know I'm not the only one. |
I'm not too knowledgeable about HTTP2 (and I don't work on the GFE server) but from looking at the packet trace and the HTTP/2 spec it appears that the GFE server is following the spec. |
This makes it sound like #5611 is the same/very similar issue and basically it is "just" a problem with GOAWAY so maybe the easiest way to reproduce (and subsequently fix) is to for example run nginx and set We've had a similar issue in the past and I know I've fixed it before, but apparently something has made it come back. I'll work on it, but I'm almost in vacation mode here for a while so it might not go super fast for me. |
Thanks for looking into this @bagder. It seems like it may be related though the error log in issue #5611 does not show curl error 16 (CURLE_HTTP2) --- it seems like it is failing in a slightly different way. From looking briefly at http2.c it seems that STREAM_REFUSED is handled in several places, but perhaps not all cases are covered. The packet capture that reproduces this is quite short (just 18 http2 packets), so perhaps a test case could be created from that. |
Did the landing of #5643 change anything? |
I'm pretty sure that it fixes the first issue mentioned of failure after an hour. This is identical to the state that I tested, where the connection is closed by the server, but the client side of the connection ends up in the cache. I'm not sure about the 4 minute issue mentioned later. I suspect that any place nghttp2 indicates a framing error that the connection will be impossible to recover and it should be closed and not reused. But I didn't audit for all of those places. |
In my eyes, that's what this issue (#5389) is about and if so, we consider this issue handled. If there are more issues still present, please file a new issue and let us know how to reproduce or provide enough details so that we can properly understand it! |
It was found that the google storage api would cause curl to return a CURLE_HTTP2 error after around an hour. The workaround implemented here is to close the connection and open a new one. The issue appears similar to curl/curl#5389 and curl/curl#5643. The fix curl/curl@ef86daf didn't solve the issue (version 7.74.0).
It was found that the google storage api would cause curl to return a CURLE_HTTP2 error after around an hour. The workaround implemented here is to close the connection and open a new one. The issue appears similar to curl/curl#5389 and curl/curl#5643. The fix curl/curl@ef86daf didn't solve the issue (version 7.74.0).
Hello there!
I have some code that uses libcurl to push data to Google Analytics [1]. This code was recently updated to use a slightly different endpoint (POST). Since I upgraded I'm having issues due to HTTP2. I get CURLE_HTTP2 error as a result. This happens in a very interesting pattern and the code never recovers (I'm guessing due to connection reuse? perhaps TLS session reuse? I'm no expert in HTTP2 but I know it's a complicated proto).
The events happen as follow:
(this might be related to the endpoint behaviour on 'broken' connections or something, since it's more variable)
My guess here is that a single HTTP2 connection is created with the remote server google-analytics.com and it is used (via multiple streams?) to push the different POST requests. The implementation uses the multi interface but I do not thing that is super-relevant here [2]
I added a flag to use HTTP1.1 and this works flawlessly :) Interestingly I have another instance that pushes data at a much slower pace (like a query every 10-60 seconds) and this one never triggers the error, which is odd given that the error seems time based and not query count based (see it happens after an hour exactly).
I'm guessing this is an HTTP2 implementation error, but given that the server is Google's implementation it can be a bit tricky to figure out. Any ideas on how I could get more info on this?
The libcurl version in 0.68, I saw some HTTP2 bugs being fixed in previous versions, but could not find any bug-fix http2-related in versions 0.68-0.70.
Thanks!
[1] https://github.com/davidgfnet/tgbot-framework/blob/master/galogger.h
[2] https://github.com/davidgfnet/tgbot-framework/blob/master/httpclient.h
The text was updated successfully, but these errors were encountered: