Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP/2 connection returning CURLE_HTTP2 error after a while #5389

Closed
davidgfnet opened this issue May 13, 2020 · 23 comments
Closed

HTTP/2 connection returning CURLE_HTTP2 error after a while #5389

davidgfnet opened this issue May 13, 2020 · 23 comments
Labels

Comments

@davidgfnet
Copy link

Hello there!

I have some code that uses libcurl to push data to Google Analytics [1]. This code was recently updated to use a slightly different endpoint (POST). Since I upgraded I'm having issues due to HTTP2. I get CURLE_HTTP2 error as a result. This happens in a very interesting pattern and the code never recovers (I'm guessing due to connection reuse? perhaps TLS session reuse? I'm no expert in HTTP2 but I know it's a complicated proto).

The events happen as follow:

  • The service starts pushing data via HTTP2 every 3-7 seconds on average.
  • After exactly an hour (+some seconds) CURL starts returning CURLE_HTTP2 error.
  • The service retries these queries and also does other new queries, but still get CURLE_HTTP2 errors.
  • After some amount of time (can be O(hours)) I get CURLE_OPERATION_TIMEDOUT. This never recovers.
    (this might be related to the endpoint behaviour on 'broken' connections or something, since it's more variable)

My guess here is that a single HTTP2 connection is created with the remote server google-analytics.com and it is used (via multiple streams?) to push the different POST requests. The implementation uses the multi interface but I do not thing that is super-relevant here [2]

I added a flag to use HTTP1.1 and this works flawlessly :) Interestingly I have another instance that pushes data at a much slower pace (like a query every 10-60 seconds) and this one never triggers the error, which is odd given that the error seems time based and not query count based (see it happens after an hour exactly).

I'm guessing this is an HTTP2 implementation error, but given that the server is Google's implementation it can be a bit tricky to figure out. Any ideas on how I could get more info on this?

The libcurl version in 0.68, I saw some HTTP2 bugs being fixed in previous versions, but could not find any bug-fix http2-related in versions 0.68-0.70.
Thanks!

[1] https://github.com/davidgfnet/tgbot-framework/blob/master/galogger.h
[2] https://github.com/davidgfnet/tgbot-framework/blob/master/httpclient.h

@jay jay added the HTTP/2 label May 13, 2020
@jay
Copy link
Member

jay commented May 13, 2020

We would need to know more about the error to know if we're handling it correctly. I don't recall a similar post. The timeout error sticks out, that strikes me as a server issue however possibly there is some DNS issue, curl is contacting a server that has since gone down.

Any ideas on how I could get more info on this?

Set CURLOPT_VERBOSE in your app. If you are using curl w/OpenSSL then you can also set SSLKEYLOGFILE environment variable, monitor in Wireshark and run the app.

@davidgfnet
Copy link
Author

Please note that it first returns CURLE_HTTP2 (after an hour of working all right) before it starts returning timeout errors. That also happens quite after a while (O(hours))
The host is "google-analytics.com", which I'm confident is up and unlikely to go down :)
I can try enabling the VERBOSE option after I receive the first error, not to clutter the log too much (this service has some amount of traffic), let's see how that goes :)

@bagder
Copy link
Member

bagder commented May 13, 2020

My blind guess is that perhaps the server sends a GOAWAY after an hour and curl doesn't treat it correctly. But nothing to back that up and I'm not aware of any goaway issues...

@davidgfnet
Copy link
Author

Ok then!
I got a log, it's very long, but I partially debugged it for you. Check "Connection #2" in the log. So connection #0 dies after an hour:

  • Connection 0 seems to be dead!
  • Closing connection 0

But connection 1 also dies in a less worthy way, that is, it is zombie for a while (~4 mins). You can see that before connection 1 is declared dead there's a lot of POST requests that go unanswered, they simply timeout (and report CURLE_HTTP2 as recorded in a separate log). The window of time seems to go from 21:29:35 to 21:33:35. What I find interesting is that the error passed to the user is CURLE_HTTP2 instead of timeout :/
Do you need a more detailed log somehow? Let me know if you want more info

Logfile: https://gist.github.com/davidgfnet/f0d6188c5a035c3ad3bb5b52a78c2f66

@davidgfnet
Copy link
Author

Oh quick update, this seems to happen roughly every hour, the connection dies from the server, I guess in a silent? way? Cause sometimes CURL will notice and sometimes it will keep pushing data for a few minutes before dying. Just saw it happen another time on the next hour boundary.

@bagder
Copy link
Member

bagder commented May 14, 2020

I don't see any error in that log?

@davidgfnet
Copy link
Author

That's the curl verbose log (CURLOPT_VERBOSE + CURLOPT_STDERR).
The error log would be:

May 14 21:29:42 (... redacted ...)[4005989]: msg->data.result = 16
May 14 21:29:43 (... redacted ...)[4005989]: msg->data.result = 16
May 14 21:29:44 (... redacted ...)[4005989]: msg->data.result = 16
May 14 21:29:44 (... redacted ...)[4005989]: msg->data.result = 16
May 14 21:29:45 (... redacted ...)[4005989]: msg->data.result = 16
May 14 21:29:45 (... redacted ...)[4005989]: msg->data.result = 16
May 14 21:29:45 (... redacted ...)[4005989]: msg->data.result = 16
May 14 21:29:45 (... redacted ...)[4005989]: msg->data.result = 16
May 14 21:29:45 (... redacted ...)[4005989]: msg->data.result = 16
May 14 21:29:45 (... redacted ...)[4005989]: msg->data.result = 16
May 14 21:29:45 (... redacted ...)[4005989]: msg->data.result = 16
May 14 21:29:45 (... redacted ...)[4005989]: msg->data.result = 16

(that's a separate log that records errors on http completion, the code looks aproximately like:)

CURLMsg *msg;
while ((msg = curl_multi_info_read(multi_handle, &msgs_left))) {
  if (msg->msg == CURLMSG_DONE) {
    CURL *h = msg->easy_handle;
    bool okcode = (msg->data.result == CURLE_OK);
    if (!okcode)
      std::cerr << msg->data.result << std::endl;
    curl_multi_remove_handle(multi_handle, h);
    curl_easy_cleanup(h);
  }
}

@bagder
Copy link
Member

bagder commented May 15, 2020

So how do they relate? The first log seems to not show any problems. The latter one shows a bunch of transfers apparently returning CURLE_HTTP2. I don't understand what I should do with these...

The key we're looking for is why they suddenly start to fail. For this, I presume we will need some level of HTTP/2 details from the connections or streams. Perhaps by enabling HTTP2 debugging in http2.c, perhaps by wiresharking the connection, perhaps some other means. I'm open for suggestions.

@davidgfnet
Copy link
Author

The first log seems to not show any problems.
Well are you sure? You can see a lot of POSTs that get no answer. I do consider that a problem in a way right? See this shorter log if it helps you: https://gist.github.com/davidgfnet/eee9c67359be8b2059b71e2bbed74bf4

The latter one shows a bunch of transfers apparently returning CURLE_HTTP2. I don't understand what I should do with these...
During a 4 minute period of time curl sends POST requests, that immediately fail with a CURLE_HTTP2 (I can deduce they are immediate since the timestamps in the log are only milliseconds away from each other, and the CPU usage during that period of time goes to the roof, since I do not use backoff for retries). My question here would be: why is not curl trying to create a new connection if the current one has problems? Is CURLE_HTTP2 some transient error? It clearly never recovers.

The reason why it suddenly fails I don't know. My guess is that the Google server kills the connection unilateraly after 3600s and stops responding completely, likely not sending any sort of message to tell the client it did. What I do not understand is why CURL then returns that error, shouldnt the requests just timeout? Perhaps the server sends a funny message (like "ive never seen this connection")?

Should I capture a pcap or go for http2.c debugging? Which is better? I assume the connection is using TLS and we won't be able to see much?
Thanks!

@bagder
Copy link
Member

bagder commented May 17, 2020

My guess is that the Google server kills the connection unilateraly after 3600s and stops responding completely

I don't think a guess helps us much here. My guess would be different, but as that's a guess as well it's not too relevant.

What I do not understand is why CURL then returns that error, shouldnt the requests just timeout?

If the connection closed they shouldn't timeout, they should return errors of some sort. Closing a connection at the wrong time (like when there are live streams on it) is a bad thing and can of course trigger http2 (and other) errors. But we're talking about a possible bug so then of course curl isn't behaving like it should and then its hard to say exactly what's going on.

Should I capture a pcap or go for http2.c debugging? Which is better?

Both will generate a lot of content that's hard to interpret, when probably only the last few requests before the error triggers are actually relevant and interesting for us to inspect. They would be sort of complementary so I can't tell which one would be better.

I assume the connection is using TLS and we won't be able to see much?

You'd have to do it the SSLKEYLOGFILE way to make sure you can inspect the TLS protected connections as well.

Can you make the problem happen if you just make one request, wait 59 minutes and then make another one or similar? It would be good to reduce the amount traffic but still see the problem...

@davidgfnet
Copy link
Author

I can certainly give that approach a try. I will try to create a test case for this instead of testing in production systems :) If I do nothing for 59 mins, will curl close the connection? I mean, does the connection cache expire? It can take a bit to reproduce, since sometimes it works (previous connection is closed more or less gracefully) and sometimes it does not.

@bagder
Copy link
Member

bagder commented May 18, 2020

No, it won't work that easily. 😢

The server will most likely close the connection within N minutes/seconds unless it's used and curl will only reuse connections that were used within the CURLOPT_MAXAGE_CONN last seconds. The latter is of course easy to fix but the server issue of course will basically force you to do very frequent requests during the hour to maintain the connection.

I presume a better way then is to basically go on like before for the first 59 minutes and then for the last minute or so, increase logging of connection and HTTP/2 details. See DEBUG_HTTP2 in lib/http2.c or one way to get more h2 debug info.

@jbms
Copy link
Contributor

jbms commented Jul 1, 2020

Have you made any more progress in debugging this problem? I have also observed the same issue when making requests to Google services, but haven't yet set up a test case that reproduces it easily.

@davidgfnet
Copy link
Author

Nope sorry! I sort of fixed the bug short-term via forcing HTTP1.1 in all my clients and forgot about building a test case. Since it is quite time consuming to do (due to the fact that you need to wait to ensure it happens).
Since this looks like a Google Frontend behaviour it should be reproducible in any Google service. As usual do not take any of this as a confirmation, this is just a personal suspicion on how this works. Did you find any user-friendly endpoint we could test and play with (that's not google ananlytics)? Thanks!
Also surprised there's no more people hitting this right?

@jbms
Copy link
Contributor

jbms commented Jul 2, 2020

Okay, thanks for your initial analysis and the update. We are now looking into this at Google, and can reliably reproduce this --- there does appear to be a problem in the interaction between curl and the Google frontend proxy server ("GFE") when the maximum connection age, which is normally 1 hour, is reached.

@jbms
Copy link
Contributor

jbms commented Jul 2, 2020

Here is a packet capture and a debug log that reproduces the error:

https://gist.github.com/jbms/18c93e28392a5244457de0e1db460f82

This is using curl head as of today 01afe4e, with #5641 applied to fix the http2 debug messages, using nghttp2 version 1.40.0.

It appears that on_stream_close gets called due to the GOAWAY message, but http2_handle_stream_close is never called to handle it. However, it was not clear to me what the correct fix would be.

I would much appreciate nay help in fixing this. Let me know if any additional information/captures would be helpful.

@davidgfnet
Copy link
Author

Ah lol you also work at Google :P Thanks for digging further into this, I'm happy to know I'm not the only one.
@bagder Is this good enough to figure out what's wrong with this? Now I'm afraid the bug could be in either side :P Thanks you both for your help!

@jbms
Copy link
Contributor

jbms commented Jul 2, 2020

I'm not too knowledgeable about HTTP2 (and I don't work on the GFE server) but from looking at the packet trace and the HTTP/2 spec it appears that the GFE server is following the spec.

@bagder
Copy link
Member

bagder commented Jul 2, 2020

This makes it sound like #5611 is the same/very similar issue and basically it is "just" a problem with GOAWAY so maybe the easiest way to reproduce (and subsequently fix) is to for example run nginx and set http2_max_requests to something low like '3' and then do 4 transfers from that same host...

We've had a similar issue in the past and I know I've fixed it before, but apparently something has made it come back. I'll work on it, but I'm almost in vacation mode here for a while so it might not go super fast for me.

@jbms
Copy link
Contributor

jbms commented Jul 2, 2020

Thanks for looking into this @bagder. It seems like it may be related though the error log in issue #5611 does not show curl error 16 (CURLE_HTTP2) --- it seems like it is failing in a slightly different way. From looking briefly at http2.c it seems that STREAM_REFUSED is handled in several places, but perhaps not all cases are covered. The packet capture that reproduces this is quite short (just 18 http2 packets), so perhaps a test case could be created from that.

@bagder
Copy link
Member

bagder commented Jul 3, 2020

Did the landing of #5643 change anything?

@laramiel
Copy link
Contributor

laramiel commented Jul 3, 2020

I'm pretty sure that it fixes the first issue mentioned of failure after an hour. This is identical to the state that I tested, where the connection is closed by the server, but the client side of the connection ends up in the cache.

I'm not sure about the 4 minute issue mentioned later.

I suspect that any place nghttp2 indicates a framing error that the connection will be impossible to recover and it should be closed and not reused. But I didn't audit for all of those places.

@bagder
Copy link
Member

bagder commented Aug 12, 2020

I'm pretty sure that it fixes the first issue mentioned of failure after an hour.

In my eyes, that's what this issue (#5389) is about and if so, we consider this issue handled. If there are more issues still present, please file a new issue and let us know how to reproduce or provide enough details so that we can properly understand it!

@bagder bagder closed this as completed Aug 12, 2020
philipnbbc added a commit to bbc/bmx that referenced this issue Mar 7, 2023
It was found that the google storage api would cause curl to return
a CURLE_HTTP2 error after around an hour. The workaround implemented here
is to close the connection and open a new one. The issue appears similar
to curl/curl#5389 and
curl/curl#5643. The fix
curl/curl@ef86daf
didn't solve the issue (version 7.74.0).
philipnbbc added a commit to bbc/bmx that referenced this issue Mar 28, 2023
It was found that the google storage api would cause curl to return
a CURLE_HTTP2 error after around an hour. The workaround implemented here
is to close the connection and open a new one. The issue appears similar
to curl/curl#5389 and
curl/curl#5643. The fix
curl/curl@ef86daf
didn't solve the issue (version 7.74.0).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

6 participants
@bagder @jay @davidgfnet @jbms @laramiel and others