Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Error in the HTTP2 framing layer" after 1000 requests #3750

Closed
TvdW opened this issue Apr 8, 2019 · 15 comments

Comments

@TvdW
Copy link

commented Apr 8, 2019

I did this

This started out as doing several thousand requests to some company's API, and getting occasional errors stating "Error in the HTTP2 framing layer". It was found out that this always happened with stream id 7d1 (2001), also known as request 1001. This lead me to http2_max_requests in nginx. A few hours of minimizing the test case later, I've found two ways to reproduce this bug, generating different errors for the same task, when they should both just work fine.

Method one

The sleep server from my previous bug report returns :) nginx is running on the server with http2_max_requests set to 10, for easier debugging. The sleep server just sleeps the given amount of milliseconds, and the issue seems to become reproducible around 5ms, but consistently so around 100ms.

curl -v $(yes https://home.xifon.eu/sleep/100 | head -n 11) -d0 -m5

Running the command above will execute 11 requests (one more than the limit). The last request times out, but there's no reason for it to.

Method two

This uses the multi socket api, and produces the "Error in the HTTP2 framing layer" error message: https://gist.github.com/TvdW/a8bf6eabc99bf812d2fd0be1d32f9313

It's a crude eventloop implementation doing the bare minimum for making the multi socket API work. This is the last version of my code before I managed to reproduce the bug using the curl command line utility. What's notable about it is that it reproduces the issue, but with a different error message...

curl/libcurl version

curl 7.64.1 (x86_64-pc-linux-gnu) libcurl/7.64.1 OpenSSL/1.0.2k zlib/1.2.7 brotli/1.0.1 c-ares/1.15.0 nghttp2/1.36.0
Release-Date: 2019-03-27
Protocols: dict file ftp ftps gopher http https imap imaps pop3 pop3s rtsp smb smbs smtp smtps telnet tftp
Features: AsynchDNS brotli HTTP2 HTTPS-proxy IPv6 Largefile libz NTLM NTLM_WB SSL UnixSockets

operating system

Linux, CentOS 7

thoughts

Some things stand out:

  • this is only reproducible for requests with bodies, and only if they are not zero-length
  • timing seems to matter. setting the 'time' component of the sleep server to 0ms makes the final request succeed quite often (maybe half the time), but when set to 100ms it seems to fail in 95% of cases or more
  • the two methods produce very different error messages
@bagder bagder added the HTTP/2 label Apr 8, 2019
@bagder

This comment has been minimized.

Copy link
Member

commented Apr 8, 2019

Tricky. I tried both of your versions and I've not made any of them error for me, in spite of me having run both versions tens of times. I've tried both my 7.65.0-DEV version and my 7.64.0 system install!

@kdudka

This comment has been minimized.

Copy link
Collaborator

commented Apr 8, 2019

We had a similar bug report in Fedora: https://bugzilla.redhat.com/1690971 ... but it was resolved (or worked around) server-side.

@TvdW

This comment has been minimized.

Copy link
Author

commented Apr 8, 2019

I realized I forgot to add to the report that this is happening behind a HTTP proxy 😞. It didn't seem too relevant at the time of writing, but I'm now increasingly convinced that the proxy is changing some behavior on the TCP level, making the issue much more likely to occur. Yay for automatic configuration from environment variables 😄

The last two requests look like this:

* Uses proxy env variable https_proxy == 'http://proxy:3128/'
* Found bundle for host home.xifon.eu: 0xe7e440 [can multiplex]
* Re-using existing connection! (#0) with proxy proxy
* Connected to proxy (1.2.3.4) port 3128 (#0)
* Using Stream ID: 13 (easy handle 0xe67480)
> POST /sleep/100 HTTP/2
> Host: home.xifon.eu
> User-Agent: curl/7.64.1
> Accept: */*
> Content-Length: 1
> Content-Type: application/x-www-form-urlencoded
>
* We are completely uploaded and fine
< HTTP/2 200
< server: nginx/1.13.7
< date: Mon, 08 Apr 2019 21:32:51 GMT
< strict-transport-security: max-age=15768000
<
* TLSv1.2 (IN), TLS alert, close notify (256):
* Connection #0 to host proxy left intact
* Uses proxy env variable https_proxy == 'http://proxy:3128/'
* Found bundle for host home.xifon.eu: 0xe7e440 [can multiplex]
* Re-using existing connection! (#0) with proxy proxy
* Connected to proxy (1.2.3.4) port 3128 (#0)
* Using Stream ID: 15 (easy handle 0xe67480)
> POST /sleep/100 HTTP/2
> Host: home.xifon.eu
> User-Agent: curl/7.64.1
> Accept: */*
> Content-Length: 1
> Content-Type: application/x-www-form-urlencoded
>
* Operation timed out after 4999 milliseconds with 0 bytes received
* stopped the pause stream!
* Connection #0 to host proxy left intact
curl: (28) Operation timed out after 4999 milliseconds with 0 bytes received
* Closing connection 0

Those are streams 0x13 (19) and 0x15 (21). The second one is supposed to fail (or not happen at all), because http2_max_requests is 10.

What stands out most are these:

  • TLSv1.2 (IN), TLS alert, close notify (256):
  • Connection #0 to host proxy left intact

In cases where it works properly, I see these:

  • Closing connection 0
  • TLSv1.2 (OUT), TLS alert, close notify (256):

I suspect there's an issue here around how the tls alerts are handled. But, it sounds like that may be really hard to reproduce...

[edit] I'm able to reproduce this on the latest 'alpine' docker image, by running 'apk add curl' (adds 7.64.0), and then running the relevant command. Only when setting the https_proxy environment variable, though. So there's definitely some proxy magic going on on the TCP level.

@jay

This comment has been minimized.

Copy link
Member

commented Apr 9, 2019

Is this a race condition with RST because if so I think we're pretty much SOL. Is this related: https://trac.nginx.org/nginx/ticket/1250

@TvdW

This comment has been minimized.

Copy link
Author

commented Apr 9, 2019

Is this a race condition with RST because if so I think we're pretty much SOL. Is this related: https://trac.nginx.org/nginx/ticket/1250

It could well be a race condition, but I don't think we're SOL just yet :) curl receives the TLS close notify before sending the next request, so it could know better than to send it out. There should also be a GOAWAY frame indicating the last accepted stream ID, and curl should use that frame to decide whether to retry the request on a new connection (but it's not logged in curl's verbose output...).

@TvdW

This comment has been minimized.

Copy link
Author

commented Apr 9, 2019

I'm able to cause a curl segfault if I do concurrent h2 requests over a connection that's going away. Maybe this somehow relates to the original issue as well, so instead of creating a new report I'll add the code here: https://gist.github.com/TvdW/3bb1415313e259b79522f9128525dbf2

The code essentially does 1000 requests to a h2 server, but since http2_max_requests is set much lower it needs to recycle connections long before that, and segfaults while doing so.

@TvdW

This comment has been minimized.

Copy link
Author

commented Apr 9, 2019

(Just double-checked I can actually reproduce the segfault on a different machine in a completely different environment, eg. no proxy, etc. I can. Now let's hope the two issues are related and this should be an easy bug fix 👍 )

@bagder

This comment has been minimized.

Copy link
Member

commented Apr 10, 2019

@TvdW, I can reproduce the crash with 7.64.1 using your example but for some reason it doesn't happen with git master. Can you check if that is the same for you or just me having something weird?

it's not logged in curl's verbose output

If you use configure --enable-debug you'll see the GOAWAY in the verbose output when the connection is closed because of it. Perhaps we should add a "normal" infof() call for that...

@TvdW

This comment has been minimized.

Copy link
Author

commented Apr 10, 2019

@bagder Yeah, I can confirm the segfault is no longer reproducible on git master. git-bisect traces it to 2f44e94... 😃

Sadly the original bug still exists, so they were independent problems...

BUT, with --enable-debug, a lot more information is revealed!

* Connection #0 to host proxy left intact
* Expire cleared (transfer 0xa4f1b8)
* STATE: INIT => CONNECT handle 0xa4f1b8; line 1348 (connection #-5000)
* Uses proxy env variable https_proxy == 'http://proxy:3128/'
* Found bundle for host home.xifon.eu: 0xa660f8 [can multiplex]
* Re-using existing connection! (#0) with proxy proxy
* Connected to proxy (1.2.3.4) port 3128 (#0)
* STATE: CONNECT => WAITDO handle 0xa4f1b8; line 1392 (connection #0)
* Using Stream ID: 13 (easy handle 0xa4f1b8)
> POST /sleep/100 HTTP/2
> Host: home.xifon.eu
> User-Agent: curl/7.65.0
> Accept: */*
> Content-Length: 1
> Content-Type: application/x-www-form-urlencoded
> 
* STATE: WAITDO => DO_MORE handle 0xa4f1b8; line 1606 (connection #0)
* STATE: DO_MORE => DO_DONE handle 0xa4f1b8; line 1728 (connection #0)
* We are completely uploaded and fine
* HTTP/2 found, allow multiplexing
< HTTP/2 200 
< server: nginx/1.13.7
< date: Wed, 10 Apr 2019 22:29:55 GMT
< strict-transport-security: max-age=15768000
< 
* TLSv1.2 (IN), TLS alert, close notify (256):
* nread <= 0, server closed connection, bailing
* STATE: DO_DONE => PERFORM handle 0xa4f1b8; line 1916 (connection #0)
* multi_done
* Connection #0 to host proxy left intact
* Expire cleared (transfer 0xa4f1b8)
* STATE: INIT => CONNECT handle 0xa4f1b8; line 1348 (connection #-5000)
* Uses proxy env variable https_proxy == 'http://proxy:3128/'
* Found bundle for host home.xifon.eu: 0xa660f8 [can multiplex]
* Re-using existing connection! (#0) with proxy proxy
* Connected to proxy (1.2.3.4) port 3128 (#0)
* STATE: CONNECT => WAITDO handle 0xa4f1b8; line 1392 (connection #0)
* Using Stream ID: 15 (easy handle 0xa4f1b8)
> POST /sleep/100 HTTP/2
> Host: home.xifon.eu
> User-Agent: curl/7.65.0
> Accept: */*
> Content-Length: 1
> Content-Type: application/x-www-form-urlencoded
> 
* STATE: WAITDO => DO_MORE handle 0xa4f1b8; line 1606 (connection #0)
* STATE: DO_MORE => DO_DONE handle 0xa4f1b8; line 1728 (connection #0)
* Curl_readwrite: forcibly told to drain data
* Curl_readwrite: forcibly told to drain data
* Curl_readwrite: forcibly told to drain data
* Curl_readwrite: forcibly told to drain data
* Curl_readwrite: forcibly told to drain data

The "forcibly told to drain data" then goes on for many thousands of lines before I can ^C it.

Maybe this reveals enough to be able to reproduce it?

@bagder

This comment has been minimized.

Copy link
Member

commented Apr 11, 2019

This piece from the log above seems to indicate that the server closes the connection but yet curl reuses it for the subsequent request and then it goes totally bananas:

* TLSv1.2 (IN), TLS alert, close notify (256):
* nread <= 0, server closed connection, bailing
* STATE: DO_DONE => PERFORM handle 0xa4f1b8; line 1916 (connection #0)
* Connection #0 to host proxy left intact
* STATE: INIT => CONNECT handle 0xa4f1b8; line 1348 (connection #-5000)
* Uses proxy env variable https_proxy == 'http://proxy:3128/'
* Found bundle for host home.xifon.eu: 0xa660f8 [can multiplex]

Can you try and see if a patch like this below makes anything better?

diff --git a/lib/transfer.c b/lib/transfer.c
index a451fb74e..1c3734f41 100644
--- a/lib/transfer.c
+++ b/lib/transfer.c
@@ -622,10 +622,15 @@ static CURLcode readwrite_data(struct Curl_easy *data,
     else {
       /* if we receive 0 or less here, the server closed the connection
          and we bail out from this! */
       DEBUGF(infof(data, "nread <= 0, server closed connection, bailing\n"));
       k->keepon &= ~KEEP_RECV;
+      if(!(conn->handler->flags & PROTOPT_DUAL))
+        /* Dirty, but avoids this for FTP since we get this for the data
+           transfer and it is then not a signal to kill the control connection
+           too! */
+        streamclose(conn, "Server-initiated connection close");
       break;
     }
 
     /* Default buffer to use when we write the buffer, it may be changed
        in the flow below before the actual storing is done. */
@TvdW

This comment has been minimized.

Copy link
Author

commented Apr 11, 2019

No joy.

* nread <= 0, server closed connection, bailing
* Kill stream: Server-initiated connection close
* STATE: DO_DONE => PERFORM handle 0x22941b8; line 1916 (connection #0)
* multi_done
* Connection #0 to host proxy left intact
* Expire cleared (transfer 0x22941b8)
* STATE: INIT => CONNECT handle 0x22941b8; line 1348 (connection #-5000)
* Uses proxy env variable https_proxy == 'http://proxy:3128/'
* Found bundle for host home.xifon.eu: 0x22ab0f8 [can multiplex]
* Re-using existing connection! (#0) with proxy proxy
* Connected to proxy (1.2.3.4) port 3128 (#0)
* STATE: CONNECT => WAITDO handle 0x22941b8; line 1392 (connection #0)
* Using Stream ID: 15 (easy handle 0x22941b8)
> POST /sleep/100 HTTP/2
> Host: home.xifon.eu
> User-Agent: curl/7.65.0
> Accept: */*
> Content-Length: 1
> Content-Type: application/x-www-form-urlencoded
> 
* STATE: WAITDO => DO_MORE handle 0x22941b8; line 1606 (connection #0)
* STATE: DO_MORE => DO_DONE handle 0x22941b8; line 1728 (connection #0)
* Curl_readwrite: forcibly told to drain data
* Curl_readwrite: forcibly told to drain data
* Curl_readwrite: forcibly told to drain data
* Curl_readwrite: forcibly told to drain data
* Curl_readwrite: forcibly told to drain data
@bagder

This comment has been minimized.

Copy link
Member

commented Apr 11, 2019

Try connclose instead of streamclose in the patch. The connection is closed there, the entire connection should be marked as for not reuse and not just a stream within it. It is interesting that no GOAWAY is logged before the TLS close notify. (This problem still doesn't reproduce for me.)

@TvdW

This comment has been minimized.

Copy link
Author

commented Apr 11, 2019

@bagder Switching to connclose makes it execute all requests successfully, but... only one request per h2 connection is executed. 😃

The "nread <= 0, server closed connection, bailing" message is shown after every request, so that may explain why.

In the case where curl does the right thing, this is shown:

> Content-Type: application/x-www-form-urlencoded                                                                                                                                                                                             
>                                                                                                                                                                                                                                             
* STATE: WAITDO => DO_MORE handle 0xd8f1b8; line 1606 (connection #0)                                                                                                                                                                         
* STATE: DO_MORE => DO_DONE handle 0xd8f1b8; line 1728 (connection #0)                                                                                                                                                                        
* We are completely uploaded and fine                                                                                                                                                                                                         
* HTTP/2 found, allow multiplexing                                                                                                                                                                                                            
< HTTP/2 200                                                                                                                                                                                                                                  
< server: nginx/1.13.7                                                                                                                                                                                                                        
< date: Thu, 11 Apr 2019 13:41:08 GMT                                                                                                                                                                                                         
< strict-transport-security: max-age=15768000                                                                                                                                                                                                 
<                                                                                                                                                                                                                                             
* Marked for [closure]: GOAWAY received                                                                                                                                                                                                       
* nread <= 0, server closed connection, bailing                                                                                                                                                                                               
* STATE: DO_DONE => PERFORM handle 0xd8f1b8; line 1916 (connection #0)                                                                                                                                                                        
* multi_done                                                                                                                                                                                                                                  
* Closing connection 0                                                                                                                                                                                                                        
* TLSv1.2 (OUT), TLS alert, close notify (256):                                                                                                                                                                                               
@bagder

This comment has been minimized.

Copy link
Member

commented Apr 11, 2019

Thanks, yes that's not the correct fix for this issue. I just wanted to verify that marking the connection as closed would at least avoid the following situation.

So, let's drop that patch and instead see if we can act on the close_notify hint instead - when there's no GOAWAY. If I understand things correctly, OpenSSL returns an SSL_ERROR_ZERO_RETURN error for this situation. This has the unfortunate downside that this patch is OpenSSL specific:

diff --git a/lib/vtls/openssl.c b/lib/vtls/openssl.c
index eff5c2106..5d2aac7d3 100644
--- a/lib/vtls/openssl.c
+++ b/lib/vtls/openssl.c
@@ -3754,11 +3754,14 @@ static ssize_t ossl_recv(struct connectdata *conn, /* connection data */
     /* failed SSL_read */
     int err = SSL_get_error(BACKEND->handle, (int)nread);
 
     switch(err) {
     case SSL_ERROR_NONE: /* this is not an error */
+      break;
     case SSL_ERROR_ZERO_RETURN: /* no more data */
+      /* close_notify alert */
+      connclose(conn, "TLS close_notify");
       break;
     case SSL_ERROR_WANT_READ:
     case SSL_ERROR_WANT_WRITE:
       /* there's data pending, re-invoke SSL_read() */
       *curlcode = CURLE_AGAIN;
@TvdW

This comment has been minimized.

Copy link
Author

commented Apr 11, 2019

Works! 😃

< server: nginx/1.13.7                                                                                                                                                                                                                        
< date: Thu, 11 Apr 2019 15:10:16 GMT                                                                                                                                                                                                         
< strict-transport-security: max-age=15768000                                                                                                                                                                                                 
<                                                                                                                                                                                                                                             
* TLSv1.2 (IN), TLS alert, close notify (256):                                                                                                                                                                                                
* Marked for [closure]: TLS close_notify                                                                                                                                                                                                      
* nread <= 0, server closed connection, bailing                                                                                                                                                                                               
* STATE: DO_DONE => PERFORM handle 0x23151b8; line 1916 (connection #0)                                                                                                                                                                       
* multi_done                                                                                                                                                                                                                                  
* Closing connection 0                                                                                                                                                                                                                        
* TLSv1.2 (OUT), TLS alert, close notify (256):                                                                                                                                                                                               
* The cache now contains 0 members                                                                                                                                                                                                            
bagder added a commit that referenced this issue Apr 11, 2019
Without this, detecting and avoid reusing a closed TLS connection
(without a previous GOAWAY) when doing HTTP/2 is tricky.

Reported-by: Tom van der Woerdt
Fixes #3750
@bagder bagder closed this in 3f5da4e Apr 12, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Jul 11, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
4 participants
You can’t perform that action at this time.