-
Notifications
You must be signed in to change notification settings - Fork 796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trafficserver holds chunked response for backend down case #7880
Comments
I'm working on reproducing this but haven't been able to yet. We have a negative-revalidating test which I've modified like so:
That's a modification of this test: Note I modified the server to not contain a content-length header in its response and I increased the body size to be greater than 2048 (3200 bytes). I then ran the test. The first three transactions do this:
Looking at the proxy verifier output, I see the following:
The line with the chunk body is truncated to the width of my terminal, thus it doesn't contain all 3200 characters. But note that it does have the zero-lengthed chunk trailer in this case. I repeated the test by hand, now taking down the server so it is not reachable rather than having it reply with a 503. This time, when the server was down and the resource was greater than Cache-Control's max-age but less than max_stale_age, I notice that it replies with a content-length of 3200 rather than chunked encoding. Anyway, that's what I've tried so far. I'll keep thinking about what else to try to fit @lukenowak 's setup. @lukenowak : if reading the above you notice any configuration or behavior I might be missing to reproduce this, please let me know. Thanks. |
@lukenowak : can you please attach the two packet captures, filtered for these two transactions? |
Some more data points and details. In the original setup where we noticed this issue (which @lukenowak then reproduced in a smaller test scenario with less identifying data) there was another weird behaviour to the chunking: for a 8211 bytes cached resource, both
So not only is the response's body lacking a final empty chunk, it is even lacking part of a chunk, and a further 3rd 0x13 bytes-long chunk, so 499 + 13 bytes are missing from the final decoded resource. This resource has Also, as I just described, in our setup the origin is not necessarily down from a network perspective: an alternative origin may be reached instead of the normal one, and it would only serve error pages with a EDIT: s/502/503/ |
Aha ! And if I actually stop origin (hence making ATS notice an immediate "connection refused"), then the issue does not manifest itself, and it is actually served in a single body (EDIT: by which I mean "not chunk-encoded") with a Content-Length visibly generated by ATS itself (there is no such header visible in an |
The packet captures. |
@vpelletier is right regarding the origin state - if it's down the issue is not there, it happens only when it return 5xx. |
@bneradt here goes the full records.config, configure options and the way how we start. We use Debian 10 AMD64. I am ready to provide all what's needed for you to being able to reproduce it. |
Thank you, both. This is helpful feedback. It's good to know that a 5xx from the server is sufficient to repro this because writing a test where the server actually disappears is possible in AuTest but more complicated than me just configuring the test server to reply with a 503. I've now reproduced the issue on this debug branch: Running this new AuTest like so:
The Proxy Verifier client shows this output for the first, correct request:
For the second, during the negative revalidating period, it shows this:
Note that the chunk is cut off prematurely, just as described by @lukenowak and @vpelletier . Interestingly, I cannot reproduce this problem on master. Only when I cherry-picked these changes to 9.0.x, rebuilt, and ran the test there did I see this issue. |
@bneradt do you see this problem on 9.1? |
Recording this now so I don't forget: in an offline conversation with @bryancall, he points out that we should look into why ATS is replying with chunk encoded content in the first place for this scenario. When we serve from the cache as a fresh resource we do a content-length reply. As we are seeing with this issue, when we reply from the cache with a stale resource via the negative revalidating feature, we are serving chunk encoded. At least theoretically we should be able to reply with a content-length for the latter case as well as with the former. And we should reply with content-length bodies if we can because body communicated via content-length has parsing and preparatory advantages for the client. Thus there are two things that should be considered when implementing a fix for this issue:
|
I've verified that the following PR, when I locally cherry-pick it to 9.0.x in my dev environment, fixes this issue: |
This adds a negative revalidating test for a chunk encoded response. This functions as a regression test for apache#7880.
@lukenowak and @vpelletier: thank you for the detailed information. Using your data I added a regression test for this issue. It turns out that @shinrich has submitted a fix for this issue in #7577. I've requested backports for the fix to go into the next 8.1.x and 9.0.x releases. I'll leave this issue open until the following cherry-picks are merged in: |
Thanks @bneradt . Is there a chance that this fixes will end up in 9.0.2? And do you have any due date for this release? I saw there are no dates on milestones (I assume you'll release, when it's ready). |
This adds a negative revalidating test for a chunk encoded response. This functions as a regression test for #7880.
The fix just got merged into 9.0.x (see #7577), so it will be in 9.0.2. We are currently working on this release. It should be out within a couple months. |
Closing since Susan's fix (#7577) is now merged to 9.0.x and 8.1.x. |
This adds a negative revalidating test for a chunk encoded response. This functions as a regression test for apache#7880. (cherry picked from commit 50f2f40) (cherry picked from commit 28e2449)
We have a situation where the backend returns a response with body size > 2048 bytes without Content-Lenght. This leads to TrafficServer returning it further with Transfer-Encoding: chunked. If the body size is <= 2048 TrafficServer adds Content-Length by itself and serves the content happily.
We are using TrafficServer 8.1.1, but the same behaviour is with 9.0.1.
Our configuration:
If the backend is up (returns 200) and negative_revalidating_enabled does not kick in, TrafficServer returns the full response (headers and body) immediately.
The backend up case communication between client and TrafficServer:
So we happily support the case when the backend is down, but after returning headers it is giving part of the body as chunked, but does not finish it with \r\n0\r\n, which leads to the client being “stuck” for the time of proxy.config.http.keep_alive_no_activity_timeout_in, when then the TrafficServer sends FIN/ACK to the client, somehow releasing it. Seems that RFC is not followed.
The backend down case communication between client and TrafficServer:
Setting proxy.config.http.chunking_enabled to 0 fixes the issue, but we can't accept it as a workaround, as we rely on chunked encoding for many cases, especially for backend-up ones.
The text was updated successfully, but these errors were encountered: