Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compressed, chunked response inhibits connection reuse #549

Closed
OrIdow6 opened this issue Oct 17, 2022 · 9 comments
Closed

Compressed, chunked response inhibits connection reuse #549

OrIdow6 opened this issue Oct 17, 2022 · 9 comments

Comments

@OrIdow6
Copy link

OrIdow6 commented Oct 17, 2022

env_logger::init();
let agent: Agent = ureq::AgentBuilder::new().build();
for _ in 0..2 {
    let resp = agent.get("https://www.jammersreviews.com/st-tos/s1/dagger.php").call().unwrap();
    println!("Connection: {}", resp.header("connection").unwrap());
    std::io::copy(&mut resp.into_reader(), &mut io::sink()).unwrap();
}

results in (RUST_LOG=trace)

[2022-10-17T01:02:37Z DEBUG ureq::stream] connecting to www.jammersreviews.com:443 at 172.67.140.111:443
[2022-10-17T01:02:37Z DEBUG rustls::client::hs] No cached session for DnsName(DnsName(DnsName("www.jammersreviews.com")))
[...]
[2022-10-17T01:02:37Z DEBUG ureq::stream] created stream: Stream(RustlsStream)
[2022-10-17T01:02:37Z DEBUG ureq::unit] sending request GET https://www.jammersreviews.com/st-tos/s1/dagger.php
[2022-10-17T01:02:37Z DEBUG ureq::unit] writing prelude: GET /st-tos/s1/dagger.php HTTP/1.1
    Host: www.jammersreviews.com
    User-Agent: ureq/2.5.0
    Accept: */*
    accept-encoding: gzip
[2022-10-17T01:02:37Z DEBUG rustls::client::tls13] Ticket saved
[2022-10-17T01:02:37Z DEBUG rustls::client::tls13] Ticket saved
[2022-10-17T01:02:37Z DEBUG ureq::response] Chunked body in response
[2022-10-17T01:02:37Z DEBUG ureq::unit] response 200 to GET https://www.jammersreviews.com/st-tos/s1/dagger.php
Connection: keep-alive
[2022-10-17T01:02:37Z DEBUG ureq::stream] dropping stream: Stream(RustlsStream)
[2022-10-17T01:02:37Z DEBUG ureq::stream] connecting to www.jammersreviews.com:443 at 172.67.140.111:443
[2022-10-17T01:02:37Z DEBUG rustls::client::hs] Resuming session
[...]
[2022-10-17T01:02:37Z DEBUG ureq::stream] created stream: Stream(RustlsStream)
[2022-10-17T01:02:37Z DEBUG ureq::unit] sending request GET https://www.jammersreviews.com/st-tos/s1/dagger.php
[2022-10-17T01:02:37Z DEBUG ureq::unit] writing prelude: GET /st-tos/s1/dagger.php HTTP/1.1
    Host: www.jammersreviews.com
    User-Agent: ureq/2.5.0
    Accept: */*
    accept-encoding: gzip
[2022-10-17T01:02:37Z DEBUG rustls::client::tls13] Ticket saved
[2022-10-17T01:02:37Z DEBUG rustls::client::tls13] Ticket saved
[2022-10-17T01:02:37Z DEBUG ureq::response] Chunked body in response
[2022-10-17T01:02:37Z DEBUG ureq::unit] response 200 to GET https://www.jammersreviews.com/st-tos/s1/dagger.php
Connection: keep-alive
[2022-10-17T01:02:37Z DEBUG ureq::stream] dropping stream: Stream(RustlsStream)

As can be seen, though TLS does session resumption, the TCP stream gets closed and reopened between requests. I believe this is caused by the fact that the GzDecoder knows when to expect the end of the gzip stream, never tries to read beyond it, and therefore never makes the read that causes the ChunkDecoder to give it Ok(0), which is what the PoolReturnRead looks for to know when to return the Stream to its pool. I was able to get it to work properly by using a MultiGzDecoder instead of a plain GzDecoder, at least in my private, heavily-modified version of ureq. But I don't think any such thing is available for Brotli.

@algesten
Copy link
Owner

@OrIdow6 Thanks for reporting this! I've written a test case in #550 illustrating the problem. I don't have a solution yet, but it probably means the Compression wrapping of the PoolReturnRead needs to inform when it reaches the end.

@jsha
Copy link
Collaborator

jsha commented Oct 19, 2022

@OrIdow6 also thanks for taking the time to write such a detailed report. I suspect the MultiGzDecoder only coincidentally fixes this because it will try additional reads after the first gzip stream is done.

In terms of how to fix this: the main problem is that the chunked encoding may have a few more bytes that it hasn't read. Specifically the 0\r\n\r\n that terminates the response. It's not enough for the GzDecoder to say its done; if there are any unread bytes left in the response stream, we can't return it to the pool.

We also don't want to do any syscall-invoking reads in a Drop impl, since that could produce surprising hangs.

Proposal: fork chunked_transfer and make it aware of BufRead. When a read is finished, peek at the next 5 bytes in the buffer (if available). If those bytes are 0\r\n\r\n consume them and set an internal piece of state saying "this stream is done." Expose that state with a done() method that can be forwarded by ureq's internal Done trait. Implement Done for chunked_transfer::Decoder. I think we don't actually need to touch the Compression layer at all, since all we care about is "did we read all the bytes from the chunked encoding yet."

Note this approach is not perfect: it's possible we could land right on a chunk boundary, with no bytes left in the buffer, but if we read more from the network, the next bytes we would read would be 0\r\n\r\n. In that case we would fail to recognize we were almost at the end, and would therefore fail to pool the connection. That's probably okay.

Another approach, which would be more perfect: if chunked_transfer::Decoder receives a read that ends on a chunk boundary, and there are no more bytes left in the buffer, then it should try to fill the buffer and peek at the next five bytes before returning.

@algesten
Copy link
Owner

@jsha I pushed a commit 8f06cf8 which potentially fixes the problem without forking anything.

Might be something I haven't thought about, but I think it should be ok.

@jsha
Copy link
Collaborator

jsha commented Oct 19, 2022

The approach in #550 has a problem: it assumes that the end-user reads the decompressed body until it gets an Ok(0) from the GzDecoder. It's quite possible to have another length-aware format inside the gzip stream, and to be using a reader that only reads the exact number of bytes. In that case, the if n == 0 { clause for CompressionRead would not trigger and the remaining \0\r\n\r\n would not be consumed.

Also, the #550 approach only fixes things for the compression use case. The same problem exists for any length-aware format even if it's not compressed. Solving this at the chunked_transfer layer solves things more generally, and is also a better use of abstraction. chunked_transfer is the component that knows when an HTTP message is done.

An interesting thing to note in both cases: our current implementation of automatic decompression will not raise an error if the server includes some extra junk after the compressed stream ends.

I don't think it's a big deal to fork chunked_transfer for this purpose; it's something we've considered before. And I bet we could get the improvements upstreamed, too!

@algesten
Copy link
Owner

algesten commented Oct 20, 2022

It's quite possible to have another length-aware format inside the gzip stream, and to be using a reader that only reads the exact number of bytes. In that case, the if n == 0 { clause for CompressionRead would not trigger

Sure. But our that's how repooling currently works: It repools if the user reads to the end.

A user might always have some reason to abort reading early. Like having another length aware thing inside (that's ortogogonal to chunked/gzip).

The contract is: read to end and ureq repools.

For this bug, the problem is that the user really does read to end, and ureq breaks the contract.

@jsha
Copy link
Collaborator

jsha commented Oct 20, 2022

I did some more thinking and reading about this. What should we do if there is extra garbage after a gzip stream ends? I tested and both curl and Go's net/http generate an error.

According to https://datatracker.ietf.org/doc/html/rfc9112#section-12.3, gzip is defined in RFC 1952. In section 2.2:

  A gzip file consists of a series of "members" (compressed data
 sets).  The format of each member is specified in the following
 section.  The members simply appear one after another in the file,
 with no additional information before, between, or after them.

The GzDecoder docs don't really make it clear, but that decoder only decodes a single member. The gzip specification (and therefore HTTP) specify a sequence of members. That's what's implemented by MultiGzDecoder. In other words, MultiGzDecoder is what you want if you actually want to read gzip response bodies, even though its documentation makes it sound like something esoteric.

That plays nicely with our intuition that extra garbage after a gzip member should cause an error. Once you've read a full member, the next thing you read must be either EOF or another fully-valid gzip member. So I take back what I said earlier: it's not really a coincidence that MultiGzDecoder repools, because it will keep reading to EOF.

So, I propose we switch to MultiGzDecoder, which I think should solve this problem in a simple way.

@jsha
Copy link
Collaborator

jsha commented Oct 21, 2022

By the way, Brotli doesn't have the same convenient property, where streams happen to be terminated by end-of-file. AFAICT the Brotli decoder behaves similarly to how the GzDecoder does, and only consumes the compressed part of its input, possibly leaving some extra garbage on the end. So while MultiGzDecoder neatly solves this problem for gzip, we still need to solve it for Brotli.

I think we should error if there is extra garbage. One way to do that: wrap BrotliDecoder in a PreciseReader. The PreciseReader, after getting an Ok(0) from the BrotliDecoder, calls into_inner() on it and does a one-byte read on the underlying reader. If that one-byte read produces anything other the Ok(0) that means there's extra garbage and PreciseReader should return an error (which leads to not pooling the connection).

Some reasons to error on extra garbage:

  • Would prefer not to silently ignore bad input
  • It avoids an obligation to read arbitrarily many bytes during a single read call
  • curl does it :-D

@algesten
Copy link
Owner

@jsha great thoughts!

I've pushed a revision of #550 that simplifies the impl somewhat and throws an error if we read anything but 0.

This was referenced Nov 26, 2022
@jsha
Copy link
Collaborator

jsha commented Dec 7, 2022

This is fixed (for the gzip case) by switching to MultiGzDecoder in #560! 🎉

For the Brotli case, this will require a little more work. See #559; we'll track the work over there.

@jsha jsha closed this as completed Dec 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants