Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mimic: rgw: use chunked encoding to get partial results out faster #28014

merged 1 commit into from May 10, 2019


None yet
2 participants
Copy link

commented May 7, 2019

rgw: use chunked encoding to get partial results out faster
Some operations can take a long time to have their complete result.

If a RGW operation does not set a content-length header, the RGW
frontends (CivetWeb, Beast) buffer the entire request so that a
Content-Length header can be sent.

If the RGW operation takes long enough, the buffering time may exceed
keepalive values, and because no bytes have been sent in the connection,
the connection will be reset.

If a HTTP response header contains neither Content-Length or chunked
Transfer-Encoding, HTTP keep-alive is not possible.

To fix the issue within these requirements, use chunked
Transfer-Encoding for the following operations:

* RGWCopyObj_ObjStore_S3 **
* RGWDeleteMultiObj_ObjStore_S3 **
* RGWGetUsage_ObjStore_S3
* RGWListBucketMultiparts_ObjStore_S3
* RGWListBucket_ObjStore_S3
* RGWListBuckets_ObjStore_S3
* RGWListMultipart_ObjStore_S3

RGWCopyObj & RGWDeleteMultiObj specifically use send_partial_response
for long-running operations, and are the most impacted by this issue,
esp. for large inputs. RGWCopyObj attempts to send a Progress header
during the copy, but it's not actually passed on to the client until the
end of the copy, because it's buffered by the RGW frontends!

The HTTP/1.1 specification REQUIRES chunked encoding to be supported,
and the specification does NOT require "chunked" to be included in the
"TE" request header.

This patch has one side-effect: this causes many more small IP packets.
When combined with high-latency links this can increase the apparent
deletion time due to round trips and TCP slow start. Future improvements
to the RGW frontends are possible in two seperate but related ways:
- The FE could continue to push more chunks without waiting for the ACK
  on the previous chunk, esp. while under the TCP window size.
- The FE could be patched for different buffer flushing behaviors, as
  that behavior is presently unclear (packets of 200-500 bytes seen).

Performance results:
- Bucket with 5M objects, index sharded 32 ways.
- Index on SSD 3x replicas, Data on spinning disk, 5:2
- Multi-delete of 1000 keys, with a common prefix.
- Cache of index primed by listing the common prefix immediately before
- Timing data captured at the RGW.
- Timing t0 is the TCP ACK sent by the RGW at the end of the response
- Client is ~75ms away from RGW.
Time to first byte of response header: 11.3 seconds.
Entire operation: 11.5 seconds.
Response packets: 17
Time to first byte of response header: 3.5ms
Entire operation: 16.36 seconds
Response packets: 206

Backport: mimic, luminous
Signed-off-by: Robin H. Johnson <>
(cherry picked from commit d22c1f9)

This comment has been minimized.

Copy link

commented May 8, 2019


yuriw approved these changes May 10, 2019

@yuriw yuriw merged commit 0746c13 into ceph:mimic May 10, 2019

4 checks passed

Docs: build check OK - docs built
Signed-off-by all commits in this PR are signed
Unmodified Submodules submodules for project are unmodified
make check make check succeeded

@cbodley cbodley deleted the cbodley:wip-39614 branch May 10, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.