Reuse TCP connections when uploading files #1353

pquentin · 2019-10-04T11:50:23Z

Reuse TCP connections when uploading files

Description

It's easy to break connection reuse when using the requests API: just use stream=True and never read the response. The connection used to make the request will never be reused, and will be dropped when the urllib3's connection pool is full.

It turns out uploading objects using the S3 API goes through prepared_request, which incorrectly sets stream to the value of raw, True in our case. And since we don't read the response data, the connection are never reused, and each upload requires its own connection.

This is particularly wasteful when uploading many small objects, which can easily happen with JSON or Parquet files generated by Apache Spark, where setting up the connection takes significant time compared to uploading a few bytes.

Setting stream=stream in the prepared_request method matches the code in the request method and fixes the bug.

Status

work in progress

Checklist (tick everything that applies)

Code linting (required, can be done after the PR checks)
Documentation
Tests
ICLA (required for bigger changes)

cc @Kami @tonybaloney

It's easy to break connection reuse when using the requests API: just use `stream=True` and never read the response. The connection used to make the request will never be reused, and will be dropped when the urllib3's connection pool is full. It turns out uploading objects using the S3 API goes through `prepared_request`, which incorrectly sets `stream` to the value of `raw`, `True` in our case. And since we don't read the response data, the connection are never reused, and each upload requires its own connection. This is particularly wasteful when uploading many small objects, which can easily happen with JSON or Parquet files generated by Apache Spark, where setting up the connection takes significant time compared to uploading a few bytes. Setting `stream=stream` in the `prepared_request` method matches the code in the `request` method and fixes the bug.

Kami · 2019-10-06T08:00:54Z

Thanks for contributing this bug fix, I will have a look shortly.

There were indeed quite many regressions introduced when we moved to the requests library (I fixed some here #1339, but more need to be fixed).

Kami · 2019-10-06T08:02:24Z

I had a look and the change looks good, but can you confirm that with this change, streaming upload will still work correctly (aka the whole input file won't be buffered in memory, but sent in chunks)?

pquentin · 2019-10-06T08:38:24Z

Thank you for the review!

Streaming upload should not be affected as this change is about response streaming. But that's a good idea anyway, I will check tomorrow. I will also check streaming download and report my findings here.

pquentin · 2019-10-07T07:24:20Z

I can confirm that streaming upload does not load the whole file in memory.

Streaming download is broken in this regard, but this is already the case in the trunk branch because raw and stream are both True, so my change does not affect streaming download. I believe this should be fixed in another pull request.

pquentin · 2019-10-07T11:06:56Z

@Kami I made a mistake when recording memory usage for streaming download, and was simply misled by the 5MB chunk size. When using a smaller chunk size, it's clear that the file isn't in memory, but sent ink chunks. So I believe we're good here. 👍

Kami · 2019-10-08T20:37:46Z

Merged, thanks!

pquentin added 2 commits October 4, 2019 15:38

Add changelog entry for apache#1353

4c4db0d

Kami added the api: storage label Oct 6, 2019

Kami added the drivers: aws s3 label Oct 6, 2019

Kami approved these changes Oct 8, 2019

View reviewed changes

Kami merged commit 6f6f16c into apache:trunk Oct 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reuse TCP connections when uploading files #1353

Reuse TCP connections when uploading files #1353

pquentin commented Oct 4, 2019 •

edited by Kami

Kami commented Oct 6, 2019

Kami commented Oct 6, 2019

pquentin commented Oct 6, 2019

pquentin commented Oct 7, 2019 •

edited

pquentin commented Oct 7, 2019 •

edited

Kami commented Oct 8, 2019

Reuse TCP connections when uploading files #1353

Reuse TCP connections when uploading files #1353

Conversation

pquentin commented Oct 4, 2019 • edited by Kami