New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reuse TCP connections when uploading files #1353
Conversation
It's easy to break connection reuse when using the requests API: just use `stream=True` and never read the response. The connection used to make the request will never be reused, and will be dropped when the urllib3's connection pool is full. It turns out uploading objects using the S3 API goes through `prepared_request`, which incorrectly sets `stream` to the value of `raw`, `True` in our case. And since we don't read the response data, the connection are never reused, and each upload requires its own connection. This is particularly wasteful when uploading many small objects, which can easily happen with JSON or Parquet files generated by Apache Spark, where setting up the connection takes significant time compared to uploading a few bytes. Setting `stream=stream` in the `prepared_request` method matches the code in the `request` method and fixes the bug.
Thanks for contributing this bug fix, I will have a look shortly. There were indeed quite many regressions introduced when we moved to the requests library (I fixed some here #1339, but more need to be fixed). |
I had a look and the change looks good, but can you confirm that with this change, streaming upload will still work correctly (aka the whole input file won't be buffered in memory, but sent in chunks)? |
Thank you for the review! Streaming upload should not be affected as this change is about response streaming. But that's a good idea anyway, I will check tomorrow. I will also check streaming download and report my findings here. |
I can confirm that streaming upload does not load the whole file in memory. Streaming download is broken in this regard, but this is already the case in the trunk branch because |
@Kami I made a mistake when recording memory usage for streaming download, and was simply misled by the 5MB chunk size. When using a smaller chunk size, it's clear that the file isn't in memory, but sent ink chunks. So I believe we're good here. 👍 |
Merged, thanks! |
Reuse TCP connections when uploading files
Description
It's easy to break connection reuse when using the requests API: just use
stream=True
and never read the response. The connection used to make the request will never be reused, and will be dropped when the urllib3's connection pool is full.It turns out uploading objects using the S3 API goes through
prepared_request
, which incorrectly setsstream
to the value ofraw
,True
in our case. And since we don't read the response data, the connection are never reused, and each upload requires its own connection.This is particularly wasteful when uploading many small objects, which can easily happen with JSON or Parquet files generated by Apache Spark, where setting up the connection takes significant time compared to uploading a few bytes.
Setting
stream=stream
in theprepared_request
method matches the code in therequest
method and fixes the bug.Status
Checklist (tick everything that applies)
cc @Kami @tonybaloney