Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse TCP connections when uploading files #1353

Merged
merged 2 commits into from Oct 8, 2019

Conversation

pquentin
Copy link
Contributor

@pquentin pquentin commented Oct 4, 2019

Reuse TCP connections when uploading files

Description

It's easy to break connection reuse when using the requests API: just use stream=True and never read the response. The connection used to make the request will never be reused, and will be dropped when the urllib3's connection pool is full.

It turns out uploading objects using the S3 API goes through prepared_request, which incorrectly sets stream to the value of raw, True in our case. And since we don't read the response data, the connection are never reused, and each upload requires its own connection.

This is particularly wasteful when uploading many small objects, which can easily happen with JSON or Parquet files generated by Apache Spark, where setting up the connection takes significant time compared to uploading a few bytes.

Setting stream=stream in the prepared_request method matches the code in the request method and fixes the bug.

Status

  • work in progress

Checklist (tick everything that applies)

  • Code linting (required, can be done after the PR checks)
  • Documentation
  • Tests
  • ICLA (required for bigger changes)

cc @Kami @tonybaloney

It's easy to break connection reuse when using the requests API: just
use `stream=True` and never read the response. The connection used to
make the request will never be reused, and will be dropped when the
urllib3's connection pool is full.

It turns out uploading objects using the S3 API goes through
`prepared_request`, which incorrectly sets `stream` to the value of
`raw`, `True` in our case. And since we don't read the response data,
the connection are never reused, and each upload requires its own
connection.

This is particularly wasteful when uploading many small objects, which
can easily happen with JSON or Parquet files generated by Apache Spark,
where setting up the connection takes significant time compared to
uploading a few bytes.

Setting `stream=stream` in the `prepared_request` method matches the
code in the `request` method and fixes the bug.
@Kami
Copy link
Member

Kami commented Oct 6, 2019

Thanks for contributing this bug fix, I will have a look shortly.

There were indeed quite many regressions introduced when we moved to the requests library (I fixed some here #1339, but more need to be fixed).

@Kami
Copy link
Member

Kami commented Oct 6, 2019

I had a look and the change looks good, but can you confirm that with this change, streaming upload will still work correctly (aka the whole input file won't be buffered in memory, but sent in chunks)?

@pquentin
Copy link
Contributor Author

pquentin commented Oct 6, 2019

Thank you for the review!

Streaming upload should not be affected as this change is about response streaming. But that's a good idea anyway, I will check tomorrow. I will also check streaming download and report my findings here.

@pquentin
Copy link
Contributor Author

pquentin commented Oct 7, 2019

I can confirm that streaming upload does not load the whole file in memory.

Streaming download is broken in this regard, but this is already the case in the trunk branch because raw and stream are both True, so my change does not affect streaming download. I believe this should be fixed in another pull request.

@pquentin
Copy link
Contributor Author

pquentin commented Oct 7, 2019

@Kami I made a mistake when recording memory usage for streaming download, and was simply misled by the 5MB chunk size. When using a smaller chunk size, it's clear that the file isn't in memory, but sent ink chunks. So I believe we're good here. 👍

@Kami Kami merged commit 6f6f16c into apache:trunk Oct 8, 2019
@Kami
Copy link
Member

Kami commented Oct 8, 2019

Merged, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants