Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sending streams to s3manager uses gobs of ram #2591

Closed
erikh opened this issue May 6, 2019 · 5 comments
Closed

Sending streams to s3manager uses gobs of ram #2591

erikh opened this issue May 6, 2019 · 5 comments

Comments

@erikh
Copy link

erikh commented May 6, 2019

Version of AWS SDK for Go?

Master, verified in code ( see links ).

Version of Go (go version)?

1.12.4

What issue did you see?

Large memory usage when sending a stream (not a physical file, will explain with links in a bit) with a multi-gigabyte multipart boundary to s3 via s3manager. The ram usage is roughly 3x the multipart size per concurrent upload.

The problem seems to be that your multipart upload implementation depends on io.ReadSeeker so that it can sha/md5 the content after the fact. This is further compounded by use of SectionReaders (Which requires ReadSeekers). Instead, I think using io.CopyN for the sections in the stream event, which could be combined with a hash.Hash stream reader for computing the sum values inline, would be much more space efficient; ditch the sync.Pool buffer implementation. From my progress meter measurements, much more time efficient too as the whole upload stalls while the buffer fills and actually gives me misleading data about the transfer rate as it's just directing it to ram -- with no option to measure the bandwidth usage myself inline.

I don't have time to do this; I'm just wiring up backups for a large data set. I just thought you should be aware if you weren't, and be willing to consider an alternative to your existing implementation to make the library more efficient. It's not really a complaint; but needing ~18GB to upload 500GB with part sizes that are large enough to make a file in S3 at that size, at 1Gbps is pretty upsetting.

Refs:

buffer pool initialized: https://github.com/aws/aws-sdk-go/blob/master/service/s3/s3manager/upload.go#L392-L394
stream (no seek capability) forces a buffer fill here: https://github.com/aws/aws-sdk-go/blob/master/service/s3/s3manager/upload.go#L451
this is where the seek implementation is actually required, and probably the only spot: https://github.com/aws/aws-sdk-go/blob/master/service/s3/body_hash.go#L122

Steps to reproduce

https://github.com/erikh/snowplow exhibits this behavior.

@jasdel
Copy link
Contributor

jasdel commented Jun 3, 2019

Thanks for reaching out to us @erikh with the detailed information. The S3 Upload Manager does use significantly more memory than it should for streaming uploads. Refactoring the S3 Upload Manager is in our backlog to improve its performance. Simplifying the difference between single and multipart upload and streaming vs file uploads will help us improve the performance of the uploader.

Additional improvements that can be made is to not compute the SHA of the object's payload at all, this is not required for HTTPS uploads. This does mean that the ContentMD5 would also not be set, which may need to be considered. We're looking at other techniques for ensuring data integrity metadata is available with the uploaded object such as trailing checksum which can be computed inline.

Related to #2036

@erikh
Copy link
Author

erikh commented Jun 4, 2019 via email

@skmcgrail
Copy link
Member

skmcgrail commented Mar 31, 2020

We recently introduced an improvement to the Upload manager that removes the usage of sync.Pool and replaces it with our own custom pool implementation. This has drastically reduced the total amount of memory that created when using a streaming payload. The improvements are documented in #3183 and can be found in SDK versions >= v1.29.21

@richardartoul
Copy link

richardartoul commented Oct 2, 2020

@skmcgrail Could you elaborate on why the part pool throws away all buffers inbetween uploads? This is causing my service to allocate like crazy when doing lots of streaming uploads even though I'm re-using the same Uploader.

The pooling would be a lot more useful if it was shared between calls to upload. I could make a P.R to address this if that would help?

EDIT: alternatively making the part pool pluggable would be really helpful as well cause then I could at least supply my own implementation

@github-actions
Copy link

github-actions bot commented Oct 3, 2021

We have noticed this issue has not received attention in 1 year. We will close this issue for now. If you think this is in error, please feel free to comment and reopen the issue.

@github-actions github-actions bot added closing-soon This issue will automatically close in 4 days unless further comments are made. closed-for-staleness and removed closing-soon This issue will automatically close in 4 days unless further comments are made. labels Oct 3, 2021
@github-actions github-actions bot closed this as completed Oct 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants