Sending streams to s3manager uses gobs of ram #2591

erikh · 2019-05-06T07:02:19Z

Version of AWS SDK for Go?

Master, verified in code ( see links ).

Version of Go (`go version`)?

1.12.4

What issue did you see?

Large memory usage when sending a stream (not a physical file, will explain with links in a bit) with a multi-gigabyte multipart boundary to s3 via s3manager. The ram usage is roughly 3x the multipart size per concurrent upload.

The problem seems to be that your multipart upload implementation depends on io.ReadSeeker so that it can sha/md5 the content after the fact. This is further compounded by use of SectionReaders (Which requires ReadSeekers). Instead, I think using io.CopyN for the sections in the stream event, which could be combined with a hash.Hash stream reader for computing the sum values inline, would be much more space efficient; ditch the sync.Pool buffer implementation. From my progress meter measurements, much more time efficient too as the whole upload stalls while the buffer fills and actually gives me misleading data about the transfer rate as it's just directing it to ram -- with no option to measure the bandwidth usage myself inline.

I don't have time to do this; I'm just wiring up backups for a large data set. I just thought you should be aware if you weren't, and be willing to consider an alternative to your existing implementation to make the library more efficient. It's not really a complaint; but needing ~18GB to upload 500GB with part sizes that are large enough to make a file in S3 at that size, at 1Gbps is pretty upsetting.

Refs:

buffer pool initialized: https://github.com/aws/aws-sdk-go/blob/master/service/s3/s3manager/upload.go#L392-L394
stream (no seek capability) forces a buffer fill here: https://github.com/aws/aws-sdk-go/blob/master/service/s3/s3manager/upload.go#L451
this is where the seek implementation is actually required, and probably the only spot: https://github.com/aws/aws-sdk-go/blob/master/service/s3/body_hash.go#L122

Steps to reproduce

https://github.com/erikh/snowplow exhibits this behavior.

The text was updated successfully, but these errors were encountered:

jasdel · 2019-06-03T20:18:08Z

Thanks for reaching out to us @erikh with the detailed information. The S3 Upload Manager does use significantly more memory than it should for streaming uploads. Refactoring the S3 Upload Manager is in our backlog to improve its performance. Simplifying the difference between single and multipart upload and streaming vs file uploads will help us improve the performance of the uploader.

Additional improvements that can be made is to not compute the SHA of the object's payload at all, this is not required for HTTPS uploads. This does mean that the ContentMD5 would also not be set, which may need to be considered. We're looking at other techniques for ensuring data integrity metadata is available with the uploaded object such as trailing checksum which can be computed inline.

Related to #2036

erikh · 2019-06-04T14:52:31Z

Sweet! Thanks for the detailed response! I will follow the other ticket.

…

On Mon, Jun 3, 2019 at 1:18 PM Jason Del Ponte ***@***.***> wrote: Thanks for reaching out to us @erikh <https://github.com/erikh> with the detailed information. The S3 Upload Manager does use significantly more memory than it should for streaming uploads. Refactoring the S3 Upload Manager is in our backlog to improve its performance. Simplifying the difference between single and multipart upload and streaming vs file uploads will help us improve the performance of the uploader. Additional improvements that can be made is to not compute the SHA of the object's payload at all, this is not required for HTTPS uploads. This does mean that the ContentMD5 would also not be set, which may need to be considered. We're looking at other techniques for ensuring data integrity metadata is available with the uploaded object such as trailing checksum which can be computed inline. Related to #2036 <#2036> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2591?email_source=notifications&email_token=AAAET22YAZ7APHXA6OCLZ7TPYV4BBA5CNFSM4HK5S5WKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW2SFIY#issuecomment-498410147>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAET27WHK6OOWOEXJ5UTIDPYV4BBANCNFSM4HK5S5WA> .

skmcgrail · 2020-03-31T17:28:06Z

We recently introduced an improvement to the Upload manager that removes the usage of sync.Pool and replaces it with our own custom pool implementation. This has drastically reduced the total amount of memory that created when using a streaming payload. The improvements are documented in #3183 and can be found in SDK versions >= v1.29.21

richardartoul · 2020-10-02T20:12:44Z

@skmcgrail Could you elaborate on why the part pool throws away all buffers inbetween uploads? This is causing my service to allocate like crazy when doing lots of streaming uploads even though I'm re-using the same Uploader.

The pooling would be a lot more useful if it was shared between calls to upload. I could make a P.R to address this if that would help?

EDIT: alternatively making the part pool pluggable would be really helpful as well cause then I could at least supply my own implementation

github-actions · 2021-10-03T00:00:59Z

We have noticed this issue has not received attention in 1 year. We will close this issue for now. If you think this is in error, please feel free to comment and reopen the issue.

jasdel added the performance label Jun 3, 2019

jasdel mentioned this issue Jul 25, 2019

Will there be serious memory improvements for the S3 Upload Manager? aws/aws-sdk-go-v2#343

Closed

github-actions bot added closing-soon This issue will automatically close in 4 days unless further comments are made. closed-for-staleness and removed closing-soon This issue will automatically close in 4 days unless further comments are made. labels Oct 3, 2021

github-actions bot closed this as completed Oct 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sending streams to s3manager uses gobs of ram #2591

Sending streams to s3manager uses gobs of ram #2591

erikh commented May 6, 2019

jasdel commented Jun 3, 2019

erikh commented Jun 4, 2019 via email

skmcgrail commented Mar 31, 2020 •

edited

Loading

richardartoul commented Oct 2, 2020 •

edited

Loading

github-actions bot commented Oct 3, 2021

Sending streams to s3manager uses gobs of ram #2591

Sending streams to s3manager uses gobs of ram #2591

Comments

erikh commented May 6, 2019

Version of AWS SDK for Go?

Version of Go (go version)?

What issue did you see?

Steps to reproduce

jasdel commented Jun 3, 2019

erikh commented Jun 4, 2019 via email

skmcgrail commented Mar 31, 2020 • edited Loading

richardartoul commented Oct 2, 2020 • edited Loading

github-actions bot commented Oct 3, 2021

Version of Go (`go version`)?

skmcgrail commented Mar 31, 2020 •

edited

Loading

richardartoul commented Oct 2, 2020 •

edited

Loading