Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 streaming with s3 cp uses several GB of memory on upload #923

Closed
rlmcpherson opened this issue Sep 29, 2014 · 8 comments · Fixed by #924
Closed

S3 streaming with s3 cp uses several GB of memory on upload #923

rlmcpherson opened this issue Sep 29, 2014 · 8 comments · Fixed by #924
Labels
bug This issue is a bug. s3

Comments

@rlmcpherson
Copy link

In testing the streaming upload feature implemented in #903, it is reading the entire stream into memory, causing large memory usage for the tool. On an ubuntu ec2 instance running the latest master branch, uploading a 9 GB file resulted in 6.5-6.9 GB of real memory usage.

Test command:

cat <large_file> | aws s3 cp - s3://bucket/key
@kyleknap
Copy link
Contributor

Interesting. Will look into it. Also on a side note, make sure you use --expected-size with its value in terms of bytes. This ensures that the number of parts when uploading is less than 1000 (which is required for s3 uploads). We default to 5 MB chunks so the threshold for using this parameter is about 5 GB.

@rlmcpherson
Copy link
Author

It's 10000 parts max according to the documentation: http://docs.aws.amazon.com/AmazonS3/latest/API/mpUploadUploadPart.html so that's a limit of ~50 GB at the minimum size.

@kyleknap
Copy link
Contributor

Yep that's right. The good news is that I have confirmed the bug, and it is a very easy fix. The wrong constant was being used to limit the amount of data in memory. It must have been changed when I rebased off develop to merge the original pull request.

For streaming an upload file, the maximum memory usage you should expect is around 90 MB. For fast processes like running cat, you will tend to see it reach that ceiling. For slower processes, the memory usage will be noticeably less. Memory usage increases though if the size of the file is over 50 GB due to a bump up in chunksize.

Thanks for the catch! I will send a pull request out soon.

@smboy
Copy link

smboy commented Feb 13, 2015

Its a closed issue, but still commenting. For some reason, this is not working on the EMR isntance I'm using. Could you please let me know what might be wrong?

cat filename.csv | aws s3 cp - s3://test-store/test-bucket/folder/filename.csv

@jamesls
Copy link
Member

jamesls commented Feb 13, 2015

What version of the CLI are you using? In what way is it not working? Do you have more information you can share?

@smboy
Copy link

smboy commented Feb 13, 2015

The version of the cli is
aws --version
aws-cli/1.3.9 Python/2.6.9 Linux/3.14.20-20.44.amzn1.x86_64

here is the error:
cat ins.csv | aws s3 cp - s3://test-store/test-bucket/folder/ins.csv
[Errno 2] No such file or directory: '/home/hadoop/testuser/-'
Completed 1 part(s) with ... file(s) remaining

@smboy
Copy link

smboy commented Feb 16, 2015

I just spinned up a new EMR instance and upgraded the aws cli to 1.7. This feature is working as expected. Sorry for the false alarm.

thanks!

@RRAlex
Copy link

RRAlex commented Jun 19, 2019

--expected-size is only used to used to segment the upload and doesn't have to be the correct file size right?
Because otherwise, doing tar ... | aws s3 cp s3://... will become very difficult without writing to disk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. s3
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants