Added the ability to stream data using `cp`. #903

kyleknap · 2014-09-02T17:41:34Z

This feature enables users to stream from stdin to s3 or from s3 to stdout.
Streaming large files is both multithreaded and uses multipart transfers.
The streaming feature is limited to single file cp commands.

You can look at some of the documentation changes to see how to run the commands.
Here is a synopsis:
For uploading a stream from stdin to s3, use:
aws s3 cp - s3://my-bucket/stream

For downloading an s3 object as a stdout stream, use:
aws s3 cp s3://my-bucket/stream -

So for example, if I had the object s3://my-bucket/stream, I could run this command:
aws s3 cp s3://my-bucket/stream - | aws s3 cp - s3://my-bucket/new-stream

This command would download the object stream from the bucket my-bucket and write it to stdout. Then the data in stdout will be piped to stdin and uploaded from stdin to an object with the key new-stream in the s3 bucket my-bucket.

cc @jamesls @danielgtaylor

coveralls · 2014-09-02T18:01:48Z

Coverage increased (+0.02%) when pulling abce027 on kyleknap:streams into 2bdb58b on aws:develop.

danielgtaylor · 2014-09-04T20:01:52Z

awscli/customizations/s3/fileinfo.py

+        # Need to save the data to be able to check the etag for a stream
+        # becuase once the data is written to the stream there is no
+        # undoing it.
+        payload = write_to_file(None, etag, md5, file_chunks, True)


It's a little unclear to me here - is this actually reading in the entire contents of the file to be printed later?

Yes, it is if the object is being streamed to standard out. This is needed because if you are writing an object out to stdout while doing the md5 calculation, there is no way to erase the data sent to stdout if there is an md5error and needs to be retried. Therefore, I write to a buffer that is later written to stdout once I have ensured the md5 is correct. On the other hand for a file, I write to file as I calculate to md5 because I can delete the file if the md5's do not match.

This is kind of concerning to me given the size of files people will put into S3. Have you considered using a temporary file? You could only use temp files if the download is large, and it would have the same behavior as a normal file except it is eventually written to stdout and removed from disk. What about writing a message out to stderr and returning a non-zero exit code (leaving retries up to the calling script if they want to use stdout)? Any other ideas you considered?

This is for a single download and the cutoff for multipart threshold is 8 MB and so there will be at most that much in memory (for a non-multipart download) since you can only perform an operation on one file when streaming. This is memory issue is more concerning with multipart operations, which I will discuss at the bottom of the comment section. On a side note, I like the idea of temporary files

danielgtaylor · 2014-09-04T20:26:15Z

Overall I'd say this looks pretty good. My main concerns:

Reading the entire file into memory
It could use a little high-level implementation documentation. What happens internally when a stream is read in, or when a stream is output? How are chunks handled? How is multipart handled and when is it used? Stuff like that as it's a little tough to follow at the moment.

kyleknap · 2014-09-04T21:36:02Z

Yeah that's a good idea. Here is a synopsis of all the different transfer scenarios with worst-case memory usage.

upload:

A _pull_from_stream in s3handler, reads in data from stdin and inserts the data into a BytesIO object.
If the length of the BytesIO object is less than the multipart threshold, then you simply upload the BytesIO object.

Maximum memory estimation: 8 MB (the maximum size of a non-multipart upload)

Multipart Upload:

Repeat step 1 for upload
If there is more data to read in stdin, begin multipart upload.
Submit a create multipart upload task and soon after submit a task to upload the first part of stdin taken in via step 1.
Continue to pull from stdin stream and place parts to upload in a queue to be processed. All of these parts are BytesIO objects and the maximum size of the queue is 10 for this operation. The pulling from stream operation must wait if the queue is full.

Maximum memory estimation: 50 MB (5 MB chunks * 10 chunks in queue at a time)

Download:

As downloading object, calculate md5 and store into BytesIO.
Once we know the md5 is correct we write to stdout.

Maximum memory estimation: 8 MB (the maximum size of a non-multipart upload)

Mutlipart Download:

Each thread begins performing a range-download on a specific part of the object.
All threads must wait till till its turn has arrived (meaning the part it is currently downloading is the next part required in the stream)
Once it is a thread's turn, it reads its part in chunks and places each chunk on a queue in order of being read.
An IO thread comes and takes these parts off of the queue and writes them in the same order that they came in.

Maximum memory estimation: 20 MB = 1MB (Size of items in the queue which are chunks of a thread's specified part) * 20 (the maximum size of the write queue)

Conclusion:
Currently 50 MB is the maximum amount of memory used when streaming. What are your thoughts on that amount? I originally did not think of using temporary files. But now that I think of it, it will be very useful when the streams get very large. Currently with 5 MB parts, the maximum stream size that someone can upload is 5 GB (which is too small). That is why I added a --expect-size parameter so that the chunksize can be updated such that the entire upload can be fit in less than 1000 parts. The issue though is that the chunksizes will increase from the originally expected 5 MB size so that may utilize too much memory if say the stream was like a TB in size.

Given the fact that temporary files can save me memory, are there any drawbacks I should be aware of? If there is not, I will probably convert everywhere I use a BytesIO thread to temporary file.

coveralls · 2014-09-15T22:24:22Z

Coverage decreased (-0.04%) when pulling 19ea686 on kyleknap:streams into 999ad81 on aws:develop.

coveralls · 2014-09-16T16:21:23Z

Coverage increased (+0.01%) when pulling 0e1ff2d on kyleknap:streams into 999ad81 on aws:develop.

coveralls · 2014-09-16T16:40:40Z

Coverage increased (+0.04%) when pulling 9022a59 on kyleknap:streams into 999ad81 on aws:develop.

jamesls · 2014-09-20T00:11:57Z

Looks good.

This feature enables users to stream from stdin to s3 or from s3 to stdout. Streaming large files is both multithreaded and uses multipart transfers. The streaming feature is limited to single file ``cp`` commands.

This includes adding more tests, simplifying the code, and some PEP8 cleaning.

coveralls · 2014-09-29T20:21:59Z

Coverage increased (+0.05%) when pulling 4716948 on kyleknap:streams into ab363c3 on aws:develop.

coveralls · 2014-09-29T20:23:09Z

Coverage increased (+0.05%) when pulling 4716948 on kyleknap:streams into ab363c3 on aws:develop.

Added the ability to stream data using ``cp``.

kyleknap force-pushed the streams branch from 74b6be1 to abce027 Compare September 2, 2014 17:52

danielgtaylor reviewed Sep 4, 2014
View reviewed changes

kyleknap force-pushed the streams branch from abce027 to 19ea686 Compare September 15, 2014 22:15

kyleknap force-pushed the streams branch from 0e1ff2d to 9022a59 Compare September 16, 2014 16:30

kyleknap added 4 commits September 29, 2014 12:47

Added the ability to stream data using cp.

c57fa91

This feature enables users to stream from stdin to s3 or from s3 to stdout. Streaming large files is both multithreaded and uses multipart transfers. The streaming feature is limited to single file ``cp`` commands.

Cut down memory usage to mutipart upload a stream.

1461190

Cleaned up the streaming code.

995dc93

This includes adding more tests, simplifying the code, and some PEP8 cleaning.

Default s3 streaming to --only-show-errors

f21837a

kyleknap force-pushed the streams branch from 9022a59 to f21837a Compare September 29, 2014 20:11

Updated changelog entry.

4716948

kyleknap added a commit that referenced this pull request Sep 29, 2014

Merge pull request #903 from kyleknap/streams

23a1aac

Added the ability to stream data using ``cp``.

kyleknap merged commit 23a1aac into aws:develop Sep 29, 2014

kyleknap deleted the streams branch September 29, 2014 21:03

This was referenced Sep 29, 2014

Streaming files in S3 #410

Closed

aws s3 cp from stdin #769

Closed

rlmcpherson mentioned this pull request Sep 29, 2014

S3 streaming with s3 cp uses several GB of memory on upload #923

Closed

twincitiesguy mentioned this pull request Feb 5, 2016

Add Source/Target Support type: Device or PIPE EMCECS/ecs-sync#5

Closed

steveh mentioned this pull request Feb 8, 2017

b2 upload_file to accept stdin Backblaze/B2_Command_Line_Tool#152

Closed

stobrien89 mentioned this pull request May 11, 2021

Copying data to s3 from a named pipe using process substitution fails #6145

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added the ability to stream data using `cp`. #903

Added the ability to stream data using `cp`. #903

kyleknap commented Sep 2, 2014

coveralls commented Sep 2, 2014

danielgtaylor Sep 4, 2014

kyleknap Sep 4, 2014

danielgtaylor Sep 4, 2014

kyleknap Sep 4, 2014

danielgtaylor commented Sep 4, 2014

kyleknap commented Sep 4, 2014

coveralls commented Sep 15, 2014

coveralls commented Sep 16, 2014

coveralls commented Sep 16, 2014

jamesls commented Sep 20, 2014

coveralls commented Sep 29, 2014

coveralls commented Sep 29, 2014

Added the ability to stream data using cp. #903

Added the ability to stream data using cp. #903

Conversation

kyleknap commented Sep 2, 2014

coveralls commented Sep 2, 2014

danielgtaylor Sep 4, 2014

Choose a reason for hiding this comment

kyleknap Sep 4, 2014

Choose a reason for hiding this comment

danielgtaylor Sep 4, 2014

Choose a reason for hiding this comment

kyleknap Sep 4, 2014

Choose a reason for hiding this comment

danielgtaylor commented Sep 4, 2014

kyleknap commented Sep 4, 2014

coveralls commented Sep 15, 2014

coveralls commented Sep 16, 2014

coveralls commented Sep 16, 2014

jamesls commented Sep 20, 2014

coveralls commented Sep 29, 2014

coveralls commented Sep 29, 2014

Added the ability to stream data using `cp`. #903

Added the ability to stream data using `cp`. #903