-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to Use botocore.response.StreamingBody as stdin PIPE #426
Comments
Can anyone help? |
@mslinn Sorry for the delays!
I'm going to close this issue, but feel free to update here if you find a working solution. |
I wish it were so simple. I must have dumped at least 30 hours into this problem, and I've tried a variety of approaches. I've tried the AWS & Python communities on StackOverflow but got no response. I think this issue requires too much setup for a generic Python programmer to address. I'm sure this will be an often-repeated question, I'm just the first to hit it. I think someone from AWS should take it up. For my part, I've got a broken product in production as a result of this issue. I am willing to pay cash money for a fix. |
I was working with json files specifically, so far from your large video file requirement. However, I was able to access response['Body']._raw_stream.data to get in a I-should-not-access-this-member kinda way. I hope you can do the same, I don't know the specifics of how S3 works ATM. |
if |
Buffering the entire file in memory is not an option. I am putting together some sample code that almost works, will share later today or tomorrow |
A test project that works in a variety of environments is here. |
If you can't fit the data in memory, you won't be able to use |
I did not use communicate in the sample code I posted 1/2 hour ago. Take a look |
Not sure if this helps, but I struggled with a similar issue streaming from boto3:s3 to flask output stream. Some sample code here, may help you: s3_response = s3_client.get_object(Bucket=BUCKET, Key=FILENAME)
def generate(result):
for chunk in iter(lambda: result['Body'].read(self.CHUNK_SIZE), b''):
yield chunk
return Response(generate(s3_response), mimetype='application/zip', headers={'Content-Disposition': 'attachment;filename=' + FILENAME})
` |
Hmm, looks interesting, thanks! |
I tested that it does indeed iterate per chunk size, but I have not profiled it - meaning I'm hoping StreamingBody really is a stream and it's not all consumed in memory. |
performance is bad, though; does python have something like a nodejs Node support multi-IO concurrency natively, Golang's io.Copy is using goroutines internally; in Python world I only found the werkzeug IterIO as a wrapper to write a stream to, which internally calls greenlet as the lightweight process model to simulate multi-io concurrency |
Thanks! I'm looking at this same issue too. I'm trying to "stream" a StreamingBody S3 input file and copy it to an S3 output file. I want to do txt file processing on potentially LARGE files. I'm a newbie to Python and AWS, but this information is exactly what I was looking for. |
I spent a lot of time but never got this to work. If someone does, please show the juicy details. |
I'm resorting to using buffered reads (4096 chars/read) at the moment. But I'm getting farther. I saw your github code submission, if I find anything I'll share. Sent from my iPad
|
What happens if you do
That should be equivalent to
in bash |
Hey |
@JordonPhillips I'm confused by this sentence:
I looked at the code of the To me it would make a lot of sense to expose the |
As pointed out above by @Scoots the actual file-like object is found in the |
@mslinn : Were you ever able to solve this ? Does the download_fileobj() with TransferConfig(max_concurrency=1, max_io_queue=1) solves this problem ? Thanks. |
Gave up as this turned into a research project
…Sent from my iPad
On Jan 22, 2018, at 11:28 PM, bishtpradeep ***@***.***> wrote:
@mslinn : Were you ever able to solve this ?
Does the download_fileobj() with TransferConfig(max_concurrency=1, max_io_queue=1) solves this problem ?
Thanks.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Thank you for responding. I did encounter this library which supports streaming large files to/from s3 and is written on top of boto - https://github.com/RaRe-Technologies/smart_open . |
@bishtpradeep looks interesting and useful, thanks! |
@mslinn I think I found a possible solution (at least for Python 3). Looking at the objects in boto3. I had a similar problem using pickle with a particularly large object. Here's what I got working:
So looking at the docs for this response here http://urllib3.readthedocs.io/en/latest/reference/#module-urllib3.response you can see:
This is good news because Python offers some nice IO objects to allow for BufferedIOReading (https://docs.python.org/3/library/io.html#io.BufferedReader). So continuing from the code above:
The BufferedReader is a true file like object. @JordonPhillips is this dangerous? |
I had to do this recently. In the end the simplest solution that I found was simply to have a loop that alternately read chunks of data from the source |
@nickovs Got any code you could share? Feel inspired to write an article about this? |
I'll see if I can did it out. I seem to recall that the key was to use call Note that an alternative option, if you have trouble getting non-blocking IO to work, is just to start a thread that reads from the |
@mslinn Would streaming it through BytesIO be sufficient to emulate the file for piping through? this also would work I think. I'm currently using it like so:
Of course you could yield instead use seek to slice bytes too. I havent tested it with popen but I feel like it could be useful |
This issue is still something you have to either read the boto3 source or go to stack overflow to resolve. Which means the library needs work and this issue should not have been closed and instead assigned as a feature request. |
I just ran into this issue.
Does anybody have a better idea how to efficiently copy objects from one bucket to the other without having to worry about tuning CHUNK_SIZE? |
This is especially frustrating since apparently I'm trying to copy file between S3-compatible services: # both `src_client` and `tgt_client` are valid, AWS S3 clients from different accounts, `key` is a valid key
obj = src_client.get_object(Bucket="bucket", Key=key)
tgt_client.put_object(Bucket="tgt_bucket", Key=key, Body=obj['Body']) responds: AttributeError: 'StreamingBody' object has no attribute 'tell' using instead obj = src_client.get_object(Bucket="bucket", Key=key)
tgt_client.put_object(Bucket="tgt_bucket", Key=key, Body=obj['Body']._raw_stream) fails with "UnsupportedOperation: seek" |
@nicornk the problem with urllib3 seems to have a workaround, but it still doesn't work anyway. obj = src_client.get_object(Bucket="bucket", Key=key)
stream = obj['Body']._raw_stream
stream.auto_close = False
tgt_client.put_object(Bucket="sharethis_archive", Key=key, Body=io.BufferedReader(stream)) fails with "UnsupportedOperation: File or stream is not seekable." |
I have a question about this... If I pass around this streaming body object, does this mean that the http connection isn't closed? Is it only closed once the streaming body object is garbage collected or when the entire stream is read? |
Hello there, I've read the whole discussion and merged PRs, but I have not yet found a proper way to do it: Did anyone find a good solution ? |
To be honest, I don't remember much about this issue. |
@eprochasson @Hiryus and for anyone discovering this thread going forward, the way to copy between S3 buckets is to use the copy() method, something like:
If copying between accounts you just need to set up a trust relationship using policies. The original question of streaming data from an S3 object is still a good one though, to which i haven't found an ideal solution. |
Please re-open. |
I want to pipe large video files from AWS S3 into
Popen
'sstdin
. This code runs as an AWS Lambda function, so these files won't fit in memory or on the local file system. Also, I don't want to copy these huge files anywhere, I just want to stream the input, process on the fly, and stream the output. I've already got the processing and streaming output bits working. The problem is how to obtain an input stream as aPopen pipe
.I can access a file in an S3 bucket:
body
is abotocore.response.StreamingBody
. I intend to usebody
something like this:But of course
body
needs to be converted into a file-like object. The question is how?The text was updated successfully, but these errors were encountered: