S3 - Accept Range argument for download_file / download_fileobj methods #3339

vEpiphyte · 2022-07-12T19:29:11Z

Describe the feature

I'd like to be able to use a Range argument in the S3 download_file / download_fileobj methods to download a subset of a file; according to the S3 Byte-Range requests noted here

https://docs.aws.amazon.com/whitepapers/latest/s3-optimizing-performance-best-practices/use-byte-range-fetches.html
https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html#API_GetObject_RequestSyntax

Use Case

My use case it to seek into blobs at given offsets and read a portion of them; without needing to read the entire blob. I have a need to read these blobs primarily into a file descriptor.

Proposed Solution

I believe the s3transfer library can be modified to allow this behavior to occur; but I am uncertain if there are other considerations [such as multi-threaded downloads] which may be complicated by simply allowing Range in the ALLOWED_DOWNLOAD_ARGS constant. When I had modified that; I ran into issues when testing with moto and didn't want to go too much further in the event that there was a better or more correct way to add support for Range here.

Other Information

If this is appropriate for moving to the s3transfer project; or there exists a way to provide a range header, please let me know and we can close this out accordingly :)

Acknowledgements

I may be able to implement this feature request
This feature might incur a breaking change

SDK version used

1.24.27

Environment details (OS name and version, etc.)

Python 3.8 / Python 3.10; Debian and Ubunutu latest stable releases.

The text was updated successfully, but these errors were encountered:

tim-finnigan · 2022-07-13T17:25:16Z

Hi @vEpiphyte have you tried using the Range argument with get_object? For example:

s3 = boto3.client('s3')
response = s3.get_object(Bucket='bucket', Range='bytes={}-{}'.format(start_byte, stop_byte))

vEpiphyte · 2022-07-13T18:24:32Z

Hi @tim-finnigan ! So the reason I'm looking at the two helper APIs is that my main target is a file descriptor that is backed by a socket; so that provides a mechanism for limiting memory consumption [since writes to that file descriptor are blocked waiting for the reader to read chunks from it]. Potentially large requests for entire blobs, or ranges thereof, don't end up exhausting memory; since the application is reliant on the requester to handle those chunks, draining the socket and allowing for additional data to be written.

It does look like the get_object response represents the response data in a StreamingBody object (https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html?highlight=streamingbody#botocore.response.StreamingBody) that could be used to feed a file descriptor in my case. I can give that a shot and see how that works.

pradoz · 2022-07-25T21:53:25Z

Also experiencing issues due to this. get_object returns a botocore.response.StreamingBody object that can't be processed the same way as an object that is returned from download_fileobj

Working Code:

        img_data_buf = BytesIO()
        S3.download_fileobj(bucket, key, img_data_buf)
        img_data_buf.seek(0)
        result = get_image_info(img_data_buf)

Failing Code:

        s3_response_object = S3.get_object(
            Bucket=bucket,
            Key=key,
            # Range="bytes=0-512" # fetching the full range throws the same error
        )
        img_data_buf = s3_response_object["Body"].read()
        result = get_image_info(img_data_buf)

Error message:

[ERROR] ValueError: embedded null byte
Traceback (most recent call last):  File "/var/task/index.py", line X, in lambda_handler
    total_chips, src_height, src_width = get_image_info(img_data_buf)
File "/var/task/index.py", line 16, in get_image_info
    width, height = imagesize.get(img_data) 
File "/var/task/imagesize.py", line 97, in get
    fhandle = open(filepath, "rb")

It looks like there is a workaround to cast the StreamingBody into a BytesIO object, see this comment

vEpiphyte · 2022-07-26T14:33:28Z

@pradoz the problem with casting the StreamingBody into a BytesIO is that you have to read the entire response into memory to do so; so a large file being downloaded from S3 that exceeded available memory could cause a Python MemoryError (and thats not good).

tim-finnigan · 2022-11-18T19:34:19Z

Checking in again - I think this may actually be a duplicate of an older issue: #1215. A corresponding issue was also created here in the s3transfer repository: boto/s3transfer#248. To make issue tracking easier we generally combine overlapping issues. Please let us know if there are any distinctions you'd like to make between the issues.

github-actions · 2022-11-23T20:05:36Z

Greetings! It looks like this issue hasn’t been active in longer than five days. We encourage you to check if this is still an issue in the latest release. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment or upvote with a reaction on the initial post to prevent automatic closure. If the issue is already closed, please feel free to open a new one.

vEpiphyte · 2022-11-28T16:18:22Z

@tim-finnigan I will try to review these today. I missed this over the american holiday week.

vEpiphyte added feature-request This issue requests a feature. needs-triage This issue or PR still needs to be triaged. labels Jul 12, 2022

tim-finnigan added response-requested Waiting on additional information or feedback. s3 and removed needs-triage This issue or PR still needs to be triaged. labels Jul 13, 2022

github-actions bot removed the response-requested Waiting on additional information or feedback. label Jul 13, 2022

aBurmeseDev added the p2 This is a standard priority issue label Nov 9, 2022

tim-finnigan added the response-requested Waiting on additional information or feedback. label Nov 18, 2022

github-actions bot added closing-soon This issue will automatically close in 4 days unless further comments are made. closed-for-staleness and removed closing-soon This issue will automatically close in 4 days unless further comments are made. labels Nov 23, 2022

github-actions bot closed this as completed Nov 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 - Accept Range argument for download_file / download_fileobj methods #3339

S3 - Accept Range argument for download_file / download_fileobj methods #3339

vEpiphyte commented Jul 12, 2022

tim-finnigan commented Jul 13, 2022

vEpiphyte commented Jul 13, 2022

pradoz commented Jul 25, 2022 •

edited

Loading

vEpiphyte commented Jul 26, 2022

tim-finnigan commented Nov 18, 2022

github-actions bot commented Nov 23, 2022

vEpiphyte commented Nov 28, 2022

S3 - Accept Range argument for download_file / download_fileobj methods #3339

S3 - Accept Range argument for download_file / download_fileobj methods #3339

Comments

vEpiphyte commented Jul 12, 2022

Describe the feature

Use Case

Proposed Solution

Other Information

Acknowledgements

SDK version used

Environment details (OS name and version, etc.)

tim-finnigan commented Jul 13, 2022

vEpiphyte commented Jul 13, 2022

pradoz commented Jul 25, 2022 • edited Loading

vEpiphyte commented Jul 26, 2022

tim-finnigan commented Nov 18, 2022

github-actions bot commented Nov 23, 2022

vEpiphyte commented Nov 28, 2022

pradoz commented Jul 25, 2022 •

edited

Loading