Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 - Accept Range argument for download_file / download_fileobj methods #3339

Closed
2 tasks done
vEpiphyte opened this issue Jul 12, 2022 · 7 comments
Closed
2 tasks done
Labels
closed-for-staleness feature-request This issue requests a feature. p2 This is a standard priority issue response-requested Waiting on additional information or feedback. s3

Comments

@vEpiphyte
Copy link

Describe the feature

I'd like to be able to use a Range argument in the S3 download_file / download_fileobj methods to download a subset of a file; according to the S3 Byte-Range requests noted here

https://docs.aws.amazon.com/whitepapers/latest/s3-optimizing-performance-best-practices/use-byte-range-fetches.html
https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html#API_GetObject_RequestSyntax

Use Case

My use case it to seek into blobs at given offsets and read a portion of them; without needing to read the entire blob. I have a need to read these blobs primarily into a file descriptor.

Proposed Solution

I believe the s3transfer library can be modified to allow this behavior to occur; but I am uncertain if there are other considerations [such as multi-threaded downloads] which may be complicated by simply allowing Range in the ALLOWED_DOWNLOAD_ARGS constant. When I had modified that; I ran into issues when testing with moto and didn't want to go too much further in the event that there was a better or more correct way to add support for Range here.

Other Information

If this is appropriate for moving to the s3transfer project; or there exists a way to provide a range header, please let me know and we can close this out accordingly :)

Acknowledgements

  • I may be able to implement this feature request
  • This feature might incur a breaking change

SDK version used

1.24.27

Environment details (OS name and version, etc.)

Python 3.8 / Python 3.10; Debian and Ubunutu latest stable releases.

@vEpiphyte vEpiphyte added feature-request This issue requests a feature. needs-triage This issue or PR still needs to be triaged. labels Jul 12, 2022
@tim-finnigan
Copy link
Contributor

Hi @vEpiphyte have you tried using the Range argument with get_object? For example:

s3 = boto3.client('s3')
response = s3.get_object(Bucket='bucket', Range='bytes={}-{}'.format(start_byte, stop_byte))

@tim-finnigan tim-finnigan added response-requested Waiting on additional information or feedback. s3 and removed needs-triage This issue or PR still needs to be triaged. labels Jul 13, 2022
@vEpiphyte
Copy link
Author

Hi @tim-finnigan ! So the reason I'm looking at the two helper APIs is that my main target is a file descriptor that is backed by a socket; so that provides a mechanism for limiting memory consumption [since writes to that file descriptor are blocked waiting for the reader to read chunks from it]. Potentially large requests for entire blobs, or ranges thereof, don't end up exhausting memory; since the application is reliant on the requester to handle those chunks, draining the socket and allowing for additional data to be written.

It does look like the get_object response represents the response data in a StreamingBody object (https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html?highlight=streamingbody#botocore.response.StreamingBody) that could be used to feed a file descriptor in my case. I can give that a shot and see how that works.

@github-actions github-actions bot removed the response-requested Waiting on additional information or feedback. label Jul 13, 2022
@pradoz
Copy link

pradoz commented Jul 25, 2022

Also experiencing issues due to this. get_object returns a botocore.response.StreamingBody object that can't be processed the same way as an object that is returned from download_fileobj

Working Code:

        img_data_buf = BytesIO()
        S3.download_fileobj(bucket, key, img_data_buf)
        img_data_buf.seek(0)
        result = get_image_info(img_data_buf)

Failing Code:

        s3_response_object = S3.get_object(
            Bucket=bucket,
            Key=key,
            # Range="bytes=0-512" # fetching the full range throws the same error
        )
        img_data_buf = s3_response_object["Body"].read()
        result = get_image_info(img_data_buf)

Error message:

[ERROR] ValueError: embedded null byte
Traceback (most recent call last):  File "/var/task/index.py", line X, in lambda_handler
    total_chips, src_height, src_width = get_image_info(img_data_buf)
File "/var/task/index.py", line 16, in get_image_info
    width, height = imagesize.get(img_data) 
File "/var/task/imagesize.py", line 97, in get
    fhandle = open(filepath, "rb")

It looks like there is a workaround to cast the StreamingBody into a BytesIO object, see this comment

@vEpiphyte
Copy link
Author

@pradoz the problem with casting the StreamingBody into a BytesIO is that you have to read the entire response into memory to do so; so a large file being downloaded from S3 that exceeded available memory could cause a Python MemoryError (and thats not good).

@aBurmeseDev aBurmeseDev added the p2 This is a standard priority issue label Nov 9, 2022
@tim-finnigan
Copy link
Contributor

Checking in again - I think this may actually be a duplicate of an older issue: #1215. A corresponding issue was also created here in the s3transfer repository: boto/s3transfer#248. To make issue tracking easier we generally combine overlapping issues. Please let us know if there are any distinctions you'd like to make between the issues.

@tim-finnigan tim-finnigan added the response-requested Waiting on additional information or feedback. label Nov 18, 2022
@github-actions
Copy link

Greetings! It looks like this issue hasn’t been active in longer than five days. We encourage you to check if this is still an issue in the latest release. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment or upvote with a reaction on the initial post to prevent automatic closure. If the issue is already closed, please feel free to open a new one.

@github-actions github-actions bot added closing-soon This issue will automatically close in 4 days unless further comments are made. closed-for-staleness and removed closing-soon This issue will automatically close in 4 days unless further comments are made. labels Nov 23, 2022
@vEpiphyte
Copy link
Author

@tim-finnigan I will try to review these today. I missed this over the american holiday week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
closed-for-staleness feature-request This issue requests a feature. p2 This is a standard priority issue response-requested Waiting on additional information or feedback. s3
Projects
None yet
Development

No branches or pull requests

4 participants