Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Multipart Range Requests in S3Transfer's download_file #3466

Closed
1 of 2 tasks
forrestfwilliams opened this issue Oct 21, 2022 · 4 comments
Closed
1 of 2 tasks
Assignees
Labels
duplicate This issue is a duplicate. feature-request This issue requests a feature. p3 This is a minor priority issue

Comments

@forrestfwilliams
Copy link

forrestfwilliams commented Oct 21, 2022

Describe the feature

Boto3 supports ranged get requests and multipart downloads, however it is not possible to perform a multi-part download over a specific range. This results in slow download times when you are trying to download a 1GB range of data from a 4GB file in S3. It would be great if a range argument were added to TransferConfig, that could then be passed to a download_file call. This would download the range of data specified, but would use multipart downloading if the range size exceed the multipart_threshold.

Use Case

I work at the Alaska Satellite Facility, where we distribute large amounts of remote sensing data to users across the globe via AWS. Many of these datasets come in legacy formats, such as zip files, that are not cloud-friendly. Due to the highly structured nature of these datasets, we can identify byte ranges that contain subsets of data that our users would be interested in downloading directly. However, since these datasets are still large (~1GB within a larger 4GB zip file), and multipart downloads are not supported for range requests, we cannot offer extraction of these dataset with low latency.

Proposed Solution

I have developed a workaround that involves using aiobotocore to set up threaded get requests for the range of data desired. This can be found within this benchmarking script. This is still much slower than the native multipart read.

Other Information

I have also started a discussion concerning this issue on stackOverflow, but no one has found a good solution.

Acknowledgements

  • I may be able to implement this feature request
  • This feature might incur a breaking change

SDK version used

1.24.59

Environment details (OS name and version, etc.)

r5d.xlarge EC2 instance running the latest Amazon Linux (same region as S3 bucket)

@forrestfwilliams forrestfwilliams added feature-request This issue requests a feature. needs-triage This issue or PR still needs to be triaged. labels Oct 21, 2022
@forrestfwilliams forrestfwilliams changed the title Support Range Requests in S3Transfer's download_file Support Multipart Range Requests in S3Transfer's download_file Oct 21, 2022
@tim-finnigan
Copy link
Contributor

Hi @forrestfwilliams thanks for reaching out. It looks like this may be a duplicate of #1215. (Also the s3transfer repo may be the best place to track these requests.) I brought this up for discussion with the team and they weren't sure about supporting multi-part download over a specific range. It seems like there was some debate on that StackOverflow post as well, although there may be some workarounds. Have you tried any workarounds and if so what has worked for you?

@tim-finnigan tim-finnigan added response-requested Waiting on additional information or feedback. and removed needs-triage This issue or PR still needs to be triaged. labels Oct 25, 2022
@forrestfwilliams
Copy link
Author

Hi @tim-finnigan, thanks for your reply. Yes, this does look like the same issue as #1215. So far I have tried solutions using both python's asyncio, and a ThreadPoolExecutor. When accessing a 1.3 GB region of data in an open bucket on an in-region r5d.xlarge EC2 instance, the asyncio approach will download the data in 6.28 seconds, and the ThreadPoolExecutor approach will download the data in 4.76 seconds. For comparison, using the boto3-native multipart download functionality to download the same amount of data under the same conditions takes 3.96 seconds (i.e. the ThreadPoolExecutor solution takes 1.2x the time of the native solution). These differences are further exacerbated under less-ideal download conditions.

Overall, this is a non-trivial difference in performance for our use case, and it would be great to work towards adding this functionality. I'm also happy to move this discussion to the s3transfer repo if that is more appropriate. Is there an open issue there along these lines?

@github-actions github-actions bot removed the response-requested Waiting on additional information or feedback. label Oct 27, 2022
@aBurmeseDev aBurmeseDev added the p3 This is a minor priority issue label Nov 8, 2022
@tim-finnigan tim-finnigan self-assigned this Nov 16, 2022
@tim-finnigan
Copy link
Contributor

Hi @forrestfwilliams thanks for your patience. I'll go ahead and close this issue so we can continue tracking #1215 in the boto3 repo and boto/s3transfer#248 which you opened in the s3transfer repo. I plan to bring this feature request up with the team soon for further review and feedback.

@tim-finnigan tim-finnigan added the duplicate This issue is a duplicate. label Nov 16, 2022
@forrestfwilliams
Copy link
Author

@tim-finnigan thank you. This feature will be a major feature improvement for my organization, as well as anyone trying to access subsets of data files in AWS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue is a duplicate. feature-request This issue requests a feature. p3 This is a minor priority issue
Projects
None yet
Development

No branches or pull requests

3 participants