You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using the UNLOAD function from Redshift to dump data as a csv to S3.
Redshift maximizes performance by handling multiple chunks simultaneously and writes them on S3 separately.
As the final destination will be a local file on the user's hard drive, I use download_fileobj() to stream each chunk into a single file.
The problem
boto3 handles this with a multipart download and often messes with the local file content.
This does not happen when using a file object in write mode.
This does not happen when disabling the multi part download with
As you can see by the final step, we get mismatching MD5's. We are going to have to do some digging to figure out why this is the case. I have a feeling it has to do with the combination of threading being used and the append mode. Thanks for reporting though!
Is this planned to ever be addressed? The suggested PR has been open for 5 years, assuming the code works. The only time I need to download in append mode is for collections of vary large files on a storage space limited environment, so appending allows me to not have to download all parts and then create a new copy with all of them combined. Using the use_threads=False trick works but it also results in significantly slower downloads, as you'd expect, approx 2.5x slower on my system - which really adds up when you're talking about 100s of GB to download.
Environment
Python 3.6.1
boto3 (1.4.5)
Why do I use it
I am using the UNLOAD function from Redshift to dump data as a csv to S3.
Redshift maximizes performance by handling multiple chunks simultaneously and writes them on S3 separately.
As the final destination will be a local file on the user's hard drive, I use download_fileobj() to stream each chunk into a single file.
The problem
boto3 handles this with a multipart download and often messes with the local file content.
This does not happen when using a file object in write mode.
This does not happen when disabling the multi part download with
config = boto3.s3.transfer.TransferConfig( use_threads = False )
Related issues
This issue seems to be related to #1304
Reproduce the issue
I created a mock dataset that I put on a public S3 bucket and created a code snippet to reproduce the issue.
The text was updated successfully, but these errors were encountered: