Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

download_fileobj with a file object in append mode can mess up file content #1446

Open
remiloze opened this issue Feb 8, 2018 · 3 comments
Labels
automation-exempt bug This issue is a confirmed bug. p3 This is a minor priority issue s3

Comments

@remiloze
Copy link

remiloze commented Feb 8, 2018

Environment

Python 3.6.1
boto3 (1.4.5)

Why do I use it

I am using the UNLOAD function from Redshift to dump data as a csv to S3.
Redshift maximizes performance by handling multiple chunks simultaneously and writes them on S3 separately.
As the final destination will be a local file on the user's hard drive, I use download_fileobj() to stream each chunk into a single file.

The problem

boto3 handles this with a multipart download and often messes with the local file content.
This does not happen when using a file object in write mode.
This does not happen when disabling the multi part download with

config = boto3.s3.transfer.TransferConfig( use_threads = False )

Related issues

This issue seems to be related to #1304

Reproduce the issue

I created a mock dataset that I put on a public S3 bucket and created a code snippet to reproduce the issue.

import boto3
import csv

bucket           = 'rl-boto3-issues-mock-data' 
file_key         = 'mock_dataset.csv' 
output_file_path = 'output_file.csv'

s3     = boto3.resource('s3')
bucket = s3.Bucket(bucket)

#Create file & write header
header = 'index1;index2;location;consumption\n'
with open(output_file_path, 'wt') as file_object:
    file_object.write(header)

#Dump binary in append mode
with open(output_file_path, 'ab') as file_object:

    bucket.download_fileobj(
        Key     = file_key, 
        Fileobj = file_object,
    )

#Read whole csv
with open(output_file_path, 'r') as file_object:

    header = next(file_object)

    reader = csv.reader(
        file_object,
        delimiter  = ';',
    )

    csv_data = list(reader)

#Check csv integrity
for index, row in enumerate(csv_data):
    
    nb_cells = len(row)

    if nb_cells != 4:
        print('Row {0} : number of cells is {1} and should be 4 !'.format(index, nb_cells))
        print('Row is : \n')
        print(row)
        print('\n')
        break
@kyleknap
Copy link
Member

kyleknap commented Feb 9, 2018

Thanks for the nice writeup. That really helped! I am able to reproduce this as well. Here is what I did:

  1. Create a csv file with this script:
input_file = 'myinput.csv'
num_rows = 1024 * 1024 * 50


with open(input_file, 'w') as f:
    for i in range(num_rows):
        f.write('%s\n' % i)
  1. Upload it with the CLI:
$ aws s3 cp myinput.csv s3://mybucketfoo/
  1. Run this script that just downloads the file in append mode:
import boto3
bucket           = 'mybucketfoo' 
file_key         = 'myinput.csv' 
output_file_path = 'myoutput.csv'

s3     = boto3.resource('s3')
bucket = s3.Bucket(bucket)

#Dump binary in append mode
with open(output_file_path, 'ab') as file_object:
    bucket.download_fileobj(
        Key     = file_key, 
        Fileobj = file_object,
    )
  1. Run an MD5 on it:
$ md5 myinput.csv myoutput.csv
MD5 (myinput.csv) = 6122e1db4d4fda33e49fe7e53690db52
MD5 (myoutput.csv) = a5bc1d5b2e4d67b1ec62bea227589ea4

As you can see by the final step, we get mismatching MD5's. We are going to have to do some digging to figure out why this is the case. I have a feeling it has to do with the combination of threading being used and the append mode. Thanks for reporting though!

@hhamalai
Copy link

Proposed a fix into boto/s3transfer

@swetashre swetashre added the auto-label-exempt Issue will not be subject to stale-bot label Aug 18, 2020
@aBurmeseDev aBurmeseDev added the p3 This is a minor priority issue label Nov 7, 2022
@tim-finnigan tim-finnigan added s3 automation-exempt and removed auto-label-exempt Issue will not be subject to stale-bot labels Nov 17, 2022
@eth10
Copy link

eth10 commented Nov 2, 2023

Is this planned to ever be addressed? The suggested PR has been open for 5 years, assuming the code works. The only time I need to download in append mode is for collections of vary large files on a storage space limited environment, so appending allows me to not have to download all parts and then create a new copy with all of them combined. Using the use_threads=False trick works but it also results in significantly slower downloads, as you'd expect, approx 2.5x slower on my system - which really adds up when you're talking about 100s of GB to download.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automation-exempt bug This issue is a confirmed bug. p3 This is a minor priority issue s3
Projects
None yet
Development

No branches or pull requests

7 participants