download_fileobj with a file object in append mode can mess up file content #1446

remiloze · 2018-02-08T08:50:48Z

Environment

Python 3.6.1
boto3 (1.4.5)

Why do I use it

I am using the UNLOAD function from Redshift to dump data as a csv to S3.
Redshift maximizes performance by handling multiple chunks simultaneously and writes them on S3 separately.
As the final destination will be a local file on the user's hard drive, I use download_fileobj() to stream each chunk into a single file.

The problem

boto3 handles this with a multipart download and often messes with the local file content.
This does not happen when using a file object in write mode.
This does not happen when disabling the multi part download with

config = boto3.s3.transfer.TransferConfig( use_threads = False )

Related issues

This issue seems to be related to #1304

Reproduce the issue

I created a mock dataset that I put on a public S3 bucket and created a code snippet to reproduce the issue.

import boto3
import csv

bucket           = 'rl-boto3-issues-mock-data' 
file_key         = 'mock_dataset.csv' 
output_file_path = 'output_file.csv'

s3     = boto3.resource('s3')
bucket = s3.Bucket(bucket)

#Create file & write header
header = 'index1;index2;location;consumption\n'
with open(output_file_path, 'wt') as file_object:
    file_object.write(header)

#Dump binary in append mode
with open(output_file_path, 'ab') as file_object:

    bucket.download_fileobj(
        Key     = file_key, 
        Fileobj = file_object,
    )

#Read whole csv
with open(output_file_path, 'r') as file_object:

    header = next(file_object)

    reader = csv.reader(
        file_object,
        delimiter  = ';',
    )

    csv_data = list(reader)

#Check csv integrity
for index, row in enumerate(csv_data):
    
    nb_cells = len(row)

    if nb_cells != 4:
        print('Row {0} : number of cells is {1} and should be 4 !'.format(index, nb_cells))
        print('Row is : \n')
        print(row)
        print('\n')
        break

The text was updated successfully, but these errors were encountered:

kyleknap · 2018-02-09T19:00:57Z

Thanks for the nice writeup. That really helped! I am able to reproduce this as well. Here is what I did:

Create a csv file with this script:

input_file = 'myinput.csv'
num_rows = 1024 * 1024 * 50


with open(input_file, 'w') as f:
    for i in range(num_rows):
        f.write('%s\n' % i)

Upload it with the CLI:

$ aws s3 cp myinput.csv s3://mybucketfoo/

Run this script that just downloads the file in append mode:

import boto3
bucket           = 'mybucketfoo' 
file_key         = 'myinput.csv' 
output_file_path = 'myoutput.csv'

s3     = boto3.resource('s3')
bucket = s3.Bucket(bucket)

#Dump binary in append mode
with open(output_file_path, 'ab') as file_object:
    bucket.download_fileobj(
        Key     = file_key, 
        Fileobj = file_object,
    )

Run an MD5 on it:

$ md5 myinput.csv myoutput.csv
MD5 (myinput.csv) = 6122e1db4d4fda33e49fe7e53690db52
MD5 (myoutput.csv) = a5bc1d5b2e4d67b1ec62bea227589ea4

As you can see by the final step, we get mismatching MD5's. We are going to have to do some digging to figure out why this is the case. I have a feeling it has to do with the combination of threading being used and the append mode. Thanks for reporting though!

hhamalai · 2018-10-23T09:05:44Z

Proposed a fix into boto/s3transfer

eth10 · 2023-11-02T23:24:49Z

Is this planned to ever be addressed? The suggested PR has been open for 5 years, assuming the code works. The only time I need to download in append mode is for collections of vary large files on a storage space limited environment, so appending allows me to not have to download all parts and then create a new copy with all of them combined. Using the use_threads=False trick works but it also results in significantly slower downloads, as you'd expect, approx 2.5x slower on my system - which really adds up when you're talking about 100s of GB to download.

kyleknap added the bug This issue is a confirmed bug. label Feb 9, 2018

hhamalai mentioned this issue Oct 23, 2018

Fix corrupted S3 downloads when file object in append mode boto/s3transfer#112

Open

swetashre added the auto-label-exempt Issue will not be subject to stale-bot label Aug 18, 2020

aBurmeseDev added the p3 This is a minor priority issue label Nov 7, 2022

tim-finnigan added s3 automation-exempt and removed auto-label-exempt Issue will not be subject to stale-bot labels Nov 17, 2022

lizaw mentioned this issue May 22, 2024

[BUG] - SSHClient.open_sftp().open(...) cause file downloaded being modified paramiko/paramiko#2393

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

download_fileobj with a file object in append mode can mess up file content #1446

download_fileobj with a file object in append mode can mess up file content #1446

remiloze commented Feb 8, 2018 •

edited

Loading

kyleknap commented Feb 9, 2018

hhamalai commented Oct 23, 2018

eth10 commented Nov 2, 2023

download_fileobj with a file object in append mode can mess up file content #1446

download_fileobj with a file object in append mode can mess up file content #1446

Comments

remiloze commented Feb 8, 2018 • edited Loading

Environment

Why do I use it

The problem

Related issues

Reproduce the issue

kyleknap commented Feb 9, 2018

hhamalai commented Oct 23, 2018

eth10 commented Nov 2, 2023

remiloze commented Feb 8, 2018 •

edited

Loading