Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Directory upload/download with boto3 #358

Open
dduleep opened this issue Nov 11, 2015 · 49 comments
Open

Directory upload/download with boto3 #358

dduleep opened this issue Nov 11, 2015 · 49 comments

Comments

@dduleep
Copy link

@dduleep dduleep commented Nov 11, 2015

In the PHP sdk has some function for download and upload as directory(http://docs.aws.amazon.com/aws-sdk-php/v2/guide/service-s3.html#uploading-a-directory-to-a-bucket)
Is there any similar function available with boto3?

if there is not such function, what kind of method/s most sufficient for download/upload directory

note My ultimate target is create sync function like aws cli

now i'm using download/upload files using https://boto3.readthedocs.org/en/latest/reference/customizations/s3.html?highlight=upload_file#module-boto3.s3.transfer

@rayluo

This comment has been minimized.

Copy link
Contributor

@rayluo rayluo commented Nov 11, 2015

Sorry there is no directory upload/download facility in Boto 3 at this moment. We are considering to backport those CLI sync functions to Boto 3, but there is no specific plan yet.

@JamieCressey

This comment has been minimized.

Copy link

@JamieCressey JamieCressey commented Feb 16, 2016

+1 for a port of the CLI sync function

@BeardedSteve

This comment has been minimized.

Copy link

@BeardedSteve BeardedSteve commented Mar 2, 2016

this would be really useful, imho sync is one of the more popular CLI functions.

@ernestm

This comment has been minimized.

Copy link

@ernestm ernestm commented Mar 8, 2016

+1 this would save me a bunch of time

@litdream

This comment has been minimized.

Copy link

@litdream litdream commented Mar 30, 2016

+1 "aws s3 sync SRC s3://BUCKET_NAME/DIR[/DIR....] "
Porting this cli to boto3 would be so helpful.

@KBoehme

This comment has been minimized.

Copy link

@KBoehme KBoehme commented Apr 21, 2016

+1

9 similar comments
@astewart-twist

This comment has been minimized.

Copy link

@astewart-twist astewart-twist commented May 19, 2016

+1

@aaroncutchin

This comment has been minimized.

Copy link

@aaroncutchin aaroncutchin commented Jun 21, 2016

+1

@hikch

This comment has been minimized.

Copy link

@hikch hikch commented Jul 7, 2016

+1

@pd3244

This comment has been minimized.

Copy link

@pd3244 pd3244 commented Jul 11, 2016

+1

@MourIdri

This comment has been minimized.

Copy link

@MourIdri MourIdri commented Jul 12, 2016

+1

@ghost

This comment has been minimized.

Copy link

@ghost ghost commented Jul 22, 2016

+1

@gonwi

This comment has been minimized.

Copy link

@gonwi gonwi commented Aug 2, 2016

+1

@zaforic

This comment has been minimized.

Copy link

@zaforic zaforic commented Aug 19, 2016

+1

@rdickey

This comment has been minimized.

Copy link

@rdickey rdickey commented Aug 19, 2016

+1

@Natim

This comment has been minimized.

Copy link

@Natim Natim commented Aug 24, 2016

I've been thinking a bit about that, it seems that we have a proof of concept working here: https://github.com/seedifferently/boto_rsync

However the project didn't seems to have any love for a while. Instead of forking it, I was asking myself what it would take to rewrite it as part as a Boto3 feature.

Can I start with just a sync between local system and a boto3 client?

Does AWS provide a CRC-32 check or something that I could use to detect if a file needs to be re-uploaded? Should I base this on the file length instead?

@Natim

This comment has been minimized.

Copy link

@Natim Natim commented Aug 24, 2016

Right now the simple way I used, is:

def sync_to_s3(target_dir, aws_region=AWS_REGION, bucket_name=BUCKET_NAME):
    if not os.path.isdir(target_dir):
        raise ValueError('target_dir %r not found.' % target_dir)

    s3 = boto3.resource('s3', region_name=aws_region)
    try:
        s3.create_bucket(Bucket=bucket_name,
                         CreateBucketConfiguration={'LocationConstraint': aws_region})
    except ClientError:
        pass

    for filename in os.listdir(target_dir):
        logger.warn('Uploading %s to Amazon S3 bucket %s' % (filename, bucket_name))
        s3.Object(bucket_name, filename).put(Body=open(os.path.join(target_dir, filename), 'rb'))

        logger.info('File uploaded to https://s3.%s.amazonaws.com/%s/%s' % (
            aws_region, bucket_name, filename))

It just upload the new version of every files but it doesn't remove previous ones nor check if the file changed in between.

@mikaelho

This comment has been minimized.

Copy link

@mikaelho mikaelho commented Nov 2, 2016

+1

3 similar comments
@danielwhatmuff

This comment has been minimized.

Copy link

@danielwhatmuff danielwhatmuff commented Nov 22, 2016

+1

@klj613

This comment has been minimized.

Copy link

@klj613 klj613 commented Nov 23, 2016

+1

@Cenhinen

This comment has been minimized.

Copy link

@Cenhinen Cenhinen commented Nov 23, 2016

+1

@Natim

This comment has been minimized.

Copy link

@Natim Natim commented Nov 23, 2016

I guess you can add as many +1 as you want but what would be more useful would be to start a pull-request on the project. Nobody is going to do it for you folks.

@mikaelho

This comment has been minimized.

Copy link

@mikaelho mikaelho commented Nov 23, 2016

Natim, you got to be kidding. Implementing this in a reliable way is not trivial, and they already got it implemented - in python - in the AWS CLI. It is just implemented in such a convoluted way that you need to be a AWS CLI expert to pull it out.

@Natim

This comment has been minimized.

Copy link

@Natim Natim commented Nov 23, 2016

Implementing this in a reliable way is not trivial

I didn't say it was trivial but it doesn't have to be perfect at first and we can iterate on it, I already wrote something working in 15 lines of code we can start from there.

I don't think reading the AWS CLI tool will help to implement it in boto3.

@cpury

This comment has been minimized.

Copy link

@cpury cpury commented Apr 3, 2017

What I really need is simpler than a directory sync. I just want to pass multiple files to boto3 and have it handle the upload of those, taking care of multithreading etc.

I guess this could be done with a light wrapper around existing API, but I'd have to spend some time on investigating it. Does anyone have some hints or a rough idea of how to set it up? I'd be willing to do a PR for this once I find the time.

@Sweenu

This comment has been minimized.

Copy link

@Sweenu Sweenu commented Aug 23, 2017

Awscli's sync function is really fast, so my current code uses subprocess to make a call to it. Having it backported to boto would be so much cleaner though. Another +1 for that to happen.

@blrnw3

This comment has been minimized.

Copy link

@blrnw3 blrnw3 commented Oct 5, 2017

+1

@jimmywan

This comment has been minimized.

Copy link

@jimmywan jimmywan commented Oct 10, 2017

I was successfully using s4cmd for a while to do this on relatively large directories, but started running into sporadic failures where it wouldn't quite get everything copied. Might be worth taking a peek at what they did there to see if some of it can be salvaged/reused.
https://github.com/bloomreach/s4cmd

@davidfischer-ch

This comment has been minimized.

Copy link

@davidfischer-ch davidfischer-ch commented Oct 24, 2017

+1

@yaniv-g

This comment has been minimized.

Copy link

@yaniv-g yaniv-g commented Nov 21, 2017

I used this method (altered from Natims code):

def upload_directory(src_dir, bucket_name, dst_dir):
    if not os.path.isdir(src_dir):
        raise ValueError('src_dir %r not found.' % src_dir)
    all_files = []

    for root, dirs, files in os.walk(src_dir):
        all_files += [os.path.join(root, f) for f in files]
    s3_resource = boto3.resource('s3')

    for filename in all_files:
        s3_resource.Object(bucket_name, os.path.join(dst_dir, os.path.relpath(filename, src_dir)))\
            .put(Body=open(filename, 'rb'))

The main differences (other then logging and different checks) is that this method copies all files in the directory recursively, and that it allows changing the root path in s3 (inside the bucket).

@samjgalbraith

This comment has been minimized.

Copy link

@samjgalbraith samjgalbraith commented Dec 6, 2017

In case it helps anyone, I ended up writing a library with a class that recursively downloads from S3. Inspired by a snippet from Stackoverflow that I can't find for due credit. In the context of this library it was useful to make it a generator that yields the filepaths of objects downloaded as they're downloaded, but your use case may vary.
https://github.com/theflyingnerd/dlow/blob/master/dlow/s3/downloader.py

I came here hoping that someone had implemented it in boto3 by now and I could throw my code away, but no dice.

@toshke

This comment has been minimized.

Copy link

@toshke toshke commented Feb 16, 2018

got here by trying to find simple library to do dummy s3 sync from bucket to bucket on aws lambda, but seems like no implementation yet - created smaller helper class to do this - https://gist.github.com/toshke/e96b454099e27600ee68f86e68c29b22, hopefully wil be useful to someone else

@ispulkit

This comment has been minimized.

Copy link

@ispulkit ispulkit commented Feb 22, 2018

Can't believe it has been 3 years here......
+1

@mattalexx

This comment has been minimized.

Copy link

@mattalexx mattalexx commented Mar 11, 2018

I have a workaround for this. Doesn't use a separate process.

Install awscli as python lib:

pip install awscli

Then define this function:

from awscli.clidriver import create_clidriver

def aws_cli(*cmd):
    old_env = dict(os.environ)
    try:

        # Environment
        env = os.environ.copy()
        env['LC_CTYPE'] = u'en_US.UTF'
        os.environ.update(env)

        # Run awscli in the same process
        exit_code = create_clidriver().main(*cmd)

        # Deal with problems
        if exit_code > 0:
            raise RuntimeError('AWS CLI exited with code {}'.format(exit_code))
    finally:
        os.environ.clear()
        os.environ.update(old_env)

To execute:

aws_cli('s3', 'sync', '/path/to/source', 's3://bucket/destination', '--delete')
@MourIdri

This comment has been minimized.

Copy link

@MourIdri MourIdri commented Mar 12, 2018

I give up ! I finally use Azure : "az storage blob upload-batch --destination --source" Create a list of file for the directory and upload in parallel with this function.

@jbq

This comment has been minimized.

Copy link

@jbq jbq commented Mar 13, 2018

Directory sync is not yet supported in the API so just fork a process like this:

#! /usr/bin/python3

import subprocess

def run(cmd):
    p = subprocess.Popen(cmd)
    return p.wait()

run(["aws", "s3", "sync", "--quiet", "/path/to/files", "s3://path/to/s3/bucket"])
@mattalexx

This comment has been minimized.

Copy link

@mattalexx mattalexx commented Mar 13, 2018

@jbq

Why use a subprocess? Just use the same process:

#358 (comment)

@jbq

This comment has been minimized.

Copy link

@jbq jbq commented Mar 13, 2018

@mattalexx because I did it like this before you added your comment, just posting my solution as I see people move to Azure because of this limitation :D

@adaranutsa

This comment has been minimized.

Copy link

@adaranutsa adaranutsa commented Mar 15, 2018

@mattalexx Thanks for that solution.

I had to put the arguments inside a list for this to work. But it works.

aws_cli(['s3', 'sync', '/path/to/source', 's3://bucket/destination', '--delete'])

@mattalexx

This comment has been minimized.

Copy link

@mattalexx mattalexx commented Mar 15, 2018

@adaranutsa

Just make sure you include the asterisk on the function sign:

def aws_cli(*cmd):

And also on the call to main(*cmd).

@toddljones

This comment has been minimized.

Copy link

@toddljones toddljones commented Aug 15, 2018

+1 👍

@reed9999

This comment has been minimized.

Copy link

@reed9999 reed9999 commented Sep 2, 2018

@mattalexx @adaranutsa
I have both asterisks but still got TypeError: main() takes from 1 to 2 positional arguments but 5 were given. Could this be because I was running it with Python 3.5?

Regardless, putting it in double parens to make it a tuple worked: aws_cli(('s3', 'sync', local_dir, s3_uri)) (I chose not to pass --delete).

@supersaeyan

This comment has been minimized.

Copy link

@supersaeyan supersaeyan commented Oct 27, 2018

I have a question but first here's some context.
(New to github so please forgive me and feel free to point me to some rules of communication if i am doing something wrong here.)

I have a case specific implementation of syncing where for some reasons, i am using presigned URLs generated by a lambda + API gateway flask app(zappa) for upload and download.
I looked through the awscli codebase and found datemodified and size strategy. My implementation is only using date modified to check for new files.
But due to S3's eventual consistency, my sync is also eventual i.e it would reupload some of the files multiple times which can be expensive in case of large files and may also go into an infinite loop.
To overcome that i delayed upload by N seconds i.e the timedelta should be N or more seconds for the file to be uploaded and warns about files that have a timedelta under N seconds for user feedback but obviously it is not perfect like aws cli.

So my question is, even if i implement size comparison with my strategy, would the size be consistent immediately or is the whole metadata eventually consistent ?
How is aws cli achieving this ?

@queglay

This comment has been minimized.

Copy link

@queglay queglay commented May 6, 2019

I have a workaround for this. Doesn't use a separate process.

Install awscli as python lib:

pip install awscli

Then define this function:

from awscli.clidriver import create_clidriver

def aws_cli(*cmd):
    old_env = dict(os.environ)
    try:

        # Environment
        env = os.environ.copy()
        env['LC_CTYPE'] = u'en_US.UTF'
        os.environ.update(env)

        # Run awscli in the same process
        exit_code = create_clidriver().main(*cmd)

        # Deal with problems
        if exit_code > 0:
            raise RuntimeError('AWS CLI exited with code {}'.format(exit_code))
    finally:
        os.environ.clear()
        os.environ.update(old_env)

To execute:

aws_cli('s3', 'sync', '/path/to/source', 's3://bucket/destination', '--delete')

I attempted this to sync testing a specific file list, but got errors. I was able to sync a whole directory though.

import os
from awscli.clidriver import create_clidriver


def aws_cli(*cmd):
    old_env = dict(os.environ)
    try:

        # Environment
        env = os.environ.copy()
        env['LC_CTYPE'] = u'en_US.UTF'
        os.environ.update(env)

        # Run awscli in the same process
        exit_code = create_clidriver().main(*cmd)

        # Deal with problems
        if exit_code > 0:
            raise RuntimeError('AWS CLI exited with code {}'.format(exit_code))
    finally:
        os.environ.clear()
        os.environ.update(old_env)


aws_cli(['s3', 'sync', '/prod/debug/debug/0230/houdini', 's3://man.firehawkfilm.com/testdir', '--include debug_debug_0230_wrk_dynamicworkrecook_v007.000_ag_wedge_object_tests.hip'])

Unknown options: --include debug_debug_0230_wrk_dynamicworkrecook_v007.000_ag_wedge_object_tests.hip
Traceback (most recent call last):
File "s3_cli_sync.py", line 26, in
aws_cli(['s3', 'sync', '/prod/debug/debug/0230/houdini', 's3://man.firehawkfilm.com/testdir', '--include debug_debug_0230_wrk_dynamicworkrecook_v007.000_ag_wedge_object_tests.hip'])
File "s3_cli_sync.py", line 20, in aws_cli
raise RuntimeError('AWS CLI exited with code {}'.format(exit_code))
RuntimeError: AWS CLI exited with code 255

@rectalogic

This comment has been minimized.

Copy link

@rectalogic rectalogic commented May 9, 2019

Here's another implementation that parallelizes the upload.

import os
from concurrent import futures
import boto3


def upload_directory(directory, bucket, prefix):
    s3 = boto3.client("s3")

    def error(e):
        raise e

    def walk_directory(directory):
        for root, _, files in os.walk(directory, onerror=error):
            for f in files:
                yield os.path.join(root, f)

    def upload_file(filename):
        s3.upload_file(Filename=filename, Bucket=bucket, Key=prefix + os.path.relpath(filename, directory))

    with futures.ThreadPoolExecutor() as executor:
        futures.wait(
            [executor.submit(upload_file, filename) for filename in walk_directory(directory)],
            return_when=futures.FIRST_EXCEPTION,
        )
@CarlosDomingues

This comment has been minimized.

Copy link

@CarlosDomingues CarlosDomingues commented Sep 10, 2019

This is a much-needed feature.

@salt-mountain

This comment has been minimized.

Copy link

@salt-mountain salt-mountain commented Oct 22, 2019

3 years have passed. Are we any closer to deciding to include this much-requested functionality ?

@codes-AMiT

This comment has been minimized.

Copy link

@codes-AMiT codes-AMiT commented Nov 19, 2019

Well, I am not the only one who had "Just one job" 🤷‍♂️🤷‍♂️

@caiounderscore

This comment has been minimized.

Copy link

@caiounderscore caiounderscore commented Dec 30, 2019

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
You can’t perform that action at this time.