Add s3.copy_objects() #187

igorborgest · 2020-04-16T19:20:22Z

Issue #186

Add s3.copy_objects()

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

JPFrancoia · 2020-04-17T10:10:22Z

I tested the copy_objects function with the default mode (append):

in_dir = f"{bucket}/{sub_path}/{old_dir}"
out_dir = f"{bucket}/{sub_path}/{new_dir}"
s3_objects = wr.s3.list_objects(in_dir)
wr.s3.copy_objects(s3_objects, in_dir, out_dir)

it does work, the objects are copied. However, if I run the copy a second time, the files are just overwritten (because they have the same file names), so I think the "append" mode is a bit confusing here.

Also, we could get the results of the overwrite mode quite simply by chaining delete_objects and copying the objects.

Regarding the overwrite_partitions mode I think it's interesting, because if old_dir and new_dir have colliding partitions (just the names of the partitions) but different data "inside" these partitions (basically different file names), what would happen if we simply copy the objects from old_dir to new_dir ? Would we add data to the partitions?

Example:

old_dir
s3://bucket/dataset/d=202004/1234.snappy.parquet

new_dir
s3://bucket/dataset/d=202004/1235.snappy.parquet

What happens if we copy old_dir to new_dir without overwriting the partitions?

This?

s3://bucket/dataset/d=202004/1234.snappy.parquet
s3://bucket/dataset/d=202004/1235.snappy.parquet

Overall I didn't expect the copy_objects function to be so flexible. The old version in 0.3.x was a good ol' simple copy function :)

Proposed solutions (just throwing ideas here):

keep the code as is, but update the doc to explain all these behaviors
rename the function to copy_dataset, since its usage seems tied to the data itself rather than to simple S3 objects
make a simple dummy function copy_objects that does a straightforward copy, and keep this function but rename it merge_dataset

…asets()

igorborgest · 2020-04-17T13:59:53Z

@JPFrancoia great points. And based on it, I'm splitting up this feature in two distinct function: copy_objects and merge_datasets as suggested.

Thanks again!

JPFrancoia · 2020-04-17T14:57:54Z

That's awesome, thank you!

I don't have a use case to test the merge_datasets function at the moment, but I tested the copy_objects function in my workflow and it behaves as expected. So for me the PR would be good to go.

Thanks again

Add s3.copy_objects() #186

73f3b88

igorborgest added bug Something isn't working enhancement New feature or request labels Apr 16, 2020

igorborgest self-assigned this Apr 16, 2020

igorborgest changed the title ~~Add s3.copy_objects() #186~~ Add s3.copy_objects() Apr 16, 2020

igorborgest added 2 commits April 16, 2020 20:17

Add merge tutorial

f480e74

Add s3.copy_objects() on the docs #186

e6d808d

igorborgest added 3 commits April 17, 2020 10:13

Splitting up and creating two functions, copy_objects() and merge_dat…

537afe5

…asets()

Fixing Flake8 compatibility

b6b8bfc

Updating tutorial 13

30f3ea0

igorborgest merged commit 0002fe6 into master Apr 17, 2020

igorborgest deleted the s3-copy branch April 17, 2020 15:25

igorborgest added WIP Work in progress and removed WIP Work in progress labels May 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add s3.copy_objects() #187

Add s3.copy_objects() #187

Uh oh!

igorborgest commented Apr 16, 2020

Uh oh!

JPFrancoia commented Apr 17, 2020

Uh oh!

igorborgest commented Apr 17, 2020

Uh oh!

JPFrancoia commented Apr 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add s3.copy_objects() #187

Add s3.copy_objects() #187

Uh oh!

Conversation

igorborgest commented Apr 16, 2020

Uh oh!

JPFrancoia commented Apr 17, 2020

Uh oh!

igorborgest commented Apr 17, 2020

Uh oh!

JPFrancoia commented Apr 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants