Skip to content

Conversation

@igorborgest
Copy link
Contributor

Issue #186

Add s3.copy_objects()

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@igorborgest igorborgest added bug Something isn't working enhancement New feature or request labels Apr 16, 2020
@igorborgest igorborgest self-assigned this Apr 16, 2020
@igorborgest igorborgest changed the title Add s3.copy_objects() #186 Add s3.copy_objects() Apr 16, 2020
@JPFrancoia
Copy link
Contributor

I tested the copy_objects function with the default mode (append):

in_dir = f"{bucket}/{sub_path}/{old_dir}"
out_dir = f"{bucket}/{sub_path}/{new_dir}"
s3_objects = wr.s3.list_objects(in_dir)
wr.s3.copy_objects(s3_objects, in_dir, out_dir)

it does work, the objects are copied. However, if I run the copy a second time, the files are just overwritten (because they have the same file names), so I think the "append" mode is a bit confusing here.

Also, we could get the results of the overwrite mode quite simply by chaining delete_objects and copying the objects.

Regarding the overwrite_partitions mode I think it's interesting, because if old_dir and new_dir have colliding partitions (just the names of the partitions) but different data "inside" these partitions (basically different file names), what would happen if we simply copy the objects from old_dir to new_dir ? Would we add data to the partitions?

Example:

old_dir
s3://bucket/dataset/d=202004/1234.snappy.parquet
new_dir
s3://bucket/dataset/d=202004/1235.snappy.parquet

What happens if we copy old_dir to new_dir without overwriting the partitions?

This?

s3://bucket/dataset/d=202004/1234.snappy.parquet
s3://bucket/dataset/d=202004/1235.snappy.parquet

Overall I didn't expect the copy_objects function to be so flexible. The old version in 0.3.x was a good ol' simple copy function :)

Proposed solutions (just throwing ideas here):

  • keep the code as is, but update the doc to explain all these behaviors
  • rename the function to copy_dataset, since its usage seems tied to the data itself rather than to simple S3 objects
  • make a simple dummy function copy_objects that does a straightforward copy, and keep this function but rename it merge_dataset

@igorborgest
Copy link
Contributor Author

@JPFrancoia great points. And based on it, I'm splitting up this feature in two distinct function: copy_objects and merge_datasets as suggested.

Thanks again!

@JPFrancoia
Copy link
Contributor

That's awesome, thank you!

I don't have a use case to test the merge_datasets function at the moment, but I tested the copy_objects function in my workflow and it behaves as expected. So for me the PR would be good to go.

Thanks again

@igorborgest igorborgest merged commit 0002fe6 into master Apr 17, 2020
@igorborgest igorborgest deleted the s3-copy branch April 17, 2020 15:25
@igorborgest igorborgest added WIP Work in progress and removed WIP Work in progress labels May 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants