-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Add S3CopyPrefixOperator for copying objects by prefix #59042
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
|
vincbeck
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty cool!
| :param acl_policy: String specifying the canned ACL policy for the file being | ||
| uploaded to the S3 bucket. | ||
| :param meta_data_directive: Whether to `COPY` the metadata from the source object or `REPLACE` it with | ||
| metadata that's provided in the request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor nit: move these two above aws_conn_id so that it matches __init__
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
| "source_bucket_prefix", | ||
| "dest_bucket_prefix", | ||
| "source_bucket_name", | ||
| "dest_bucket_name", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to not template the rest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new operator is based on the existing S3CopyObjectOperator, which only templates these four fields (or their equivalent). I also introduced page_size and continue_on_failure, but I don't think templating is applicable to these two.
If you have specific fields in mind that you think are missing, please feel free to propose and I'd be happy to add them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the two new fields are worth templating. There is usually very little rhyme or reason for why older operators template some fields vs others. So it's not necessarily worth copying that.
| self.dest_bucket_name, self.dest_bucket_prefix, "dest_bucket_name", "dest_bucket_prefix" | ||
| ) | ||
|
|
||
| # Get paginator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think comments like this are superfluous (often if you're using an AI agent for coding they add many many comments), and this one seems to be on the wrong line also. But I would just drop it entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
| # [END howto_operator_s3_copy_object] | ||
|
|
||
| # [START howto_operator_s3_copy_prefix] | ||
| copy_prefix = S3CopyPrefixOperator( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the system test! By chance did you run the dag and see if it's still working?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is working!
| It should be omitted when `dest_bucket_prefix` is provided as a full s3:// url. | ||
| :param page_size: Number of objects to list per page when paginating through S3 objects. | ||
| Low values result in more API calls, high values increase memory usage. | ||
| Between 1 and 1000, setting it to 0 results in no objects copied. Default is 1000. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the operator __init__ check if zero was provided? Not transferring anything seems like a weird silent failure. 0 often represents unbounded (and has in the past in Airflow for other configs), so I can see people making that mistake. I think it's worth detecting that situation and throwing an exception instead of just silently doing nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial thought for allowing a value of 0 is that it can potentially be used as a dynamic mechanism to 'disable' the task without having to modify the DAG. For instance, in the event of facing an incident with corrupted data, one can quickly stop copying files by setting this parameter to 0, without having to deal with also changing unit-tests, etc. This is just an example, I can see other scenarios in which it might be useful.
Furthermore, I feel like in this context, a value of 0 is relatively self-explanatory. I also think page_size is not a 'trivial' parameter, and anyone providing a custom value (instead of using the default) would have a minimum understanding on how to use it. That being said, I didn't know that 0 often represents unbounded in Airflow, so I can totally understand that confusions might happen.
Personally, I don't have a strong opinion about this, so I'm happy to go with your approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also just remove page_size completely, see data points in the previous comment. Tagging @vincbeck and @ferruzzi for further opinions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea of disabling a task is interesting, but not something that is very common in airflow, so I'm not sure an operator would think of it in the rare case that it would be useful . Plus it would also need a deploy to update dag code. I think overall it's worth simplifying and just catching that case and not allowing it
| self.log.info("Successfully copied %s object(s)", copied_object_count) | ||
|
|
||
| if failed_object_count > 0: | ||
| raise RuntimeError(f"Failed to copy {failed_object_count} object(s)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My expectation would be that continue_on_failure would not fail the task, perhaps just log at ERROR or WARNING that some copy operations failed. But I also see the exception approach as being a "you can't miss this" communication mechanism that some things failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I'm not a big fan of silent failures, so my direction with continue_on_failure was more "do I want to copy as much data as possible in the event of a failure or just stop immediately?". In both cases, there has been an error so the operator should fail, it's just a matter of when. I think there are valid scenarios in which incomplete data is better than no data, but one still wants to be alerted about issues.
Of course I do understand your interpretation and I think it also makes sense. Would be great to hear other opinions as well @vincbeck @ferruzzi .
| It should be omitted when `source_bucket_prefix` is provided as a full s3:// url. | ||
| :param dest_bucket_name: Name of the S3 bucket to where the objects are copied. (templated) | ||
| It should be omitted when `dest_bucket_prefix` is provided as a full s3:// url. | ||
| :param page_size: Number of objects to list per page when paginating through S3 objects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something that many folks actually ever modify? Is there any benefit to using anything of than what S3 defaults to? It would simplify your src and test code to just not include it here unless it's very common or a user you're working backwards from is asking for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very good question/point. The maximum length of an S3 object key is 1,024 bytes (UTF-8). At the same time, both the default and maximum value for page_size is 1000. This means that in theory, the raw data retrieved by a single list_objects_v2API call should fit in at most 1MB, which nowadays is a very small memory footprint.
The main reason I decided to include this parameter is for the special case when it is equal to 0, which results in no data being copied. However, as we are discussing in this comment, we might not want to support that. In that case, we could just remove the parameter altogether.
9ea19ca to
b9e4dab
Compare
- Add S3CopyPrefixOperator to copy all objects under a prefix - Support both bucket/prefix params and full S3 URLs - Include pagination, error handling, and OpenLineage integration - Add comprehensive unit tests with 14 test cases - Add system test example and update documentation - Update S3 operator documentation with new operator section
|
Just checking in on this one, is all the feedback addressed now @bentorb? |
Add S3CopyPrefixOperator for copying objects by prefix
Description
This PR introduces a new
S3CopyPrefixOperatorthat enables copying all S3 objects under a specified prefix from a source bucket to a destination bucket. This operator fills a gap in the current S3 operators by providing prefix-based bulk copy functionality.What does this operator do?
• Copies all objects matching a specified prefix from source to destination S3 bucket
• Supports cross-bucket
• Handles large datasets through pagination
• Provides configurable error handling (continue on failure or stop on first error)
• Integrates with OpenLineage for data lineage tracking
• Supports Airflow templating for dynamic parameter values
Why is this needed?
Currently, Airflow's S3 operators allow copying individual objects. For use cases involving copying entire "directory" structures or large numbers of objects sharing a common prefix, users must implement custom solutions or use multiple operator instances.
This operator provides a native, efficient solution for prefix-based bulk operations.
Key Features
• Pagination Support: Automatically handles large object lists using S3's pagination
• Error Handling: Configurable
continue_on_failureparameter for resilient operations• Template Fields: All key parameters support Jinja templating
• OpenLineage Integration: Automatic data lineage tracking for copied objects
• Standard Exception Handling: Uses RuntimeError following new Airflow guidelines
Implementation Details
• Base Class: Based on
S3CopyObjectOperatorfor consistency• Dependencies: Uses existing
S3Hookand AWS connection infrastructure• Documentation: Updated
providers/amazon/docs/operators/s3/s3.rstwith operator documentation• Error Handling: Follows new Airflow guidelines using standard Python exceptions
Testing
Includes 14 new unit tests covering:
• Basic functionality and successful copying
• Error scenarios and exception handling
• Pagination configuration
• Continue on failure behavior
• OpenLineage integration
• Template field functionality
• System test integration in
tests/system/providers/amazon/aws/example_s3.py• All tests pass in Breeze testing environment
Usage Example
Checklist
• [x] Tests included (14 comprehensive unit tests)
• [x] Documentation updated
• [x] Code follows project coding standards
• [x] All static code checks pass
• [x] Apache license headers added
• [x] PR is focused on single feature
• [x] Local tests pass
• [x] No unrelated changes included