-
Notifications
You must be signed in to change notification settings - Fork 16.7k
Description
Description
The S3Hook registers an output asset for every file which it uploads:
airflow/providers/amazon/src/airflow/providers/amazon/aws/hooks/s3.py
Lines 1376 to 1379 in e491aac
| # No input because file_obj can be anything - handle in calling function if possible | |
| get_hook_lineage_collector().add_output_asset( | |
| context=self, scheme="s3", asset_kwargs={"bucket": bucket_name, "key": key} | |
| ) |
Which is not always the desired behaviour when using the S3 Hook (see motivating example below).
I'd propose just adding a switch to the S3Hook so disable this sort of lineage:
hook = S3Hook(enable_hook_level_lineage=False)I am happy to submit a fix here, but I wanted to run it by y'all first to make sure that I'm not missing some previous context or undoing an intentional design decision.
Use case/motivation
We have seen issues where users upload chunked data to S3 within a PythonOperator like so:
hook = S3Hook()
for idx, data in enumerate(list_of_values):
hook.upload_string(data, f"some_prefix/file_{idx}.txt", "some_bucket")Which then creates a ton of output assets. I know that this is limited to 100 output objects (since #45798), but it would be nice if we could disable hook-level lineage altogether and instead manage our own output asset definition at the custom operator / PythonOperator level.
In this case, we likely want to only enable a single output asset at the some_prefix/ level, not one per file.
This likely applies to other object storage hooks as well.
Related issues
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct