Skip to content

Allow disabling hook-level lineage in Hook constructors #63371

@SamWheating

Description

@SamWheating

Description

The S3Hook registers an output asset for every file which it uploads:

# No input because file_obj can be anything - handle in calling function if possible
get_hook_lineage_collector().add_output_asset(
context=self, scheme="s3", asset_kwargs={"bucket": bucket_name, "key": key}
)

Which is not always the desired behaviour when using the S3 Hook (see motivating example below).

I'd propose just adding a switch to the S3Hook so disable this sort of lineage:

hook = S3Hook(enable_hook_level_lineage=False)

I am happy to submit a fix here, but I wanted to run it by y'all first to make sure that I'm not missing some previous context or undoing an intentional design decision.

Use case/motivation

We have seen issues where users upload chunked data to S3 within a PythonOperator like so:

hook = S3Hook()
for idx, data in enumerate(list_of_values):
  hook.upload_string(data, f"some_prefix/file_{idx}.txt", "some_bucket")

Which then creates a ton of output assets. I know that this is limited to 100 output objects (since #45798), but it would be nice if we could disable hook-level lineage altogether and instead manage our own output asset definition at the custom operator / PythonOperator level.

In this case, we likely want to only enable a single output asset at the some_prefix/ level, not one per file.

This likely applies to other object storage hooks as well.

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions