Skip to content

Data transformation end-to-end view using data URIs #51978

@chsanjeev

Description

@chsanjeev

Description

Use Case:
Provide an end-to-end view of data loading for a single table using multiple data workflows.

Flow Diagram:

Image

Data Uri example -> file://tmp/txn_leads.csv

Thoughts:

  • Provides a consolidated view of data moving into a target table.
  • Not related to scheduling dependencies like "Assets".
  • It does not replace data lineage tools like Collibra or OpenLineage etc., but offers a basic utility for a high-level overview.

Nice to have:

  • Ability to click on the DAG name or data URI to navigate to the respective DAG or task.
  • Display the number of records or extract information under the data URI for a quick glance.

Airflow Version: 3

Use case/motivation

Example: In a Data Vault 2.0 model, a HUB table receives data from various sources using different data rules. It is useful to understand how data moves into a single table through different DAGs and tasks.

Lineage SQL: May not be the most optimized or accurate

select
TODR.DAG_ID,
TODR.TASK_ID,
TODR.priority_weight,
INA.URI IN_URI,
OUA.URI AS OUT_URI
from
public.task_instance TODR
inner join public.task_inlet_asset_reference INTIAR
on (INTIAR.task_id = TODR.task_id and INTIAR.dag_id = TODR.dag_id)
inner join public.asset INA
on (INTIAR.asset_id = INA.id )
inner join public.task_outlet_asset_reference OUTAR
on (OUTAR.task_id = TODR.task_id and OUTAR.dag_id = TODR.dag_id)
inner join public.asset OUA
on (OUTAR.asset_id = OUA.id )
where
(TODR.TASK_ID,TODR.DAG_ID,TODR.start_date) in (select TASK_ID,DAG_ID,MAX(start_date) from public.task_instance group by TASK_ID,DAG_ID )
and TODR.DAG_ID in ('source_system_1','source_system_2','source_system_3','source_system_4')
order by TODR.DAG_ID DESC,TODR.priority_weight DESC;

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind:featureFeature Requestsneeds-triagelabel for new issues that we didn't triage yet

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions