New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

feat(ingest/dbt): dbt column-level lineage #8991

Merged

hsheth2 merged 26 commits into datahub-project:master from hsheth2:dbt-cll

Nov 14, 2023

Collaborator

hsheth2 commented Oct 11, 2023 •

edited

~~Changes stacked on top of #8989~~

Caveats

if you're referencing tables directly (outside of dbt's ref or source), those won't get CLL

TODOs

add tests for schema inference
add tests for CLL generation

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

github-actions bot added the ingestion label

vercel bot deployed to Preview

October 11, 2023 23:31

View deployment

hsheth2 force-pushed the dbt-cll branch from d9deb86 to e70d463 Compare

October 12, 2023 05:31

hsheth2 requested a review from asikowitz

October 12, 2023 05:32

vercel bot deployed to Preview

October 12, 2023 06:06

View deployment

vercel bot deployed to Preview

October 13, 2023 19:16

View deployment

hsheth2 added 9 commits

October 25, 2023 13:20


          add topological sort

191ec39


          start adding schema inference

a571714


          start adding dbt schema inference

fe33bb3


          add mvp support for dbt cll

1c811ef


          code cleanup

dbcad42


          other tweaks

3410d29


          fix keyerror

01278fe


          start fix

c352766


          more fixes

741a771

hsheth2 force-pushed the dbt-cll branch from 7b3147f to 741a771 Compare

October 26, 2023 06:51

hsheth2 added 3 commits

October 26, 2023 10:50


          more tests

c3a0d42


          even more tests

a3bf5d7


          Merge branch 'master' into dbt-cll

2b9ea9d

vercel bot deployed to Preview

October 26, 2023 18:22

View deployment

maggiehays added the hacktoberfest-accepted label

hsheth2 added 4 commits

November 7, 2023 17:23


          Merge branch 'master' into dbt-cll

e6a11b1


          fix validator

1c43980


          fix column casing bug

89c3334


          make graph instance not required

618eb2f

vercel bot deployed to Preview

November 8, 2023 20:39

View deployment

hsheth2 added 4 commits

November 8, 2023 12:44


          Merge branch 'master' into dbt-cll

0f29cad


          fix mypy

f3f87c7


          make infer dbt schemas default true

a5683c9


          make dbt files canonical

d4b574a


          update test goldens

5377b12

hsheth2 marked this pull request as ready for review

November 8, 2023 20:54

vercel bot deployed to Preview

November 8, 2023 22:06

View deployment

asikowitz approved these changes

View reviewed changes

Collaborator

asikowitz left a comment

The core functionality looked like it was in _infer_schemas_and_update_cll, which made enough sense but I didn't really get a full picture of the process. I don't like how this logic is decently different from our CLL in other sources, but I'm guessing there's some key differences between the two.

metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_common.py Outdated Show resolved Hide resolved

metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_common.py Outdated Show resolved Hide resolved

metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_common.py Outdated Show resolved Hide resolved

metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_common.py Outdated Show resolved Hide resolved

metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_common.py

+                              and should_fetch_target_node_schema
+                              and graph
+                          ):
+                              schema_metadata = graph.get_aspect(target_node_urn, SchemaMetadata)

Collaborator

asikowitz Nov 10, 2023

This isn't cached right? Is it possible we end up querying this multiple times for the same urn? Could we do a bulk fetch instead?

Collaborator Author

hsheth2 Nov 13, 2023

it won't ever query for the same urn multiple times

metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_common.py Show resolved Hide resolved

metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_common.py

+                          if self.config.include_column_lineage and sql_result:
+                              # We only save the debug info here. We're report errors based on it later, after
+                              # applying the configured node filters.
+                              node.cll_debug_info = sql_result.debug_info

Collaborator

asikowitz Nov 10, 2023

Attaching lineage info to nodes is different than we do most other sources. How come you're doing it this way here?

Collaborator Author

hsheth2 Nov 13, 2023

lineage is determined by the view definition and the schema of the upstreams. in most other sources, we have the schemas available, but in the case of dbt ephemeral models, we have to infer the schemas. that means we need to do it topographical order, so it's easier to do it all in one go

metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_common.py Outdated Show resolved Hide resolved

metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_common.py Show resolved Hide resolved

metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_common.py Outdated Show resolved Hide resolved

hsheth2 added 5 commits

November 10, 2023 14:59


          Merge branch 'master' into dbt-cll

4daa7ae


          tweak incremental lineage impl

a5c5e4a


          refactor

184bc2f


          Merge branch 'master' into dbt-cll

13d71fd

fix

18cb777

vercel bot deployed to Preview

November 13, 2023 22:11

View deployment

hsheth2 merged commit 19aa215 into datahub-project:master

51 checks passed

hsheth2 deleted the dbt-cll branch

November 14, 2023 00:00

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment