Skip to content

Conversation

@kyungsoo-datahub
Copy link
Contributor

@kyungsoo-datahub kyungsoo-datahub commented Oct 15, 2025

Change

  • Skip column lineage entries with empty column names to prevent errors during lineage generation
  • Skip lineage entries with empty downstream column names before processing
  • Add unit test to verify the fix handles edge cases

Problem

A query metadata can sometimes include empty string column names in the column lineage. This causes errors when constructing SchemaFieldUrn objects, breaking the entire lineage generation process.

Fix

  • Only process column lineage when upstream_ref.column is not empty
  • Only process column lineage when downstream column is not empty

Testing

  • Added unit test with scenarios covering all empty column cases

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Oct 15, 2025
@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Oct 15, 2025
aggregator.add_known_query_lineage(known_query_lineage)

# This should not raise an error even with empty string columns
mcps = list(aggregator.gen_metadata())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to use golden files as it is easier to understand what is going on.

Check out the test above which uses it ->

mcpws = [mcp for mcp in aggregator.gen_metadata()]
    lineage_mcpws = [mcpw for mcpw in mcpws if mcpw.aspectName == "upstreamLineage"]
    out_path = tmp_path / "mcpw.json"
    write_metadata_file(out_path, lineage_mcpws)

    mce_helpers.check_golden_file(
        pytestconfig,
        out_path,
        pytestconfig.rootpath
        / "tests/unit/sql_parsing/aggregator_goldens/test_diamond_problem_golden.json",
    )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comment. Revised.

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter needs-review Label for PRs that need review from a maintainer. and removed needs-review Label for PRs that need review from a maintainer. pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Oct 16, 2025
@kyungsoo-datahub kyungsoo-datahub force-pushed the ks--fix-empty-column-failure branch from 87e9c03 to 834152b Compare October 18, 2025 18:31
@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Oct 18, 2025
@codecov
Copy link

codecov bot commented Oct 20, 2025

Bundle Report

Changes will increase total bundle size by 19.29kB (0.07%) ⬆️. This is within the configured threshold ✅

Detailed changes
Bundle name Size Change
datahub-react-web-esm 28.59MB 19.29kB (0.07%) ⬆️

Affected Assets, Files, and Routes:

view changes for bundle: datahub-react-web-esm

Assets Changed:

Asset Name Size Change Total Size Change (%)
assets/index-*.js 19.29kB 18.93MB 0.1%

@skrydal
Copy link
Collaborator

skrydal commented Oct 20, 2025

Very nice and thoughtful change @kyungsoo-datahub , just shared couple of minor comments. Thank you.

@codecov
Copy link

codecov bot commented Oct 20, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter needs-review Label for PRs that need review from a maintainer. labels Oct 20, 2025
@skrydal skrydal self-requested a review October 21, 2025 12:41
Copy link
Collaborator

@skrydal skrydal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work, I am approving now. Maybe we should merge it now, if there are more cases, we can simply create a new PR. Up to you.

@datahub-cyborg datahub-cyborg bot removed the needs-review Label for PRs that need review from a maintainer. label Oct 21, 2025
@datahub-cyborg datahub-cyborg bot added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Oct 21, 2025
Kyungsoo Lee and others added 3 commits October 21, 2025 11:00
Change

- Skip column lineage entries with empty column names to prevent errors during lineage generation
- Skip lineage entries with empty or whitespace-only downstream column names before processing
- Add comprehensive unit test to verify the fix handles various edge cases

Problem

A query metadata can sometimes include empty string column names in the column lineage.
This causes errors when constructing SchemaFieldUrn objects, breaking the entire lineage
generation process.

Fix

- Only process column lineage when upstream_ref.column is not empty

Testing

- Added unit test with three scenarios covering edge cases
Clarify use of KnownQueryLineageInfo over ObservedQuery to avoid mocking
_run_sql_parser() for testing empty column names from external systems.
@kyungsoo-datahub kyungsoo-datahub force-pushed the ks--fix-empty-column-failure branch from 31cc611 to 78d0456 Compare October 21, 2025 18:00
@kyungsoo-datahub kyungsoo-datahub enabled auto-merge (squash) October 21, 2025 18:03
@kyungsoo-datahub kyungsoo-datahub merged commit eecc2e9 into datahub-project:master Oct 22, 2025
57 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata merge-pending-ci A PR that has passed review and should be merged once CI is green.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants