-
Notifications
You must be signed in to change notification settings - Fork 3.3k
fix(ingest): Handle empty column names in SQL parsing column lineage #15013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(ingest): Handle empty column names in SQL parsing column lineage #15013
Conversation
| aggregator.add_known_query_lineage(known_query_lineage) | ||
|
|
||
| # This should not raise an error even with empty string columns | ||
| mcps = list(aggregator.gen_metadata()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to use golden files as it is easier to understand what is going on.
Check out the test above which uses it ->
mcpws = [mcp for mcp in aggregator.gen_metadata()]
lineage_mcpws = [mcpw for mcpw in mcpws if mcpw.aspectName == "upstreamLineage"]
out_path = tmp_path / "mcpw.json"
write_metadata_file(out_path, lineage_mcpws)
mce_helpers.check_golden_file(
pytestconfig,
out_path,
pytestconfig.rootpath
/ "tests/unit/sql_parsing/aggregator_goldens/test_diamond_problem_golden.json",
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the comment. Revised.
87e9c03 to
834152b
Compare
Bundle ReportChanges will increase total bundle size by 19.29kB (0.07%) ⬆️. This is within the configured threshold ✅ Detailed changes
Affected Assets, Files, and Routes:view changes for bundle: datahub-react-web-esmAssets Changed:
|
metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py
Outdated
Show resolved
Hide resolved
|
Very nice and thoughtful change @kyungsoo-datahub , just shared couple of minor comments. Thank you. |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
skrydal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work, I am approving now. Maybe we should merge it now, if there are more cases, we can simply create a new PR. Up to you.
Change - Skip column lineage entries with empty column names to prevent errors during lineage generation - Skip lineage entries with empty or whitespace-only downstream column names before processing - Add comprehensive unit test to verify the fix handles various edge cases Problem A query metadata can sometimes include empty string column names in the column lineage. This causes errors when constructing SchemaFieldUrn objects, breaking the entire lineage generation process. Fix - Only process column lineage when upstream_ref.column is not empty Testing - Added unit test with three scenarios covering edge cases
Clarify use of KnownQueryLineageInfo over ObservedQuery to avoid mocking _run_sql_parser() for testing empty column names from external systems.
31cc611 to
78d0456
Compare
Change
Problem
A query metadata can sometimes include empty string column names in the column lineage. This causes errors when constructing SchemaFieldUrn objects, breaking the entire lineage generation process.
Fix
Testing