Is there an existing issue for this?
Problem statement
DQX now documents rules tracking via fingerprints and table schemas (Table Schemas and Relationships, quality checks storage, summary metrics, detailed results and way to identify checks run). This are great resources, but there is no single, practical guide for how to query these tables for common lineage and analysis tasks.
Users need to be able to:
- Navigate between layers: Summary metrics → row-level results → checks table (and back).
- Identify failing checks: For a given run, determine which checks failed and on which rows.
- Analyze runs: Track runs over time, drill down from summary to row details, and see which rules were applied.
- Trace failures: From a row with errors, get the exact check definition (function, arguments, filter) that failed.
Today this information is spread across several docs, with partial examples and no unified “query cookbook.” Users must infer join keys and patterns from schema descriptions and scattered snippets.
Proposed Solution
Add a Lineage Query Cookbook (or similar) as a new doc page to the docs (guide) with SQL and PySpark examples for common lineage tasks.
Suggested content
- Summary → row details
- Filter output/quarantine by run_id from the summary table.
- Explode _errors and _warnings and filter by run_id.
- Row details → summary
- Join output/quarantine to the summary table on run_id to get run metadata (input_location, checks_location, run_time, etc.).
- Row details → checks table
- Join exploded _errors/_warnings to the checks table on rule_fingerprint and rule_set_fingerprint to get the full check definition for each failure.
- Summary → checks table
- Use checks_location and rule_set_fingerprint to load or join the rule set used for a run.
- Which checks failed for a run
- From summary: get run_id → filter output/quarantine → explode errors → join to checks table for definitions.
- Which rules were run
- From summary: use checks_location and rule_set_fingerprint to load the rule set, or join checks table on rule_set_fingerprint.
- Run tracking over time
- Query summary table by run_time, input_location, output_location; pivot metrics; compute error/warning rates.
Format
- Each task as a short section with a clear goal.
- Copy-pasteable SQL and PySpark snippets.
- Brief notes on join keys and edge cases (e.g. custom error/warning column names).
- Cross-links to the schema reference and related guides.
Placement
Acceptance criteria
- New doc page with at least the 7 query patterns above.
Each pattern has SQL and/or PySpark examples.
- Page is linked from other pages
- Examples use realistic table names and column names
Additional Context
Some of these could be offered as part of DQX library maybe
Is there an existing issue for this?
Problem statement
DQX now documents rules tracking via fingerprints and table schemas (Table Schemas and Relationships, quality checks storage, summary metrics, detailed results and way to identify checks run). This are great resources, but there is no single, practical guide for how to query these tables for common lineage and analysis tasks.
Users need to be able to:
Today this information is spread across several docs, with partial examples and no unified “query cookbook.” Users must infer join keys and patterns from schema descriptions and scattered snippets.
Proposed Solution
Add a Lineage Query Cookbook (or similar) as a new doc page to the docs (guide) with SQL and PySpark examples for common lineage tasks.
Suggested content
Format
Placement
Acceptance criteria
Additional Context
Some of these could be offered as part of DQX library maybe