Skip to content

[FEATURE]: Documentation: Query Cookbook #1086

@mwojtyczka

Description

@mwojtyczka

Is there an existing issue for this?

  • I have searched the existing issues

Problem statement

DQX now documents rules tracking via fingerprints and table schemas (Table Schemas and Relationships, quality checks storage, summary metrics, detailed results and way to identify checks run). This are great resources, but there is no single, practical guide for how to query these tables for common lineage and analysis tasks.

Users need to be able to:

  • Navigate between layers: Summary metrics → row-level results → checks table (and back).
  • Identify failing checks: For a given run, determine which checks failed and on which rows.
  • Analyze runs: Track runs over time, drill down from summary to row details, and see which rules were applied.
  • Trace failures: From a row with errors, get the exact check definition (function, arguments, filter) that failed.

Today this information is spread across several docs, with partial examples and no unified “query cookbook.” Users must infer join keys and patterns from schema descriptions and scattered snippets.

Proposed Solution

Add a Lineage Query Cookbook (or similar) as a new doc page to the docs (guide) with SQL and PySpark examples for common lineage tasks.

Suggested content

  1. Summary → row details
    • Filter output/quarantine by run_id from the summary table.
    • Explode _errors and _warnings and filter by run_id.
  2. Row details → summary
    • Join output/quarantine to the summary table on run_id to get run metadata (input_location, checks_location, run_time, etc.).
  3. Row details → checks table
    • Join exploded _errors/_warnings to the checks table on rule_fingerprint and rule_set_fingerprint to get the full check definition for each failure.
  4. Summary → checks table
    • Use checks_location and rule_set_fingerprint to load or join the rule set used for a run.
  5. Which checks failed for a run
    • From summary: get run_id → filter output/quarantine → explode errors → join to checks table for definitions.
  6. Which rules were run
    • From summary: use checks_location and rule_set_fingerprint to load the rule set, or join checks table on rule_set_fingerprint.
  7. Run tracking over time
    • Query summary table by run_time, input_location, output_location; pivot metrics; compute error/warning rates.

Format

  • Each task as a short section with a clear goal.
  • Copy-pasteable SQL and PySpark snippets.
  • Brief notes on join keys and edge cases (e.g. custom error/warning column names).
  • Cross-links to the schema reference and related guides.

Placement

Acceptance criteria

  • New doc page with at least the 7 query patterns above.
Each pattern has SQL and/or PySpark examples.

  • Page is linked from other pages
  • Examples use realistic table names and column names

Additional Context

Some of these could be offered as part of DQX library maybe

Metadata

Metadata

Assignees

No one assigned

    Labels

    Q2 2026enhancementNew feature or requestgood first issueGood for newcomers or if you are looking at a small ticket

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions