Skip to content

fix: restore column-vs-literal comparison in isGreaterThan/isLessThan family (#227)#273

Open
nikolauspschuetz wants to merge 1 commit into
awslabs:masterfrom
nikolauspschuetz:fix/issue-227-comparator-literal-regression
Open

fix: restore column-vs-literal comparison in isGreaterThan/isLessThan family (#227)#273
nikolauspschuetz wants to merge 1 commit into
awslabs:masterfrom
nikolauspschuetz:fix/issue-227-comparator-literal-regression

Conversation

@nikolauspschuetz

Copy link
Copy Markdown

Problem

isGreaterThanOrEqualTo (and the rest of the comparator family) regressed for column-vs-literal comparisons. This used to work:

check.isGreaterThanOrEqualTo("cluster_size", "1", hint="Cluster should have at least one element")

but now fails with the constraint message Input data does not include column 1! — the second operand is interpreted strictly as a column name. Reported in #227.

Root cause

This rode in with the bundled Deequ jar upgrade to 2.0.x (PyDeequ is pinned to com.amazon.deequ:deequ:2.0.8-... via pydeequ/configs.py). In Deequ 1.2.x the comparators built a plain SQL predicate ("<colA> >= <colB>") and passed no columns list, so a literal second operand worked. In Deequ 2.0.x the comparators pass columns = List(columnA, columnB), and Deequ validates that every entry is a real dataframe column — so literals fail. The PyDeequ wrappers themselves never changed; they just forward to the regressed Scala methods.

Fix

Route the whole comparator family (isLessThan, isLessThanOrEqualTo, isGreaterThan, isGreaterThanOrEqualTo) through Deequ's public satisfies(...) with an empty columns list — exactly the pre-2.0 behavior. Column-vs-column comparisons are unchanged; column-vs-literal/expression works again. Applied across the entire family for consistency.

Note: as with Deequ's own satisfies, columnB is treated as a Spark SQL expression, so string literals must be quoted by the caller (e.g. "'foo'"). This matches the original 1.2.x semantics.

Tests

Added to tests/test_checks.py:

  • test_comparator_against_literal — column-vs-literal for all four comparators, expecting Success.
  • test_fail_comparator_against_literal — a failing literal comparison, expecting Failure.

Validated against real Spark 3.5 / Deequ 2.0.8: the 2 new tests plus all 8 existing column-vs-column comparator tests pass (10 passed). CI will exercise the full pyspark 3.1/3.2/3.3/3.5 matrix.

Closes #227

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Generated by AI (model: us.anthropic.claude-opus-4-6-v1, prompt: 416310f3) — may not be fully accurate. Reply if this doesn't help.

Comment thread pydeequ/checks.py Outdated
Comment thread pydeequ/checks.py Outdated
Comment thread pydeequ/checks.py
@nikolauspschuetz nikolauspschuetz marked this pull request as ready for review June 27, 2026 17:30
@nikolauspschuetz

Copy link
Copy Markdown
Author

Ready for review. Restores column-vs-literal comparison across the isGreaterThan*/isLessThan* family — a regression that rode in with the Deequ 2.0.x jar (it now routes through satisfies with an empty columns list, the pre-2.0 behavior). Validated locally on Spark 3.5: full test_checks.py → 88 passed. cc @sudsali @chenliu0831 — would appreciate your review. Closes #227.

…ly (awslabs#227)

Deequ 2.0.x's Check.isLessThan/isLessThanOrEqualTo/isGreaterThan/
isGreaterThanOrEqualTo forward columns = List(columnA, columnB) to
satisfies, which makes Deequ require both operands to be existing
columns. This regressed the long-supported column-vs-literal usage
(e.g. isGreaterThanOrEqualTo("cluster_size", "1")), failing with
'Input data does not include column 1!' (issue awslabs#227).

Route the comparator family through Deequ's satisfies with an empty
columns list (the pre-2.0 behaviour), building the SQL predicate in the
wrapper. columnB may now be a column name or a SQL literal/expression.
Column-vs-column comparisons are unchanged.

Adds regression tests for column-vs-literal comparisons.
@nikolauspschuetz nikolauspschuetz force-pushed the fix/issue-227-comparator-literal-regression branch from 50fc96c to 83c28e2 Compare June 27, 2026 17:41

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Generated by AI (model: us.anthropic.claude-opus-4-6-v1, prompt: 416310f3) — may not be fully accurate. Reply if this doesn't help.

Comment thread pydeequ/checks.py
@nikolauspschuetz

Copy link
Copy Markdown
Author

Thanks for the automated review pass. The current revision addresses these findings:

  • columnA quoting — the generated predicate backtick-quotes column A (`{columnA}` {operator} {columnB}), so column-A names with spaces/special characters or SQL reserved words stay valid.
  • columnB left raw (intentional) — restoring the column-vs-literal usage from Regression in behavior of check comparator function isGreaterThanOrEqualTo #227 is the whole point: columnB may be a column name, a SQL literal, or a SQL expression, so quoting is the caller's responsibility — the same contract as Deequ's own satisfies. This is documented in the _column_comparison docstring.
  • Default assertion — now routed through satisfies$default$3 (_ == 1.0), which is the correct default now that the comparator family genuinely goes through satisfies; it matches each comparator's own Deequ default.

Resolving the threads accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regression in behavior of check comparator function isGreaterThanOrEqualTo

1 participant