Add strict NotEqualTo/NotIn null and NaN tests#3547
Conversation
|
This looks the same fix as #3521 |
|
|
||
| should_read = _StrictMetricsEvaluator(strict_data_file_schema, NotIn("some_nulls", {"abc", "def"})).eval(strict_data_file_1) | ||
| assert should_read, "Should match: notIn on some nulls column, 'bbb' > 'abc' and 'bbb' < 'def'" | ||
| assert not should_read, "Should not match: mixed-null notIn cannot be proven when bounds are missing" |
There was a problem hiding this comment.
some_nulls has value_count = 50 and null_count = 10
- Its field id is
5in strict_data_file_schema
So 40 values are non-null, but without lower/upper bounds the strict evaluator cannot rule out "abc" or "def" among them.
Before the fix, “column can contain nulls” incorrectly caused ROWS_MUST_MATCH; now only “column contains nulls only” can do that.
fa29f84 to
43860b7
Compare
| ) | ||
|
|
||
| should_read = _StrictMetricsEvaluator(schema, NotEqualTo("x", 5)).eval(data_file) | ||
| assert should_read == ROWS_MIGHT_NOT_MATCH, "Should not match: bounds prove the non-null value is 5" |
There was a problem hiding this comment.
value_count = 2 and null_count = 1, so there is 1 non-null value. Bounds [5..5] mean the non-null value is 5, so NotEqualTo("x", 5) / NotIn("x", {5, 6}) cannot match every row.
There was a problem hiding this comment.
This feels like something that would be great as a code comment :)
There was a problem hiding this comment.
I think this is restating the logic in the test
| nan_value_counts=None, | ||
| ) | ||
|
|
||
| should_read = _StrictMetricsEvaluator(schema, NotEqualTo("x", 5)).eval(data_file) |
There was a problem hiding this comment.
value_count = 2 and null_count = 2, so all values are null. That means every row matches NotEqualTo("x", 5) / NotIn("x", {5, 6}).
| upper_bounds={1: to_bytes(field_type, 5.0)}, | ||
| ) | ||
|
|
||
| should_read = _StrictMetricsEvaluator(schema, NotEqualTo("x", 5.0)).eval(data_file) |
There was a problem hiding this comment.
value_count = 2 and nan_count = 1, so there is 1 non-NaN value. Bounds [5.0..5.0] mean the non-NaN value is 5.0, so NotEqualTo("x", 5.0) / NotIn("x", {5.0, 6.0}) cannot match every row.
There was a problem hiding this comment.
Weird, my comments didn't show up in the right place? Anyways, you wrote out some really useful explanations here and we should have these as code comments.
| nan_value_counts={1: 2}, | ||
| ) | ||
|
|
||
| should_read = _StrictMetricsEvaluator(schema, NotEqualTo("x", 5.0)).eval(data_file) |
There was a problem hiding this comment.
value_count = 2 and nan_count = 2, so all values are NaN. That means every row matches NotEqualTo("x", 5.0) / NotIn("x", {5.0, 6.0}).
thanks! i didnt see that one |
43860b7 to
104dfda
Compare
| assert not should_read, "Should not match: no_nulls field does not have bounds" | ||
|
|
||
|
|
||
| def test_strict_not_eq_partial_nulls_within_bounds() -> None: |
There was a problem hiding this comment.
Removed test_strict_not_eq_partial_nulls_within_bounds because the new branch-focused tests cover the same partial-null boundary more directly, and also add the missing all-null, mixed-NaN, and all-NaN cases. The removed test also used singleton NotIn("x", {5}), which normalizes to NotEqualTo("x", 5) and therefore did not exercise the strict visit_not_in branch.
104dfda to
a595909
Compare
rambleraptor
left a comment
There was a problem hiding this comment.
The tests look great!
| ) | ||
|
|
||
| should_read = _StrictMetricsEvaluator(schema, NotEqualTo("x", 5)).eval(data_file) | ||
| assert should_read == ROWS_MIGHT_NOT_MATCH, "Should not match: bounds prove the non-null value is 5" |
There was a problem hiding this comment.
This feels like something that would be great as a code comment :)
| nan_value_counts=None, | ||
| ) | ||
|
|
||
| should_read = _StrictMetricsEvaluator(schema, NotEqualTo("x", 5)).eval(data_file) |
Summary
Follow-up to #3521 for #3498.
This PR keeps the implementation from #3521 and tightens the strict metrics regression coverage:
NotInliterals so the tests exercisevisit_not_ininstead of normalizing toNotEqualTo.Why
#3521 fixed the strict metrics over-pruning bug by only short-circuiting negative predicates when a column contains only nulls or only NaNs. These tests lock in that boundary: mixed null/NaN counts must continue to bounds checks, while all-null/all-NaN counts can still prove
ROWS_MUST_MATCH.Java Parity
This follows Java
StrictMetricsEvaluator, where negative predicates short-circuit only for all-null/all-NaN columns:notEqnotInValidation
UV_CACHE_DIR=.cache/uv PYTHON_GIL=1 PYTHONPATH=. uv run pytest tests/expressions/test_evaluator.py -k "mixed_nulls_and_matching_bounds or all_nulls or mixed_nans_and_matching_bounds or all_nans or strict_integer_not_in"UV_CACHE_DIR=.cache/uv PYTHON_GIL=1 PYTHONPATH=. uv run pytest tests/expressions/test_evaluator.pyUV_CACHE_DIR=.cache/uv PYTHON_GIL=1 PYTHONPATH=. uv run prek run -agit diff --check