Add strict NotEqualTo/NotIn null and NaN tests by kevinjqliu · Pull Request #3547 · apache/iceberg-python

kevinjqliu · 2026-06-22T01:15:59Z

Summary

Follow-up to #3521 for #3498.

This PR keeps the implementation from #3521 and tightens the strict metrics regression coverage:

Replace the singleton partial-null regression test with branch-focused mixed-null and all-null cases.
Add equivalent mixed-NaN and all-NaN coverage for float and double columns.
Use multi-value NotIn literals so the tests exercise visit_not_in instead of normalizing to NotEqualTo.

Why

#3521 fixed the strict metrics over-pruning bug by only short-circuiting negative predicates when a column contains only nulls or only NaNs. These tests lock in that boundary: mixed null/NaN counts must continue to bounds checks, while all-null/all-NaN counts can still prove ROWS_MUST_MATCH.

Java Parity

This follows Java StrictMetricsEvaluator, where negative predicates short-circuit only for all-null/all-NaN columns:

Validation

UV_CACHE_DIR=.cache/uv PYTHON_GIL=1 PYTHONPATH=. uv run pytest tests/expressions/test_evaluator.py -k "mixed_nulls_and_matching_bounds or all_nulls or mixed_nans_and_matching_bounds or all_nans or strict_integer_not_in"
UV_CACHE_DIR=.cache/uv PYTHON_GIL=1 PYTHONPATH=. uv run pytest tests/expressions/test_evaluator.py
UV_CACHE_DIR=.cache/uv PYTHON_GIL=1 PYTHONPATH=. uv run prek run -a
git diff --check

rambleraptor · 2026-06-22T01:24:31Z

This looks the same fix as #3521

kevinjqliu · 2026-06-22T01:27:12Z


    should_read = _StrictMetricsEvaluator(strict_data_file_schema, NotIn("some_nulls", {"abc", "def"})).eval(strict_data_file_1)
-    assert should_read, "Should match: notIn on some nulls column, 'bbb' > 'abc' and 'bbb' < 'def'"
+    assert not should_read, "Should not match: mixed-null notIn cannot be proven when bounds are missing"


some_nulls has value_count = 50 and null_count = 10

Its field id is 5 in strict_data_file_schema

So 40 values are non-null, but without lower/upper bounds the strict evaluator cannot rule out "abc" or "def" among them.

Before the fix, “column can contain nulls” incorrectly caused ROWS_MUST_MATCH; now only “column contains nulls only” can do that.

kevinjqliu · 2026-06-22T01:51:43Z

+    )
+
+    should_read = _StrictMetricsEvaluator(schema, NotEqualTo("x", 5)).eval(data_file)
+    assert should_read == ROWS_MIGHT_NOT_MATCH, "Should not match: bounds prove the non-null value is 5"


value_count = 2 and null_count = 1, so there is 1 non-null value. Bounds [5..5] mean the non-null value is 5, so NotEqualTo("x", 5) / NotIn("x", {5, 6}) cannot match every row.

This feels like something that would be great as a code comment :)

I think this is restating the logic in the test

kevinjqliu · 2026-06-22T01:51:55Z

+        nan_value_counts=None,
+    )
+
+    should_read = _StrictMetricsEvaluator(schema, NotEqualTo("x", 5)).eval(data_file)


value_count = 2 and null_count = 2, so all values are null. That means every row matches NotEqualTo("x", 5) / NotIn("x", {5, 6}).

Code comment :)

kevinjqliu · 2026-06-22T01:52:17Z

+        upper_bounds={1: to_bytes(field_type, 5.0)},
+    )
+
+    should_read = _StrictMetricsEvaluator(schema, NotEqualTo("x", 5.0)).eval(data_file)


value_count = 2 and nan_count = 1, so there is 1 non-NaN value. Bounds [5.0..5.0] mean the non-NaN value is 5.0, so NotEqualTo("x", 5.0) / NotIn("x", {5.0, 6.0}) cannot match every row.

Weird, my comments didn't show up in the right place? Anyways, you wrote out some really useful explanations here and we should have these as code comments.

kevinjqliu · 2026-06-22T01:52:21Z

+        nan_value_counts={1: 2},
+    )
+
+    should_read = _StrictMetricsEvaluator(schema, NotEqualTo("x", 5.0)).eval(data_file)


value_count = 2 and nan_count = 2, so all values are NaN. That means every row matches NotEqualTo("x", 5.0) / NotIn("x", {5.0, 6.0}).

kevinjqliu · 2026-06-22T04:05:40Z

This looks the same fix as #3521

thanks! i didnt see that one

kevinjqliu · 2026-06-22T04:17:10Z

    assert not should_read, "Should not match: no_nulls field does not have bounds"

-
-def test_strict_not_eq_partial_nulls_within_bounds() -> None:


Removed test_strict_not_eq_partial_nulls_within_bounds because the new branch-focused tests cover the same partial-null boundary more directly, and also add the missing all-null, mixed-NaN, and all-NaN cases. The removed test also used singleton NotIn("x", {5}), which normalizes to NotEqualTo("x", 5) and therefore did not exercise the strict visit_not_in branch.

rambleraptor

The tests look great!

rambleraptor · 2026-06-22T04:23:21Z

+    )
+
+    should_read = _StrictMetricsEvaluator(schema, NotEqualTo("x", 5)).eval(data_file)
+    assert should_read == ROWS_MIGHT_NOT_MATCH, "Should not match: bounds prove the non-null value is 5"


This feels like something that would be great as a code comment :)

rambleraptor · 2026-06-22T04:23:32Z

+        nan_value_counts=None,
+    )
+
+    should_read = _StrictMetricsEvaluator(schema, NotEqualTo("x", 5)).eval(data_file)


Code comment :)

geruh

LGTM!

kevinjqliu commented Jun 22, 2026

View reviewed changes

kevinjqliu changed the title ~~[codex] Fix strict NotEqualTo and NotIn metrics with nulls and NaNs~~ Fix strict NotEqualTo and NotIn metrics with nulls and NaNs Jun 22, 2026

kevinjqliu force-pushed the kevinjqliu/codex-strict-metrics-not-eq-not-in branch 2 times, most recently from fa29f84 to 43860b7 Compare June 22, 2026 01:45

kevinjqliu commented Jun 22, 2026

View reviewed changes

kevinjqliu mentioned this pull request Jun 22, 2026

Fix strict NotEqualTo/NotIn pruning with partial nulls or NaNs #3521

Merged

kevinjqliu force-pushed the kevinjqliu/codex-strict-metrics-not-eq-not-in branch from 43860b7 to 104dfda Compare June 22, 2026 04:14

kevinjqliu changed the title ~~Fix strict NotEqualTo and NotIn metrics with nulls and NaNs~~ Add strict NotEqualTo/NotIn null and NaN tests Jun 22, 2026

kevinjqliu commented Jun 22, 2026

View reviewed changes

Add strict NotEqualTo/NotIn null and NaN tests

a595909

kevinjqliu force-pushed the kevinjqliu/codex-strict-metrics-not-eq-not-in branch from 104dfda to a595909 Compare June 22, 2026 04:19

kevinjqliu marked this pull request as ready for review June 22, 2026 04:20

rambleraptor approved these changes Jun 22, 2026

View reviewed changes

geruh approved these changes Jun 22, 2026

View reviewed changes

geruh merged commit 8c7912f into apache:main Jun 22, 2026
16 checks passed

		assert not should_read, "Should not match: no_nulls field does not have bounds"


		def test_strict_not_eq_partial_nulls_within_bounds() -> None:

Conversation

kevinjqliu commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Java Parity

Validation

Uh oh!

rambleraptor commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevinjqliu commented Jun 22, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rambleraptor left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

geruh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kevinjqliu commented Jun 22, 2026 •

edited

Loading

rambleraptor commented Jun 22, 2026 •

edited

Loading