Fix null comparison for Parquet pruning predicate #1595

viirya · 2022-01-17T10:01:02Z

Which issue does this PR close?

Closes #1591.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb · 2022-01-17T15:17:07Z

Thank you @viirya -- I will try to review this carefully, but likely won't be able to do so until tomorrow

viirya · 2022-01-17T18:24:49Z

Thank you @alamb

houqp

Thanks @viirya for the quick fix!

alamb

Thanks @viirya and @houqp

BLUF: I am fairly sure this change is ok, but I am not sure why it is needed; I have outlined my confusions below.

Note I ran this change against the IOx test suite in https://github.com/influxdata/influxdb_iox/pull/3479 and it was good.

Confusion 1: Doesn't follow definition of a pruning predicate:

As a reminder, the pruning predicate definition is

    /// A pruning predicate is one that has been rewritten in terms of
    /// the min and max values of column references and that evaluates
    /// to FALSE if the filter predicate would evaluate FALSE *for
    /// every row* whose values fell within the min / max ranges (aka
    /// could be pruned).
    ///
    /// The pruning predicate evaluates to TRUE or NULL
    /// if the filter predicate *might* evaluate to TRUE for at least
    /// one row whose vaules fell within the min/max ranges (in other
    /// words they might pass the predicate)

Thus, a "TRUE" or "NULL" result for a predicate means the row group must be kept. This is the safe behavior -- only if it is 100% certain that the predicate will evaluate to FALSE should the row group be removed

In this case, x = null doesn't seem to satisfy the stated conditions in PruningPredicate. x = null evaluates to null for all values (both null and non null) as can be seen in postgres:

alamb=# select x, x = null from foo;
 x | ?column?
---+----------
 1 |
   |
 2 |
(3 rows)

alamb=#

Thus there is something wrong. Either:

the pruning predicate definition should be updated to say that a pruning predicate will return false if all rows will evaluate to FALSE OR NULL (which seems reasonable as only rows that evaluate to TRUE pass a predicate, not row that return null)
this is not a correct transformation

Confusion 2: Why are we treating `=` specially?

If we go with this PR, I don't see any reason to handle = specially, as same argument applies to other operators such as !=, >, etc (though it does not apply to IS DISTINCT / IS NOT DISTINCT).

alamb · 2022-01-17T14:20:12Z

datafusion/src/physical_optimizer/pruning.rs

@@ -421,6 +444,13 @@ impl<'a> PruningExpressionBuilder<'a> {
        &self.scalar_expr
    }

+    fn scalar_expr_value(&self) -> Result<&ScalarValue> {


Suggested change

fn scalar_expr_value(&self) -> Result<&ScalarValue> {

fn scalar_expr_value(&self) -> Option<&ScalarValue> {

Would save a string creation on error (not that it really matters)

alamb · 2022-01-18T18:32:28Z

datafusion/src/physical_optimizer/pruning.rs

+
+    fn null_count_column_expr(&mut self) -> Result<Expr> {
+        let null_count_field = &Field::new(self.field.name(), DataType::Int64, false);
+        self.required_columns.null_count_column_expr(


alamb · 2022-01-18T18:33:00Z

datafusion/src/physical_optimizer/pruning.rs

+                {
+                    // column = null => null_count > 0
+                    let null_count_column_expr = expr_builder.null_count_column_expr()?;
+                    null_count_column_expr.gt(lit::<i64>(0))


I am curious why we use a i64 here rather than u64?

Oh, you're right. This should be u64.

Changed to u64.

alamb · 2022-01-18T18:35:44Z

datafusion/src/physical_plan/file_format/parquet.rs

-        // because the null values propagate to the end result, making the predicate result undefined
-        assert_eq!(row_group_filter, vec![true, true]);
+        // First row group was filtered out because it contains no null value on "c2".
+        assert_eq!(row_group_filter, vec![false, true]);


I actually think this could be vec![false, false] as the predicate can never be true (int > 1 AND bool = NULL is always NULL)

I am not sure about the expression semantics in datafusion. In Spark, the predicate should be IsNull that checks the null value. Here I follow the original expression bool = NULL.

I see there is also IsNull predicate expression, but I don't see IsNull is handled in predicate pushdown. I don't know if this is intentional (i.e. using = to do null predicate pushdown) or a bug.

I can fix it if you agree that IsNull is correct way to handle null predicate here.

I think this is related to the "Confusion 1 and 2". I guess this is also why you feel confused about treating = specially.

In sql IsNull is the correct way to test a column for null as well 👍

It would make a lot of sense to me to rewrite x IS NULL --> 0 > x_null_count

yea, I'm surprised when I looked at the bool = NULL and confused too. I guess this is how datafusion works but seems not :). Let me fix it together.

Would you like me to fix it here or in a following PR?

I've updated to use IsNull for predicate pruning.

alamb · 2022-01-18T18:37:35Z

datafusion/src/physical_optimizer/pruning.rs

+
+    /// return the number of null values for the named column.
+    /// Note: the returned array must contain `num_containers()` rows.
+    fn null_counts(&self, column: &Column) -> Option<ArrayRef>;


Suggested change

/// return the number of null values for the named column.

/// Note: the returned array must contain `num_containers()` rows.

fn null_counts(&self, column: &Column) -> Option<ArrayRef>;

/// return the number of null values for the named column as an

/// `Option<Int64Array>`.

///

/// Note: the returned array must contain `num_containers()` rows.

fn null_counts(&self, column: &Column) -> Option<ArrayRef>;

I had to look this up to figure out what type this was required

alamb

Thanks @viirya -- looks good to me. @houqp is this what you had in mind for #1591 ?

houqp · 2022-01-21T07:19:29Z

Thank you @viirya for the fix and @alamb for the detailed review 👍

viirya · 2022-01-21T07:26:43Z

Thank you @houqp @alamb !

alamb · 2022-01-21T11:59:20Z

Thanks @houqp !

Fix null comparison for Parquet pruning predicate

49e38f5

github-actions bot added the datafusion Changes in the datafusion crate label Jan 17, 2022

Fix clippy

416806f

houqp approved these changes Jan 17, 2022

View reviewed changes

houqp added the enhancement New feature or request label Jan 17, 2022

alamb approved these changes Jan 18, 2022

View reviewed changes

viirya added 2 commits January 18, 2022 23:19

Use u64

c9718cf

Address comments

eaedebb

viirya force-pushed the issue_1591 branch from 6044589 to 7501c18 Compare January 19, 2022 17:42

Use IsNull for null count predicate pruning

bc6b9b5

viirya force-pushed the issue_1591 branch from 7501c18 to bc6b9b5 Compare January 19, 2022 18:07

alamb approved these changes Jan 19, 2022

View reviewed changes

houqp added the performance label Jan 21, 2022

houqp merged commit 03075d5 into apache:master Jan 21, 2022

alamb mentioned this pull request Aug 5, 2022

Error pruning IsNull expressions: Column 'instance_null_count' is declared as non-nullable but contains null values #3042

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix null comparison for Parquet pruning predicate #1595

Fix null comparison for Parquet pruning predicate #1595

viirya commented Jan 17, 2022

alamb commented Jan 17, 2022

viirya commented Jan 17, 2022

houqp left a comment

alamb left a comment

alamb Jan 17, 2022

viirya Jan 19, 2022

alamb Jan 18, 2022

alamb Jan 18, 2022

viirya Jan 18, 2022

viirya Jan 19, 2022

alamb Jan 18, 2022

viirya Jan 18, 2022 •

edited

viirya Jan 18, 2022 •

edited

alamb Jan 18, 2022

viirya Jan 18, 2022

viirya Jan 18, 2022

viirya Jan 19, 2022

alamb Jan 18, 2022

viirya Jan 19, 2022

alamb left a comment

houqp commented Jan 21, 2022

viirya commented Jan 21, 2022

alamb commented Jan 21, 2022

	fn scalar_expr_value(&self) -> Result<&ScalarValue> {
	fn scalar_expr_value(&self) -> Option<&ScalarValue> {

Fix null comparison for Parquet pruning predicate #1595

Fix null comparison for Parquet pruning predicate #1595

Conversation

viirya commented Jan 17, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb commented Jan 17, 2022

viirya commented Jan 17, 2022

houqp left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Confusion 1: Doesn't follow definition of a pruning predicate:

Confusion 2: Why are we treating = specially?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Jan 18, 2022 • edited

Choose a reason for hiding this comment

viirya Jan 18, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

houqp commented Jan 21, 2022

viirya commented Jan 21, 2022

alamb commented Jan 21, 2022

Confusion 2: Why are we treating `=` specially?

viirya Jan 18, 2022 •

edited

viirya Jan 18, 2022 •

edited