Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix null comparison for Parquet pruning predicate #1595

Merged
merged 5 commits into from Jan 21, 2022

Conversation

viirya
Copy link
Member

@viirya viirya commented Jan 17, 2022

Which issue does this PR close?

Closes #1591.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Jan 17, 2022
@alamb
Copy link
Contributor

alamb commented Jan 17, 2022

Thank you @viirya -- I will try to review this carefully, but likely won't be able to do so until tomorrow

@viirya
Copy link
Member Author

viirya commented Jan 17, 2022

Thank you @alamb

Copy link
Member

@houqp houqp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @viirya for the quick fix!

@houqp houqp added the enhancement New feature or request label Jan 17, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @viirya and @houqp

BLUF: I am fairly sure this change is ok, but I am not sure why it is needed; I have outlined my confusions below.

Note I ran this change against the IOx test suite in https://github.com/influxdata/influxdb_iox/pull/3479 and it was good.

Confusion 1: Doesn't follow definition of a pruning predicate:

As a reminder, the pruning predicate definition is

    /// A pruning predicate is one that has been rewritten in terms of
    /// the min and max values of column references and that evaluates
    /// to FALSE if the filter predicate would evaluate FALSE *for
    /// every row* whose values fell within the min / max ranges (aka
    /// could be pruned).
    ///
    /// The pruning predicate evaluates to TRUE or NULL
    /// if the filter predicate *might* evaluate to TRUE for at least
    /// one row whose vaules fell within the min/max ranges (in other
    /// words they might pass the predicate)

Thus, a "TRUE" or "NULL" result for a predicate means the row group must be kept. This is the safe behavior -- only if it is 100% certain that the predicate will evaluate to FALSE should the row group be removed

In this case, x = null doesn't seem to satisfy the stated conditions in PruningPredicate. x = null evaluates to null for all values (both null and non null) as can be seen in postgres:

alamb=# select x, x = null from foo;
 x | ?column?
---+----------
 1 |
   |
 2 |
(3 rows)

alamb=#

Thus there is something wrong. Either:

  1. the pruning predicate definition should be updated to say that a pruning predicate will return false if all rows will evaluate to FALSE OR NULL (which seems reasonable as only rows that evaluate to TRUE pass a predicate, not row that return null)
  2. this is not a correct transformation

Confusion 2: Why are we treating = specially?

If we go with this PR, I don't see any reason to handle = specially, as same argument applies to other operators such as !=, >, etc (though it does not apply to IS DISTINCT / IS NOT DISTINCT).

@@ -421,6 +444,13 @@ impl<'a> PruningExpressionBuilder<'a> {
&self.scalar_expr
}

fn scalar_expr_value(&self) -> Result<&ScalarValue> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn scalar_expr_value(&self) -> Result<&ScalarValue> {
fn scalar_expr_value(&self) -> Option<&ScalarValue> {

Would save a string creation on error (not that it really matters)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.


fn null_count_column_expr(&mut self) -> Result<Expr> {
let null_count_field = &Field::new(self.field.name(), DataType::Int64, false);
self.required_columns.null_count_column_expr(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

{
// column = null => null_count > 0
let null_count_column_expr = expr_builder.null_count_column_expr()?;
null_count_column_expr.gt(lit::<i64>(0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious why we use a i64 here rather than u64?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you're right. This should be u64.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to u64.

// because the null values propagate to the end result, making the predicate result undefined
assert_eq!(row_group_filter, vec![true, true]);
// First row group was filtered out because it contains no null value on "c2".
assert_eq!(row_group_filter, vec![false, true]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think this could be vec![false, false] as the predicate can never be true (int > 1 AND bool = NULL is always NULL)

Copy link
Member Author

@viirya viirya Jan 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about the expression semantics in datafusion. In Spark, the predicate should be IsNull that checks the null value. Here I follow the original expression bool = NULL.

I see there is also IsNull predicate expression, but I don't see IsNull is handled in predicate pushdown. I don't know if this is intentional (i.e. using = to do null predicate pushdown) or a bug.

I can fix it if you agree that IsNull is correct way to handle null predicate here.

Copy link
Member Author

@viirya viirya Jan 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is related to the "Confusion 1 and 2". I guess this is also why you feel confused about treating = specially.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In sql IsNull is the correct way to test a column for null as well 👍

It would make a lot of sense to me to rewrite x IS NULL --> 0 > x_null_count

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, I'm surprised when I looked at the bool = NULL and confused too. I guess this is how datafusion works but seems not :). Let me fix it together.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like me to fix it here or in a following PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated to use IsNull for predicate pruning.

Comment on lines 80 to 83

/// return the number of null values for the named column.
/// Note: the returned array must contain `num_containers()` rows.
fn null_counts(&self, column: &Column) -> Option<ArrayRef>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// return the number of null values for the named column.
/// Note: the returned array must contain `num_containers()` rows.
fn null_counts(&self, column: &Column) -> Option<ArrayRef>;
/// return the number of null values for the named column as an
/// `Option<Int64Array>`.
///
/// Note: the returned array must contain `num_containers()` rows.
fn null_counts(&self, column: &Column) -> Option<ArrayRef>;

I had to look this up to figure out what type this was required

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @viirya -- looks good to me. @houqp is this what you had in mind for #1591 ?

@houqp houqp merged commit 03075d5 into apache:master Jan 21, 2022
@houqp
Copy link
Member

houqp commented Jan 21, 2022

Thank you @viirya for the fix and @alamb for the detailed review 👍

@viirya
Copy link
Member Author

viirya commented Jan 21, 2022

Thank you @houqp @alamb !

@alamb
Copy link
Contributor

alamb commented Jan 21, 2022

Thanks @houqp !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate enhancement New feature or request performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet pruning predicate for IS NULL
3 participants