Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10356: [Rust][DataFusion] Add support for is_in #9038

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion rust/benchmarks/src/bin/tpch.rs
Original file line number Diff line number Diff line change
Expand Up @@ -656,7 +656,7 @@ fn create_logical_plan(ctx: &mut ExecutionContext, query: usize) -> Result<Logic
on
l_orderkey = o_orderkey
where
(l_shipmode = 'MAIL' or l_shipmode = 'SHIP')
l_shipmode in ('MAIL', 'SHIP')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

and l_commitdate < l_receiptdate
and l_shipdate < l_commitdate
and l_receiptdate >= date '1994-01-01'
Expand Down
53 changes: 53 additions & 0 deletions rust/datafusion/src/logical_plan/expr.rs
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,15 @@ pub enum Expr {
/// List of expressions to feed to the functions as arguments
args: Vec<Expr>,
},
/// Returns whether the list contains the expr value.
InList {
/// The expression to compare
expr: Box<Expr>,
/// A list of values to compare against
list: Vec<Expr>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be easier to convert it here already to a vec where each element should have the same datatype,. And we check that while generating it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot find another example where we do validation like checking same datatypes in the Logical Plan. Most of this type of validation is performed in the Physical Plan: https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/expressions.rs#L1650

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I see. Maybe could be a future optimization so that we can convert it to a more efficient representation upfront, and generating an error earlier when it can not be executed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the rationale / idea (largely expressed by @jorgecarleitao ) was that actual type coercion happens during physical planning (so that we could potentially have different backend physical planning mechanisms but the same logical mechanisms).

You could potentially use the coercion logic here: https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/type_coercion.rs#L118

And coerce the in list items all to the same types

/// Whether the expression is negated
negated: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might keep negated out and use not instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helps keeping the logical plan simple, and also makes future code that uses the LP tree simple, e.g. an optimization rule on not(..)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mainly included negated to allow pretty printing like: 'z' NOT IN ('x','y'). I have changed this so it now uses the not expr so will now display NOT 'z' IN ('x','y')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think supporting sql style NOT IN would be nice (though no changes needed in this PR)

Copy link
Contributor

@Dandandan Dandandan Jan 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice indeed for a next PR, I think we could have a special case to match on Not (ListIn (...) in the formatter instead 👍

Copy link
Contributor

@alamb alamb Jan 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember exactly, but I think there might be some semantic difference (regarding NULLs, of course) in SQL between c NOT IN (...) and NOT c IN (...) FWIW that might require representing them differently

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm ok... in that case my initial suggestion might have been wrong... would good to have some tests for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments. I have done some testing with Postgres 13.1 and found that it does not appear to make a difference. These are all equivalent and return NULL.

SELECT NOT NULL IN ('a');
SELECT NULL NOT IN ('a');
SELECT NOT 'a' IN (NULL);
SELECT 'a' NOT IN (NULL);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks @seddonm1 for checking . sounds good to me

},
/// Represents a reference to all fields in a schema.
Wildcard,
}
Expand Down Expand Up @@ -224,6 +233,7 @@ impl Expr {
),
Expr::Sort { ref expr, .. } => expr.get_type(schema),
Expr::Between { .. } => Ok(DataType::Boolean),
Expr::InList { .. } => Ok(DataType::Boolean),
Expr::Wildcard => Err(DataFusionError::Internal(
"Wildcard expressions are not valid in a logical query plan".to_owned(),
)),
Expand Down Expand Up @@ -278,6 +288,7 @@ impl Expr {
} => Ok(left.nullable(input_schema)? || right.nullable(input_schema)?),
Expr::Sort { ref expr, .. } => expr.nullable(input_schema),
Expr::Between { ref expr, .. } => expr.nullable(input_schema),
Expr::InList { ref expr, .. } => expr.nullable(input_schema),
Expr::Wildcard => Err(DataFusionError::Internal(
"Wildcard expressions are not valid in a logical query plan".to_owned(),
)),
Expand Down Expand Up @@ -389,6 +400,15 @@ impl Expr {
Expr::Alias(Box::new(self.clone()), name.to_owned())
}

/// InList
pub fn in_list(&self, list: Vec<Expr>, negated: bool) -> Expr {
Expr::InList {
expr: Box::new(self.clone()),
list,
negated,
}
}

/// Create a sort expression from an existing expression.
///
/// ```
Expand Down Expand Up @@ -579,6 +599,15 @@ pub fn count_distinct(expr: Expr) -> Expr {
}
}

/// Create an in_list expression
pub fn in_list(expr: Expr, list: Vec<Expr>, negated: bool) -> Expr {
Expr::InList {
expr: Box::new(expr),
list,
negated,
}
}

/// Whether it can be represented as a literal expression
pub trait Literal {
/// convert the value to a Literal expression
Expand Down Expand Up @@ -814,6 +843,17 @@ impl fmt::Debug for Expr {
write!(f, "{:?} BETWEEN {:?} AND {:?}", expr, low, high)
}
}
Expr::InList {
expr,
list,
negated,
} => {
if *negated {
write!(f, "{:?} NOT IN ({:?})", expr, list)
} else {
write!(f, "{:?} IN ({:?})", expr, list)
}
}
Expr::Wildcard => write!(f, "*"),
}
}
Expand Down Expand Up @@ -906,6 +946,19 @@ fn create_name(e: &Expr, input_schema: &DFSchema) -> Result<String> {
}
Ok(format!("{}({})", fun.name, names.join(",")))
}
Expr::InList {
expr,
list,
negated,
} => {
let expr = create_name(expr, input_schema)?;
let list = list.iter().map(|expr| create_name(expr, input_schema));
if *negated {
Ok(format!("{:?} NOT IN ({:?})", expr, list))
} else {
Ok(format!("{:?} IN ({:?})", expr, list))
}
}
other => Err(DataFusionError::NotImplemented(format!(
"Physical plan does not support logical expression {:?}",
other
Expand Down
4 changes: 2 additions & 2 deletions rust/datafusion/src/logical_plan/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,8 @@ pub use display::display_schema;
pub use expr::{
abs, acos, and, array, asin, atan, avg, binary_expr, case, ceil, col, concat, cos,
count, count_distinct, create_udaf, create_udf, exp, exprlist_to_fields, floor,
length, lit, ln, log10, log2, lower, ltrim, max, min, or, round, rtrim, signum, sin,
sqrt, sum, tan, trim, trunc, upper, when, Expr, Literal,
in_list, length, lit, ln, log10, log2, lower, ltrim, max, min, or, round, rtrim,
signum, sin, sqrt, sum, tan, trim, trunc, upper, when, Expr, Literal,
};
pub use extension::UserDefinedLogicalNode;
pub use operators::Operator;
Expand Down
15 changes: 15 additions & 0 deletions rust/datafusion/src/optimizer/utils.rs
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,13 @@ pub fn expr_to_column_names(expr: &Expr, accum: &mut HashSet<String>) -> Result<
expr_to_column_names(high, accum)?;
Ok(())
}
Expr::InList { expr, list, .. } => {
expr_to_column_names(expr, accum)?;
for list_expr in list {
expr_to_column_names(list_expr, accum)?;
}
Ok(())
}
Expr::Wildcard => Err(DataFusionError::Internal(
"Wildcard expressions are not valid in a logical query plan".to_owned(),
)),
Expand Down Expand Up @@ -305,6 +312,13 @@ pub fn expr_sub_expressions(expr: &Expr) -> Result<Vec<Expr>> {
low.as_ref().to_owned(),
high.as_ref().to_owned(),
]),
Expr::InList { expr, list, .. } => {
let mut expr_list: Vec<Expr> = vec![expr.as_ref().to_owned()];
for list_expr in list {
expr_list.push(list_expr.to_owned());
}
Ok(expr_list)
}
Expr::Wildcard { .. } => Err(DataFusionError::Internal(
"Wildcard expressions are not valid in a logical query plan".to_owned(),
)),
Expand Down Expand Up @@ -416,6 +430,7 @@ pub fn rewrite_expression(expr: &Expr, expressions: &Vec<Expr>) -> Result<Expr>
Ok(expr)
}
}
Expr::InList { .. } => Ok(expr.clone()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise here, I think we might want to include the list -- even though at the moment it only contains constants, it is a Vec<Expr>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here this is just cloning the while InList expression (not the expr in InList) as the optimiser is not doing anything for this Expression yet.

Expr::Wildcard { .. } => Err(DataFusionError::Internal(
"Wildcard expressions are not valid in a logical query plan".to_owned(),
)),
Expand Down