-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10356: [Rust][DataFusion] Add support for is_in #9038
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -158,6 +158,15 @@ pub enum Expr { | |
/// List of expressions to feed to the functions as arguments | ||
args: Vec<Expr>, | ||
}, | ||
/// Returns whether the list contains the expr value. | ||
InList { | ||
/// The expression to compare | ||
expr: Box<Expr>, | ||
/// A list of values to compare against | ||
list: Vec<Expr>, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it might be easier to convert it here already to a vec where each element should have the same datatype,. And we check that while generating it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I cannot find another example where we do validation like checking same datatypes in the Logical Plan. Most of this type of validation is performed in the Physical Plan: https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/expressions.rs#L1650 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I see. Maybe could be a future optimization so that we can convert it to a more efficient representation upfront, and generating an error earlier when it can not be executed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the rationale / idea (largely expressed by @jorgecarleitao ) was that actual type coercion happens during physical planning (so that we could potentially have different backend physical planning mechanisms but the same logical mechanisms). You could potentially use the coercion logic here: https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/type_coercion.rs#L118 And coerce the in list items all to the same types |
||
/// Whether the expression is negated | ||
negated: bool, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We might keep negated out and use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This helps keeping the logical plan simple, and also makes future code that uses the LP tree simple, e.g. an optimization rule on There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I mainly included There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think supporting sql style There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would be nice indeed for a next PR, I think we could have a special case to match on Not (ListIn (...) in the formatter instead 👍 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can't remember exactly, but I think there might be some semantic difference (regarding NULLs, of course) in SQL between There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hm ok... in that case my initial suggestion might have been wrong... would good to have some tests for this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the comments. I have done some testing with Postgres 13.1 and found that it does not appear to make a difference. These are all equivalent and return SELECT NOT NULL IN ('a');
SELECT NULL NOT IN ('a');
SELECT NOT 'a' IN (NULL);
SELECT 'a' NOT IN (NULL); There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, thanks @seddonm1 for checking . sounds good to me |
||
}, | ||
/// Represents a reference to all fields in a schema. | ||
Wildcard, | ||
} | ||
|
@@ -224,6 +233,7 @@ impl Expr { | |
), | ||
Expr::Sort { ref expr, .. } => expr.get_type(schema), | ||
Expr::Between { .. } => Ok(DataType::Boolean), | ||
Expr::InList { .. } => Ok(DataType::Boolean), | ||
Expr::Wildcard => Err(DataFusionError::Internal( | ||
"Wildcard expressions are not valid in a logical query plan".to_owned(), | ||
)), | ||
|
@@ -278,6 +288,7 @@ impl Expr { | |
} => Ok(left.nullable(input_schema)? || right.nullable(input_schema)?), | ||
Expr::Sort { ref expr, .. } => expr.nullable(input_schema), | ||
Expr::Between { ref expr, .. } => expr.nullable(input_schema), | ||
Expr::InList { ref expr, .. } => expr.nullable(input_schema), | ||
Expr::Wildcard => Err(DataFusionError::Internal( | ||
"Wildcard expressions are not valid in a logical query plan".to_owned(), | ||
)), | ||
|
@@ -389,6 +400,15 @@ impl Expr { | |
Expr::Alias(Box::new(self.clone()), name.to_owned()) | ||
} | ||
|
||
/// InList | ||
pub fn in_list(&self, list: Vec<Expr>, negated: bool) -> Expr { | ||
Expr::InList { | ||
expr: Box::new(self.clone()), | ||
list, | ||
negated, | ||
} | ||
} | ||
|
||
/// Create a sort expression from an existing expression. | ||
/// | ||
/// ``` | ||
|
@@ -579,6 +599,15 @@ pub fn count_distinct(expr: Expr) -> Expr { | |
} | ||
} | ||
|
||
/// Create an in_list expression | ||
pub fn in_list(expr: Expr, list: Vec<Expr>, negated: bool) -> Expr { | ||
Expr::InList { | ||
expr: Box::new(expr), | ||
list, | ||
negated, | ||
} | ||
} | ||
|
||
/// Whether it can be represented as a literal expression | ||
pub trait Literal { | ||
/// convert the value to a Literal expression | ||
|
@@ -814,6 +843,17 @@ impl fmt::Debug for Expr { | |
write!(f, "{:?} BETWEEN {:?} AND {:?}", expr, low, high) | ||
} | ||
} | ||
Expr::InList { | ||
expr, | ||
list, | ||
negated, | ||
} => { | ||
if *negated { | ||
write!(f, "{:?} NOT IN ({:?})", expr, list) | ||
} else { | ||
write!(f, "{:?} IN ({:?})", expr, list) | ||
} | ||
} | ||
Expr::Wildcard => write!(f, "*"), | ||
} | ||
} | ||
|
@@ -906,6 +946,19 @@ fn create_name(e: &Expr, input_schema: &DFSchema) -> Result<String> { | |
} | ||
Ok(format!("{}({})", fun.name, names.join(","))) | ||
} | ||
Expr::InList { | ||
expr, | ||
list, | ||
negated, | ||
} => { | ||
let expr = create_name(expr, input_schema)?; | ||
let list = list.iter().map(|expr| create_name(expr, input_schema)); | ||
if *negated { | ||
Ok(format!("{:?} NOT IN ({:?})", expr, list)) | ||
} else { | ||
Ok(format!("{:?} IN ({:?})", expr, list)) | ||
} | ||
} | ||
other => Err(DataFusionError::NotImplemented(format!( | ||
"Physical plan does not support logical expression {:?}", | ||
other | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -102,6 +102,13 @@ pub fn expr_to_column_names(expr: &Expr, accum: &mut HashSet<String>) -> Result< | |
expr_to_column_names(high, accum)?; | ||
Ok(()) | ||
} | ||
Expr::InList { expr, list, .. } => { | ||
expr_to_column_names(expr, accum)?; | ||
for list_expr in list { | ||
expr_to_column_names(list_expr, accum)?; | ||
} | ||
Ok(()) | ||
} | ||
Expr::Wildcard => Err(DataFusionError::Internal( | ||
"Wildcard expressions are not valid in a logical query plan".to_owned(), | ||
)), | ||
|
@@ -305,6 +312,13 @@ pub fn expr_sub_expressions(expr: &Expr) -> Result<Vec<Expr>> { | |
low.as_ref().to_owned(), | ||
high.as_ref().to_owned(), | ||
]), | ||
Expr::InList { expr, list, .. } => { | ||
let mut expr_list: Vec<Expr> = vec![expr.as_ref().to_owned()]; | ||
for list_expr in list { | ||
expr_list.push(list_expr.to_owned()); | ||
} | ||
Ok(expr_list) | ||
} | ||
Expr::Wildcard { .. } => Err(DataFusionError::Internal( | ||
"Wildcard expressions are not valid in a logical query plan".to_owned(), | ||
)), | ||
|
@@ -416,6 +430,7 @@ pub fn rewrite_expression(expr: &Expr, expressions: &Vec<Expr>) -> Result<Expr> | |
Ok(expr) | ||
} | ||
} | ||
Expr::InList { .. } => Ok(expr.clone()), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. likewise here, I think we might want to include the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here this is just cloning the while |
||
Expr::Wildcard { .. } => Err(DataFusionError::Internal( | ||
"Wildcard expressions are not valid in a logical query plan".to_owned(), | ||
)), | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍