New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-9756: [Rust] [DataFusion] Added support for scalar UDFs of arbitrary return types #7974
Conversation
FYI @andygrove @alamb @houqp : this is a draft because it depends on other PRs being reviewed and accepted. |
@@ -1087,7 +1101,7 @@ impl LogicalPlanBuilder { | |||
/// Apply a projection | |||
pub fn project(&self, expr: Vec<Expr>) -> Result<Self> { | |||
let input_schema = self.plan.schema(); | |||
let projected_expr = if expr.contains(&Expr::Wildcard) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to drop partialEq for expressions, and therefore also dropped this since it was strictly not needed.
@@ -87,7 +95,7 @@ impl Debug for ScalarFunctionExpr { | |||
.field("fun", &"<FUNC>") | |||
.field("name", &self.name) | |||
.field("args", &self.args) | |||
.field("return_type", &self.return_type) | |||
//.field("return_type", &self.return_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo
@@ -50,7 +58,7 @@ impl Debug for ScalarFunction { | |||
f.debug_struct("ScalarFunction") | |||
.field("name", &self.name) | |||
.field("args", &self.args) | |||
.field("return_type", &self.return_type) | |||
//.field("return_type", &self.return_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo
Currently a UDF's argument can only be a single type. This PR adds support for multiple types per argument, thus allowing users to register UDFs that can handle multiple types at once.
This operation is already by the optimizer and is more complete.
This allows to declare UDFs that support multiple types. The existing UDFs (math) now support float32 and float64.
@@ -156,7 +168,7 @@ macro_rules! sum_accumulate { | |||
|
|||
#[derive(Debug)] | |||
struct SumAccumulator { | |||
sum: Option<ScalarValue>, | |||
pub sum: Option<ScalarValue>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed?
@@ -31,6 +31,7 @@ extern crate sqlparser; | |||
|
|||
pub mod dataframe; | |||
pub mod datasource; | |||
mod datatyped; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub so that others can implement a different DataTyped?
@@ -355,7 +355,7 @@ pub enum Expr { | |||
/// List of expressions to feed to the functions as arguments | |||
args: Vec<Expr>, | |||
/// The `DataType` the expression will yield | |||
return_type: DataType, | |||
return_type: ReturnType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THIS IS A MAJOR CHANGE: it makes Expr hold an Arc<&dyn>
, which make them unserializable. I was unable to find another way, unfortunately.
Expr::AggregateFunction { return_type, .. } => Ok(return_type.clone()), | ||
Expr::ScalarFunction { | ||
args, return_type, .. | ||
} => return_type(&args.iter().map(|x| x.as_datatyped()).collect(), schema), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not very beautiful, but was unable to find a better pattern for this :(
match self.schema_provider.get_function_meta(&name) { | ||
Some(fm) => { | ||
let args = if name == "count" { | ||
// optimization to avoid computing expressions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be moved to another place (an optimizer?). For now I left it where it here as it was.
This PR is done on top of #7971 ,
Its current runtime consequence is that all our math functions now return float32 or float64 depending on their incoming column (and use float32 for other numeric types). Its API consequences is that it allows to register UDFs of variable return types.
This PR hits both physical plans and logical plans, and abstracts a bit the typing of some of our structs. In particular, it introduces a new trait that only cares about the return datatype of an object (PhysicalExpr and LogicalExpr).