-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Add support for lambda column capture #21323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
9afcd37
c07e168
17e50e0
0a3f2e6
26bc713
960bf69
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -23,7 +23,7 @@ use arrow::array::{ArrayRef, RecordBatch}; | |||||||||||||||||||
| use arrow::datatypes::{DataType, FieldRef, Schema}; | ||||||||||||||||||||
| use arrow_schema::SchemaRef; | ||||||||||||||||||||
| use datafusion_common::config::ConfigOptions; | ||||||||||||||||||||
| use datafusion_common::{Result, ScalarValue, not_impl_err}; | ||||||||||||||||||||
| use datafusion_common::{Result, ScalarValue, exec_err, not_impl_err}; | ||||||||||||||||||||
| use datafusion_expr_common::dyn_eq::{DynEq, DynHash}; | ||||||||||||||||||||
| use datafusion_expr_common::signature::Volatility; | ||||||||||||||||||||
| use datafusion_physical_expr_common::physical_expr::PhysicalExpr; | ||||||||||||||||||||
|
|
@@ -224,35 +224,112 @@ pub struct LambdaArgument { | |||||||||||||||||||
| /// per outer sublist), avoiding the per-call `Schema::new` build that | ||||||||||||||||||||
| /// includes constructing the internal name -> index map. | ||||||||||||||||||||
| schema: SchemaRef, | ||||||||||||||||||||
| /// A RecordBatch containing the captured columns inside this lambda body, if any | ||||||||||||||||||||
| /// | ||||||||||||||||||||
| /// For example, for `array_transform([2], v -> v + a + b)`, | ||||||||||||||||||||
| /// this will be a `RecordBatch` with two columns, `a` and `b` | ||||||||||||||||||||
| captures: Option<RecordBatch>, | ||||||||||||||||||||
| } | ||||||||||||||||||||
|
|
||||||||||||||||||||
| impl LambdaArgument { | ||||||||||||||||||||
| pub fn new(params: Vec<FieldRef>, body: Arc<dyn PhysicalExpr>) -> Self { | ||||||||||||||||||||
| let schema = Arc::new(Schema::new(params.clone())); | ||||||||||||||||||||
| pub fn new( | ||||||||||||||||||||
| params: Vec<FieldRef>, | ||||||||||||||||||||
| body: Arc<dyn PhysicalExpr>, | ||||||||||||||||||||
| captures: Option<RecordBatch>, | ||||||||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
datafusion/datafusion/physical-expr/src/higher_order_function.rs Lines 348 to 356 in 0a3f2e6
|
||||||||||||||||||||
| ) -> Self { | ||||||||||||||||||||
| let fields = match &captures { | ||||||||||||||||||||
| Some(batch) => batch | ||||||||||||||||||||
| .schema_ref() | ||||||||||||||||||||
| .fields() | ||||||||||||||||||||
| .iter() | ||||||||||||||||||||
| .cloned() | ||||||||||||||||||||
| .chain(params.clone()) | ||||||||||||||||||||
| .collect(), | ||||||||||||||||||||
| None => params.clone(), | ||||||||||||||||||||
| }; | ||||||||||||||||||||
|
|
||||||||||||||||||||
| let schema = Arc::new(Schema::new(fields)); | ||||||||||||||||||||
|
|
||||||||||||||||||||
| Self { | ||||||||||||||||||||
| params, | ||||||||||||||||||||
| body, | ||||||||||||||||||||
| schema, | ||||||||||||||||||||
| captures, | ||||||||||||||||||||
| } | ||||||||||||||||||||
| } | ||||||||||||||||||||
|
|
||||||||||||||||||||
| /// Evaluate this lambda | ||||||||||||||||||||
| /// `args` should evaluate to the value of each parameter | ||||||||||||||||||||
| /// of the correspondent lambda returned in [HigherOrderUDF::lambda_parameters]. | ||||||||||||||||||||
| /// | ||||||||||||||||||||
| /// `spread_captures` is responsible for transforming the captured column arrays | ||||||||||||||||||||
| /// so they align with the evaluation batch. Captures are snapshotted from the | ||||||||||||||||||||
| /// outer batch at construction time, giving one value per outer row, but the | ||||||||||||||||||||
| /// function may evaluate the lambda body over a batch with a different number | ||||||||||||||||||||
| /// of rows. It is the function responsibility to provide the appropriate `spread_captures` | ||||||||||||||||||||
| /// closure to expand (or otherwise reshape) the captures to match. | ||||||||||||||||||||
| /// Function working on lists, for example `array_transform(arr, v -> v + 1)` | ||||||||||||||||||||
| /// flattens all list elements into a single batch, duplicating captured | ||||||||||||||||||||
| /// values for rows with multiple elements and dropping them for empty lists. | ||||||||||||||||||||
| /// If the lambda has no captures, `spread_captures` is never called. | ||||||||||||||||||||
| pub fn evaluate( | ||||||||||||||||||||
| &self, | ||||||||||||||||||||
| args: &[&dyn Fn() -> Result<ArrayRef>], | ||||||||||||||||||||
| spread_captures: impl FnOnce(&[ArrayRef]) -> Result<Vec<ArrayRef>>, | ||||||||||||||||||||
| ) -> Result<ColumnarValue> { | ||||||||||||||||||||
| let columns = args | ||||||||||||||||||||
| let spread_captures = self | ||||||||||||||||||||
| .captures | ||||||||||||||||||||
| .as_ref() | ||||||||||||||||||||
| .map(|captures| { | ||||||||||||||||||||
| let spread_columns = spread_captures(captures.columns())?; | ||||||||||||||||||||
|
|
||||||||||||||||||||
| RecordBatch::try_new(captures.schema(), spread_columns) | ||||||||||||||||||||
| }) | ||||||||||||||||||||
| .transpose()?; | ||||||||||||||||||||
|
|
||||||||||||||||||||
| let merged = merge_captures_with_variables( | ||||||||||||||||||||
| spread_captures.as_ref(), | ||||||||||||||||||||
| Arc::clone(&self.schema), | ||||||||||||||||||||
| &self.params, | ||||||||||||||||||||
| args, | ||||||||||||||||||||
| )?; | ||||||||||||||||||||
|
|
||||||||||||||||||||
| self.body.evaluate(&merged) | ||||||||||||||||||||
| } | ||||||||||||||||||||
| } | ||||||||||||||||||||
|
|
||||||||||||||||||||
| fn merge_captures_with_variables( | ||||||||||||||||||||
| captures: Option<&RecordBatch>, | ||||||||||||||||||||
| schema: SchemaRef, | ||||||||||||||||||||
| params: &[FieldRef], | ||||||||||||||||||||
| variables: &[&dyn Fn() -> Result<ArrayRef>], | ||||||||||||||||||||
| ) -> Result<RecordBatch> { | ||||||||||||||||||||
| if variables.len() < params.len() { | ||||||||||||||||||||
| return exec_err!( | ||||||||||||||||||||
| "expected at least {} lambda arguments to merge with captures, got {}", | ||||||||||||||||||||
| params.len(), | ||||||||||||||||||||
| variables.len() | ||||||||||||||||||||
| ); | ||||||||||||||||||||
| } | ||||||||||||||||||||
|
|
||||||||||||||||||||
| let columns = match captures { | ||||||||||||||||||||
| Some(captures) => { | ||||||||||||||||||||
| let mut columns = captures.columns().to_vec(); | ||||||||||||||||||||
|
|
||||||||||||||||||||
| for arg in &variables[..params.len()] { | ||||||||||||||||||||
| columns.push(arg()?); | ||||||||||||||||||||
| } | ||||||||||||||||||||
|
|
||||||||||||||||||||
| columns | ||||||||||||||||||||
| } | ||||||||||||||||||||
| None => variables | ||||||||||||||||||||
| .iter() | ||||||||||||||||||||
| .take(self.params.len()) | ||||||||||||||||||||
| .take(params.len()) | ||||||||||||||||||||
| .map(|arg| arg()) | ||||||||||||||||||||
| .collect::<Result<_>>()?; | ||||||||||||||||||||
|
|
||||||||||||||||||||
| let batch = RecordBatch::try_new(Arc::clone(&self.schema), columns)?; | ||||||||||||||||||||
| .collect::<Result<_>>()?, | ||||||||||||||||||||
| }; | ||||||||||||||||||||
|
|
||||||||||||||||||||
| self.body.evaluate(&batch) | ||||||||||||||||||||
| } | ||||||||||||||||||||
| Ok(RecordBatch::try_new(schema, columns)?) | ||||||||||||||||||||
| } | ||||||||||||||||||||
|
|
||||||||||||||||||||
| /// Information about arguments passed to the function | ||||||||||||||||||||
|
|
||||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use
ProjectionExpr? with all of its existing helpersThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ProjectionExprscan be used if we modify it to also handleLambdaVariable's, should we do that?I checked all usages I could find and didn't identify none where such modification would cause problems, I'm just not sure if we should do it.
In case you note in next reads of the PR, currently indices of lambda variables of inner lambdas are also collected and would cause problems with
ProjectionExprs, but that's easy fixable so please don't take that into account when thinking about itThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this was
PhysicalExpr, but because this is not and also created every evaluate call, then I don't think you would have a bug with the naming.I do think that we should try to have
LambdaArgumentin the physical plan so we won't have to recreate it every batch and have that costThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe try to create a different pr of supporting
ProjectionExprsand how would that look like