-
Notifications
You must be signed in to change notification settings - Fork 1.8k
DRAFT implement trait in proto de/serialization to allow finer control #18813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
milenkovicm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
proposal does make sense to me
| Arc::new(ChildPhysicalExtensionCodec {}), | ||
| ]); | ||
| ])); | ||
| let mut parser = TaskContextWithPhysicalCodec { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find parser as a name may be a bit confusing
| pub fn serialize_physical_aggr_expr<S: PhysicalSerializer>( | ||
| aggr_expr: Arc<AggregateFunctionExpr>, | ||
| codec: &dyn PhysicalExtensionCodec, | ||
| parser: &mut S, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find name parser a bit confusing, but i have no better suggestion at the moment, maybe serializer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Easily done.
| pub fn parse_physical_sort_expr<D: PhysicalDeserializer>( | ||
| proto: &protobuf::PhysicalSortExprNode, | ||
| ctx: &TaskContext, | ||
| parser: &mut D, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deserializer rather than parser ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed I think deserializer is a better parameter name
| let schema = WrappedSchema(FFI_ArrowSchema::try_from(args.schema)?); | ||
|
|
||
| let codec = DefaultPhysicalExtensionCodec {}; | ||
| let mut codec = DefaultPhysicalExtensionCodec {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a plan @timsaucer to make this configurable so we can use our own codec? It’s likely covered in other PRs, is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is my ultimate goal. I don't want to merge any PR until I've got at least a hacked together demo of it with FFI and used in df-python.
| pub fn parse_physical_sort_exprs<D: PhysicalDeserializer>( | ||
| proto: &[protobuf::PhysicalSortExprNode], | ||
| ctx: &TaskContext, | ||
| parser: &mut D, | ||
| input_schema: &Schema, | ||
| codec: &dyn PhysicalExtensionCodec, | ||
| ) -> Result<Vec<PhysicalSortExpr>> { | ||
| proto | ||
| .iter() | ||
| .map(|sort_expr| parse_physical_sort_expr(sort_expr, ctx, input_schema, codec)) | ||
| .map(|sort_expr| parse_physical_sort_expr(sort_expr, parser, input_schema)) | ||
| .collect() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I need to change the signature from
pub fn parse_physical_sort_exprs<D: PhysicalDeserializer>(
proto: &[protobuf::PhysicalSortExprNode],
parser: &mut D,
input_schema: &Schema,
) -> Result<Vec<PhysicalSortExpr>> into
pub fn parse_physical_sort_exprs(
proto: &[protobuf::PhysicalSortExprNode],
parser: &mut dyn PhysicalDeserializer,
input_schema: &Schema,
) -> Result<Vec<PhysicalSortExpr>> My reasoning is that I do want to use this in the FFI crate. In that case I will have libraries where I do not know the underlying implementation. Instead I will probably have something like a Box<dyn PhysicalDeserializer>. So I can't use the generic on the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense
| pub trait PhysicalDeserializer { | ||
| fn try_decode_execution_plan( | ||
| &mut self, | ||
| buf: &[u8], | ||
| inputs: &[Arc<dyn ExecutionPlan>], | ||
| ) -> Result<Arc<dyn ExecutionPlan>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if there is a good use case for having two separate traits - PhysicalDeserializer and PhysicalSerializer. Almost all of the methods so closely mirror the PhysicalExtensionCodec. Maybe we make one trait with a better name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we were to combine the two traits I could imagine calling it PhysicalCodec or PhysicalSerde. The problem with PhysicalCodec is the name is so close to PhysicalExtensionCodec and the methods are so similar I suspect it will cause more confusion than good.
adriangb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for taking a stab at this @timsaucer ! I think this achieves one of the goals of #18477 ("Optionality of features. Some use cases demand more information / capabilities for serialization / deserialization ...") but I don't think it enables the flexibility requested there.
| ExprType::Extension(extension) => { | ||
| let inputs: Vec<Arc<dyn PhysicalExpr>> = extension | ||
| .inputs | ||
| .iter() | ||
| .map(|e| parse_physical_expr(e, ctx, input_schema, codec)) | ||
| .map(|e| parse_physical_expr(e, parser, input_schema)) | ||
| .collect::<Result<_>>()?; | ||
| (codec.try_decode_expr(extension.expr.as_slice(), &inputs)?) as _ | ||
| parser.try_decode_expr(extension.expr.as_slice(), &inputs)? as _ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this is missing some key points of #18477:
- It only operates on
Extensionvariants - It doesn't allow pre/post/fork control of serde
From #18477:
design goals were:
- More flexible pre/post processing. The current Codec system only operates as a "fallback" and doesn't support hooking into the moment before trying to deserialize or the moment after a successful deserialization. I have a lot of example use cases for this that I can share.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to understand what additional hook you need. Here is an example using the latest push. Of course my example doesn't really have an encoder so it doesn't parse through to give a count on each one of the physical expressions, so the output is :
Pre-encoding: 0 expressions.
post: 1 expressions
Minimal example:
#[cfg(test)]
mod tests {
use std::sync::Arc;
use datafusion::logical_expr::Operator;
use datafusion_common::{DataFusionError, ScalarValue};
use datafusion_expr::{lit, AggregateUDF, ScalarUDF, WindowUDF};
use datafusion_physical_expr::expressions::{BinaryExpr, Column, Literal};
use datafusion_physical_expr_common::physical_expr::PhysicalExpr;
use datafusion_physical_plan::ExecutionPlan;
use crate::physical_plan::{PhysicalSerializer};
struct PrintOnSerialize {
num_exprs: u8,
}
impl PhysicalSerializer for PrintOnSerialize {
fn try_encode_execution_plan(&mut self, node: Arc<dyn ExecutionPlan>, buf: &mut Vec<u8>) -> datafusion_common::Result<()> {
todo!()
}
fn try_encode_udf(&mut self, node: &ScalarUDF, buf: &mut Vec<u8>) -> datafusion_common::Result<()> {
todo!()
}
fn try_encode_expr(&mut self, node: &Arc<dyn PhysicalExpr>, buf: &mut Vec<u8>) -> datafusion_common::Result<()> {
println!("Pre-encoding: {} expressions.", self.num_exprs);
self.num_exprs += 1;
// Call to inner actual encoder happens here
buf.push(self.num_exprs);
println!(" post: {} expressions", self.num_exprs);
Ok(())
}
fn try_encode_udaf(&mut self, _node: &AggregateUDF, _buf: &mut Vec<u8>) -> datafusion_common::Result<()> {
todo!()
}
fn try_encode_udwf(&mut self, _node: &WindowUDF, _buf: &mut Vec<u8>) -> datafusion_common::Result<()> {
todo!()
}
}
#[test]
fn demonstrate_serialization_hooks() -> Result<(), DataFusionError> {
let literal_expr = Literal::new(ScalarValue::Boolean(Some(true)));
let column_expr = Column::new("a", 0);
let binary_expr = BinaryExpr::new(Arc::new(literal_expr), Operator::And, Arc::new(column_expr));
let binary_expr = Arc::new(binary_expr) as Arc<dyn PhysicalExpr>;
let mut custom_serialize = PrintOnSerialize {
num_exprs: 0,
};
let mut bytes = Vec::new();
custom_serialize.try_encode_expr(&binary_expr, &mut bytes)?;
Ok(())
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The particular use case I have in mind is re-attaching a SchemaAdapterFactory to a ParquetSource after de-serializing it.
This is in draft form as a proof of concept
Which issue does this PR close?
Rationale for this change
This trait adds finer grained control over how processing occurs for protobuf serialization and deserialization. See the issue linked above for more discussion on why it is useful.
What changes are included in this PR?
Create traits for serialization and deserialization.
Add implementations for TaskContext and TaskContext with a PhysicalExtensionCodec
Give the same treatment to the logical side if we are in agreement
Are these changes tested?
Are there any user-facing changes?
Users will need to make some slight changes to their calls to serialization and deserialization.