Skip to content

Support user defined functions as expressions in FFI calls #18671

@timsaucer

Description

@timsaucer

Is your feature request related to a problem or challenge?

We use protobuf to serialize and deserialize frequently in the FFI work. This has been a great advantage in exposing these functions and reduces the amount of code duplication we need to perform. We currently have a problem in that to call the de/serialize functions we need to pass either a FunctionRegistry or a TaskContext depending on whether you are working with the logical or physical expressions. Right now the implementation creates a default SessionContext before making the de/serialize calls.

The problem with this is that if a user has registered a custom function and used that function as an input to any of the FFI calls that take expressions, it will fail in the de/serialize calls.

Describe the solution you'd like

There are a few things I think we should do to improve this work and I have a functioning branch tested against datafusion-python that performs most of them. I will be putting up a series of PRs to address. The order of the PRs is currently:

  1. Reduce FFI wrappers when round tripping code #18672 : This adds a method to identify when a Foreign FFI struct is actually in the local library. When this is true, convert to the underlying data structure instead of keeping the FFI wrapper.
  2. Implement FFI_PhysicalExpr which mirrors the PhysicalExpr trait. This will allow us to pass expressions between foreign and local code without having to go through the protobuf physical parser. It is important to not have to go through the physical parser because this ends up creating a round trip dependency between how we pass around a physical codec and a task context. I can go into more details of the exact issue there if someone needs background.
  3. Add a TaskContextProvider trait that we can hold a weak reference to. This is used so that at a point after registration we can get the current TaskContext during de/serialization. This trait contains a single method to get the current Arc<TaskContext> and is implemented on SessionContext.
  4. Implement FFI_PhysicalExtensionCodec and FFI_LogicalExtensionCodec. We will still rely on protobuf serialization and deserialization on the logical side but not the physical side. By implementing these traits we will prepare ourselves for removal of the core datafusion crate. This has the added benefit of being able to serialize and deserialize much more across the FFI boundary, which is useful to the distributed work underway.
  5. Implement FFI_Session which covers the Session trait. This will allow us to pass the actual session without having to convert to a SessionConfig.
  6. Update FFI_ExecutionPlan to use the FFI_TaskContext. This is the first point of integration of all of the previous PRs into the existing code. Up until this point the rest have been additions without changing the existing functionality. This step will stop us from creating a SessionContext in order to create execution plans in FFI. This is a major step towards removing the core crate.
  7. Update existing methods to use the FFI_PhysicalExpr. This will remove the physical expression protobuf parsing in our code. This step is the MVP for getting to use user defined functions in many of the use cases of this issue.
  8. Pass FFI_LogicalExtensionCodec into the table provider, which means also all of the structures that create table providers like table functions, schema provider, catalog provider, and catalog provider list. With this PR we have all use cases covered for this issue.
  9. Remove datafusion/core from datafusion/ffi. With this PR we no longer include the dependency to improve build times. It also includes some quality of life formatting of use statements.

Describe alternatives you've considered

We could pass in a task context directly and pass that around the FFI structs. This has a major problem in that it would only be based on what was registered at the time of creation of that task context. I haven't been able to come up with a better alternative.

Additional context

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestffiChanges to the ffi crate

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions