Skip to content

Build and execute Java-constructed protobuf plans via SessionContext #8

@andygrove

Description

@andygrove

Follow-up to #4.

#4 generates Java classes for the datafusion-proto schema. This issue is the end-to-end wiring that makes those classes useful — Java code constructs a LogicalPlanNode, hands it to SessionContext, and gets back a DataFrame that streams Arrow batches the same way ctx.sql(...) does today.

Sketch

Three pieces sit between Java-built protobuf and execution:

  1. Java surface. A new method on SessionContext:

    public DataFrame fromProto(byte[] planBytes);

    No other Java work is needed — the protobuf builders generated by Add datafusion-proto build #4 are the public construction API.

  2. JNI bridge. A single new native method that takes the byte array and returns a DataFrame pointer, modeled on the existing sql() JNI shim. Arrow FFI stays on the result path; the plan crosses JNI as a primitive byte[].

  3. Rust deserialization + execution. Add datafusion-proto = \"53\" to native/Cargo.toml, then in the JNI implementation:

    let node = datafusion_proto::protobuf::LogicalPlanNode::decode(&bytes[..])?;
    let plan = node.try_into_logical_plan(&ctx.state(), &DefaultLogicalExtensionCodec {})?;
    let df = runtime.block_on(ctx.execute_logical_plan(plan))?;

    Wrap the resulting DataFrame the same way sql() does and return its pointer.

Open design questions for this issue

  • Logical vs. physical plan first. Recommend logical so DataFusion's own optimizer runs; physical is a follow-up.
  • Schema discovery. A JVM caller building a TableScanNode needs the schema of registered tables. Likely a new JNI shim like SessionContext.tableSchema(name) -> org.apache.arrow.vector.types.pojo.Schema. Without this, the API is not usable for anything but trivial plans.
  • Extension codec. DefaultLogicalExtensionCodec is fine for now; a real codec arrives with Java-defined UDFs (the fourth roadmap item).
  • Round-trip tests. Build a plan in Java, execute it, and compare results against the same query run via ctx.sql(...) over the TPC-H integration data.

Out of scope here

  • Java-defined UDFs and their custom extension codec.
  • Physical plan submission.
  • Any builder helpers / fluent API on top of the raw protobuf builders.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions