Skip to content

[SPARK-56413] Add gRPC UDF execution protocol#55657

Open
haiyangsun-db wants to merge 5 commits intoapache:masterfrom
haiyangsun-db:SPARK-56413
Open

[SPARK-56413] Add gRPC UDF execution protocol#55657
haiyangsun-db wants to merge 5 commits intoapache:masterfrom
haiyangsun-db:SPARK-56413

Conversation

@haiyangsun-db
Copy link
Copy Markdown
Contributor

@haiyangsun-db haiyangsun-db commented May 3, 2026

What changes were proposed in this pull request?

Adds udf_protocol.proto, the gRPC wire contract between the Spark engine and a
UDF worker process, as described in SPIP. Sits next to the existing worker_spec.proto.

Defines a Worker service with two RPCs:

  • Execute(stream UdfRequest) returns (stream UdfResponse) — one bidirectional
    stream per UDF execution. Lifecycle on the stream: Init → 0..N
    DataRequest / DataResponse → exactly one Finish or Cancel.
    PayloadChunk streams oversized UDF bodies.
  • Manage(WorkerRequest) returns (WorkerResponse) — unary, worker-scoped
    (heartbeat, graceful shutdown).

UdfPayload carries the engine-opaque callable bytes plus a format tag,
an eval_type worker-dispatch hint, and optional input/output encoders.
Init carries data_format, schemas, session_conf, task_context, and
timezone (the first graduate from session_conf); a reserved field range
absorbs future graduates.

Also fixes two typos in common.proto (exachanged/bidrectional).

Out of scope

No planning info on the wire (no execution-shape / cardinality enum, no
chained-UDF metadata). Both can be added additively later.

Why are the changes needed?

Spark Connect's UDF support today is Python-only and tied to a Python-specific
socket protocol. Onboarding other client languages requires a structured,
language-neutral wire contract. This PR lands the proto layer; engine and
worker implementations will follow.

Does this PR introduce any user-facing change?

No. Wire contract only; not yet wired into any end-to-end path.

How was this patch tested?

Verified the proto compiles with protoc against common.proto and
worker_spec.proto, and inspected the generated descriptor for field-number
and oneof correctness. End-to-end conformance tests will land with the
engine-side client and first worker implementation.

Was this patch authored or co-authored using generative AI tooling?

Yes

@haiyangsun-db haiyangsun-db marked this pull request as ready for review May 3, 2026 16:13
@haiyangsun-db haiyangsun-db changed the title [SPARK-56413] Introduce the grpc protocol for UDF execution. [SPARK-56413] Add gRPC UDF execution protocol May 3, 2026
Comment thread udf/worker/proto/src/main/protobuf/udf_protocol.proto
Comment thread udf/worker/proto/src/main/protobuf/udf_protocol.proto
Comment thread udf/worker/proto/src/main/protobuf/udf_protocol.proto

// (Optional) Session timezone, promoted out of [[session_conf]]
// because every eval needs it for timestamp encoding/decoding.
optional string timezone = 7;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is string the canonical type to represent the timezone? I am afraid all kinds of conversion errors may happen with no schema/enum enforcement.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is convention from Spark, timezone is a string in spark.

Comment thread udf/worker/proto/src/main/protobuf/udf_protocol.proto Outdated

// (Optional) Session timezone, promoted out of [[session_conf]]
// because every eval needs it for timestamp encoding/decoding.
optional string timezone = 7;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should specify the exact format in which the timezone will be reported since its a string

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timezone in spark is a string config, we should get it from spark following the same format.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, great. Thank you for clarifying!

Comment thread udf/worker/proto/src/main/protobuf/udf_protocol.proto Outdated
Copy link
Copy Markdown
Contributor

@sven-weber-db sven-weber-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing the comments

Comment thread udf/worker/README.md Outdated
session.init(Init.newBuilder()
.setUdf(UdfPayload.newBuilder()
.setPayload(ByteString.copyFrom(serializedFunction))
.setFormat("py-cloudpickle-v3"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this already exists? Or do we still need to create this? The only reason why I am bringing it up is because examples are forever :)...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@haiyangsun-db
Copy link
Copy Markdown
Contributor Author

@hvanhovell could you please help take another pass?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants