[SPARK-56413] Add gRPC UDF execution protocol#55657
[SPARK-56413] Add gRPC UDF execution protocol#55657haiyangsun-db wants to merge 5 commits intoapache:masterfrom
Conversation
|
|
||
| // (Optional) Session timezone, promoted out of [[session_conf]] | ||
| // because every eval needs it for timestamp encoding/decoding. | ||
| optional string timezone = 7; |
There was a problem hiding this comment.
is string the canonical type to represent the timezone? I am afraid all kinds of conversion errors may happen with no schema/enum enforcement.
There was a problem hiding this comment.
this is convention from Spark, timezone is a string in spark.
|
|
||
| // (Optional) Session timezone, promoted out of [[session_conf]] | ||
| // because every eval needs it for timestamp encoding/decoding. | ||
| optional string timezone = 7; |
There was a problem hiding this comment.
We should specify the exact format in which the timezone will be reported since its a string
There was a problem hiding this comment.
Timezone in spark is a string config, we should get it from spark following the same format.
There was a problem hiding this comment.
Ok, great. Thank you for clarifying!
sven-weber-db
left a comment
There was a problem hiding this comment.
Thank you for addressing the comments
| session.init(Init.newBuilder() | ||
| .setUdf(UdfPayload.newBuilder() | ||
| .setPayload(ByteString.copyFrom(serializedFunction)) | ||
| .setFormat("py-cloudpickle-v3")) |
There was a problem hiding this comment.
Does this already exists? Or do we still need to create this? The only reason why I am bringing it up is because examples are forever :)...
|
@hvanhovell could you please help take another pass? |
What changes were proposed in this pull request?
Adds
udf_protocol.proto, the gRPC wire contract between the Spark engine and aUDF worker process, as described in SPIP. Sits next to the existing
worker_spec.proto.Defines a
Workerservice with two RPCs:Execute(stream UdfRequest) returns (stream UdfResponse)— one bidirectionalstream per UDF execution. Lifecycle on the stream:
Init→ 0..NDataRequest/DataResponse→ exactly oneFinishorCancel.PayloadChunkstreams oversized UDF bodies.Manage(WorkerRequest) returns (WorkerResponse)— unary, worker-scoped(heartbeat, graceful shutdown).
UdfPayloadcarries the engine-opaque callable bytes plus aformattag,an
eval_typeworker-dispatch hint, and optional input/output encoders.Initcarriesdata_format, schemas,session_conf,task_context, andtimezone(the first graduate fromsession_conf); a reserved field rangeabsorbs future graduates.
Also fixes two typos in
common.proto(exachanged/bidrectional).Out of scope
No planning info on the wire (no execution-shape / cardinality enum, no
chained-UDF metadata). Both can be added additively later.
Why are the changes needed?
Spark Connect's UDF support today is Python-only and tied to a Python-specific
socket protocol. Onboarding other client languages requires a structured,
language-neutral wire contract. This PR lands the proto layer; engine and
worker implementations will follow.
Does this PR introduce any user-facing change?
No. Wire contract only; not yet wired into any end-to-end path.
How was this patch tested?
Verified the proto compiles with
protocagainstcommon.protoandworker_spec.proto, and inspected the generated descriptor for field-numberand oneof correctness. End-to-end conformance tests will land with the
engine-side client and first worker implementation.
Was this patch authored or co-authored using generative AI tooling?
Yes