[WIP][SPARK-56661] Introducing logical and physical planning nodes for language-agnostic Spark UDFs by sven-weber-db · Pull Request #55611 · apache/spark

sven-weber-db · 2026-04-29T17:29:45Z

What changes were proposed in this pull request?

This PR introduces new logical and physical Catalyst nodes for language-agnostic User Defined Functions (UDF) as part of SPIP SPARK-55278, which proposes language-agnostic UDFs.

As a first step towards the goal of language-agnostic UDFs, we want to target mapPartition UDFs like pyspark.sql.DataFrame.mapInArrow, pyspark.RDD.mapPartitions, or pyspark.sql.DataFrame.mapInArrow. The overarching goal is to deprecate the current, language-specific Catalyst nodes (like mapInArrow). However, for now, the new nodes will exist in addition to the old ones until the new framework has reach maturity.

In summary, this PR introduces:

A new Catalyst Expression, ExternalUDFExpression, which captures language-agnostic UDF properties (payload, name, etc.)
A new Catalyst logical node, ExternalUDF, which serves as a base class for all language-agnostic UDF nodes
A new Catalyst logical node, MapPartitionExternalUDF, which is the new, language-agnostic map partition node
Catalyst physical nodes for both logical nodes
WorkerSessionFactory - A factory class to generate new worker sessions using the new UDF worker approach

None of the changes introduced above are currently consumed in Spark.

Why are the changes needed?

This is the first step from the Spark planning side to enable UDFs written in any language.

This is the first step toward language-agnostic UDF execution for Spark. Existing physical and logical planning nodes need to be replaced eventually to achieve this goal as they make language-specific assumptions.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New unit-tests were added.

Was this patch authored or co-authored using generative AI tooling?

Partially. However, the code was manually reviewed and adjusted.

…ng nodes

haiyangsun-db · 2026-04-30T14:40:16Z

+case class MapPartitionExternalUDF(
+    workerSpec: UDFWorkerSpecification,
+    sessionFactory: WorkerSessionFactory,
+    functionExpr: ExternalUDFExpression,


A typical mapPartitions takes just a function, rather than an expression. The difference is that:

In udf expressions: select some_udf(col + 1), some_udf2(col2 + col3), we have two UDF expressions, each expression is mapped to one function, and has its inputs from other expressions.

for mapPartitions, it's usually input rows as whole and output rows as whole, it usually looks like
df.mapInArrow(some_func), df.mapPartitions(some_lambda). It directly relies on one udf function, rather than depending on an expression.

haiyangsun-db · 2026-04-30T14:42:14Z

+  def workerSpec: UDFWorkerSpecification
+
+  /** The UDF expression describing the function to execute. */
+  def functionExpr: ExternalUDFExpression


as commented in the other class, this may not be shared in common between mapPartitions and udf expressions.

haiyangsun-db · 2026-04-30T14:48:42Z

+  def workerSpec: UDFWorkerSpecification
+
+  /** The UDF expression describing the function to execute. */
+  def functionExpr: ExternalUDFExpression


same here.

Also how do you plan to express multiple UDFs (for single udf cases) in the same node with one single functionExpr? I can imagine that we might compact chained UDFs into one single expressions (where inner udf calls are leaf expression nodes), but we coul still have parallel UDF expressions that generate multiple outputs.

For example: select foo(bar(col1)), baz(col2) from some_table, you may have one udf node that contains two UDF expressions: 1 representing foo(bar(col1), while the other represents baz(col2)

haiyangsun-db · 2026-04-30T14:53:31Z

+ * intercept (typically [[close]]).
+ */
+@Experimental
+abstract class DelegatedWorkerSession(


why we need a new class like this?

haiyangsun-db · 2026-04-30T14:55:21Z

+  // All access guarded by `synchronized(lock)`.
+  private val lock = new Object
+  private val dispatchers =
+    new HashMap[UDFWorkerSpecification, DispatcherEntry]()


Shall we use concurrent hashmap here?

haiyangsun-db · 2026-04-30T14:56:10Z

+      securityScope: Option[WorkerSecurityScope] = None
+  ): WorkerSession = {
+    val entry = lock.synchronized {
+      var e = dispatchers.get(workerSpec)


use a more meaningful name than "e"

haiyangsun-db · 2026-04-30T15:02:09Z

+abstract class WorkerSessionFactory {
+
+  private class DispatcherEntry(val dispatcher: WorkerDispatcher) {
+    var activeSessionCount: Int = 0


If dispatcher already directly generates session, why not keep the active session count a property for every dispatcher? Then we can get rid of this dispatcher entry.

haiyangsun-db · 2026-04-30T15:04:23Z

+      workerSpec: UDFWorkerSpecification)
+    extends DelegatedWorkerSession(underlying) {
+
+    override def close(): Unit = {


maybe this logic can also be a contract between session and dispatcher directly, instead of inventing an indirection here.

haiyangsun-db · 2026-04-30T15:08:09Z

@@ -0,0 +1,128 @@
+/*


I do feel like we should break this PR into two parts - the catalyst changes should be separate from the changes in "/udf" module

sven-weber-db force-pushed the sven-weber_data/SPARK-56661 branch from 40ef055 to 0599ac7 Compare April 30, 2026 11:28

sven-weber-db changed the title ~~[WIP][SPARK-56661] Adding language-agnosic UDF logical and physical planning nodes~~ [SPARK-56661] Adding language-agnosic UDF logical and physical planning nodes Apr 30, 2026

sven-weber-db changed the title ~~[SPARK-56661] Adding language-agnosic UDF logical and physical planning nodes~~ [SPARK-56661] Adding logical and physical planning nodes for language-agnostic Spark UDFs Apr 30, 2026

sven-weber-db changed the title ~~[SPARK-56661] Adding logical and physical planning nodes for language-agnostic Spark UDFs~~ [SPARK-56661] Introducing logical and physical planning nodes for language-agnostic Spark UDFs Apr 30, 2026

sven-weber-db force-pushed the sven-weber_data/SPARK-56661 branch from 0599ac7 to ffcd9e0 Compare April 30, 2026 13:51

[SPARK-56661] Adding language-agnosic UDF logical and physical planni…

c372bd1

…ng nodes

sven-weber-db force-pushed the sven-weber_data/SPARK-56661 branch from ffcd9e0 to c372bd1 Compare April 30, 2026 13:54

haiyangsun-db reviewed Apr 30, 2026

View reviewed changes

sven-weber-db changed the title ~~[SPARK-56661] Introducing logical and physical planning nodes for language-agnostic Spark UDFs~~ [WIP[SPARK-56661] Introducing logical and physical planning nodes for language-agnostic Spark UDFs Apr 30, 2026

sven-weber-db changed the title ~~[WIP[SPARK-56661] Introducing logical and physical planning nodes for language-agnostic Spark UDFs~~ [WIP][SPARK-56661] Introducing logical and physical planning nodes for language-agnostic Spark UDFs Apr 30, 2026

sven-weber-db marked this pull request as draft April 30, 2026 15:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][SPARK-56661] Introducing logical and physical planning nodes for language-agnostic Spark UDFs#55611

[WIP][SPARK-56661] Introducing logical and physical planning nodes for language-agnostic Spark UDFs#55611
sven-weber-db wants to merge 1 commit intoapache:masterfrom
sven-weber-db:sven-weber_data/SPARK-56661

sven-weber-db commented Apr 29, 2026 •

edited

Loading

Uh oh!

haiyangsun-db Apr 30, 2026

Uh oh!

haiyangsun-db Apr 30, 2026

Uh oh!

haiyangsun-db Apr 30, 2026 •

edited

Loading

Uh oh!

haiyangsun-db Apr 30, 2026

Uh oh!

haiyangsun-db Apr 30, 2026

Uh oh!

haiyangsun-db Apr 30, 2026

Uh oh!

haiyangsun-db Apr 30, 2026

Uh oh!

haiyangsun-db Apr 30, 2026

Uh oh!

haiyangsun-db Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sven-weber-db commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haiyangsun-db Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sven-weber-db commented Apr 29, 2026 •

edited

Loading

haiyangsun-db Apr 30, 2026 •

edited

Loading