[WIP][SPARK-56661] Introducing logical and physical planning nodes for language-agnostic Spark UDFs#55611
Conversation
40ef055 to
0599ac7
Compare
0599ac7 to
ffcd9e0
Compare
ffcd9e0 to
c372bd1
Compare
| case class MapPartitionExternalUDF( | ||
| workerSpec: UDFWorkerSpecification, | ||
| sessionFactory: WorkerSessionFactory, | ||
| functionExpr: ExternalUDFExpression, |
There was a problem hiding this comment.
A typical mapPartitions takes just a function, rather than an expression. The difference is that:
- In udf expressions: select some_udf(col + 1), some_udf2(col2 + col3), we have two UDF expressions, each expression is mapped to one function, and has its inputs from other expressions.
- for mapPartitions, it's usually input rows as whole and output rows as whole, it usually looks like
df.mapInArrow(some_func), df.mapPartitions(some_lambda). It directly relies on one udf function, rather than depending on an expression.
| def workerSpec: UDFWorkerSpecification | ||
|
|
||
| /** The UDF expression describing the function to execute. */ | ||
| def functionExpr: ExternalUDFExpression |
There was a problem hiding this comment.
as commented in the other class, this may not be shared in common between mapPartitions and udf expressions.
| def workerSpec: UDFWorkerSpecification | ||
|
|
||
| /** The UDF expression describing the function to execute. */ | ||
| def functionExpr: ExternalUDFExpression |
There was a problem hiding this comment.
same here.
Also how do you plan to express multiple UDFs (for single udf cases) in the same node with one single functionExpr? I can imagine that we might compact chained UDFs into one single expressions (where inner udf calls are leaf expression nodes), but we coul still have parallel UDF expressions that generate multiple outputs.
For example: select foo(bar(col1)), baz(col2) from some_table, you may have one udf node that contains two UDF expressions: 1 representing foo(bar(col1), while the other represents baz(col2)
| * intercept (typically [[close]]). | ||
| */ | ||
| @Experimental | ||
| abstract class DelegatedWorkerSession( |
There was a problem hiding this comment.
why we need a new class like this?
| // All access guarded by `synchronized(lock)`. | ||
| private val lock = new Object | ||
| private val dispatchers = | ||
| new HashMap[UDFWorkerSpecification, DispatcherEntry]() |
There was a problem hiding this comment.
Shall we use concurrent hashmap here?
| securityScope: Option[WorkerSecurityScope] = None | ||
| ): WorkerSession = { | ||
| val entry = lock.synchronized { | ||
| var e = dispatchers.get(workerSpec) |
There was a problem hiding this comment.
use a more meaningful name than "e"
| abstract class WorkerSessionFactory { | ||
|
|
||
| private class DispatcherEntry(val dispatcher: WorkerDispatcher) { | ||
| var activeSessionCount: Int = 0 |
There was a problem hiding this comment.
If dispatcher already directly generates session, why not keep the active session count a property for every dispatcher? Then we can get rid of this dispatcher entry.
| workerSpec: UDFWorkerSpecification) | ||
| extends DelegatedWorkerSession(underlying) { | ||
|
|
||
| override def close(): Unit = { |
There was a problem hiding this comment.
maybe this logic can also be a contract between session and dispatcher directly, instead of inventing an indirection here.
| @@ -0,0 +1,128 @@ | |||
| /* | |||
There was a problem hiding this comment.
I do feel like we should break this PR into two parts - the catalyst changes should be separate from the changes in "/udf" module
What changes were proposed in this pull request?
This PR introduces new logical and physical Catalyst nodes for language-agnostic User Defined Functions (UDF) as part of SPIP SPARK-55278, which proposes language-agnostic UDFs.
As a first step towards the goal of language-agnostic UDFs, we want to target mapPartition UDFs like
pyspark.sql.DataFrame.mapInArrow,pyspark.RDD.mapPartitions, orpyspark.sql.DataFrame.mapInArrow. The overarching goal is to deprecate the current, language-specific Catalyst nodes (likemapInArrow). However, for now, the new nodes will exist in addition to the old ones until the new framework has reach maturity.In summary, this PR introduces:
ExternalUDFExpression, which captures language-agnostic UDF properties (payload, name, etc.)ExternalUDF, which serves as a base class for all language-agnostic UDF nodesMapPartitionExternalUDF, which is the new, language-agnostic map partition nodeWorkerSessionFactory- A factory class to generate new worker sessions using the new UDF worker approachNone of the changes introduced above are currently consumed in Spark.
Why are the changes needed?
This is the first step from the Spark planning side to enable UDFs written in any language.
This is the first step toward language-agnostic UDF execution for Spark. Existing physical and logical planning nodes need to be replaced eventually to achieve this goal as they make language-specific assumptions.
Does this PR introduce any user-facing change?
No
How was this patch tested?
New unit-tests were added.
Was this patch authored or co-authored using generative AI tooling?
Partially. However, the code was manually reviewed and adjusted.