[SPARK-48228][PYTHON][CONNECT] Implement the missing function validation in ApplyInXXX #46519

zhengruifeng · 2024-05-10T03:48:27Z

What changes were proposed in this pull request?

Implement the missing function validation in ApplyInXXX

#46397 fixed this issue for Cogrouped.ApplyInPandas, this PR fix remaining methods.

Why are the changes needed?

for better error message:

In [12]: df1 = spark.range(11)

In [13]: df2 = df1.groupby("id").applyInPandas(lambda: 1, StructType([StructField("d", DoubleType())]))

In [14]: df2.show()

before this PR, an invalid function causes weird execution errors:

24/05/10 11:37:36 ERROR Executor: Exception in task 0.0 in stage 10.0 (TID 36)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1834, in main
    process()
  File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1826, in process
    serializer.dump_stream(out_iter, outfile)
  File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 531, in dump_stream
    return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), stream)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 104, in dump_stream
    for batch in iterator:
  File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 524, in init_stream_yield_batches
    for series in iterator:
  File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1610, in mapper
    return f(keys, vals)
           ^^^^^^^^^^^^^
  File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 488, in <lambda>
    return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))]
                          ^^^^^^^^^^^^^
  File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 483, in wrapped
    result, return_type, _assign_cols_by_name, truncate_return_schema=False
    ^^^^^^
UnboundLocalError: cannot access local variable 'result' where it is not associated with a value

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:523)
	at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:117)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:479)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:601)
	at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:896)

	...

After this PR, the error happens before execution, which is consistent with Spark Classic, and
much clear

PySparkValueError: [INVALID_PANDAS_UDF] Invalid function: pandas_udf with function type GROUPED_MAP or the function in groupby.applyInPandas must take either one argument (data) or two arguments (key, data).

Does this PR introduce any user-facing change?

yes, error message changes

How was this patch tested?

added tests

Was this patch authored or co-authored using generative AI tooling?

no

nit nit

zhengruifeng · 2024-05-10T03:48:40Z

python/pyspark/sql/connect/group.py

@@ -34,6 +34,7 @@
 from pyspark.util import PythonEvalType
 from pyspark.sql.group import GroupedData as PySparkGroupedData
 from pyspark.sql.pandas.group_ops import PandasCogroupedOps as PySparkPandasCogroupedOps
+from pyspark.sql.pandas.functions import _validate_pandas_udf  # type: ignore[attr-defined]


Spark Classic invoke pandas_udf in Pandas Functions (ApplyInXXX), pandas_udf includes the function validation.
While in Spark Connect, we can not use pandas_udf due to the differences in underlying implementations: pandas_udf returns a wrapper while Spark Connect requires a UserDefinedFunction object.

dongjoon-hyun

+1, LGTM (Pending CIs).

dongjoon-hyun · 2024-05-10T05:01:38Z

Merged to master for Apache Spark 4.0.0.

zhengruifeng · 2024-05-10T05:11:45Z

@dongjoon-hyun and @HyukjinKwon thanks for reviews

…ion in ApplyInXXX ### What changes were proposed in this pull request? Implement the missing function validation in ApplyInXXX apache#46397 fixed this issue for `Cogrouped.ApplyInPandas`, this PR fix remaining methods. ### Why are the changes needed? for better error message: ``` In [12]: df1 = spark.range(11) In [13]: df2 = df1.groupby("id").applyInPandas(lambda: 1, StructType([StructField("d", DoubleType())])) In [14]: df2.show() ``` before this PR, an invalid function causes weird execution errors: ``` 24/05/10 11:37:36 ERROR Executor: Exception in task 0.0 in stage 10.0 (TID 36) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1834, in main process() File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1826, in process serializer.dump_stream(out_iter, outfile) File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 531, in dump_stream return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), stream) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 104, in dump_stream for batch in iterator: File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 524, in init_stream_yield_batches for series in iterator: File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1610, in mapper return f(keys, vals) ^^^^^^^^^^^^^ File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 488, in <lambda> return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))] ^^^^^^^^^^^^^ File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 483, in wrapped result, return_type, _assign_cols_by_name, truncate_return_schema=False ^^^^^^ UnboundLocalError: cannot access local variable 'result' where it is not associated with a value at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:523) at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:117) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:479) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:601) at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:896) ... ``` After this PR, the error happens before execution, which is consistent with Spark Classic, and much clear ``` PySparkValueError: [INVALID_PANDAS_UDF] Invalid function: pandas_udf with function type GROUPED_MAP or the function in groupby.applyInPandas must take either one argument (data) or two arguments (key, data). ``` ### Does this PR introduce _any_ user-facing change? yes, error message changes ### How was this patch tested? added tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#46519 from zhengruifeng/missing_check_in_group. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

xinrong-meng · 2024-05-14T17:53:26Z

Late LGTM, thank you!

nit

64b8c72

nit nit

github-actions bot added SQL PYTHON CONNECT labels May 10, 2024

zhengruifeng commented May 10, 2024

View reviewed changes

zhengruifeng requested review from HyukjinKwon and xinrong-meng and removed request for HyukjinKwon May 10, 2024 03:48

dongjoon-hyun approved these changes May 10, 2024

View reviewed changes

HyukjinKwon approved these changes May 10, 2024

View reviewed changes

dongjoon-hyun closed this in 9bb15db May 10, 2024

zhengruifeng deleted the missing_check_in_group branch May 10, 2024 05:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48228][PYTHON][CONNECT] Implement the missing function validation in ApplyInXXX #46519

[SPARK-48228][PYTHON][CONNECT] Implement the missing function validation in ApplyInXXX #46519

zhengruifeng commented May 10, 2024 •

edited

zhengruifeng May 10, 2024

dongjoon-hyun left a comment

dongjoon-hyun commented May 10, 2024

zhengruifeng commented May 10, 2024

xinrong-meng commented May 14, 2024

[SPARK-48228][PYTHON][CONNECT] Implement the missing function validation in ApplyInXXX #46519

[SPARK-48228][PYTHON][CONNECT] Implement the missing function validation in ApplyInXXX #46519

Conversation

zhengruifeng commented May 10, 2024 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

zhengruifeng May 10, 2024

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented May 10, 2024

zhengruifeng commented May 10, 2024

xinrong-meng commented May 14, 2024

zhengruifeng commented May 10, 2024 •

edited