[SPARK-48142][PYTHON][CONNECT][TESTS] Enable `CogroupedApplyInPandasTests.test_wrong_args` #46397

zhengruifeng · 2024-05-06T07:29:54Z

What changes were proposed in this pull request?

Enable CogroupedApplyInPandasTests.test_wrong_args by including a missing check

Why are the changes needed?

for test coverage

Does this PR introduce any user-facing change?

no

How was this patch tested?

ci

Was this patch authored or co-authored using generative AI tooling?

no

zhengruifeng · 2024-05-06T09:36:53Z

python/pyspark/sql/connect/group.py

+            func,
+            schema,
+            PythonEvalType.SQL_COGROUPED_MAP_PANDAS_UDF,
+        )

        udf_obj = UserDefinedFunction(


PySpark Classic directly uses the pandas_udf method which include this check

spark/python/pyspark/sql/pandas/group_ops.py

Lines 617 to 620 in c74f584

# The usage of the pandas_udf is internal so type checking is disabled.

udf = pandas_udf(

func, returnType=schema, functionType=PythonEvalType.SQL_COGROUPED_MAP_PANDAS_UDF

) # type: ignore[call-overload]

while in Spark Connect, it cannot use the output of pandas_udf method here and requires a UserDefinedFunction object.

So I factor the check to a separate method and invoke it here.

nit nit

xinrong-meng · 2024-05-06T18:00:25Z

LGTM thank you!

zhengruifeng · 2024-05-07T01:16:08Z

thanks @xinrong-meng

merged to master

…ion in ApplyInXXX ### What changes were proposed in this pull request? Implement the missing function validation in ApplyInXXX #46397 fixed this issue for `Cogrouped.ApplyInPandas`, this PR fix remaining methods. ### Why are the changes needed? for better error message: ``` In [12]: df1 = spark.range(11) In [13]: df2 = df1.groupby("id").applyInPandas(lambda: 1, StructType([StructField("d", DoubleType())])) In [14]: df2.show() ``` before this PR, an invalid function causes weird execution errors: ``` 24/05/10 11:37:36 ERROR Executor: Exception in task 0.0 in stage 10.0 (TID 36) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1834, in main process() File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1826, in process serializer.dump_stream(out_iter, outfile) File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 531, in dump_stream return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), stream) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 104, in dump_stream for batch in iterator: File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 524, in init_stream_yield_batches for series in iterator: File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1610, in mapper return f(keys, vals) ^^^^^^^^^^^^^ File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 488, in <lambda> return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))] ^^^^^^^^^^^^^ File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 483, in wrapped result, return_type, _assign_cols_by_name, truncate_return_schema=False ^^^^^^ UnboundLocalError: cannot access local variable 'result' where it is not associated with a value at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:523) at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:117) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:479) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:601) at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:896) ... ``` After this PR, the error happens before execution, which is consistent with Spark Classic, and much clear ``` PySparkValueError: [INVALID_PANDAS_UDF] Invalid function: pandas_udf with function type GROUPED_MAP or the function in groupby.applyInPandas must take either one argument (data) or two arguments (key, data). ``` ### Does this PR introduce _any_ user-facing change? yes, error message changes ### How was this patch tested? added tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #46519 from zhengruifeng/missing_check_in_group. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…ests.test_wrong_args` ### What changes were proposed in this pull request? Enable `CogroupedApplyInPandasTests.test_wrong_args` by including a missing check ### Why are the changes needed? for test coverage ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#46397 from zhengruifeng/fix_pandas_udf_check. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

…ion in ApplyInXXX ### What changes were proposed in this pull request? Implement the missing function validation in ApplyInXXX apache#46397 fixed this issue for `Cogrouped.ApplyInPandas`, this PR fix remaining methods. ### Why are the changes needed? for better error message: ``` In [12]: df1 = spark.range(11) In [13]: df2 = df1.groupby("id").applyInPandas(lambda: 1, StructType([StructField("d", DoubleType())])) In [14]: df2.show() ``` before this PR, an invalid function causes weird execution errors: ``` 24/05/10 11:37:36 ERROR Executor: Exception in task 0.0 in stage 10.0 (TID 36) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1834, in main process() File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1826, in process serializer.dump_stream(out_iter, outfile) File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 531, in dump_stream return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), stream) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 104, in dump_stream for batch in iterator: File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 524, in init_stream_yield_batches for series in iterator: File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1610, in mapper return f(keys, vals) ^^^^^^^^^^^^^ File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 488, in <lambda> return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))] ^^^^^^^^^^^^^ File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 483, in wrapped result, return_type, _assign_cols_by_name, truncate_return_schema=False ^^^^^^ UnboundLocalError: cannot access local variable 'result' where it is not associated with a value at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:523) at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:117) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:479) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:601) at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:896) ... ``` After this PR, the error happens before execution, which is consistent with Spark Classic, and much clear ``` PySparkValueError: [INVALID_PANDAS_UDF] Invalid function: pandas_udf with function type GROUPED_MAP or the function in groupby.applyInPandas must take either one argument (data) or two arguments (key, data). ``` ### Does this PR introduce _any_ user-facing change? yes, error message changes ### How was this patch tested? added tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#46519 from zhengruifeng/missing_check_in_group. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

github-actions bot added SQL PYTHON CONNECT labels May 6, 2024

zhengruifeng changed the title ~~[MINOR][DOCS] Make a dataframe.join doctest deterministic~~ [WIP] Enable CogroupedApplyInPandasTests.test_wrong_args May 6, 2024

zhengruifeng changed the title ~~[WIP] Enable CogroupedApplyInPandasTests.test_wrong_args~~ [WIP] Enable CogroupedApplyInPandasTests.test_wrong_args May 6, 2024

zhengruifeng changed the title ~~[WIP] Enable CogroupedApplyInPandasTests.test_wrong_args~~ [SPARK-48142][PYTHON][CONNECT][TESTS] Enable CogroupedApplyInPandasTests.test_wrong_args May 6, 2024

zhengruifeng commented May 6, 2024

View reviewed changes

nit

4304b9b

nit nit

zhengruifeng force-pushed the fix_pandas_udf_check branch from 6161865 to 4304b9b Compare May 6, 2024 09:45

xinrong-meng approved these changes May 6, 2024

View reviewed changes

zhengruifeng closed this in 2ef7246 May 7, 2024

zhengruifeng deleted the fix_pandas_udf_check branch May 7, 2024 01:16

zhengruifeng mentioned this pull request May 10, 2024

[SPARK-48228][PYTHON][CONNECT] Implement the missing function validation in ApplyInXXX #46519

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48142][PYTHON][CONNECT][TESTS] Enable `CogroupedApplyInPandasTests.test_wrong_args` #46397

[SPARK-48142][PYTHON][CONNECT][TESTS] Enable `CogroupedApplyInPandasTests.test_wrong_args` #46397

zhengruifeng commented May 6, 2024 •

edited

Loading

zhengruifeng May 6, 2024 •

edited

Loading

xinrong-meng commented May 6, 2024

zhengruifeng commented May 7, 2024

	# The usage of the pandas_udf is internal so type checking is disabled.
	udf = pandas_udf(
	func, returnType=schema, functionType=PythonEvalType.SQL_COGROUPED_MAP_PANDAS_UDF
	) # type: ignore[call-overload]

[SPARK-48142][PYTHON][CONNECT][TESTS] Enable CogroupedApplyInPandasTests.test_wrong_args #46397

[SPARK-48142][PYTHON][CONNECT][TESTS] Enable CogroupedApplyInPandasTests.test_wrong_args #46397

Conversation

zhengruifeng commented May 6, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

zhengruifeng May 6, 2024 • edited Loading

Choose a reason for hiding this comment

xinrong-meng commented May 6, 2024

zhengruifeng commented May 7, 2024

[SPARK-48142][PYTHON][CONNECT][TESTS] Enable `CogroupedApplyInPandasTests.test_wrong_args` #46397

[SPARK-48142][PYTHON][CONNECT][TESTS] Enable `CogroupedApplyInPandasTests.test_wrong_args` #46397

zhengruifeng commented May 6, 2024 •

edited

Loading

zhengruifeng May 6, 2024 •

edited

Loading