[MINOR][PYTHON][TESTS] Remove the doc in error message tests to allow other PyArrow versions in tests #46453

HyukjinKwon · 2024-05-08T02:08:51Z

What changes were proposed in this pull request?

This PR is a minor change to support more PyArrow versions in the test.

Why are the changes needed?

To support more PyArrow versions in the test. it can fail: (https://github.com/HyukjinKwon/spark/actions/runs/8994639538/job/24708397027)

Traceback (most recent call last):
  File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/pandas/test_pandas_cogrouped_map.py", line 585, in _test_merge_error
    self.__test_merge_error(
  File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/pandas/test_pandas_cogrouped_map.py", line 606, in __test_merge_error
    with self.assertRaisesRegex(error_class, error_message_regex):
AssertionError: "Return type of the user-defined function should be pandas.DataFrame, but is int64." does not match "
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1834, in main
    process()
  File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1826, in process
    serializer.dump_stream(out_iter, outfile)
  File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 531, in dump_stream
    return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), stream)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 104, in dump_stream
    for batch in iterator:
  File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 524, in init_stream_yield_batches
    for series in iterator:
  File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1694, in mapper
    return f(df1_keys, df1_vals, df2_keys, df2_vals)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 370, in <lambda>
    return lambda kl, vl, kr, vr: [(wrapped(kl, vl, kr, vr), to_arrow_type(return_type))]
                                    ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 364, in wrapped
    verify_pandas_result(
  File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 234, in verify_pandas_result
    raise PySparkTypeError(
pyspark.errors.exceptions.base.PySparkTypeError: [UDF_RETURN_TYPE] Return type of the user-defined function should be pandas.DataFrame, but is int.

Does this PR introduce any user-facing change?

No, test-only.

How was this patch tested?

Ci should validate it.

Was this patch authored or co-authored using generative AI tooling?

No.

dongjoon-hyun

+1, LGTM.

dongjoon-hyun

+1, LGTM.

dongjoon-hyun · 2024-05-08T03:07:54Z

Merged to master.

Feel free to backport this if you need in branch-3.5.

… other PyArrow versions in tests This PR is a minor change to support more PyArrow versions in the test. To support more PyArrow versions in the test. it can fail: (https://github.com/HyukjinKwon/spark/actions/runs/8994639538/job/24708397027) ``` Traceback (most recent call last): File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/pandas/test_pandas_cogrouped_map.py", line 585, in _test_merge_error self.__test_merge_error( File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/pandas/test_pandas_cogrouped_map.py", line 606, in __test_merge_error with self.assertRaisesRegex(error_class, error_message_regex): AssertionError: "Return type of the user-defined function should be pandas.DataFrame, but is int64." does not match " An exception was thrown from the Python worker. Please see the stack trace below. Traceback (most recent call last): File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1834, in main process() File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1826, in process serializer.dump_stream(out_iter, outfile) File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 531, in dump_stream return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), stream) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 104, in dump_stream for batch in iterator: File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 524, in init_stream_yield_batches for series in iterator: File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1694, in mapper return f(df1_keys, df1_vals, df2_keys, df2_vals) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 370, in <lambda> return lambda kl, vl, kr, vr: [(wrapped(kl, vl, kr, vr), to_arrow_type(return_type))] ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 364, in wrapped verify_pandas_result( File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 234, in verify_pandas_result raise PySparkTypeError( pyspark.errors.exceptions.base.PySparkTypeError: [UDF_RETURN_TYPE] Return type of the user-defined function should be pandas.DataFrame, but is int. ``` No, test-only. Ci should validate it. No. Closes #46453 from HyukjinKwon/minor-test. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

… other PyArrow versions in tests ### What changes were proposed in this pull request? This PR is a minor change to support more PyArrow versions in the test. ### Why are the changes needed? To support more PyArrow versions in the test. it can fail: (https://github.com/HyukjinKwon/spark/actions/runs/8994639538/job/24708397027) ``` Traceback (most recent call last): File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/pandas/test_pandas_cogrouped_map.py", line 585, in _test_merge_error self.__test_merge_error( File "/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/pandas/test_pandas_cogrouped_map.py", line 606, in __test_merge_error with self.assertRaisesRegex(error_class, error_message_regex): AssertionError: "Return type of the user-defined function should be pandas.DataFrame, but is int64." does not match " An exception was thrown from the Python worker. Please see the stack trace below. Traceback (most recent call last): File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1834, in main process() File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1826, in process serializer.dump_stream(out_iter, outfile) File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 531, in dump_stream return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), stream) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 104, in dump_stream for batch in iterator: File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 524, in init_stream_yield_batches for series in iterator: File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1694, in mapper return f(df1_keys, df1_vals, df2_keys, df2_vals) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 370, in <lambda> return lambda kl, vl, kr, vr: [(wrapped(kl, vl, kr, vr), to_arrow_type(return_type))] ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 364, in wrapped verify_pandas_result( File "/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 234, in verify_pandas_result raise PySparkTypeError( pyspark.errors.exceptions.base.PySparkTypeError: [UDF_RETURN_TYPE] Return type of the user-defined function should be pandas.DataFrame, but is int. ``` ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Ci should validate it. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46453 from HyukjinKwon/minor-test. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

fix

19d1773

github-actions bot added SQL PYTHON labels May 8, 2024

dongjoon-hyun reviewed May 8, 2024

View reviewed changes

dongjoon-hyun approved these changes May 8, 2024

View reviewed changes

dongjoon-hyun closed this in 3b1ea0f May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MINOR][PYTHON][TESTS] Remove the doc in error message tests to allow other PyArrow versions in tests #46453

[MINOR][PYTHON][TESTS] Remove the doc in error message tests to allow other PyArrow versions in tests #46453

HyukjinKwon commented May 8, 2024

dongjoon-hyun left a comment

dongjoon-hyun left a comment

dongjoon-hyun commented May 8, 2024

[MINOR][PYTHON][TESTS] Remove the doc in error message tests to allow other PyArrow versions in tests #46453

[MINOR][PYTHON][TESTS] Remove the doc in error message tests to allow other PyArrow versions in tests #46453

Conversation

HyukjinKwon commented May 8, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented May 8, 2024