[SPARK-54211][PYTHON][DOCS] Update exmple of mapInArrow to use arrow builtin function #52909

zhengruifeng · 2025-11-06T03:42:50Z

What changes were proposed in this pull request?

Update exmple of mapInArrow to use arrow builtin function

Why are the changes needed?

1, it is encouraged to use arrow builtin function in arrow-based UDFs, so that additional conversion between arrow and pandas can be saved;
2, mapInArrow doctest on longer depends on pandas, so that we can still test it when pyarrow is installed but pandas is missing

Does this PR introduce any user-facing change?

yes, doc-only change

How was this patch tested?

ci

Was this patch authored or co-authored using generative AI tooling?

no

dongjoon-hyun

+1, LGTM.

dongjoon-hyun · 2025-11-06T05:13:28Z

Merged to master for Apache Spark 4.2.0.

dongjoon-hyun

Hi @zhengruifeng .

It seems to break build_python_minimum.yml (two modules: pyspark-sql and spark-connect). Could you take a look at the failure?

https://github.com/apache/spark/actions/workflows/build_python_minimum.yml
- https://github.com/apache/spark/actions/runs/19182257209

**********************************************************************
   2 of   5 in pyspark.sql.dataframe.DataFrame.mapInArrow
***Test Failed*** 2 failures.

dongjoon-hyun · 2025-11-08T02:16:05Z

python/pyspark/sql/dataframe.py

-        ...         pdf = batch.to_pandas()
-        ...         yield pa.RecordBatch.from_pandas(pdf[pdf.id == 1])
+        ...         yield batch.filter(pa.compute.field("id") == 1)
        >>> df.mapInArrow(filter_func, df.schema).show()


The failure happens here.

File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 409, in pyspark.sql.dataframe.DataFrame.mapInArrow Failed example: df.mapInArrow(filter_func, df.schema).show() Exception raised: Traceback (most recent call last): File "/usr/lib/python3.10/doctest.py", line 1350, in __run exec(compile(example.source, filename, "single", File "<doctest pyspark.sql.dataframe.DataFrame.mapInArrow[3]>", line 1, in <module> df.mapInArrow(filter_func, df.schema).show() File "/__w/spark/spark/python/pyspark/sql/classic/dataframe.py", line 285, in show print(self._show_string(n, truncate, vertical)) File "/__w/spark/spark/python/pyspark/sql/classic/dataframe.py", line 303, in _show_string return self._jdf.showString(n, 20, vertical) File "/__w/spark/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/java_gateway.py", line 1362, in __call__ return_value = get_return_value( File "/__w/spark/spark/python/pyspark/errors/exceptions/captured.py", line 269, in deco raise converted from None File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 3343, in main process() File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 3334, in process serializer.dump_stream(out_iter, outfile) File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 187, in dump_stream return super(ArrowStreamUDFSerializer, self).dump_stream(wrap_and_init_stream(), stream) File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 121, in dump_stream for batch in iterator: File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 167, in wrap_and_init_stream for batch, _ in iterator: File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 2857, in func for result_batch, result_type in result_iter: File "<doctest pyspark.sql.dataframe.DataFrame.mapInArrow[2]>", line 3, in filter_func yield batch.filter(pa.compute.field("id") == 1) File "pyarrow/table.pxi", line 2580, in pyarrow.lib.RecordBatch.filter return _pc().filter(self, mask, null_selection_behavior) File "/usr/local/lib/python3.10/dist-packages/pyarrow/compute.py", line 263, in wrapper return func.call(args, options, memory_pool) File "pyarrow/_compute.pyx", line 372, in pyarrow._compute.Function.call _pack_compute_args(args, &c_batch.values) File "pyarrow/_compute.pyx", line 505, in pyarrow._compute._pack_compute_args raise TypeError(f"Got unexpected argument type {type(val)} " pyspark.errors.exceptions.captured.PythonException:

yeah, I am fixing it in #52965

…builtin function ### What changes were proposed in this pull request? Update exmple of mapInArrow to use arrow builtin function ### Why are the changes needed? 1, it is encouraged to use arrow builtin function in arrow-based UDFs, so that additional conversion between arrow and pandas can be saved; 2, mapInArrow doctest on longer depends on pandas, so that we can still test it when pyarrow is installed but pandas is missing ### Does this PR introduce _any_ user-facing change? yes, doc-only change ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#52909 from zhengruifeng/map_in_doc_test. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

nit

388f726

github-actions bot added SQL PYTHON labels Nov 6, 2025

zhengruifeng requested a review from HyukjinKwon November 6, 2025 03:43

dongjoon-hyun approved these changes Nov 6, 2025

View reviewed changes

dongjoon-hyun closed this in f9fadf2 Nov 6, 2025

zhengruifeng deleted the map_in_doc_test branch November 6, 2025 05:54

dongjoon-hyun reviewed Nov 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-54211][PYTHON][DOCS] Update exmple of mapInArrow to use arrow builtin function #52909

[SPARK-54211][PYTHON][DOCS] Update exmple of mapInArrow to use arrow builtin function #52909

Uh oh!

zhengruifeng commented Nov 6, 2025

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dongjoon-hyun commented Nov 6, 2025

Uh oh!

dongjoon-hyun left a comment •

edited

Loading

Uh oh!

dongjoon-hyun Nov 8, 2025

Uh oh!

zhengruifeng Nov 10, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-54211][PYTHON][DOCS] Update exmple of mapInArrow to use arrow builtin function #52909

[SPARK-54211][PYTHON][DOCS] Update exmple of mapInArrow to use arrow builtin function #52909

Uh oh!

Conversation

zhengruifeng commented Nov 6, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Nov 6, 2025

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dongjoon-hyun left a comment •

edited

Loading

zhengruifeng Nov 10, 2025 •

edited

Loading