Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

Update exmple of mapInArrow to use arrow builtin function

Why are the changes needed?

1, it is encouraged to use arrow builtin function in arrow-based UDFs, so that additional conversion between arrow and pandas can be saved;
2, mapInArrow doctest on longer depends on pandas, so that we can still test it when pyarrow is installed but pandas is missing

Does this PR introduce any user-facing change?

yes, doc-only change

How was this patch tested?

ci

Was this patch authored or co-authored using generative AI tooling?

no

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@dongjoon-hyun
Copy link
Member

Merged to master for Apache Spark 4.2.0.

@zhengruifeng zhengruifeng deleted the map_in_doc_test branch November 6, 2025 05:54
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @zhengruifeng .

It seems to break build_python_minimum.yml (two modules: pyspark-sql and spark-connect). Could you take a look at the failure?

**********************************************************************
   2 of   5 in pyspark.sql.dataframe.DataFrame.mapInArrow
***Test Failed*** 2 failures.

... pdf = batch.to_pandas()
... yield pa.RecordBatch.from_pandas(pdf[pdf.id == 1])
... yield batch.filter(pa.compute.field("id") == 1)
>>> df.mapInArrow(filter_func, df.schema).show()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failure happens here.

File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 409, in pyspark.sql.dataframe.DataFrame.mapInArrow
Failed example:
    df.mapInArrow(filter_func, df.schema).show()
Exception raised:
    Traceback (most recent call last):
      File "/usr/lib/python3.10/doctest.py", line 1350, in __run
        exec(compile(example.source, filename, "single",
      File "<doctest pyspark.sql.dataframe.DataFrame.mapInArrow[3]>", line 1, in <module>
        df.mapInArrow(filter_func, df.schema).show()
      File "/__w/spark/spark/python/pyspark/sql/classic/dataframe.py", line 285, in show
        print(self._show_string(n, truncate, vertical))
      File "/__w/spark/spark/python/pyspark/sql/classic/dataframe.py", line 303, in _show_string
        return self._jdf.showString(n, 20, vertical)
      File "/__w/spark/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/java_gateway.py", line 1362, in __call__
        return_value = get_return_value(
      File "/__w/spark/spark/python/pyspark/errors/exceptions/captured.py", line 269, in deco
        raise converted from None
      File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 3343, in main
        process()
      File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 3334, in process
        serializer.dump_stream(out_iter, outfile)
      File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 187, in dump_stream
        return super(ArrowStreamUDFSerializer, self).dump_stream(wrap_and_init_stream(), stream)
      File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 121, in dump_stream
        for batch in iterator:
      File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 167, in wrap_and_init_stream
        for batch, _ in iterator:
      File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 2857, in func
        for result_batch, result_type in result_iter:
      File "<doctest pyspark.sql.dataframe.DataFrame.mapInArrow[2]>", line 3, in filter_func
        yield batch.filter(pa.compute.field("id") == 1)
      File "pyarrow/table.pxi", line 2580, in pyarrow.lib.RecordBatch.filter
        return _pc().filter(self, mask, null_selection_behavior)
      File "/usr/local/lib/python3.10/dist-packages/pyarrow/compute.py", line 263, in wrapper
        return func.call(args, options, memory_pool)
      File "pyarrow/_compute.pyx", line 372, in pyarrow._compute.Function.call
        _pack_compute_args(args, &c_batch.values)
      File "pyarrow/_compute.pyx", line 505, in pyarrow._compute._pack_compute_args
        raise TypeError(f"Got unexpected argument type {type(val)} "
    pyspark.errors.exceptions.captured.PythonException: 

Copy link
Contributor Author

@zhengruifeng zhengruifeng Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I am fixing it in #52965

zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025
…builtin function

### What changes were proposed in this pull request?
Update exmple of mapInArrow to use arrow builtin function

### Why are the changes needed?
1, it is encouraged to use arrow builtin function in arrow-based UDFs, so that additional conversion between arrow and pandas can be saved;
2, mapInArrow doctest on longer depends on pandas, so that we can still test it when pyarrow is installed but pandas is missing

### Does this PR introduce _any_ user-facing change?
yes, doc-only change

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#52909 from zhengruifeng/map_in_doc_test.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
huangxiaopingRD pushed a commit to huangxiaopingRD/spark that referenced this pull request Nov 25, 2025
…builtin function

### What changes were proposed in this pull request?
Update exmple of mapInArrow to use arrow builtin function

### Why are the changes needed?
1, it is encouraged to use arrow builtin function in arrow-based UDFs, so that additional conversion between arrow and pandas can be saved;
2, mapInArrow doctest on longer depends on pandas, so that we can still test it when pyarrow is installed but pandas is missing

### Does this PR introduce _any_ user-facing change?
yes, doc-only change

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#52909 from zhengruifeng/map_in_doc_test.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants