-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-54211][PYTHON][DOCS] Update exmple of mapInArrow to use arrow builtin function #52909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
|
Merged to master for Apache Spark 4.2.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @zhengruifeng .
It seems to break build_python_minimum.yml (two modules: pyspark-sql and spark-connect). Could you take a look at the failure?
**********************************************************************
2 of 5 in pyspark.sql.dataframe.DataFrame.mapInArrow
***Test Failed*** 2 failures.
| ... pdf = batch.to_pandas() | ||
| ... yield pa.RecordBatch.from_pandas(pdf[pdf.id == 1]) | ||
| ... yield batch.filter(pa.compute.field("id") == 1) | ||
| >>> df.mapInArrow(filter_func, df.schema).show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failure happens here.
File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 409, in pyspark.sql.dataframe.DataFrame.mapInArrow
Failed example:
df.mapInArrow(filter_func, df.schema).show()
Exception raised:
Traceback (most recent call last):
File "/usr/lib/python3.10/doctest.py", line 1350, in __run
exec(compile(example.source, filename, "single",
File "<doctest pyspark.sql.dataframe.DataFrame.mapInArrow[3]>", line 1, in <module>
df.mapInArrow(filter_func, df.schema).show()
File "/__w/spark/spark/python/pyspark/sql/classic/dataframe.py", line 285, in show
print(self._show_string(n, truncate, vertical))
File "/__w/spark/spark/python/pyspark/sql/classic/dataframe.py", line 303, in _show_string
return self._jdf.showString(n, 20, vertical)
File "/__w/spark/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/java_gateway.py", line 1362, in __call__
return_value = get_return_value(
File "/__w/spark/spark/python/pyspark/errors/exceptions/captured.py", line 269, in deco
raise converted from None
File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 3343, in main
process()
File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 3334, in process
serializer.dump_stream(out_iter, outfile)
File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 187, in dump_stream
return super(ArrowStreamUDFSerializer, self).dump_stream(wrap_and_init_stream(), stream)
File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 121, in dump_stream
for batch in iterator:
File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 167, in wrap_and_init_stream
for batch, _ in iterator:
File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 2857, in func
for result_batch, result_type in result_iter:
File "<doctest pyspark.sql.dataframe.DataFrame.mapInArrow[2]>", line 3, in filter_func
yield batch.filter(pa.compute.field("id") == 1)
File "pyarrow/table.pxi", line 2580, in pyarrow.lib.RecordBatch.filter
return _pc().filter(self, mask, null_selection_behavior)
File "/usr/local/lib/python3.10/dist-packages/pyarrow/compute.py", line 263, in wrapper
return func.call(args, options, memory_pool)
File "pyarrow/_compute.pyx", line 372, in pyarrow._compute.Function.call
_pack_compute_args(args, &c_batch.values)
File "pyarrow/_compute.pyx", line 505, in pyarrow._compute._pack_compute_args
raise TypeError(f"Got unexpected argument type {type(val)} "
pyspark.errors.exceptions.captured.PythonException:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I am fixing it in #52965
…builtin function ### What changes were proposed in this pull request? Update exmple of mapInArrow to use arrow builtin function ### Why are the changes needed? 1, it is encouraged to use arrow builtin function in arrow-based UDFs, so that additional conversion between arrow and pandas can be saved; 2, mapInArrow doctest on longer depends on pandas, so that we can still test it when pyarrow is installed but pandas is missing ### Does this PR introduce _any_ user-facing change? yes, doc-only change ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#52909 from zhengruifeng/map_in_doc_test. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…builtin function ### What changes were proposed in this pull request? Update exmple of mapInArrow to use arrow builtin function ### Why are the changes needed? 1, it is encouraged to use arrow builtin function in arrow-based UDFs, so that additional conversion between arrow and pandas can be saved; 2, mapInArrow doctest on longer depends on pandas, so that we can still test it when pyarrow is installed but pandas is missing ### Does this PR introduce _any_ user-facing change? yes, doc-only change ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#52909 from zhengruifeng/map_in_doc_test. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
Update exmple of mapInArrow to use arrow builtin function
Why are the changes needed?
1, it is encouraged to use arrow builtin function in arrow-based UDFs, so that additional conversion between arrow and pandas can be saved;
2, mapInArrow doctest on longer depends on pandas, so that we can still test it when pyarrow is installed but pandas is missing
Does this PR introduce any user-facing change?
yes, doc-only change
How was this patch tested?
ci
Was this patch authored or co-authored using generative AI tooling?
no