You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Problem: I'm trying to run a pandas_udf in spark over a column containing an array of booleans, but arrow doesn't like it. Not sure whether to raise this in the spark repo or here.
Stacktrace:
File "/<path-to-project>/lib/python3.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
process()
File "/<path-to-project>/lib/python3.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/<path-to-project>/lib/python3.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 283, in dump_stream
for series in iterator:
File "/<path-to-project>/lib/python3.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 301, in load_stream
yield [self.arrow_to_pandas(c) for c in pa.Table.from_batches([batch]).itercolumns()]
File "/<path-to-project>/lib/python3.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 301, in <listcomp>
yield [self.arrow_to_pandas(c) for c in pa.Table.from_batches([batch]).itercolumns()]
File "/<path-to-project>/lib/python3.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 270, in arrow_to_pandas
s = arrow_column.to_pandas()
File "pyarrow/array.pxi", line 441, in pyarrow.lib._PandasConvertible.to_pandas
File "pyarrow/table.pxi", line 468, in pyarrow.lib.Column._to_pandas
File "pyarrow/table.pxi", line 144, in pyarrow.lib.ChunkedArray._to_pandas
File "pyarrow/error.pxi", line 93, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Not implemented type for lists: bool
The text was updated successfully, but these errors were encountered:
It seems that converting a list type of booleans is not (yet) supported in the arrow->pandas conversion code:
In [4]: a = pa.array(np.array([[True, False], [True, True, True]]))
In [5]: a
Out[5]:
<pyarrow.lib.ListArray object at 0x7f37b71e4a98>
[
[
true,
false
],
[
true,
true,
true
]
]
In [6]: a.to_pandas()
...
ArrowNotImplementedError: Not implemented type for lists: bool
But for a plain boolean array, this works fine. So a workaround for now is to ensure you have a plain boolean array, and not a nested list with booleans.
pyarrow version: 0.14.1
pyspark version: 2.4.0
Problem: I'm trying to run a
pandas_udf
in spark over a column containing an array of booleans, but arrow doesn't like it. Not sure whether to raise this in the spark repo or here.Stacktrace:
The text was updated successfully, but these errors were encountered: