Pyspark pandas_udfs are failing for columns with arrays of booleans when using arrow #5203

finsqm · 2019-08-27T07:22:15Z

pyarrow version: 0.14.1
pyspark version: 2.4.0

Problem: I'm trying to run a pandas_udf in spark over a column containing an array of booleans, but arrow doesn't like it. Not sure whether to raise this in the spark repo or here.

Stacktrace:

File "/<path-to-project>/lib/python3.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
    process()
  File "/<path-to-project>/lib/python3.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/<path-to-project>/lib/python3.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 283, in dump_stream
    for series in iterator:
  File "/<path-to-project>/lib/python3.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 301, in load_stream
    yield [self.arrow_to_pandas(c) for c in pa.Table.from_batches([batch]).itercolumns()]
  File "/<path-to-project>/lib/python3.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 301, in <listcomp>
    yield [self.arrow_to_pandas(c) for c in pa.Table.from_batches([batch]).itercolumns()]
  File "/<path-to-project>/lib/python3.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 270, in arrow_to_pandas
    s = arrow_column.to_pandas()
  File "pyarrow/array.pxi", line 441, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 468, in pyarrow.lib.Column._to_pandas
  File "pyarrow/table.pxi", line 144, in pyarrow.lib.ChunkedArray._to_pandas
  File "pyarrow/error.pxi", line 93, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Not implemented type for lists: bool

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2019-08-27T11:35:05Z

It seems that converting a list type of booleans is not (yet) supported in the arrow->pandas conversion code:

In [4]: a = pa.array(np.array([[True, False], [True, True, True]])) 

In [5]: a 
Out[5]: 
<pyarrow.lib.ListArray object at 0x7f37b71e4a98>
[
  [
    true,
    false
  ],
  [
    true,
    true,
    true
  ]
]

In [6]: a.to_pandas() 
...
ArrowNotImplementedError: Not implemented type for lists: bool

But for a plain boolean array, this works fine. So a workaround for now is to ensure you have a plain boolean array, and not a nested list with booleans.

wesm · 2019-08-27T15:45:42Z

https://issues.apache.org/jira/browse/ARROW-6369

wesm closed this as completed Aug 27, 2019

asfimport mentioned this issue Sep 6, 2019

[Python] Support list-of-boolean in Array.to_pandas conversion #22743

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pyspark pandas_udfs are failing for columns with arrays of booleans when using arrow #5203

Pyspark pandas_udfs are failing for columns with arrays of booleans when using arrow #5203

finsqm commented Aug 27, 2019

jorisvandenbossche commented Aug 27, 2019

wesm commented Aug 27, 2019

Pyspark pandas_udfs are failing for columns with arrays of booleans when using arrow #5203

Pyspark pandas_udfs are failing for columns with arrays of booleans when using arrow #5203

Comments

finsqm commented Aug 27, 2019

jorisvandenbossche commented Aug 27, 2019

wesm commented Aug 27, 2019