Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pyspark pandas_udfs are failing for columns with arrays of booleans when using arrow #5203

Closed
finsqm opened this issue Aug 27, 2019 · 2 comments

Comments

@finsqm
Copy link

finsqm commented Aug 27, 2019

pyarrow version: 0.14.1
pyspark version: 2.4.0

Problem: I'm trying to run a pandas_udf in spark over a column containing an array of booleans, but arrow doesn't like it. Not sure whether to raise this in the spark repo or here.

Stacktrace:

File "/<path-to-project>/lib/python3.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
    process()
  File "/<path-to-project>/lib/python3.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/<path-to-project>/lib/python3.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 283, in dump_stream
    for series in iterator:
  File "/<path-to-project>/lib/python3.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 301, in load_stream
    yield [self.arrow_to_pandas(c) for c in pa.Table.from_batches([batch]).itercolumns()]
  File "/<path-to-project>/lib/python3.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 301, in <listcomp>
    yield [self.arrow_to_pandas(c) for c in pa.Table.from_batches([batch]).itercolumns()]
  File "/<path-to-project>/lib/python3.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 270, in arrow_to_pandas
    s = arrow_column.to_pandas()
  File "pyarrow/array.pxi", line 441, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 468, in pyarrow.lib.Column._to_pandas
  File "pyarrow/table.pxi", line 144, in pyarrow.lib.ChunkedArray._to_pandas
  File "pyarrow/error.pxi", line 93, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Not implemented type for lists: bool
@jorisvandenbossche
Copy link
Member

It seems that converting a list type of booleans is not (yet) supported in the arrow->pandas conversion code:

In [4]: a = pa.array(np.array([[True, False], [True, True, True]])) 

In [5]: a 
Out[5]: 
<pyarrow.lib.ListArray object at 0x7f37b71e4a98>
[
  [
    true,
    false
  ],
  [
    true,
    true,
    true
  ]
]

In [6]: a.to_pandas() 
...
ArrowNotImplementedError: Not implemented type for lists: bool

But for a plain boolean array, this works fine. So a workaround for now is to ensure you have a plain boolean array, and not a nested list with booleans.

@wesm
Copy link
Member

wesm commented Aug 27, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants