[Python] Cannot read empty DataFrame Interchange object #37050

stinodego · 2023-08-07T20:49:39Z

Describe the bug, including details regarding any error messages, version, and platform.

Creating an empty table, converting to the interchange format, then reading it back, gives an error:

import pyarrow as pa
import pyarrow.interchange

df = pa.table([[]], names=['col1'])
dfi = df.__dataframe__()
pa.interchange.from_dataframe(dfi)
# ValueError: Must pass schema, or at least one RecordBatch

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/stijn/code/polars/py-polars/.venv/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py", line 86, in from_dataframe
    return _from_dataframe(df.__dataframe__(allow_copy=allow_copy),
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stijn/code/polars/py-polars/.venv/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py", line 112, in _from_dataframe
    return pa.Table.from_batches(batches)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/table.pxi", line 3972, in pyarrow.lib.Table.from_batches
ValueError: Must pass schema, or at least one RecordBatch

I believe the reason for this is that dfi.num_chunks() is 0, when it should be 1 (a single, empty chunk).

Component(s)

Python

The text was updated successfully, but these errors were encountered:

AlenkaF · 2023-08-16T13:58:29Z

Thank you for submitting the issue @stinodego!

This is true, the num_chunks should result in 1 if the table is empty with one empty chunk. That is also in line with the behaviour of pyarrow.ChunkedArray.num_chunks.

After some research I found that the underlying issue is in the get_batches method from the pa.Table. In case of an empty table it returns an empty list instead of an empty record batch with schema equal to the schema of the table.

>>> # Using schema when constructing the table here otherwise
>>> # we get a Null Array which is not supported by the protocol
>>> my_schema = pa.schema([
...     pa.field('col1', pa.int64()),])
>>> df = pa.table([[]], schema=my_schema)
>>> df
pyarrow.Table
col1: int64
----
col1: [[]]

>>> df.to_batches()
[]

>>> # Should result in
>>> batch = pa.record_batch([[]], schema=my_schema)
>>> batch
pyarrow.RecordBatch
col1: int64
----
col1: []

Because we use to_batches in the protocol implementation for num_chunks and also for get_chunks, the error is raised.

I have opened a new issue to fix the behaviour of pa.Table.to_batches(): #37200.
Will keep this issue open to see the fix in 37200 issue will be enough (and add a test in this case).

AlenkaF · 2023-10-05T11:21:04Z

Added a workaround in the case of empty dataframes with 0 chunks as it is more general (it is possible other libraries might also create interchange object without chunks). PR: #38037

…ataframes (#38037) ### Rationale for this change The implementation of the DataFrame Interchange Protocol does not currently support consumption of dataframes with 0 number of chunks (empty dataframes). ### What changes are included in this PR? Add a workaround to not error in this case. ### Are these changes tested? Yes, added `test_empty_dataframe` in `python/pyarrow/tests/interchange/test_conversion.py`. ### Are there any user-facing changes? No. * Closes: #37050 Authored-by: AlenkaF <frim.alenka@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

…mpty dataframes (apache#38037) ### Rationale for this change The implementation of the DataFrame Interchange Protocol does not currently support consumption of dataframes with 0 number of chunks (empty dataframes). ### What changes are included in this PR? Add a workaround to not error in this case. ### Are these changes tested? Yes, added `test_empty_dataframe` in `python/pyarrow/tests/interchange/test_conversion.py`. ### Are there any user-facing changes? No. * Closes: apache#37050 Authored-by: AlenkaF <frim.alenka@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

stinodego added the Type: bug label Aug 7, 2023

github-actions bot added the Component: Python label Aug 7, 2023

stinodego changed the title ~~Cannot read empty DataFrame Interchange object~~ [Python] Cannot read empty DataFrame Interchange object Aug 8, 2023

AlenkaF self-assigned this Oct 5, 2023

github-actions bot mentioned this issue Oct 5, 2023

GH-37050: [Python][Interchange protocol] Add a workaround for empty dataframes #38037

Merged

AlenkaF added this to the 14.0.0 milestone Oct 5, 2023

jorisvandenbossche closed this as completed in #38037 Oct 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Cannot read empty DataFrame Interchange object #37050

[Python] Cannot read empty DataFrame Interchange object #37050

stinodego commented Aug 7, 2023 •

edited

AlenkaF commented Aug 16, 2023

AlenkaF commented Oct 5, 2023

[Python] Cannot read empty DataFrame Interchange object #37050

[Python] Cannot read empty DataFrame Interchange object #37050

Comments

stinodego commented Aug 7, 2023 • edited

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

AlenkaF commented Aug 16, 2023

AlenkaF commented Oct 5, 2023

stinodego commented Aug 7, 2023 •

edited