Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Cannot read empty DataFrame Interchange object #37050

Closed
stinodego opened this issue Aug 7, 2023 · 2 comments · Fixed by #38037
Closed

[Python] Cannot read empty DataFrame Interchange object #37050

stinodego opened this issue Aug 7, 2023 · 2 comments · Fixed by #38037

Comments

@stinodego
Copy link

stinodego commented Aug 7, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Creating an empty table, converting to the interchange format, then reading it back, gives an error:

import pyarrow as pa
import pyarrow.interchange

df = pa.table([[]], names=['col1'])
dfi = df.__dataframe__()
pa.interchange.from_dataframe(dfi)
# ValueError: Must pass schema, or at least one RecordBatch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/stijn/code/polars/py-polars/.venv/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py", line 86, in from_dataframe
    return _from_dataframe(df.__dataframe__(allow_copy=allow_copy),
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stijn/code/polars/py-polars/.venv/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py", line 112, in _from_dataframe
    return pa.Table.from_batches(batches)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/table.pxi", line 3972, in pyarrow.lib.Table.from_batches
ValueError: Must pass schema, or at least one RecordBatch

I believe the reason for this is that dfi.num_chunks() is 0, when it should be 1 (a single, empty chunk).

Component(s)

Python

@stinodego stinodego changed the title Cannot read empty DataFrame Interchange object [Python] Cannot read empty DataFrame Interchange object Aug 8, 2023
@AlenkaF
Copy link
Member

AlenkaF commented Aug 16, 2023

Thank you for submitting the issue @stinodego!

This is true, the num_chunks should result in 1 if the table is empty with one empty chunk. That is also in line with the behaviour of pyarrow.ChunkedArray.num_chunks.

After some research I found that the underlying issue is in the get_batches method from the pa.Table. In case of an empty table it returns an empty list instead of an empty record batch with schema equal to the schema of the table.

>>> # Using schema when constructing the table here otherwise
>>> # we get a Null Array which is not supported by the protocol
>>> my_schema = pa.schema([
...     pa.field('col1', pa.int64()),])
>>> df = pa.table([[]], schema=my_schema)
>>> df
pyarrow.Table
col1: int64
----
col1: [[]]

>>> df.to_batches()
[]

>>> # Should result in
>>> batch = pa.record_batch([[]], schema=my_schema)
>>> batch
pyarrow.RecordBatch
col1: int64
----
col1: []

Because we use to_batches in the protocol implementation for num_chunks and also for get_chunks, the error is raised.

I have opened a new issue to fix the behaviour of pa.Table.to_batches(): #37200.
Will keep this issue open to see the fix in 37200 issue will be enough (and add a test in this case).

@AlenkaF
Copy link
Member

AlenkaF commented Oct 5, 2023

Added a workaround in the case of empty dataframes with 0 chunks as it is more general (it is possible other libraries might also create interchange object without chunks). PR: #38037

@AlenkaF AlenkaF added this to the 14.0.0 milestone Oct 5, 2023
jorisvandenbossche pushed a commit that referenced this issue Oct 10, 2023
…ataframes (#38037)

### Rationale for this change

The implementation of the DataFrame Interchange Protocol does not currently support consumption of dataframes with 0 number of chunks (empty dataframes).

### What changes are included in this PR?

Add a workaround to not error in this case.

### Are these changes tested?

Yes, added `test_empty_dataframe` in `python/pyarrow/tests/interchange/test_conversion.py`.

### Are there any user-facing changes?
No.
* Closes: #37050

Authored-by: AlenkaF <frim.alenka@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 23, 2023
…mpty dataframes (apache#38037)

### Rationale for this change

The implementation of the DataFrame Interchange Protocol does not currently support consumption of dataframes with 0 number of chunks (empty dataframes).

### What changes are included in this PR?

Add a workaround to not error in this case.

### Are these changes tested?

Yes, added `test_empty_dataframe` in `python/pyarrow/tests/interchange/test_conversion.py`.

### Are there any user-facing changes?
No.
* Closes: apache#37050

Authored-by: AlenkaF <frim.alenka@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…mpty dataframes (apache#38037)

### Rationale for this change

The implementation of the DataFrame Interchange Protocol does not currently support consumption of dataframes with 0 number of chunks (empty dataframes).

### What changes are included in this PR?

Add a workaround to not error in this case.

### Are these changes tested?

Yes, added `test_empty_dataframe` in `python/pyarrow/tests/interchange/test_conversion.py`.

### Are there any user-facing changes?
No.
* Closes: apache#37050

Authored-by: AlenkaF <frim.alenka@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…mpty dataframes (apache#38037)

### Rationale for this change

The implementation of the DataFrame Interchange Protocol does not currently support consumption of dataframes with 0 number of chunks (empty dataframes).

### What changes are included in this PR?

Add a workaround to not error in this case.

### Are these changes tested?

Yes, added `test_empty_dataframe` in `python/pyarrow/tests/interchange/test_conversion.py`.

### Are there any user-facing changes?
No.
* Closes: apache#37050

Authored-by: AlenkaF <frim.alenka@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants