Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow support #21

Merged
merged 15 commits into from
May 16, 2023
Merged

Arrow support #21

merged 15 commits into from
May 16, 2023

Conversation

honno
Copy link
Member

@honno honno commented Mar 29, 2023

Should resolve #20. Wrapped pa.Table but haven't gone through all the current failures, and will need to do the same with pa.RecordBatch.

@AlenkaF
Copy link

AlenkaF commented Apr 4, 2023

I think the reason for the failing test is a bug in the Vaex implementation. It seems to create a bitmask where a missing value should be defined with 0 but is in fact defined by 1/True:

>>> import pyarrow as pa
>>> table = pa.table([[False]], names=["A"])
>>> table
pyarrow.Table
A: bool
----
A: [[false]]

>>> import vaex
>>> from vaex.dataframe_protocol import from_dataframe_to_vaex
>>> vaex_df = from_dataframe_to_vaex(table)
>>> vaex_df
  #  A
  0  False

>>> vaex_df_p = vaex_df.__dataframe__()
>>> vaex_col_p = vaex_df_p.get_column_by_name("A")           
>>> vaex_col_p.describe_null
(3, 0)

>>> vaex_col_p.get_buffers()
{'data': (VaexBuffer({'bufsize': 1, 'ptr': 105553125834912, 'device': 'CPU'}), (<_DtypeKind.BOOL: 20>, 8, '|b1', '|')), 'validity': (VaexBuffer({'bufsize': 1, 'ptr': 105553125865376, 'device': 'CPU'}), (<_DtypeKind.BOOL: 20>, 8, '|b1', '|')), 'offsets': None}
>>> vaex_col_p._get_validity_buffer()[0]._x
array([False])

In the above example describe_null gives (3, 0) but the valid value has a validity buffer value equal to 0/False which would mean it is missing.

@jorisvandenbossche
Copy link
Member

FWIW I am not sure it is very important to test both Table and RecordBatch (the implementation is almost fully shared under the hood, and certainly when creating the Table in a simple way, it will typically consist of a single batch, and then it is fully the same), in case that complicates the testing code

@honno honno force-pushed the polars-arrow-support branch 3 times, most recently from 2215ab8 to 8f8d2ad Compare May 16, 2023 12:55
@honno honno marked this pull request as ready for review May 16, 2023 13:55
@honno honno merged commit 3b2da6d into main May 16, 2023
2 checks passed
@honno
Copy link
Member Author

honno commented May 16, 2023

In the end it looks like pyarrow has the most robust protocol implementation heh.

FWIW I am not sure it is very important to test both Table and RecordBatch (the implementation is almost fully shared under the hood, and certainly when creating the Table in a simple way, it will typically consist of a single batch, and then it is fully the same), in case that complicates the testing code

Good shout, fortunately its been pretty simple to test RecordBatch with the existing machinery for Table!

@AlenkaF
Copy link

AlenkaF commented May 16, 2023

Great news! 🎉
And great work @honno, thank you!

@rgommers rgommers deleted the polars-arrow-support branch May 24, 2023 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add wrapper for pyarrow
3 participants