Skip to content

ARROW-8301: [R] Handle ChunkedArray and Table in C data interface#7648

Closed
nealrichardson wants to merge 1 commit intoapache:masterfrom
nealrichardson:r-py-tables
Closed

ARROW-8301: [R] Handle ChunkedArray and Table in C data interface#7648
nealrichardson wants to merge 1 commit intoapache:masterfrom
nealrichardson:r-py-tables

Conversation

@nealrichardson
Copy link
Member

In terms of number of lines of code, this wasn't bad, though I don't know how efficient these methods are. Maybe there's a better way

The one thing that would be lost is any metadata attached to the Table schema because the Table is reconstructed from its ChunkedArrays without schema. I wonder if we could export the Schema on its own--the existing _export_to_c/_import_from_c methods take both an array pointer and a schema pointer.

@nealrichardson nealrichardson requested a review from pitrou July 6, 2020 19:35
@github-actions
Copy link

github-actions bot commented Jul 6, 2020

colnames <- maybe_py_to_r(x$column_names)
r_cols <- maybe_py_to_r(x$columns)
names(r_cols) <- colnames
Table$create(!!!r_cols)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, what does the !!! mean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's like doing **kwargs

@pitrou
Copy link
Member

pitrou commented Jul 7, 2020

The main source of potential inefficiency here is that the Schema is exported/imported once for each chunk. We may or may not case immediately about this.

Also, note that you can transfer a Table as a sequence of RecordBatches, rather than a sequence of ChunkedArrays. If you have many many columns, it would probably be more efficient.

(in C++, you can use TableBatchReader to export a Table as RecordBatches, and Table::FromRecordBatches to recreate a Table from the RecordBatches)

@pitrou
Copy link
Member

pitrou commented Jul 7, 2020

(otherwise, the code here looks ok, but I'm not a R expert at all :-))

@nealrichardson
Copy link
Member Author

Also, note that you can transfer a Table as a sequence of RecordBatches, rather than a sequence of ChunkedArrays.

But Tables are a sequence of ChunkedArrays, right? Those ChunkedArrays may not be chunked the same way, and dictionaries within a ChunkedArray may not be the same, etc. Can TableBatchReader handle those cases robustly and with zero-copy?

@pitrou
Copy link
Member

pitrou commented Jul 7, 2020

It should :-)

@pitrou
Copy link
Member

pitrou commented Jul 7, 2020

However, it may use the offset member of arrays, which might not work with the C Data Interface...

@nealrichardson
Copy link
Member Author

@wesm do you have an opinion on this?

@wesm
Copy link
Member

wesm commented Jul 7, 2020

As long as the implementation details / semantics aren't exposed (they don't seem to be), this seems sufficient to me to have the feature established, and we an always return later and make it more efficient (or replace it with the iterator mechanism that will hopefully have some discussion on the ML)

@nealrichardson
Copy link
Member Author

Ok then I'll merge this and we can come back and improve it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants