Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Handle ChunkedArray and Table in C data interface #24492

Closed
asfimport opened this issue Mar 31, 2020 · 8 comments
Closed

[R] Handle ChunkedArray and Table in C data interface #24492

asfimport opened this issue Mar 31, 2020 · 8 comments

Comments

@asfimport
Copy link

Currently the C data interface does Array and RecordBatch, but we're also going to need ChunkedArray and Table. 

Reporter: Neal Richardson / @nealrichardson
Assignee: Neal Richardson / @nealrichardson

PRs and other links:

Note: This issue was originally created as ARROW-8301. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Could you elaborate on the use case? Is it suboptimal, for example, to export your Table one batch at a time?

Also cc @wesm for advice.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
There's actually a more fundamental issue at play, namely at C library interfaces that need to provide a sequence of C interface ArrowArray objects which may not be available all up front. [~jacques] brought this up also when we were designing the C interface.

So the iteration API looks something like this:

struct ArrowArrayStream {
  void (*get_schema)(struct ArrowSchema*);
  int (*get_next)(struct ArrowArray*);
  void (*release)(struct ArrowArrayStream*);
  void* private_data;
};

Consider a canonical use case: a database query returning a sequence of RecordBatches.

You could say that this interface should be redefined and redefined on an application by application basis but that seems rather tedious to me.

@asfimport
Copy link
Author

Neal Richardson / @nealrichardson:
The use case I'm thinking of is: the Python package I'm using that does things with Arrow, and from which I want to pull data into R, always returns a Table. I can't "just" export its RecordBatches because Tables don't contain RecordBatches, they contain ChunkedArrays. So to export the Table, it would be something like

table.export_schema()
for col in table.chunked_arrays():
    for a in col.chunks():
        a.export_array()

and reassemble the Table. Looking at the R and Python code we have now that does the Array and RecordBatch work, I'm not sure how simple that would be to do, and I wonder if there's a better way.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
@wesm I wonder if an iteration API wouldn't need some kind of error signalling - e.g. the database goes down in the middle of iterating. In that case, void-returning callbacks aren't really adequate.

@nealrichardson TableBatchReader gives you a stream of record batches from a Table. It's reasonably efficient as well (no data is copied).

@asfimport
Copy link
Author

Wes McKinney / @wesm:
@pitrou I agree that it would need error signaling. I was thinking that int return values might be sufficient plus perhaps a "get error message" callback.

In any case, perhaps we could propose a standard interface to add to the ABI for this

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
@wesm We should probably launch a discussion on the ML about the ABI/API design. This depends a lot on third-party needs (though we can already take some inspiration the Flight APIs).

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Done

@asfimport
Copy link
Author

Neal Richardson / @nealrichardson:
Issue resolved by pull request 7648
#7648

@asfimport asfimport added this to the 1.0.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants