Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLib][Ruby][Dataset] Remove bindings for internal classes #27634

Closed
asfimport opened this issue Feb 25, 2021 · 11 comments
Closed

[GLib][Ruby][Dataset] Remove bindings for internal classes #27634

asfimport opened this issue Feb 25, 2021 · 11 comments

Comments

@asfimport
Copy link

asfimport commented Feb 25, 2021

GLib and ruby include bindings for internal classes such as ScanOptions, ScanContext, InMemoryScanTask, ScanTask, ... These are probably unnecessary and should be removed to present a simpler interface less prone to breakage under refactoring of the wrapped classes https://github.com/apache/arrow/pull/9532/checks?check_run_id=1974229719#step:8:2071

Reporter: Ben Kietzman / @bkietz
Assignee: Kouhei Sutou / @kou

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-11782. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Ben Kietzman / @bkietz:
@mrkn

@asfimport
Copy link
Author

Weston Pace / @westonpace:
This may be an issue for ARROW-7001.  ScanTask is changing to use some of the internal async utilities which I do not believe we are not exposing yet (AsyncGenerator, Future, etc.)

@asfimport
Copy link
Author

Kouhei Sutou / @kou:
It seems that pyarrow also has bindings of ScanTask, .... Should we also remove them from pyarrow?

@asfimport
Copy link
Author

Kouhei Sutou / @kou:
R also has ScanTask bindings.

@asfimport
Copy link
Author

Weston Pace / @westonpace:
Thanks for pointing this out.  I think we can't get rid of ScanTask.  I will create a ScanTaskInternal (or some similar name) for the work I'm doing and leave ScanTask's interface as is.

In the future though it might be better to change Scanner::Scan to return a RecordBatchIterator instead of ScanTaskIterator.  Otherwise we're putting the burden on the user to manage how many files are loaded concurrently.  I think that should either be hidden completely or controlled by some option (MaxConcurrentFiles or, better yet, MaxAllocatedBytes)

@asfimport
Copy link
Author

Ben Kietzman / @bkietz:
IIUC, python and R only use ScanTasks when materializing a stream of RecordBatches instead of using Scanner::ToTable. That functionality could easily be moved upstream into C++ by providing Scanner::GetRecordBatches or similar, which would be easier to consume from bindings in any case. Therefore I'd say the wrappers for ScanTask could also be removed from Python and R.

If that's not acceptable, I'd think that ARROW-7001 should prefer to add ScanTask::ExecuteAsync and refactor ScanTask::Execute to be a backwards compatible wrapper.

@jorisvandenbossche @nealrichardson @westonpace

@asfimport
Copy link
Author

Neal Richardson / @nealrichardson:
I would love to delete ScanTask from the R bindings. The reason they're exposed there is to support a (hacky, experimental) attempt to do computations on the stream of record batches so that it's possible to compute things that we couldn't do otherwise because we can't hold the whole Table in memory. So Scanner::ToBatches doesn't work in that case because everything would be materialized.

What I really want is to be able to essentially pass a function/lambda to something like ToTable or ToBatches and have that function be applied to every record batch in the stream. I don't want to manage consuming the ScanTasks/RecordBatchIterators, I'd prefer to have the C++ library handle that. (In my current hacky use of ScanTasks, it's actually prohibitively slow because it has to consume the iterators single-threaded.)

@asfimport
Copy link
Author

Ben Kietzman / @bkietz:
We could certainly provide Scanner::VisitBatches instead of/in addition to Scanner::ToBatches. I'll open a JIRA for both methods and make a PR today, so that everyone here can see if it accomodates their use cases

@asfimport
Copy link
Author

@asfimport
Copy link
Author

Kouhei Sutou / @kou:
Thanks. I understand.

I agree with removing GLib bindings of them.

@asfimport
Copy link
Author

Kouhei Sutou / @kou:
Issue resolved by pull request 10533
#10533

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants