-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-8376: [R] Add experimental interface to ScanTask/RecordBatch iterators #6365
Changes from all commits
9637807
4459d77
7997b88
d053c1a
8e3e846
528ecfd
9cbf733
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -233,4 +233,32 @@ std::shared_ptr<arrow::Table> dataset___Scanner__ToTable( | |
return VALUE_OR_STOP(scanner->ToTable()); | ||
} | ||
|
||
// [[arrow::export]] | ||
std::vector<std::shared_ptr<ds::ScanTask>> dataset___Scanner__Scan( | ||
const std::shared_ptr<ds::Scanner>& scanner) { | ||
auto it = VALUE_OR_STOP(scanner->Scan()); | ||
std::vector<std::shared_ptr<ds::ScanTask>> out; | ||
std::shared_ptr<ds::ScanTask> scan_task; | ||
// TODO(npr): can this iteration be parallelized? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It can, but it's a hazard, e.g. each ScanTask can be attached to an open file descriptor, so you may bust limits if you collect them before aggregating them. That's why you want to consume them immediately, because you control the number of resource in-flight. |
||
for (auto st : it) { | ||
scan_task = VALUE_OR_STOP(st); | ||
out.push_back(scan_task); | ||
} | ||
return out; | ||
} | ||
|
||
// [[arrow::export]] | ||
std::vector<std::shared_ptr<arrow::RecordBatch>> dataset___ScanTask__get_batches( | ||
const std::shared_ptr<ds::ScanTask>& scan_task) { | ||
arrow::RecordBatchIterator rbi; | ||
rbi = VALUE_OR_STOP(scan_task->Execute()); | ||
std::vector<std::shared_ptr<arrow::RecordBatch>> out; | ||
std::shared_ptr<arrow::RecordBatch> batch; | ||
for (auto b : rbi) { | ||
batch = VALUE_OR_STOP(b); | ||
out.push_back(batch); | ||
} | ||
return out; | ||
} | ||
|
||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use_threads default to option(arrow.use_threads) for consistency and other API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps so, though at least these threads should be safer because they're in the C++ library and not the R bindings. I can make this change in my current PR though.