New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-8376: [R] Add experimental interface to ScanTask/RecordBatch iterators #6365
Conversation
Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? Then could you also rename pull request title in the following format?
See also: |
The negative numbers is because the proper type is |
0b81419
to
23ecd7e
Compare
70bc362
to
48bbed0
Compare
48bbed0
to
7788c00
Compare
6042a10
to
9cbf733
Compare
Any objection to merging this @fsaintjacques ? I don't plan on advocating its use, but thought it might be useful to have in the package for experimenting and exploring things. |
#' * `projection`: A character vector of column names to select | ||
#' * `filter`: A `Expression` to filter the scanned rows by, or `TRUE` (default) | ||
#' to keep all rows. | ||
#' * `use_threads`: logical: should scanning use multithreading? Default `TRUE` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use_threads default to option(arrow.use_threads) for consistency and other API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps so, though at least these threads should be safer because they're in the C++ library and not the R bindings. I can make this change in my current PR though.
auto it = VALUE_OR_STOP(scanner->Scan()); | ||
std::vector<std::shared_ptr<ds::ScanTask>> out; | ||
std::shared_ptr<ds::ScanTask> scan_task; | ||
// TODO(npr): can this iteration be parallelized? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can, but it's a hazard, e.g. each ScanTask can be attached to an open file descriptor, so you may bust limits if you collect them before aggregating them. That's why you want to consume them immediately, because you control the number of resource in-flight.
As an alternative to calling
ToTable()
to bring everything into memory, it would be nice to expose the stream of batches so that you could aggregate (or really do whatever) on each chunk. That gives access to the full dataset, which otherwise you can't handle unless it's small.On the NYC taxi dataset (10.5 years, 125 parquet files),
gives me the tabulation of
passenger_count
in about 200s (no parallelization). And you can see all sorts of weird features in the data: