ARROW-8376: [R] Add experimental interface to ScanTask/RecordBatch iterators #6365

nealrichardson · 2020-02-05T22:36:12Z

As an alternative to calling ToTable() to bring everything into memory, it would be nice to expose the stream of batches so that you could aggregate (or really do whatever) on each chunk. That gives access to the full dataset, which otherwise you can't handle unless it's small.

On the NYC taxi dataset (10.5 years, 125 parquet files),

tab <- ds %>%
  select(passenger_count) %>%
  map_batches(~count(., passenger_count)) %>%
  group_by(passenger_count) %>%
  summarize(n = sum(n))

gives me the tabulation of passenger_count in about 200s (no parallelization). And you can see all sorts of weird features in the data:

> as.data.frame(tab)
   passenger_count          n
1             -127          7
2             -123          1
3             -122          1
4             -119          1
5             -115          1
6             -101          1
7              -98          1
8              -96          1
9              -93          1
10             -92          1
11             -91          1
12             -79          1
13             -64          2
14             -63          1
15             -48       1508
16             -45          1
17             -43          4
18             -33          1
19             -31          1
20              -9          1
21              -7          1
22              -6          3
23              -2          1
24              -1         10
25               0    5809809
26               1 1078624900
27               2  227454966
28               3   67096194
29               4   32443710
30               5   99064441
31               6   37241244
32               7       1753
33               8       1437
34               9       1304
35              10         17
36              13          1
37              15          2
38              17          1
39              19          1
40              25          1
41              33          2
42              34          1
43              36          1
44              37          1
45              38          1
46              47          1
47              49         26
48              53          1
49              58          2
50              61          1
51              65          3
52              66          1
53              69          1
54              70          1
55              84          1
56              96          1
57              97          1
58             113          1
59             125          1

github-actions · 2020-02-05T22:46:28Z

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

fsaintjacques · 2020-02-06T14:43:27Z

The negative numbers is because the proper type is uint8_t, the negative number are because the passenger count is greater than 127. I doubt that this column (or any) are reliable.

nealrichardson · 2020-04-09T18:46:06Z

Any objection to merging this @fsaintjacques ? I don't plan on advocating its use, but thought it might be useful to have in the package for experimenting and exploring things.

fsaintjacques · 2020-04-09T20:00:19Z

r/R/dataset.R

+#' * `projection`: A character vector of column names to select
+#' * `filter`: A `Expression` to filter the scanned rows by, or `TRUE` (default)
+#'    to keep all rows.
+#' * `use_threads`: logical: should scanning use multithreading? Default `TRUE`


Should use_threads default to option(arrow.use_threads) for consistency and other API?

Perhaps so, though at least these threads should be safer because they're in the C++ library and not the R bindings. I can make this change in my current PR though.

fsaintjacques · 2020-04-09T20:04:24Z

r/src/dataset.cpp

+  auto it = VALUE_OR_STOP(scanner->Scan());
+  std::vector<std::shared_ptr<ds::ScanTask>> out;
+  std::shared_ptr<ds::ScanTask> scan_task;
+  // TODO(npr): can this iteration be parallelized?


It can, but it's a hazard, e.g. each ScanTask can be attached to an open file descriptor, so you may bust limits if you collect them before aggregating them. That's why you want to consume them immediately, because you control the number of resource in-flight.

nealrichardson force-pushed the r-map-reduce branch from 0b81419 to 23ecd7e Compare February 6, 2020 22:38

kszucs force-pushed the master branch from b18ed44 to e79c251 Compare February 7, 2020 07:41

kszucs force-pushed the r-map-reduce branch from 23ecd7e to 70bc362 Compare February 7, 2020 10:03

nealrichardson force-pushed the r-map-reduce branch from 70bc362 to 48bbed0 Compare April 7, 2020 23:10

nealrichardson marked this pull request as ready for review April 8, 2020 21:00

nealrichardson changed the title ~~WIP: expose an interface to ScanTask/RecordBatch iterators in R~~ ARROW-8376: [R] Add experimental interface to ScanTask/RecordBatch iterators Apr 8, 2020

nealrichardson force-pushed the r-map-reduce branch from 48bbed0 to 7788c00 Compare April 8, 2020 21:07

nealrichardson added 7 commits April 8, 2020 16:23

Hacking and failing to compile

9637807

Make it work and add a basic test

4459d77

Doc and cleanup

7997b88

Refactor/cleanup

d053c1a

Test setup

8e3e846

Fixes

528ecfd

lint

9cbf733

nealrichardson force-pushed the r-map-reduce branch from 6042a10 to 9cbf733 Compare April 8, 2020 23:23

fsaintjacques approved these changes Apr 9, 2020

View reviewed changes

fsaintjacques closed this in 9662dd6 Apr 9, 2020

nealrichardson deleted the r-map-reduce branch April 9, 2020 20:29

asfimport mentioned this pull request Apr 9, 2020

[R] Add experimental interface to ScanTask/RecordBatch iterators #24560

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-8376: [R] Add experimental interface to ScanTask/RecordBatch iterators #6365

ARROW-8376: [R] Add experimental interface to ScanTask/RecordBatch iterators #6365

nealrichardson commented Feb 5, 2020

github-actions bot commented Feb 5, 2020

fsaintjacques commented Feb 6, 2020

nealrichardson commented Apr 9, 2020

fsaintjacques Apr 9, 2020

nealrichardson Apr 9, 2020

fsaintjacques Apr 9, 2020

ARROW-8376: [R] Add experimental interface to ScanTask/RecordBatch iterators #6365

ARROW-8376: [R] Add experimental interface to ScanTask/RecordBatch iterators #6365

Conversation

nealrichardson commented Feb 5, 2020

github-actions bot commented Feb 5, 2020

fsaintjacques commented Feb 6, 2020

nealrichardson commented Apr 9, 2020

fsaintjacques Apr 9, 2020

Choose a reason for hiding this comment

nealrichardson Apr 9, 2020

Choose a reason for hiding this comment

fsaintjacques Apr 9, 2020

Choose a reason for hiding this comment