ARROW-13465: [R] to_arrow() from duckdb #11032

jonkeane · 2021-08-30T18:18:02Z

~~This isn't yet fully functional, but I've unmarked [WIP] to get fuller CI on it.~~

This now works, something like:

ds <- InMemoryDataset$create(mtcars)

ds %>%
  filter(mpg < 30) %>%
  to_duckdb() %>%
  group_by(cyl) %>%
  summarize(mean_mpg = mean(mpg, na.rm = TRUE)) %>%
  to_arrow() %>%
  collect()

github-actions · 2021-08-30T18:18:21Z

https://issues.apache.org/jira/browse/ARROW-13465

r/R/duckdb.R

github-actions · 2021-09-23T16:30:35Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

nealrichardson · 2021-10-05T14:54:56Z

cpp/src/arrow/compute/exec/plan_test.cc

@@ -840,6 +841,7 @@ TEST(ExecPlanExecution, ScalarSourceScalarAggSink) {
      }))));
 }

+<<<<<<< HEAD


Looks like you have a merge issue here

I'm working on this again this morning — I think I'm close to constructing the proper arrow_dplyr_query object

Weird, this wasn't showing up in vscode, but was definitely there — should be fixed now regardless.

nealrichardson · 2021-10-05T14:58:28Z

r/R/duckdb.R

+  # * get the record batch reader from duckdb
+  # * produce the SourceNode
+  # * build an ExecPlan with that in place of the ScanNode you would have gotten from ExecNode_Scan
+  plan <- ExecPlan$create()


I don't think this will work because you're making this node with an ExecPlan here, but I think you'll be creating a different ExecPlan in collect().

Why not return the RecordBatchReader from duckdb::duckdb_fetch_record_batch(res)? Then do the wrapping of that inside plan$Build (or even plan$Scan)

Oh that's an interesting idea. Let me explore that. AFAIU we can't simply return the RecordBatchReader, since we want to be able to continue building an arrow_dplyr_query here (unless we make all the dplyr methods available for RecordBatchReaders too)

Right. So to restate: allow an arrow_dplyr_query to hold a RecordBatchReader in .data (similar to how I extended it recently to allow holding an arrow_dplyr_query itself, in collapse). So you'd return an arrow_dplyr_query here, containing a RecordBatchReader.

jonkeane · 2021-10-05T17:55:13Z

r/R/query-engine.R

@@ -59,6 +59,10 @@ ExecPlan <- R6Class("ExecPlan",
    Scan = function(dataset) {
      # Handle arrow_dplyr_query
      if (inherits(dataset, "arrow_dplyr_query")) {
+        if(inherits(dataset$.data, "RecordBatchReader")) {
+          return(ExecNode_ReadFromRecordBatchReader(self, dataset$.data))
+        }


This doesn't (yet) work, I'm getting a segfault

Warning: stack imbalance in '[[', 215 then 216 Warning: stack imbalance in 'is.null', 213 then 220 Warning: stack imbalance in '!', 212 then 219 Warning: stack imbalance in '&&', 210 then 217 Error: C stack usage 17587191479824 is too close to the limit Warning: stack imbalance in '&&', 208 then 215 Warning: stack imbalance in '(', 207 then 214 Warning: stack imbalance in '||', 205 then 212 Warning: stack imbalance in 'if', 203 then 210 Warning: stack imbalance in '{', 199 then 218 Warning: stack imbalance in '{', 192 then 211 Warning: stack imbalance in 'is.null', 187 then 206 Warning: stack imbalance in '!', 186 then 205 Warning: stack imbalance in '&&', 184 then 203 Warning: stack imbalance in '&&', 182 then 201 Warning: stack imbalance in '(', 181 then 200 Warning: stack imbalance in '||', 179 then 198 Warning: stack imbalance in 'if', 177 then 196 Warning: stack imbalance in '{', 173 then 192 Warning: stack imbalance in '<-', 168 then 187 Warning: stack imbalance in '{', 164 then 183 Warning: stack imbalance in 'if', 162 then 181 *** caught segfault *** address 0x0, cause 'memory not mapped' Warning: stack imbalance in '>', 183 then 187

I'm digging into this now

This is a duckdb issue right?

Yes, we believe so. I don't yet have a reprex, but I'm working on one.

nealrichardson · 2021-10-05T17:58:06Z

r/R/duckdb.R

+  # * build an ExecPlan with that in place of the ScanNode you would have gotten from ExecNode_Scan
+  # source_node <- ExecNode_ReadFromRecordBatchReader(plan, RBR)
+  .data <- duckdb::duckdb_fetch_record_batch(res)
+  structure(


You shouldn't have to copy this from arrow_dplyr_query() should you?

Right, I did out of expedience, but will come back and clean this up to do the right thing later.

You might need to collapse() this further, judging from how you're handling this case in plan$Scan: you don't want any filter/projection at this level of nesting

jonkeane · 2021-10-05T18:33:54Z

Ok, this is getting closer. I've added a test using our own recordbatchreader export and then wrapping that with arrow_dplyr_query, which works 🎉

With the DuckDB reader, I'm getting a number of segfaults. The messages / where is reported as a fault varies, each of the following are errors I've seen so far (on separate runs):

...Error: no more error handlers available (recursive errors?); invoking 'abort' restart
Error: SET_VECTOR_ELT() can only be applied to a 'list', not a 'double'
Warning: stack imbalance in '{', 70 then 64

...Warning: stack imbalance in '$', 224 then 215
Warning: stack imbalance in 'is.null', 223 then 213

 *** caught bus error ***
address 0x7fbf1b470b00, cause 'non-existent physical address'
An irrecoverable exception occurred. R is aborting now ...

 *** caught segfault ***
address 0x0, cause 'memory not mapped'

...Warning: stack imbalance in '&&', 201 then 202
Warning: stack imbalance in '(', 200 then 206
Error: C stack usage  17587432755568 is too close to the limit
Warning: stack imbalance in '||', 198 then 204
Warning: stack imbalance in 'if', 196 then 202
Warning: stack imbalance in '!', 205 then 63
Warning: stack imbalance in '[', 202 then 62
Warning: stack imbalance in '{', 192 then 63
Warning: stack imbalance in '{', 185 then 61
Warning: stack imbalance in 'is.null', 180 then 59
Warning: stack imbalance in '!', 179 then 58
Warning: stack imbalance in '&&', 177 then 59
Warning: stack imbalance in '&&', 175 then 57
Warning: stack imbalance in '(', 174 then 56
Warning: stack imbalance in 'is.null', 63 then 65
Warning: stack imbalance in '!', 62 then 68
Warning: stack imbalance in '&&', 60 then 66
Warning: stack imbalance in '&&', 58 then 67
Warning: stack imbalance in '(', 57 then 66
Warning: stack imbalance in '||', 172 then 66
Warning: stack imbalance in 'if', 170 then 64
Warning: stack imbalance in '<-', 66 then 67
Warning: stack imbalance in '{', 166 then 63
Warning: stack imbalance in '<-', 161 then 58
Warning: stack imbalance in '{', 157 then 54
Warning: stack imbalance in 'if', 155 then 53
Warning: stack imbalance in 'is.null', 81 then 82
Warning: stack imbalance in 'if', 78 then 81

 *** caught segfault ***
address 0x1fb3eed17820, cause 'memory not mapped'
Error: no more error handlers available (recursive errors?); invoking 'abort' restart

 *** caught segfault ***
address 0x0, cause 'memory not mapped'

jonkeane · 2021-10-06T13:28:13Z

r/R/duckdb.R

+#'   summarize(mean_mpg = mean(mpg, na.rm = TRUE)) %>%
+#'   to_arrow() %>%
+#'   collect()
+to_arrow <- function(.data, as_table = TRUE) {


Ultimately, I don't believe we want to have as_table = TRUE, but I've added it for now to show that arrow_dplyr_query(duckdb::duckdb_fetch_record_batch(res)$read_table()) (the default, as_table = TRUE) works just fine, but when we arrow_dplyr_query(duckdb::duckdb_fetch_record_batch(res)) we get a segfault.

We might need (or want) to release with the $read_table() version this time, and then upgrade to the version that doesn't convert to a table in 7.0.0 (after duckdb resolves the issue + releases, presuming the issue is on their side)

Were you going to remove this argument before merging?

Yes, sorry I removed the code that used it, but not the option itself.

jonkeane · 2021-10-06T13:31:15Z

r/R/duckdb.R

+#'   to_arrow() %>%
+#'   collect()
+to_arrow <- function(.data, as_table = TRUE) {
+  # TODO: figure out WTAF .data is before just doing stuff


I will also do this, since at the very least we should gate this to tbls that are backed by duckdb connections.

There might be a few other options for things we could accept here too, in principle we could accept a DBIResult (again, backed by duckdb) which would make it easier to send SQL and avoid dbplyr totally.

nealrichardson · 2021-10-07T21:08:43Z

r/R/query-engine.R

@@ -59,6 +59,10 @@ ExecPlan <- R6Class("ExecPlan",
    Scan = function(dataset) {
      # Handle arrow_dplyr_query
      if (inherits(dataset, "arrow_dplyr_query")) {
+        if(inherits(dataset$.data, "RecordBatchReader")) {


Suggested change

if(inherits(dataset$.data, "RecordBatchReader")) {

if (inherits(dataset$.data, "RecordBatchReader")) {

r/R/query-engine.R

r/src/compute-exec.cpp

nealrichardson · 2021-10-08T18:36:26Z

cpp/src/arrow/compute/exec/exec_plan.cc

+      MakeBackgroundGenerator(std::move(batch_it), io_executor, max_q, q_restart));
+
+  return std::function<Future<util::optional<ExecBatch>>()>([batch_gen] {
+    // TODO(ARROW-14070) Awful workaround for MSVC 19.0 (Visual Studio 2015) bug.


@bkietz is this still needed? We dropped Visual Studio 2015 didn't we?

We have removed vs2015, this can be simplified.

- ARROW_ASSIGN_OR_RAISE( - auto batch_gen, - MakeBackgroundGenerator(std::move(batch_it), io_executor, max_q, q_restart)); - - return std::function<Future<util::optional<ExecBatch>>()>([batch_gen] { - // TODO(ARROW-14070) Awful workaround for MSVC 19.0 (Visual Studio 2015) bug. - //... + return MakeBackgroundGenerator(std::move(batch_it), io_executor, max_q, q_restart);

Would you mind making a suggestion that we can apply?

nealrichardson · 2021-10-08T18:38:49Z

r/tests/testthat/test-dataset.R

+  reader_adq <- arrow_dplyr_query(circle)
+
+  tab_from_c_new <- reader_adq %>%
+    dplyr::collect()


Can you do more dplyr stuff here? filter/mutate/summarize something? Just to confirm that we can do more than just collect the reader into a table.

nealrichardson · 2021-10-08T18:39:13Z

r/tests/testthat/test-dataset.R

+    dplyr::collect()
+  expect_equal(
+    tab_from_c_new %>%
+      arrange(dbl),


You should be able to arrange before collect, that is supported.

I'm getting a segfault on that right now (though maybe I need a rebase to get the work that you've done to make that possible?) I'll add it as a TODO after we merge

Yeah try rebasing

Hmmm this actually still seems to persist, even after a rebase: https://github.com/apache/arrow/pull/11032/checks?check_run_id=3895824638#step:9:11237

If you get a segfault please make a followup jira

nealrichardson · 2021-10-08T18:39:28Z

r/tests/testthat/test-duckdb.R

+  # And we can continue the pipeline
+  ds_rt <- ds %>%
+    to_duckdb() %>%
+    # factors don't roundtrip


Jira? duckdb issue number?

I've added duckdb/duckdb#1879 which talks about implementing them in duckdb

jonkeane · 2021-10-12T21:52:47Z

@github-actions autotune

r/R/duckdb.R

nealrichardson · 2021-10-13T21:33:11Z

cpp/src/arrow/compute/exec/sink_node.cc

@@ -69,7 +69,22 @@ class SinkNode : public ExecNode {
      util::BackpressureOptions backpressure) {
    PushGenerator<util::optional<ExecBatch>> push_gen(std::move(backpressure));
    auto out = push_gen.producer();
-    *out_gen = std::move(push_gen);
+    *out_gen = [push_gen] {
+      // TODO(ARROW-14070) Awful workaround for MSVC 19.0 (Visual Studio 2015) bug.


@bkietz same here, could you suggest a fix?

Just undo this change? It seems that would be enough as a fix, unless I'm missing something.

pitrou · 2021-10-14T13:21:47Z

cpp/src/arrow/compute/exec/exec_plan.h

+constexpr int kDefaultBackgroundQRestart = 16;
+
+ARROW_EXPORT
+Result<std::function<Future<util::optional<ExecBatch>>()>> MakeReaderGenerator(


Can you add a docstring here?

I've added some here, though I'm sure they could be polished by those who know the style and execplan setup better.

r/tests/testthat/test-dataset.R

cpp/src/arrow/compute/exec/sink_node.cc

…ome_ in

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>

ursabot · 2021-10-15T13:12:18Z

Benchmark runs are scheduled for baseline = 0059d61 and contender = b868090. b868090 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Scheduled] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.51% ⬆️0.51%] ursa-i9-9960x
[Finished ⬇️0.22% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

ElenaHenderson · 2021-10-22T01:23:13Z

Benchmark runs are scheduled for baseline = 0059d61 and contender = b868090. b868090 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.51% ⬆️0.51%] ursa-i9-9960x
[Finished ⬇️0.22% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

github-actions bot added the Component: R label Aug 30, 2021

jonkeane commented Aug 30, 2021

View reviewed changes

r/R/duckdb.R Outdated Show resolved Hide resolved

github-actions bot added the Component: C++ label Sep 15, 2021

jonkeane force-pushed the ARROW-13465-to_arrow-from-duckdb branch from 383cbb4 to 398bc03 Compare September 21, 2021 15:34

github-actions bot added the Component: Python label Sep 22, 2021

jonkeane force-pushed the ARROW-13465-to_arrow-from-duckdb branch 2 times, most recently from 87c3a96 to 5b9d598 Compare September 23, 2021 16:27

jonkeane changed the title ~~ARROW-13465: [R] to_arrow() from duckdb [WIP]~~ ARROW-13465: [R] to_arrow() from duckdb Sep 23, 2021

jonkeane force-pushed the ARROW-13465-to_arrow-from-duckdb branch from 931bc0c to 5192d72 Compare October 4, 2021 17:10

nealrichardson reviewed Oct 5, 2021

View reviewed changes

jonkeane commented Oct 5, 2021

View reviewed changes

nealrichardson reviewed Oct 5, 2021

View reviewed changes

jonkeane force-pushed the ARROW-13465-to_arrow-from-duckdb branch 2 times, most recently from c78e54d to aaaa9f9 Compare October 5, 2021 18:49

jonkeane commented Oct 6, 2021

View reviewed changes

nealrichardson reviewed Oct 7, 2021

View reviewed changes

r/R/query-engine.R Outdated Show resolved Hide resolved

nealrichardson reviewed Oct 7, 2021

View reviewed changes

r/src/compute-exec.cpp Outdated Show resolved Hide resolved

nealrichardson reviewed Oct 8, 2021

View reviewed changes

jonkeane force-pushed the ARROW-13465-to_arrow-from-duckdb branch from b30c344 to 4ce5e50 Compare October 12, 2021 22:27

nealrichardson reviewed Oct 13, 2021

View reviewed changes

r/R/duckdb.R Outdated Show resolved Hide resolved

nealrichardson reviewed Oct 13, 2021

View reviewed changes

pitrou reviewed Oct 14, 2021

View reviewed changes

jonkeane force-pushed the ARROW-13465-to_arrow-from-duckdb branch from 20150a3 to 6b3368a Compare October 14, 2021 13:26

nealrichardson reviewed Oct 14, 2021

View reviewed changes

r/tests/testthat/test-dataset.R Outdated Show resolved Hide resolved

nealrichardson reviewed Oct 14, 2021

View reviewed changes

cpp/src/arrow/compute/exec/sink_node.cc Outdated Show resolved Hide resolved

jonkeane and others added 18 commits October 14, 2021 17:45

squash

8170933

add an option for as_table = TRUE temporarily to be able to pull _s…

37fde64

…ome_ in

Also add a dataset recordbatchreader test

de776d2

Actually test a real partitioned dataset

a43bb8d

Clean up more, add error handling

0847c29

Fix pkgdown

d67d1f4

PR comments

0c6ea1a

PR comments

f80823d

Autoformat/render all the things [automated commit]

51394f0

CI bump

2aa2a7c

Update r/R/duckdb.R

c80567e

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>

oops, cleanup suggested code

fb4cf7b

PR comments

af703a7

arrange before collect

dfe62b5

docstrings

fb35215

clean up

6820cd3

revert/jira issue

17bfce1

Update cpp/src/arrow/compute/exec/sink_node.cc

38624c3

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>

jonkeane force-pushed the ARROW-13465-to_arrow-from-duckdb branch from 870f19f to 38624c3 Compare October 14, 2021 22:45

jonkeane and others added 2 commits October 14, 2021 18:18

RecordBatchReaders don't have group_vars!

fbe6e88

Update r/tests/testthat/test-dataset.R

c3ae102

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>

jonkeane closed this in b868090 Oct 15, 2021

asfimport mentioned this pull request Oct 22, 2021

[R] to_arrow() from duckdb #29128

Closed

	if(inherits(dataset$.data, "RecordBatchReader")) {
	if (inherits(dataset$.data, "RecordBatchReader")) {

ARROW-13465: [R] to_arrow() from duckdb #11032

ARROW-13465: [R] to_arrow() from duckdb #11032

Conversation

jonkeane commented Aug 30, 2021 • edited Loading

github-actions bot commented Aug 30, 2021

github-actions bot commented Sep 23, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonkeane commented Oct 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkietz Oct 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonkeane Oct 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonkeane commented Oct 12, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ursabot commented Oct 15, 2021 • edited Loading

ElenaHenderson commented Oct 22, 2021 • edited Loading

jonkeane commented Aug 30, 2021 •

edited

Loading

bkietz Oct 13, 2021 •

edited

Loading

jonkeane Oct 14, 2021 •

edited

Loading

ursabot commented Oct 15, 2021 •

edited

Loading

ElenaHenderson commented Oct 22, 2021 •

edited

Loading