Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-13465: [R] to_arrow() from duckdb #11032

Closed

Conversation

jonkeane
Copy link
Member

@jonkeane jonkeane commented Aug 30, 2021

This isn't yet fully functional, but I've unmarked [WIP] to get fuller CI on it.

This now works, something like:

ds <- InMemoryDataset$create(mtcars)

ds %>%
  filter(mpg < 30) %>%
  to_duckdb() %>%
  group_by(cyl) %>%
  summarize(mean_mpg = mean(mpg, na.rm = TRUE)) %>%
  to_arrow() %>%
  collect()

@github-actions
Copy link

r/R/duckdb.R Outdated Show resolved Hide resolved
@jonkeane jonkeane force-pushed the ARROW-13465-to_arrow-from-duckdb branch from 383cbb4 to 398bc03 Compare September 21, 2021 15:34
@jonkeane jonkeane force-pushed the ARROW-13465-to_arrow-from-duckdb branch 2 times, most recently from 87c3a96 to 5b9d598 Compare September 23, 2021 16:27
@jonkeane jonkeane changed the title ARROW-13465: [R] to_arrow() from duckdb [WIP] ARROW-13465: [R] to_arrow() from duckdb Sep 23, 2021
@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@jonkeane jonkeane force-pushed the ARROW-13465-to_arrow-from-duckdb branch from 931bc0c to 5192d72 Compare October 4, 2021 17:10
@@ -840,6 +841,7 @@ TEST(ExecPlanExecution, ScalarSourceScalarAggSink) {
}))));
}

<<<<<<< HEAD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you have a merge issue here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on this again this morning — I think I'm close to constructing the proper arrow_dplyr_query object

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird, this wasn't showing up in vscode, but was definitely there — should be fixed now regardless.

r/R/duckdb.R Outdated
# * get the record batch reader from duckdb
# * produce the SourceNode
# * build an ExecPlan with that in place of the ScanNode you would have gotten from ExecNode_Scan
plan <- ExecPlan$create()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will work because you're making this node with an ExecPlan here, but I think you'll be creating a different ExecPlan in collect().

Why not return the RecordBatchReader from duckdb::duckdb_fetch_record_batch(res)? Then do the wrapping of that inside plan$Build (or even plan$Scan)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's an interesting idea. Let me explore that. AFAIU we can't simply return the RecordBatchReader, since we want to be able to continue building an arrow_dplyr_query here (unless we make all the dplyr methods available for RecordBatchReaders too)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. So to restate: allow an arrow_dplyr_query to hold a RecordBatchReader in .data (similar to how I extended it recently to allow holding an arrow_dplyr_query itself, in collapse). So you'd return an arrow_dplyr_query here, containing a RecordBatchReader.

@@ -59,6 +59,10 @@ ExecPlan <- R6Class("ExecPlan",
Scan = function(dataset) {
# Handle arrow_dplyr_query
if (inherits(dataset, "arrow_dplyr_query")) {
if(inherits(dataset$.data, "RecordBatchReader")) {
return(ExecNode_ReadFromRecordBatchReader(self, dataset$.data))
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't (yet) work, I'm getting a segfault

Warning: stack imbalance in '[[', 215 then 216
Warning: stack imbalance in 'is.null', 213 then 220
Warning: stack imbalance in '!', 212 then 219
Warning: stack imbalance in '&&', 210 then 217
Error: C stack usage  17587191479824 is too close to the limit
Warning: stack imbalance in '&&', 208 then 215
Warning: stack imbalance in '(', 207 then 214
Warning: stack imbalance in '||', 205 then 212
Warning: stack imbalance in 'if', 203 then 210
Warning: stack imbalance in '{', 199 then 218
Warning: stack imbalance in '{', 192 then 211
Warning: stack imbalance in 'is.null', 187 then 206
Warning: stack imbalance in '!', 186 then 205
Warning: stack imbalance in '&&', 184 then 203
Warning: stack imbalance in '&&', 182 then 201
Warning: stack imbalance in '(', 181 then 200
Warning: stack imbalance in '||', 179 then 198
Warning: stack imbalance in 'if', 177 then 196
Warning: stack imbalance in '{', 173 then 192
Warning: stack imbalance in '<-', 168 then 187
Warning: stack imbalance in '{', 164 then 183
Warning: stack imbalance in 'if', 162 then 181

 *** caught segfault ***
address 0x0, cause 'memory not mapped'
Warning: stack imbalance in '>', 183 then 187

I'm digging into this now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a duckdb issue right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we believe so. I don't yet have a reprex, but I'm working on one.

r/R/duckdb.R Outdated
# * build an ExecPlan with that in place of the ScanNode you would have gotten from ExecNode_Scan
# source_node <- ExecNode_ReadFromRecordBatchReader(plan, RBR)
.data <- duckdb::duckdb_fetch_record_batch(res)
structure(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't have to copy this from arrow_dplyr_query() should you?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I did out of expedience, but will come back and clean this up to do the right thing later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might need to collapse() this further, judging from how you're handling this case in plan$Scan: you don't want any filter/projection at this level of nesting

@jonkeane
Copy link
Member Author

jonkeane commented Oct 5, 2021

Ok, this is getting closer. I've added a test using our own recordbatchreader export and then wrapping that with arrow_dplyr_query, which works 🎉

With the DuckDB reader, I'm getting a number of segfaults. The messages / where is reported as a fault varies, each of the following are errors I've seen so far (on separate runs):

...Error: no more error handlers available (recursive errors?); invoking 'abort' restart
Error: SET_VECTOR_ELT() can only be applied to a 'list', not a 'double'
Warning: stack imbalance in '{', 70 then 64
...Warning: stack imbalance in '$', 224 then 215
Warning: stack imbalance in 'is.null', 223 then 213

 *** caught bus error ***
address 0x7fbf1b470b00, cause 'non-existent physical address'
An irrecoverable exception occurred. R is aborting now ...

 *** caught segfault ***
address 0x0, cause 'memory not mapped'
...Warning: stack imbalance in '&&', 201 then 202
Warning: stack imbalance in '(', 200 then 206
Error: C stack usage  17587432755568 is too close to the limit
Warning: stack imbalance in '||', 198 then 204
Warning: stack imbalance in 'if', 196 then 202
Warning: stack imbalance in '!', 205 then 63
Warning: stack imbalance in '[', 202 then 62
Warning: stack imbalance in '{', 192 then 63
Warning: stack imbalance in '{', 185 then 61
Warning: stack imbalance in 'is.null', 180 then 59
Warning: stack imbalance in '!', 179 then 58
Warning: stack imbalance in '&&', 177 then 59
Warning: stack imbalance in '&&', 175 then 57
Warning: stack imbalance in '(', 174 then 56
Warning: stack imbalance in 'is.null', 63 then 65
Warning: stack imbalance in '!', 62 then 68
Warning: stack imbalance in '&&', 60 then 66
Warning: stack imbalance in '&&', 58 then 67
Warning: stack imbalance in '(', 57 then 66
Warning: stack imbalance in '||', 172 then 66
Warning: stack imbalance in 'if', 170 then 64
Warning: stack imbalance in '<-', 66 then 67
Warning: stack imbalance in '{', 166 then 63
Warning: stack imbalance in '<-', 161 then 58
Warning: stack imbalance in '{', 157 then 54
Warning: stack imbalance in 'if', 155 then 53
Warning: stack imbalance in 'is.null', 81 then 82
Warning: stack imbalance in 'if', 78 then 81

 *** caught segfault ***
address 0x1fb3eed17820, cause 'memory not mapped'
Error: no more error handlers available (recursive errors?); invoking 'abort' restart

 *** caught segfault ***
address 0x0, cause 'memory not mapped'

@jonkeane jonkeane force-pushed the ARROW-13465-to_arrow-from-duckdb branch 2 times, most recently from c78e54d to aaaa9f9 Compare October 5, 2021 18:49
r/R/duckdb.R Outdated
#' summarize(mean_mpg = mean(mpg, na.rm = TRUE)) %>%
#' to_arrow() %>%
#' collect()
to_arrow <- function(.data, as_table = TRUE) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultimately, I don't believe we want to have as_table = TRUE, but I've added it for now to show that arrow_dplyr_query(duckdb::duckdb_fetch_record_batch(res)$read_table()) (the default, as_table = TRUE) works just fine, but when we arrow_dplyr_query(duckdb::duckdb_fetch_record_batch(res)) we get a segfault.

We might need (or want) to release with the $read_table() version this time, and then upgrade to the version that doesn't convert to a table in 7.0.0 (after duckdb resolves the issue + releases, presuming the issue is on their side)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you going to remove this argument before merging?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry I removed the code that used it, but not the option itself.

r/R/duckdb.R Outdated
#' to_arrow() %>%
#' collect()
to_arrow <- function(.data, as_table = TRUE) {
# TODO: figure out WTAF .data is before just doing stuff
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will also do this, since at the very least we should gate this to tbls that are backed by duckdb connections.

There might be a few other options for things we could accept here too, in principle we could accept a DBIResult (again, backed by duckdb) which would make it easier to send SQL and avoid dbplyr totally.

@@ -59,6 +59,10 @@ ExecPlan <- R6Class("ExecPlan",
Scan = function(dataset) {
# Handle arrow_dplyr_query
if (inherits(dataset, "arrow_dplyr_query")) {
if(inherits(dataset$.data, "RecordBatchReader")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if(inherits(dataset$.data, "RecordBatchReader")) {
if (inherits(dataset$.data, "RecordBatchReader")) {

r/R/query-engine.R Outdated Show resolved Hide resolved
MakeBackgroundGenerator(std::move(batch_it), io_executor, max_q, q_restart));

return std::function<Future<util::optional<ExecBatch>>()>([batch_gen] {
// TODO(ARROW-14070) Awful workaround for MSVC 19.0 (Visual Studio 2015) bug.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkietz is this still needed? We dropped Visual Studio 2015 didn't we?

Copy link
Member

@bkietz bkietz Oct 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have removed vs2015, this can be simplified.

-  ARROW_ASSIGN_OR_RAISE(
-      auto batch_gen,
-      MakeBackgroundGenerator(std::move(batch_it), io_executor, max_q, q_restart));
-
-  return std::function<Future<util::optional<ExecBatch>>()>([batch_gen] {
-    // TODO(ARROW-14070) Awful workaround for MSVC 19.0 (Visual Studio 2015) bug.
-    //...
+  return MakeBackgroundGenerator(std::move(batch_it), io_executor, max_q, q_restart);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind making a suggestion that we can apply?

reader_adq <- arrow_dplyr_query(circle)

tab_from_c_new <- reader_adq %>%
dplyr::collect()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do more dplyr stuff here? filter/mutate/summarize something? Just to confirm that we can do more than just collect the reader into a table.

dplyr::collect()
expect_equal(
tab_from_c_new %>%
arrange(dbl),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to arrange before collect, that is supported.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting a segfault on that right now (though maybe I need a rebase to get the work that you've done to make that possible?) I'll add it as a TODO after we merge

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah try rebasing

Copy link
Member Author

@jonkeane jonkeane Oct 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm this actually still seems to persist, even after a rebase: https://github.com/apache/arrow/pull/11032/checks?check_run_id=3895824638#step:9:11237

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you get a segfault please make a followup jira

# And we can continue the pipeline
ds_rt <- ds %>%
to_duckdb() %>%
# factors don't roundtrip
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jira? duckdb issue number?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added duckdb/duckdb#1879 which talks about implementing them in duckdb

@jonkeane
Copy link
Member Author

@github-actions autotune

@jonkeane jonkeane force-pushed the ARROW-13465-to_arrow-from-duckdb branch from b30c344 to 4ce5e50 Compare October 12, 2021 22:27
r/R/duckdb.R Outdated Show resolved Hide resolved
@@ -69,7 +69,22 @@ class SinkNode : public ExecNode {
util::BackpressureOptions backpressure) {
PushGenerator<util::optional<ExecBatch>> push_gen(std::move(backpressure));
auto out = push_gen.producer();
*out_gen = std::move(push_gen);
*out_gen = [push_gen] {
// TODO(ARROW-14070) Awful workaround for MSVC 19.0 (Visual Studio 2015) bug.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkietz same here, could you suggest a fix?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just undo this change? It seems that would be enough as a fix, unless I'm missing something.

constexpr int kDefaultBackgroundQRestart = 16;

ARROW_EXPORT
Result<std::function<Future<util::optional<ExecBatch>>()>> MakeReaderGenerator(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a docstring here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some here, though I'm sure they could be polished by those who know the style and execplan setup better.

@jonkeane jonkeane force-pushed the ARROW-13465-to_arrow-from-duckdb branch from 20150a3 to 6b3368a Compare October 14, 2021 13:26
@jonkeane jonkeane force-pushed the ARROW-13465-to_arrow-from-duckdb branch from 870f19f to 38624c3 Compare October 14, 2021 22:45
jonkeane and others added 2 commits October 14, 2021 18:18
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
@jonkeane jonkeane closed this in b868090 Oct 15, 2021
@ursabot
Copy link

ursabot commented Oct 15, 2021

Benchmark runs are scheduled for baseline = 0059d61 and contender = b868090. b868090 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Scheduled] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.51% ⬆️0.51%] ursa-i9-9960x
[Finished ⬇️0.22% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

@ElenaHenderson
Copy link
Contributor

ElenaHenderson commented Oct 22, 2021

Benchmark runs are scheduled for baseline = 0059d61 and contender = b868090. b868090 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.51% ⬆️0.51%] ursa-i9-9960x
[Finished ⬇️0.22% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants