ARROW-15016: [R] `show_exec_plan` for an `arrow_dplyr_query` #13541

dragosmg · 2022-07-07T17:02:12Z

This PR adds show_exec_plan() will allow users to inspect the ExecPlan, in a similar way to dplyr's show_query().

mtcars %>% 
  arrow_table() %>% 
  filter(mpg > 20) %>% 
  mutate(x = gear/carb) %>%
  group_by(cyl) %>% 
  show_exec_plan()
#> ExecPlan with 3 nodes:
#> 2:ProjectNode{projection=[mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb, "x": divide(cast(gear, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false}), cast(carb, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false}))]}
#>   1:FilterNode{filter=(mpg > 20)}
#>     0:TableSourceNode{}

Some design considerations are discussed in the design doc.

Summary of the approach:

I opted for the show_exec_plan() name as I believe it aligns well with the purpose of this PR: to expose the ExecPlan (in its raw state).
We could later extend, beautify the output (e.g. use {cli} and / or show_query() / explain() methods).

github-actions · 2022-07-07T17:02:33Z

https://issues.apache.org/jira/browse/ARROW-15016

r/src/arrowExports.cpp

jonkeane

I know this is a still a draft, but I was curious + saw a few things that looked comment-able that might help drive this forward

I appreciate how nice and targeted these changes are so far!

r/R/dplyr.R

r/tests/testthat/test-dplyr-query.R

* add dataset test * add tests for `summarise()`, `group_by()` and `join()`

wjones127

I like this function! I made a few optional suggestions.

One potential future enhancement: Neal is adding this function query_can_stream() in #13563, which might be helpful info to know when looking at the query exec plan. Showing it next to the plan will help people understand what kinds of plans can and can't stream.

r/R/dplyr.R

dragosmg · 2022-07-12T15:43:44Z

@wjones127 Thanks for the review. I definitely think we could expand this sort of functionality. Maybe we could keep show_exec_plan() as bare-bone and add an explain.arrow_dplyr_query() method with more bells and whistles (both in terms of information and formatting). What do you think?

wjones127 · 2022-07-12T16:57:49Z

Maybe we could keep show_exec_plan() as bare-bone and add an explain.arrow_dplyr_query() method with more bells and whistles (both in terms of information and formatting). What do you think?

Yeah that makes sense to me 👍

dragosmg · 2022-07-21T19:02:18Z

In order to avoid repetition in the unit tests, I went for a combined approach:

used expect_output() to check for specific pieces of information that we want / expect to be present for show_exec_plan()
used expect_snapshot() to test show_query() and explain() which are just wrappers around show_exec_plan().

paleolimbot

Once the CI turns green, I'm good with this! Technically the ExecPlan_prepare() changes in compute-exec.cpp are no longer needed for this PR but I am also not worried about them (they're needed for the user-defined functions PR as well).

nealrichardson

Unless I'm missing something, this looks like a major rewrite of the PR that causes show_query() to actually run the query, which is not desirable.

nealrichardson · 2022-07-21T14:01:22Z

r/src/compute-exec.cpp

+                                  cpp11::list sort_options, cpp11::strings metadata,
+                                  int64_t head = -1) {
+  auto prepared_plan = ExecPlan_prepare(plan, final_node, sort_options, metadata, head);
+  arrow::StopIfNotOk(prepared_plan.first->StartProducing());


IIUC this starts evaluating the ExecPlan, which we don't want to do.

Suggested change

arrow::StopIfNotOk(prepared_plan.first->StartProducing());

@nealrichardson I think StartProducing() is central to this (let's call it the BuildAndShow) approach. I couldn't get it to work without it. Extracting the duplicated code in ExecPlan_prepare doesn't work without starting the plan.

My f40e5d2 commit, submitted 3 days ago - was AFAICT exactly what you suggest and it didn't work.

So was Dewey's fbb4c1e (submitted 2 days ago).

I will revert to this state minus the line you suggested we delete and see where that takes us.

r/tests/testthat/test-dplyr-query.R

nealrichardson · 2022-07-21T14:08:24Z

r/R/dplyr.R

+#'   filter(mpg > 20) %>%
+#'   mutate(x = gear/carb) %>%
+#'   show_exec_plan()
+show_exec_plan <- function(x) {


My recommendation was the latter, but I don't object to also having a standalone show_exec_plan()

r/R/query-engine.R

r/R/dplyr.R

nealrichardson · 2022-07-21T20:42:04Z

r/R/query-engine.R

@@ -191,7 +192,7 @@ ExecPlan <- R6Class("ExecPlan",
      }
      node
    },
-    Run = function(node) {
+    Run = function(node, explain = FALSE) {


IIUC this is a really bad idea: you're evaluating the whole query just to print it.

Do we know how much data gets pulled from an exec plan that is created and immediately deleted?

As I understand it, Acero is a push model, not pull. So when you call start producing, it fires up, doesn't matter that you haven't pulled the first batch, it's pushing batches to the reader. I also wouldn't assume that just because the RBR object goes out of scope that that sends any signal to stop producing. Assumptions aside, we should be able to observe this if we test with real data.

But I would think that printing the query shouldn't trigger any evaluation on the data--anything >0 data being read is probably too much.

Creating an exec plan: None
Calling StartProducing: All of it

You can call StopProducing after you call StartProducing but that isn't great right now (we still fully consume whatever files we happen to be reading at the moment) and will hopefully get better.

nealrichardson · 2022-07-22T01:24:31Z

@dragosmg check out #13397 (comment). I think this explains why "the filter node is gone" on the head queries--it gets evaluated 🙀 inside of Build(). For now we should probably not try to show the query if (is_collapsed(.data) && has_head_tail(.data$.data)) because that will cause the inner query to evaluate. We can back that out after ARROW-16628, I believe (should add a comment pointing to that issue too).

dragosmg · 2022-07-22T08:25:07Z

This ☝🏻 commit (5751543 ) includes the call to start the plan. Most tests pass (except the large memory test).

Commit e29a835 removes the call to _StartProducing => tests fail with

/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:209: Check failed: started_ stopped an ExecPlan which never started

This makes me think, that, if we want to avoid starting the plan, we need to go back to a state where the "preparation" is done with duplicated code in both ExecPlan_run and ExecPlan_ToString/ExecPlan_BuildAndShow.

* `ExecPlan_run` and `ExecPlan_BuildAndShow` C++ functions * `$Run()` and `$BuildAndShow()` R6 methods

dragosmg · 2022-07-22T11:05:13Z

What have I done so far today:

Reverted to an earlier version (without the ExecPlan_prepare C++ helper function).
Removed unit tests for show_query() and explain().
Improved the documentation for show_exec_plan() to reference dplyr::show_query() and dplyr::explain().
Added comments indicating duplicate code that should be in sync, in:
- ExecPlan_run and ExecPlan_BuildAndShow.
- $Run() and $BuildAndShow()
Created ARROW-17184 to investigate why some nodes are missing from the print out.

dragosmg · 2022-07-22T11:11:36Z

@dragosmg check out #13397 (comment). I think this explains why "the filter node is gone" on the head queries--it gets evaluated 🙀 inside of Build(). For now we should probably not try to show the query if (is_collapsed(.data) && has_head_tail(.data$.data)) because that will cause the inner query to evaluate. We can back that out after ARROW-16628, I believe (should add a comment pointing to that issue too).

Not sure adding a comment referencing ARROW-16628 is needed since I am no longer touching the R6 $Build() method. I added a comment on ARROW-17184.

nealrichardson · 2022-07-22T12:59:14Z

@dragosmg check out #13397 (comment). I think this explains why "the filter node is gone" on the head queries--it gets evaluated 🙀 inside of Build(). For now we should probably not try to show the query if (is_collapsed(.data) && has_head_tail(.data$.data)) because that will cause the inner query to evaluate. We can back that out after ARROW-16628, I believe (should add a comment pointing to that issue too).

Not sure adding a comment referencing ARROW-16628 is needed since I am no longer touching the R6 $Build() method. I added a comment on ARROW-17184.

But you're still calling $Build, and in this case, the inner query will be evaluated inside of that, so we should check before calling $Build.

dragosmg · 2022-07-22T13:56:52Z

I think this is ready for another review @nealrichardson & @paleolimbot. 😄

paleolimbot

I don't like the duplication but as long as its clearly marked and there is a followup JIRA to remove it I'm game. The fact that we have to copy so much code just to print out what's happening suggests to me that we need to improve our abstraction and the solution you've provided here are exactly what we need to do that.

Thank you for sticking with this! I will keep checking https://github.com/dragosmg/arrow/branches for green CI since that's likely to come well before the Arrow CI runs.

dragosmg · 2022-07-22T15:27:25Z

Thank you for sticking with this! I will keep checking https://github.com/dragosmg/arrow/branches for green CI since that's likely to come well before the Arrow CI runs.

Most checks are green - https://github.com/dragosmg/arrow/runs/7469491011?check_suite_focus=true - with the exception of the first one which has been failing for a few days now (OOM in the large memory tests).

Follow-up Jira: ARROW-17185.

ursabot · 2022-07-24T01:10:58Z

Benchmark runs are scheduled for baseline = 9ad2255 and contender = 9442e1c. 9442e1c is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Failed ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.63% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.25% ⬆️0.04%] ursa-thinkcentre-m75q
Buildkite builds:
[Failed] 9442e1ce ec2-t3-xlarge-us-east-2
[Failed] 9442e1ce test-mac-arm
[Failed] 9442e1ce ursa-i9-9960x
[Finished] 9442e1ce ursa-thinkcentre-m75q
[Failed] 9ad22551 ec2-t3-xlarge-us-east-2
[Failed] 9ad22551 test-mac-arm
[Failed] 9ad22551 ursa-i9-9960x
[Finished] 9ad22551 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

…Hub issue numbers (#34260) Rewrite the Jira issue numbers to the GitHub issue numbers, so that the GitHub issue numbers are automatically linked to the issues by pkgdown's auto-linking feature. Issue numbers have been rewritten based on the following correspondence. Also, the pkgdown settings have been changed and updated to link to GitHub. I generated the Changelog page using the `pkgdown::build_news()` function and verified that the links work correctly. --- ARROW-6338 #5198 ARROW-6364 #5201 ARROW-6323 #5169 ARROW-6278 #5141 ARROW-6360 #5329 ARROW-6533 #5450 ARROW-6348 #5223 ARROW-6337 #5399 ARROW-10850 #9128 ARROW-10624 #9092 ARROW-10386 #8549 ARROW-6994 #23308 ARROW-12774 #10320 ARROW-12670 #10287 ARROW-16828 #13484 ARROW-14989 #13482 ARROW-16977 #13514 ARROW-13404 #10999 ARROW-16887 #13601 ARROW-15906 #13206 ARROW-15280 #13171 ARROW-16144 #13183 ARROW-16511 #13105 ARROW-16085 #13088 ARROW-16715 #13555 ARROW-16268 #13550 ARROW-16700 #13518 ARROW-16807 #13583 ARROW-16871 #13517 ARROW-16415 #13190 ARROW-14821 #12154 ARROW-16439 #13174 ARROW-16394 #13118 ARROW-16516 #13163 ARROW-16395 #13627 ARROW-14848 #12589 ARROW-16407 #13196 ARROW-16653 #13506 ARROW-14575 #13160 ARROW-15271 #13170 ARROW-16703 #13650 ARROW-16444 #13397 ARROW-15016 #13541 ARROW-16776 #13563 ARROW-15622 #13090 ARROW-18131 #14484 ARROW-18305 #14581 ARROW-18285 #14615 * Closes: #33631 Authored-by: SHIMA Tatsuya <ts1s1andn@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

…to GitHub issue numbers (apache#34260) Rewrite the Jira issue numbers to the GitHub issue numbers, so that the GitHub issue numbers are automatically linked to the issues by pkgdown's auto-linking feature. Issue numbers have been rewritten based on the following correspondence. Also, the pkgdown settings have been changed and updated to link to GitHub. I generated the Changelog page using the `pkgdown::build_news()` function and verified that the links work correctly. --- ARROW-6338 apache#5198 ARROW-6364 apache#5201 ARROW-6323 apache#5169 ARROW-6278 apache#5141 ARROW-6360 apache#5329 ARROW-6533 apache#5450 ARROW-6348 apache#5223 ARROW-6337 apache#5399 ARROW-10850 apache#9128 ARROW-10624 apache#9092 ARROW-10386 apache#8549 ARROW-6994 apache#23308 ARROW-12774 apache#10320 ARROW-12670 apache#10287 ARROW-16828 apache#13484 ARROW-14989 apache#13482 ARROW-16977 apache#13514 ARROW-13404 apache#10999 ARROW-16887 apache#13601 ARROW-15906 apache#13206 ARROW-15280 apache#13171 ARROW-16144 apache#13183 ARROW-16511 apache#13105 ARROW-16085 apache#13088 ARROW-16715 apache#13555 ARROW-16268 apache#13550 ARROW-16700 apache#13518 ARROW-16807 apache#13583 ARROW-16871 apache#13517 ARROW-16415 apache#13190 ARROW-14821 apache#12154 ARROW-16439 apache#13174 ARROW-16394 apache#13118 ARROW-16516 apache#13163 ARROW-16395 apache#13627 ARROW-14848 apache#12589 ARROW-16407 apache#13196 ARROW-16653 apache#13506 ARROW-14575 apache#13160 ARROW-15271 apache#13170 ARROW-16703 apache#13650 ARROW-16444 apache#13397 ARROW-15016 apache#13541 ARROW-16776 apache#13563 ARROW-15622 apache#13090 ARROW-18131 apache#14484 ARROW-18305 apache#14581 ARROW-18285 apache#14615 * Closes: apache#33631 Authored-by: SHIMA Tatsuya <ts1s1andn@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

dragosmg changed the title ~~ARROW-15016 [R] show_query() for an arrow_dplyr_query~~ ARROW-15016: [R] show_query() for an arrow_dplyr_query Jul 7, 2022

github-actions bot added the Component: R label Jul 7, 2022

dragosmg changed the title ~~ARROW-15016: [R] show_query() for an arrow_dplyr_query~~ ARROW-15016: [R] show_query() for an arrow_dplyr_query Jul 7, 2022

dragosmg commented Jul 7, 2022

View reviewed changes

r/src/arrowExports.cpp Outdated Show resolved Hide resolved

dragosmg added 3 commits July 11, 2022 09:21

add a _ToString method for the ExecPlan

403d9df

add ToString method for the ExecPlan R6 object

4a1f94e

define show_arrow_query() and add it to print.arrow_dplyr_query()

f03ced0

dragosmg force-pushed the show_query branch from faeea48 to f03ced0 Compare July 11, 2022 08:21

dragosmg added 5 commits July 11, 2022 13:23

rename show_arrow_query() to show_exec_plan() + unit tests

fdd20be

lint

350bea3

use expect_snapshot() instead of expect_output()

c889ec0

remove empty row

1260540

update test and snapshot

1f4801d

github-actions bot added the Component: Documentation label Jul 11, 2022

dragosmg added 4 commits July 11, 2022 15:22

document + export show_exec_plan()

254e14c

example + redocument

6294765

add show_exec_plan to r/_pkgdown.yml under *Computation*

cb5d5b7

bump ci

424f12a

jonkeane reviewed Jul 11, 2022

View reviewed changes

r/R/dplyr.R Outdated Show resolved Hide resolved

r/tests/testthat/test-dplyr-query.R Outdated Show resolved Hide resolved

dragosmg added 4 commits July 12, 2022 10:19

run example only when {dplyr} is available

ccd8f76

remove snapshots and test with expect_output()

83e5492

* add dataset test * add tests for `summarise()`, `group_by()` and `join()`

lint

b9c0c30

bump ci

07de250

dragosmg marked this pull request as ready for review July 12, 2022 13:58

dragosmg changed the title ~~ARROW-15016: [R] show_query() for an arrow_dplyr_query~~ ARROW-15016: [R] show_exec_plan for an arrow_dplyr_query Jul 12, 2022

wjones127 approved these changes Jul 12, 2022

View reviewed changes

r/R/dplyr.R Outdated Show resolved Hide resolved

r/R/dplyr.R Outdated Show resolved Hide resolved

docs

0e3b894

lints

6376bb2

paleolimbot approved these changes Jul 21, 2022

View reviewed changes

nealrichardson requested changes Jul 21, 2022

View reviewed changes

nealrichardson mentioned this pull request Jul 21, 2022

ARROW-16444: [R] Implement user-defined scalar functions in R bindings #13397

Merged

dragosmg added 2 commits July 22, 2022 09:06

revert to da73019

5751543

don't start producing the plan

e29a835

dragosmg added 4 commits July 22, 2022 10:31

remove separate tests for show_query() and explain()

1309e0f

revert to fb2a5a0 and use the BuildAndShow name

5f890dc

docs + comments to indicate the duplicate code in:

c4912a4

* `ExecPlan_run` and `ExecPlan_BuildAndShow` C++ functions * `$Run()` and `$BuildAndShow()` R6 methods

removed ungroup() call

5aaa3cd

clang lint

ce8b17e

comment

0ffa30a

dragosmg added 3 commits July 22, 2022 14:35

warn and don't build & print the ExecPlan when we have a nested query

6a8c753

Merge branch 'master' into show_query

7fc20ca

comments in ExecPlan_prepare

35e9ef8

paleolimbot approved these changes Jul 22, 2022

View reviewed changes

paleolimbot merged commit 9442e1c into apache:master Jul 22, 2022

asfimport mentioned this pull request Jul 24, 2022

[R] show_exec_plan() for an arrow_dplyr_query #30534

Closed

6 tasks

eitsupi mentioned this pull request Feb 19, 2023

GH-33631: [R] Rewrite Jira ticket numbers in pkgdown documents to GitHub issue numbers #34260

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-15016: [R] `show_exec_plan` for an `arrow_dplyr_query` #13541

ARROW-15016: [R] `show_exec_plan` for an `arrow_dplyr_query` #13541

dragosmg commented Jul 7, 2022 •

edited

github-actions bot commented Jul 7, 2022

jonkeane left a comment

wjones127 left a comment

dragosmg commented Jul 12, 2022

wjones127 commented Jul 12, 2022

dragosmg commented Jul 21, 2022 •

edited

paleolimbot left a comment

nealrichardson left a comment

nealrichardson Jul 21, 2022

dragosmg Jul 22, 2022

nealrichardson Jul 21, 2022

nealrichardson Jul 21, 2022

paleolimbot Jul 22, 2022

nealrichardson Jul 22, 2022

westonpace Jul 22, 2022

nealrichardson commented Jul 22, 2022

dragosmg commented Jul 22, 2022 •

edited

dragosmg commented Jul 22, 2022

dragosmg commented Jul 22, 2022 •

edited

nealrichardson commented Jul 22, 2022

dragosmg commented Jul 22, 2022

paleolimbot left a comment

dragosmg commented Jul 22, 2022

ursabot commented Jul 24, 2022

ARROW-15016: [R] show_exec_plan for an arrow_dplyr_query #13541

ARROW-15016: [R] show_exec_plan for an arrow_dplyr_query #13541

Conversation

dragosmg commented Jul 7, 2022 • edited

github-actions bot commented Jul 7, 2022

jonkeane left a comment

Choose a reason for hiding this comment

wjones127 left a comment

Choose a reason for hiding this comment

dragosmg commented Jul 12, 2022

wjones127 commented Jul 12, 2022

dragosmg commented Jul 21, 2022 • edited

paleolimbot left a comment

Choose a reason for hiding this comment

nealrichardson left a comment

Choose a reason for hiding this comment

nealrichardson Jul 21, 2022

Choose a reason for hiding this comment

dragosmg Jul 22, 2022

Choose a reason for hiding this comment

nealrichardson Jul 21, 2022

Choose a reason for hiding this comment

nealrichardson Jul 21, 2022

Choose a reason for hiding this comment

paleolimbot Jul 22, 2022

Choose a reason for hiding this comment

nealrichardson Jul 22, 2022

Choose a reason for hiding this comment

westonpace Jul 22, 2022

Choose a reason for hiding this comment

nealrichardson commented Jul 22, 2022

dragosmg commented Jul 22, 2022 • edited

dragosmg commented Jul 22, 2022

dragosmg commented Jul 22, 2022 • edited

nealrichardson commented Jul 22, 2022

dragosmg commented Jul 22, 2022

paleolimbot left a comment

Choose a reason for hiding this comment

dragosmg commented Jul 22, 2022

ursabot commented Jul 24, 2022

ARROW-15016: [R] `show_exec_plan` for an `arrow_dplyr_query` #13541

ARROW-15016: [R] `show_exec_plan` for an `arrow_dplyr_query` #13541

dragosmg commented Jul 7, 2022 •

edited

dragosmg commented Jul 21, 2022 •

edited

dragosmg commented Jul 22, 2022 •

edited

dragosmg commented Jul 22, 2022 •

edited