New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-15016: [R] show_exec_plan
for an arrow_dplyr_query
#13541
Conversation
show_query()
for an arrow_dplyr_query
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is a still a draft, but I was curious + saw a few things that looked comment-able that might help drive this forward
I appreciate how nice and targeted these changes are so far!
* add dataset test * add tests for `summarise()`, `group_by()` and `join()`
show_query()
for an arrow_dplyr_query
show_exec_plan
for an arrow_dplyr_query
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this function! I made a few optional suggestions.
One potential future enhancement: Neal is adding this function query_can_stream()
in #13563, which might be helpful info to know when looking at the query exec plan. Showing it next to the plan will help people understand what kinds of plans can and can't stream.
@wjones127 Thanks for the review. I definitely think we could expand this sort of functionality. Maybe we could keep |
Yeah that makes sense to me 👍 |
In order to avoid repetition in the unit tests, I went for a combined approach:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once the CI turns green, I'm good with this! Technically the ExecPlan_prepare()
changes in compute-exec.cpp are no longer needed for this PR but I am also not worried about them (they're needed for the user-defined functions PR as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm missing something, this looks like a major rewrite of the PR that causes show_query()
to actually run the query, which is not desirable.
r/src/compute-exec.cpp
Outdated
cpp11::list sort_options, cpp11::strings metadata, | ||
int64_t head = -1) { | ||
auto prepared_plan = ExecPlan_prepare(plan, final_node, sort_options, metadata, head); | ||
arrow::StopIfNotOk(prepared_plan.first->StartProducing()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this starts evaluating the ExecPlan, which we don't want to do.
arrow::StopIfNotOk(prepared_plan.first->StartProducing()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nealrichardson I think StartProducing()
is central to this (let's call it the BuildAndShow) approach. I couldn't get it to work without it. Extracting the duplicated code in ExecPlan_prepare
doesn't work without starting the plan.
My f40e5d2 commit, submitted 3 days ago - was AFAICT exactly what you suggest and it didn't work.
So was Dewey's fbb4c1e (submitted 2 days ago).
I will revert to this state minus the line you suggested we delete and see where that takes us.
#' filter(mpg > 20) %>% | ||
#' mutate(x = gear/carb) %>% | ||
#' show_exec_plan() | ||
show_exec_plan <- function(x) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My recommendation was the latter, but I don't object to also having a standalone show_exec_plan()
r/R/query-engine.R
Outdated
@@ -191,7 +192,7 @@ ExecPlan <- R6Class("ExecPlan", | |||
} | |||
node | |||
}, | |||
Run = function(node) { | |||
Run = function(node, explain = FALSE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this is a really bad idea: you're evaluating the whole query just to print it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know how much data gets pulled from an exec plan that is created and immediately deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, Acero is a push model, not pull. So when you call start producing, it fires up, doesn't matter that you haven't pulled the first batch, it's pushing batches to the reader. I also wouldn't assume that just because the RBR object goes out of scope that that sends any signal to stop producing. Assumptions aside, we should be able to observe this if we test with real data.
But I would think that printing the query shouldn't trigger any evaluation on the data--anything >0 data being read is probably too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating an exec plan: None
Calling StartProducing: All of it
You can call StopProducing after you call StartProducing but that isn't great right now (we still fully consume whatever files we happen to be reading at the moment) and will hopefully get better.
@dragosmg check out #13397 (comment). I think this explains why "the filter node is gone" on the head queries--it gets evaluated 🙀 inside of Build(). For now we should probably not try to show the query |
This ☝🏻 commit (5751543 ) includes the call to start the plan. Most tests pass (except the large memory test). Commit e29a835 removes the call to
This makes me think, that, if we want to avoid starting the plan, we need to go back to a state where the "preparation" is done with duplicated code in both |
* `ExecPlan_run` and `ExecPlan_BuildAndShow` C++ functions * `$Run()` and `$BuildAndShow()` R6 methods
What have I done so far today:
|
Not sure adding a comment referencing ARROW-16628 is needed since I am no longer touching the R6 |
But you're still calling $Build, and in this case, the inner query will be evaluated inside of that, so we should check before calling $Build. |
I think this is ready for another review @nealrichardson & @paleolimbot. 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the duplication but as long as its clearly marked and there is a followup JIRA to remove it I'm game. The fact that we have to copy so much code just to print out what's happening suggests to me that we need to improve our abstraction and the solution you've provided here are exactly what we need to do that.
Thank you for sticking with this! I will keep checking https://github.com/dragosmg/arrow/branches for green CI since that's likely to come well before the Arrow CI runs.
Most checks are green - https://github.com/dragosmg/arrow/runs/7469491011?check_suite_focus=true - with the exception of the first one which has been failing for a few days now (OOM in the large memory tests). Follow-up Jira: ARROW-17185. |
Benchmark runs are scheduled for baseline = 9ad2255 and contender = 9442e1c. 9442e1c is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
…Hub issue numbers (#34260) Rewrite the Jira issue numbers to the GitHub issue numbers, so that the GitHub issue numbers are automatically linked to the issues by pkgdown's auto-linking feature. Issue numbers have been rewritten based on the following correspondence. Also, the pkgdown settings have been changed and updated to link to GitHub. I generated the Changelog page using the `pkgdown::build_news()` function and verified that the links work correctly. --- ARROW-6338 #5198 ARROW-6364 #5201 ARROW-6323 #5169 ARROW-6278 #5141 ARROW-6360 #5329 ARROW-6533 #5450 ARROW-6348 #5223 ARROW-6337 #5399 ARROW-10850 #9128 ARROW-10624 #9092 ARROW-10386 #8549 ARROW-6994 #23308 ARROW-12774 #10320 ARROW-12670 #10287 ARROW-16828 #13484 ARROW-14989 #13482 ARROW-16977 #13514 ARROW-13404 #10999 ARROW-16887 #13601 ARROW-15906 #13206 ARROW-15280 #13171 ARROW-16144 #13183 ARROW-16511 #13105 ARROW-16085 #13088 ARROW-16715 #13555 ARROW-16268 #13550 ARROW-16700 #13518 ARROW-16807 #13583 ARROW-16871 #13517 ARROW-16415 #13190 ARROW-14821 #12154 ARROW-16439 #13174 ARROW-16394 #13118 ARROW-16516 #13163 ARROW-16395 #13627 ARROW-14848 #12589 ARROW-16407 #13196 ARROW-16653 #13506 ARROW-14575 #13160 ARROW-15271 #13170 ARROW-16703 #13650 ARROW-16444 #13397 ARROW-15016 #13541 ARROW-16776 #13563 ARROW-15622 #13090 ARROW-18131 #14484 ARROW-18305 #14581 ARROW-18285 #14615 * Closes: #33631 Authored-by: SHIMA Tatsuya <ts1s1andn@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…to GitHub issue numbers (apache#34260) Rewrite the Jira issue numbers to the GitHub issue numbers, so that the GitHub issue numbers are automatically linked to the issues by pkgdown's auto-linking feature. Issue numbers have been rewritten based on the following correspondence. Also, the pkgdown settings have been changed and updated to link to GitHub. I generated the Changelog page using the `pkgdown::build_news()` function and verified that the links work correctly. --- ARROW-6338 apache#5198 ARROW-6364 apache#5201 ARROW-6323 apache#5169 ARROW-6278 apache#5141 ARROW-6360 apache#5329 ARROW-6533 apache#5450 ARROW-6348 apache#5223 ARROW-6337 apache#5399 ARROW-10850 apache#9128 ARROW-10624 apache#9092 ARROW-10386 apache#8549 ARROW-6994 apache#23308 ARROW-12774 apache#10320 ARROW-12670 apache#10287 ARROW-16828 apache#13484 ARROW-14989 apache#13482 ARROW-16977 apache#13514 ARROW-13404 apache#10999 ARROW-16887 apache#13601 ARROW-15906 apache#13206 ARROW-15280 apache#13171 ARROW-16144 apache#13183 ARROW-16511 apache#13105 ARROW-16085 apache#13088 ARROW-16715 apache#13555 ARROW-16268 apache#13550 ARROW-16700 apache#13518 ARROW-16807 apache#13583 ARROW-16871 apache#13517 ARROW-16415 apache#13190 ARROW-14821 apache#12154 ARROW-16439 apache#13174 ARROW-16394 apache#13118 ARROW-16516 apache#13163 ARROW-16395 apache#13627 ARROW-14848 apache#12589 ARROW-16407 apache#13196 ARROW-16653 apache#13506 ARROW-14575 apache#13160 ARROW-15271 apache#13170 ARROW-16703 apache#13650 ARROW-16444 apache#13397 ARROW-15016 apache#13541 ARROW-16776 apache#13563 ARROW-15622 apache#13090 ARROW-18131 apache#14484 ARROW-18305 apache#14581 ARROW-18285 apache#14615 * Closes: apache#33631 Authored-by: SHIMA Tatsuya <ts1s1andn@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
This PR adds
show_exec_plan()
will allow users to inspect the ExecPlan, in a similar way to dplyr'sshow_query()
.Some design considerations are discussed in the design doc.
Summary of the approach:
show_exec_plan()
name as I believe it aligns well with the purpose of this PR: to expose the ExecPlan (in its raw state).show_query()
/explain()
methods).