Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-15016: [R]
show_exec_plan
for anarrow_dplyr_query
#13541ARROW-15016: [R]
show_exec_plan
for anarrow_dplyr_query
#13541Changes from 12 commits
403d9df
4a1f94e
f03ced0
fdd20be
350bea3
c889ec0
1260540
1f4801d
254e14c
6294765
cb5d5b7
424f12a
ccd8f76
83e5492
b9c0c30
07de250
0e3b894
3aab4e0
3355135
b9b4d49
4779f08
5730cec
23f5f22
2eb3cdb
2b33495
4ff3065
d686f9d
d452a40
f4a1ba4
ac7d756
761b66d
a36f826
d887d42
effc963
4571df4
d1e4ce7
e2220fe
3c96fb7
068dc6d
fb2a5a0
c14a497
aac9edc
3faecc6
1afcce9
1de3447
f40e5d2
8aae3c3
fae74ae
ebd4a02
311b70b
d0bc159
d7834b0
9a02651
ac37a9d
238e406
9a34255
1db36b6
2153451
807aa88
6141c45
9af1403
1f88ff7
fbb4c1e
da73019
ee90354
66dcd22
0482c6c
ad5024f
20799f9
21bc8c3
6376bb2
5751543
e29a835
1309e0f
5f890dc
c4912a4
5aaa3cd
ce8b17e
0ffa30a
6a8c753
7fc20ca
35e9ef8
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't follow the discussion: why not use show_query() or explain() here? I saw something about wanting to massage the output to make explain() prettier, but why not use this for explain() today since that's something people know, and it's about logical plans?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a link to the design doc, also linked in the PR description and the Jira ticket.
In short, in my opinion, a first step would be to have
show_exec_plan()
as a function that outputs exactly what we get from C++. That's what I interpreted the scope of the ticket to be.My proposal rests on this point-of-view: I think we need, for accessibility reasons, 2 separate ways of surfacing details regarding the ExecPlan, for 2 separate audiences (one readable by the seasoned Arrow developer and one readable by the regular R user).
show_exec_plan()
would cater to the first audience and, thus, focus on minimising cognitive load by keeping things the same across languages.explain()
should probably not aim to be both and would be targeted towards the second audience and do it well. By well I understand here as in a way that makes sense to the regular R user, who expects arrow to just work. In this context,explain()
would be a tool allowing them to inspect what is going on in a dplyr-like pipeline, but do so in a language they are familiar with.Having an
explain()
method would absolutely be super useful, but I do not think that is fully scoped out yet. Probably, before we start the implementation ofexplain()
, we should flesh out what the output ofexplain()
will look like. I strongly believe we should start from the description of the generic which says: "gives more details about an object (...) and is more focused on human readable output" (my underlining).In this PR there are already some really good suggestions of stuff we could include in the output of a function like
explain()
(e.g. to coverquery_can_stream()
).If we add to the above the time constraint of the impending release, I do not think extending this PR to cover
explain()
would be the best course of action.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I see your case for why not
explain()
, you want to make something more R-user friendly. So then this seems likeshow_query()
. The only reason I see in your doc for not using show_query is "show_query()
states that one of its aims is to provide a more human readable output thatstr()
". That's a really low bar here, given what str() does with R6 classes.My recommendation, for what it's worth: in this PR, make this function be both
show_query
andexplain
, and in a followup, add more human embellishment toexplain()
. (dbplyr:::explain.tbl_sql
, for example, calls show_query() and then prints more details after it.)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy with your recommendation. One of my earlier drafts looked a bit more like
explain.tbl_sql
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nealrichardson Just to clarify, is your recommendation to have
show_exec_plan()
as a stand-alone function +show_query.arrow_dplyr_query()
andexplain.arrow_dplyr_query()
or just the 2 methods and to discardshow_exec_plan()
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My recommendation was the latter, but I don't object to also having a standalone show_exec_plan()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of having
show_exec_plan()
separately as it allows us to write a bit of documentation vsshow_query()
andexplain()
. I'm also aware that this would be the first instance in which document dplyr-like behaviour (AFAIK there is no such documentation in the pkgdown website / help files).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're missing a few things here that get handled in ExecPlan_run: https://github.com/apache/arrow/blob/master/r/src/compute-exec.cpp#L67-L89
So plan$ToString() will miss sorting and sorting + head/tail (topK). If you add some tests that include
arrange()
, you'll see they're not showing up.You could add a C++ function that does that first part of ExecPlan_run and creates the sink nodes, then calls ToString(). That would print everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, if the pipeline includes calls to
arrange()
and / orhead()
they are not captured byshow_exec_plan()
. Moreover, the filter node gets lost when we usehead()
.Let me do some research / reading and will get back on this. It might be well beyond my C++ skillset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting about the filter node, I didn't expect that. I wonder what's going on there. (That's one of the great things about adding this feature: who knows what else we'll see in the exec plans that doesn't match what we think it should be.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added 2 failing tests for
arrange()
andhead()
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nealrichardson thanks for pointing me to use the first part of the
ExecPlan_run
. I managed to get some output related to theSinkNodes
, see below:but, we still lose some information when head/tail are involved:
What I noticed is that, besides the loss of information on the
FilterNode
, the starting point becomes aSourceNode
and not aTableSourceNode
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand what is going on here. We effectively have 2 ExecPlans because the
order_by_sink
option (involved inarrange()
) is a pipeline breaker and fully materialises the input into memory (thus splitting the ExecPlan into 2 parts). The first ExecPlan starts with aTableSourceNode
, includes aFilterNode
and most likely ends with anOrderBySinkNode
. The second ExecPlan starts with aSourceNode
and ends with theSinkNode
corresponding tohead()
. The print method only captures the second (final) ExecPlan. Maybe we could think about how we want to capture situations in which "multiple" ExecPlans are involved (i.e. we have an operation that is a pipeline breaker).Is my understanding correct? @nealrichardson @jonkeane @paleolimbot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a few minutes to do up master...paleolimbot:arrow:r-print-plan , which basically adds an
explain = TRUE
option toplan$Build()
,plan$Run()
, asas_record_batch_reader.arrow_dplyr_query()
and prints out the exec plan afterExecPlan_run()
. It's a bit of a hack but could work to support chained plans (and reduce some of the code duplication introduced here).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this, but on a first pass we get less information (we lose all info on the SinkNode - similar to one of my early implementations):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a second pass, it chains the exec plans correctly, but the sink nodes themselves are missing:
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.