Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-15016: [R] show_exec_plan for an arrow_dplyr_query #13541

Merged
merged 82 commits into from
Jul 22, 2022
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
403d9df
add a `_ToString` method for the ExecPlan
dragosmg Jul 7, 2022
4a1f94e
add `ToString` method for the `ExecPlan` R6 object
dragosmg Jul 7, 2022
f03ced0
define `show_arrow_query()` and add it to `print.arrow_dplyr_query()`
dragosmg Jul 7, 2022
fdd20be
rename `show_arrow_query()` to `show_exec_plan()` + unit tests
dragosmg Jul 11, 2022
350bea3
lint
dragosmg Jul 11, 2022
c889ec0
use `expect_snapshot()` instead of `expect_output()`
dragosmg Jul 11, 2022
1260540
remove empty row
dragosmg Jul 11, 2022
1f4801d
update test and snapshot
dragosmg Jul 11, 2022
254e14c
document + export `show_exec_plan()`
dragosmg Jul 11, 2022
6294765
example + redocument
dragosmg Jul 11, 2022
cb5d5b7
add `show_exec_plan` to r/_pkgdown.yml under *Computation*
dragosmg Jul 11, 2022
424f12a
bump ci
dragosmg Jul 11, 2022
ccd8f76
run example only when {dplyr} is available
dragosmg Jul 12, 2022
83e5492
remove snapshots and test with `expect_output()`
dragosmg Jul 12, 2022
b9c0c30
lint
dragosmg Jul 12, 2022
07de250
bump ci
dragosmg Jul 12, 2022
0e3b894
docs
dragosmg Jul 12, 2022
3aab4e0
improved `show_exec_plan()` docs
dragosmg Jul 13, 2022
3355135
Merge branch 'master' into show_query
dragosmg Jul 13, 2022
b9b4d49
use `adq`
dragosmg Jul 13, 2022
4779f08
add minimal test
dragosmg Jul 13, 2022
5730cec
failing unit tests for `arrange()` and `head()` "nodes"
dragosmg Jul 13, 2022
23f5f22
add `show_query.arrow-dplyr_query()` and `explain.arrow_dplyr_query()`
dragosmg Jul 13, 2022
2eb3cdb
tests for `show_query()` & `explain()` + clarifications
dragosmg Jul 13, 2022
2b33495
move dataset tests to test-dataset-dplyr.R + unskip
dragosmg Jul 13, 2022
4ff3065
clean-up
dragosmg Jul 13, 2022
d686f9d
clean-up
dragosmg Jul 13, 2022
d452a40
define `ExecPlan_ToStringWithSink` C++ function
dragosmg Jul 18, 2022
f4a1ba4
define the `ToStringWithSink()` R6 ExecPlan method and use it in `sh…
dragosmg Jul 18, 2022
ac7d756
updated unit tests
dragosmg Jul 18, 2022
761b66d
comments
dragosmg Jul 18, 2022
a36f826
clang style
dragosmg Jul 18, 2022
d887d42
comment
dragosmg Jul 18, 2022
effc963
clang style
dragosmg Jul 18, 2022
4571df4
clang style
dragosmg Jul 18, 2022
d1e4ce7
clang style
dragosmg Jul 18, 2022
e2220fe
update dataset tests
dragosmg Jul 18, 2022
3c96fb7
comment
dragosmg Jul 18, 2022
068dc6d
bump ci
dragosmg Jul 18, 2022
fb2a5a0
don't run the cyclocomp linter on the ExecPlan R6 definition
dragosmg Jul 18, 2022
c14a497
typo
dragosmg Jul 19, 2022
aac9edc
`&&`
dragosmg Jul 19, 2022
3faecc6
rename `ToStrinWithSink` to `ToString` and keep only this method
dragosmg Jul 19, 2022
1afcce9
docs
dragosmg Jul 19, 2022
1de3447
C++ lint
dragosmg Jul 19, 2022
f40e5d2
add `ExecPlan_prepare` helper and update `ExecPlan_ToString` C++ func…
dragosmg Jul 19, 2022
8aae3c3
update ToString R6 ExecPlan method
dragosmg Jul 19, 2022
fae74ae
c++ lint
dragosmg Jul 19, 2022
ebd4a02
C++ lint
dragosmg Jul 19, 2022
311b70b
another C++ lint
dragosmg Jul 19, 2022
d0bc159
simpler ExecPlan_prepare
dragosmg Jul 19, 2022
d7834b0
C++ lint
dragosmg Jul 19, 2022
9a02651
more C++ linting
dragosmg Jul 19, 2022
ac37a9d
extend ExecPlan_prepare and use it for ExecPlan_run
dragosmg Jul 19, 2022
238e406
C++ lint
dragosmg Jul 19, 2022
9a34255
C++ lint
dragosmg Jul 19, 2022
1db36b6
include safe call into R header file
dragosmg Jul 19, 2022
2153451
revert to old ExecPlan_run
dragosmg Jul 19, 2022
807aa88
c++ lint
dragosmg Jul 19, 2022
6141c45
simplified ExecPlan_prepare
dragosmg Jul 19, 2022
9af1403
`ExecPlan_prepare` to return both the plan and the sink node
dragosmg Jul 20, 2022
1f88ff7
C++ lint
dragosmg Jul 20, 2022
fbb4c1e
fix the rabbit hole that I led Dragos down
paleolimbot Jul 20, 2022
da73019
always call startproducing
paleolimbot Jul 20, 2022
ee90354
bump ci
dragosmg Jul 21, 2022
66dcd22
updated unit tests + snapshots
dragosmg Jul 21, 2022
0482c6c
updated dataset unit tests + snapshots
dragosmg Jul 21, 2022
ad5024f
remove `BuildAndShow` + add `ToString`
dragosmg Jul 21, 2022
20799f9
update `as_record_batch_reader` and use it inside `show_exec_plan()`
dragosmg Jul 21, 2022
21bc8c3
docs
dragosmg Jul 21, 2022
6376bb2
lints
dragosmg Jul 21, 2022
5751543
revert to da730196fc62b7a1572dc2d40e2550744039a142
dragosmg Jul 22, 2022
e29a835
don't start producing the plan
dragosmg Jul 22, 2022
1309e0f
remove separate tests for `show_query()` and `explain()`
dragosmg Jul 22, 2022
5f890dc
revert to fb2a5a00fc24d16da8e0a596256e0c18b02f5289 and use the `Build…
dragosmg Jul 22, 2022
c4912a4
docs + comments to indicate the duplicate code in:
dragosmg Jul 22, 2022
5aaa3cd
removed `ungroup()` call
dragosmg Jul 22, 2022
ce8b17e
clang lint
dragosmg Jul 22, 2022
0ffa30a
comment
dragosmg Jul 22, 2022
6a8c753
warn and don't build & print the ExecPlan when we have a nested query
dragosmg Jul 22, 2022
7fc20ca
Merge branch 'master' into show_query
dragosmg Jul 22, 2022
35e9ef8
comments in `ExecPlan_prepare`
dragosmg Jul 22, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions r/NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -348,6 +348,7 @@ export(s3_bucket)
export(schema)
export(set_cpu_count)
export(set_io_thread_count)
export(show_exec_plan)
export(starts_with)
export(string)
export(struct)
Expand Down
4 changes: 4 additions & 0 deletions r/R/arrowExports.R

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

25 changes: 25 additions & 0 deletions r/R/dplyr.R
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,31 @@ tail.arrow_dplyr_query <- function(x, n = 6L, ...) {
x
}

#' Show the details of an Arrow ExecPlan
#'
#' This is a function which gives more details about the `ExecPlan` of an
dragosmg marked this conversation as resolved.
Show resolved Hide resolved
#' `arrow_dplyr_query` object. It is similar to `dplyr::show_query()`.
dragosmg marked this conversation as resolved.
Show resolved Hide resolved
#'
#' @param x an `arrow_dplyr_query` to print the ExecPlan for.
#'
#' @return The argument, invisibly.
dragosmg marked this conversation as resolved.
Show resolved Hide resolved
#' @export
#'
#' @examples
dragosmg marked this conversation as resolved.
Show resolved Hide resolved
#' library(dplyr)
#' mtcars %>%
#' arrow_table() %>%
#' filter(mpg > 20) %>%
#' mutate(x = gear/carb) %>%
#' show_exec_plan()
show_exec_plan <- function(x) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't follow the discussion: why not use show_query() or explain() here? I saw something about wanting to massage the output to make explain() prettier, but why not use this for explain() today since that's something people know, and it's about logical plans?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a link to the design doc, also linked in the PR description and the Jira ticket.

In short, in my opinion, a first step would be to have show_exec_plan() as a function that outputs exactly what we get from C++. That's what I interpreted the scope of the ticket to be.

My proposal rests on this point-of-view: I think we need, for accessibility reasons, 2 separate ways of surfacing details regarding the ExecPlan, for 2 separate audiences (one readable by the seasoned Arrow developer and one readable by the regular R user).

  1. show_exec_plan() would cater to the first audience and, thus, focus on minimising cognitive load by keeping things the same across languages.
  2. explain() should probably not aim to be both and would be targeted towards the second audience and do it well. By well I understand here as in a way that makes sense to the regular R user, who expects arrow to just work. In this context, explain() would be a tool allowing them to inspect what is going on in a dplyr-like pipeline, but do so in a language they are familiar with.

Having an explain() method would absolutely be super useful, but I do not think that is fully scoped out yet. Probably, before we start the implementation of explain(), we should flesh out what the output of explain() will look like. I strongly believe we should start from the description of the generic which says: "gives more details about an object (...) and is more focused on human readable output" (my underlining).

In this PR there are already some really good suggestions of stuff we could include in the output of a function like explain() (e.g. to cover query_can_stream()).

If we add to the above the time constraint of the impending release, I do not think extending this PR to cover explain() would be the best course of action.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I see your case for why not explain(), you want to make something more R-user friendly. So then this seems like show_query(). The only reason I see in your doc for not using show_query is "show_query() states that one of its aims is to provide a more human readable output that str()". That's a really low bar here, given what str() does with R6 classes.

My recommendation, for what it's worth: in this PR, make this function be both show_query and explain, and in a followup, add more human embellishment to explain(). (dbplyr:::explain.tbl_sql, for example, calls show_query() and then prints more details after it.)

Copy link
Contributor Author

@dragosmg dragosmg Jul 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy with your recommendation. One of my earlier drafts looked a bit more like explain.tbl_sql.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nealrichardson Just to clarify, is your recommendation to have show_exec_plan() as a stand-alone function + show_query.arrow_dplyr_query() and explain.arrow_dplyr_query() or just the 2 methods and to discard show_exec_plan()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My recommendation was the latter, but I don't object to also having a standalone show_exec_plan()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of having show_exec_plan() separately as it allows us to write a bit of documentation vs show_query() and explain(). I'm also aware that this would be the first instance in which document dplyr-like behaviour (AFAIK there is no such documentation in the pkgdown website / help files).

adq <- as_adq(x)
dragosmg marked this conversation as resolved.
Show resolved Hide resolved
plan <- ExecPlan$create()
final_node <- plan$Build(x)
cat(plan$ToString())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're missing a few things here that get handled in ExecPlan_run: https://github.com/apache/arrow/blob/master/r/src/compute-exec.cpp#L67-L89

So plan$ToString() will miss sorting and sorting + head/tail (topK). If you add some tests that include arrange(), you'll see they're not showing up.

You could add a C++ function that does that first part of ExecPlan_run and creates the sink nodes, then calls ToString(). That would print everything.

Copy link
Contributor Author

@dragosmg dragosmg Jul 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, if the pipeline includes calls to arrange() and / or head() they are not captured by show_exec_plan(). Moreover, the filter node gets lost when we use head().

  mtcars %>% 
    arrow_table() %>% 
    filter(mpg > 20) %>% 
    arrange(desc(wt)) %>% 
    head(3) %>% 
    show_exec_plan()
#> ExecPlan with 2 nodes:
#> 1:ProjectNode{projection=[mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb]}
#>  0:SourceNode{}

mtcars %>% 
    arrow_table() %>% 
    filter(mpg > 20) %>% 
    arrange(desc(wt)) %>%
    show_exec_plan()
#> ExecPlan with 3 nodes:
#> 2:ProjectNode{projection=[mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb]}
#>   1:FilterNode{filter=(mpg > 20)}
#>     0:TableSourceNode{}

Let me do some research / reading and will get back on this. It might be well beyond my C++ skillset.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting about the filter node, I didn't expect that. I wonder what's going on there. (That's one of the great things about adding this feature: who knows what else we'll see in the exec plans that doesn't match what we think it should be.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added 2 failing tests for arrange() and head().

Copy link
Contributor Author

@dragosmg dragosmg Jul 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nealrichardson thanks for pointing me to use the first part of the ExecPlan_run. I managed to get some output related to the SinkNodes, see below:

mtcars %>%
  arrow_table() %>%
  filter(mpg > 20) %>%
  arrange(desc(wt)) %>%
  show_exec_plan()
#> ExecPlan with 4 nodes:
#> 3:OrderBySinkNode{by={sort_keys=[FieldRef.Name(wt) DESC], null_placement=AtEnd}}
#>   2:ProjectNode{projection=[mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb]}
#>     1:FilterNode{filter=(mpg > 20)}
#>       0:TableSourceNode{}

but, we still lose some information when head/tail are involved:

mtcars %>%
  arrow_table() %>%
  filter(mpg > 20) %>%
  arrange(desc(wt)) %>%
  head(3) %>%
  show_exec_plan()
#> ExecPlan with 3 nodes:
#> 2:SinkNode{}
#>   1:ProjectNode{projection=[mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb]}
#>     0:SourceNode{}

What I noticed is that, besides the loss of information on the FilterNode, the starting point becomes a SourceNode and not a TableSourceNode.

Copy link
Contributor Author

@dragosmg dragosmg Jul 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand what is going on here. We effectively have 2 ExecPlans because the order_by_sink option (involved in arrange()) is a pipeline breaker and fully materialises the input into memory (thus splitting the ExecPlan into 2 parts). The first ExecPlan starts with a TableSourceNode, includes a FilterNode and most likely ends with an OrderBySinkNode. The second ExecPlan starts with a SourceNode and ends with the SinkNode corresponding to head(). The print method only captures the second (final) ExecPlan. Maybe we could think about how we want to capture situations in which "multiple" ExecPlans are involved (i.e. we have an operation that is a pipeline breaker).
Is my understanding correct? @nealrichardson @jonkeane @paleolimbot

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a few minutes to do up master...paleolimbot:arrow:r-print-plan , which basically adds an explain = TRUE option to plan$Build(), plan$Run(), as as_record_batch_reader.arrow_dplyr_query() and prints out the exec plan after ExecPlan_run(). It's a bit of a hack but could work to support chained plans (and reduce some of the code duplication introduced here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this, but on a first pass we get less information (we lose all info on the SinkNode - similar to one of my early implementations):

ds %>%
  filter(lgl) %>%
  arrange(chr) %>%
  show_exec_plan()
#> ExecPlan with 3 nodes:
#> 2:ProjectNode{projection=[int, dbl, lgl, chr, fct, ts, part]}
#>   1:FilterNode{filter=lgl}
#>     0:SourceNode{}

Copy link
Contributor Author

@dragosmg dragosmg Jul 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second pass, it chains the exec plans correctly, but the sink nodes themselves are missing:

ds %>%
  filter(lgl) %>%
  arrange(chr) %>%
  head() %>%
  show_exec_plan()
#> ExecPlan with 3 nodes:
#> 2:ProjectNode{projection=[int, dbl, lgl, chr, fct, ts, part]}
#>   1:FilterNode{filter=lgl}
#>     0:SourceNode{}
#> ExecPlan with 2 nodes:
#> 1:ProjectNode{projection=[int, dbl, lgl, chr, fct, ts, part]}
#>   0:SourceNode{}

invisible(x)
}

ensure_group_vars <- function(x) {
if (inherits(x, "arrow_dplyr_query")) {
# Before pulling data from Arrow, make sure all group vars are in the projection
Expand Down
1 change: 1 addition & 0 deletions r/R/query-engine.R
Original file line number Diff line number Diff line change
Expand Up @@ -259,6 +259,7 @@ ExecPlan <- R6Class("ExecPlan",
...
)
},
ToString = function() ExecPlan_ToString(self),
dragosmg marked this conversation as resolved.
Show resolved Hide resolved
Stop = function() ExecPlan_StopProducing(self)
)
)
Expand Down
1 change: 1 addition & 0 deletions r/_pkgdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,7 @@ reference:
- match_arrow
- value_counts
- list_compute_functions
- show_exec_plan
- title: Connections to other systems
contents:
- to_arrow
Expand Down
26 changes: 26 additions & 0 deletions r/man/show_exec_plan.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

9 changes: 9 additions & 0 deletions r/src/arrowExports.cpp

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions r/src/compute-exec.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,11 @@ std::shared_ptr<arrow::Schema> ExecNode_output_schema(
return node->output_schema();
}

// [[arrow::export]]
std::string ExecPlan_ToString(const std::shared_ptr<compute::ExecPlan>& plan) {
return plan->ToString();
}
dragosmg marked this conversation as resolved.
Show resolved Hide resolved

#if defined(ARROW_R_WITH_DATASET)

#include <arrow/dataset/file_base.h>
Expand Down
22 changes: 22 additions & 0 deletions r/tests/testthat/_snaps/dplyr-query.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# show_exec_plan()

Code
tbl %>% arrow_table() %>% filter(dbl > 2, chr != "e") %>% select(chr, int, lgl) %>%
mutate(int_plus_ten = int + 10) %>% show_exec_plan()
Output
ExecPlan with 3 nodes:
2:ProjectNode{projection=[chr, int, lgl, "int_plus_ten": add_checked(cast(int, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false}), 10)]}
1:FilterNode{filter=((dbl > 2) and (chr != "e"))}
0:TableSourceNode{}

---

Code
tbl %>% record_batch() %>% filter(dbl > 2, chr != "e") %>% select(chr, int, lgl) %>%
mutate(int_plus_ten = int + 10) %>% show_exec_plan()
Output
ExecPlan with 3 nodes:
2:ProjectNode{projection=[chr, int, lgl, "int_plus_ten": add_checked(cast(int, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false}), 10)]}
1:FilterNode{filter=((dbl > 2) and (chr != "e"))}
0:TableSourceNode{}

20 changes: 20 additions & 0 deletions r/tests/testthat/test-dplyr-query.R
Original file line number Diff line number Diff line change
Expand Up @@ -293,3 +293,23 @@ test_that("No duplicate field names are allowed in an arrow_dplyr_query", {
)
)
})

test_that("show_exec_plan()", {
expect_snapshot(
tbl %>%
arrow_table() %>%
filter(dbl > 2, chr != "e") %>%
select(chr, int, lgl) %>%
mutate(int_plus_ten = int + 10) %>%
show_exec_plan()
)

expect_snapshot(
tbl %>%
record_batch() %>%
filter(dbl > 2, chr != "e") %>%
select(chr, int, lgl) %>%
mutate(int_plus_ten = int + 10) %>%
show_exec_plan()
)
})
dragosmg marked this conversation as resolved.
Show resolved Hide resolved