Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-15016: [R] show_exec_plan for an arrow_dplyr_query #13541

Merged
merged 82 commits into from
Jul 22, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
403d9df
add a `_ToString` method for the ExecPlan
dragosmg Jul 7, 2022
4a1f94e
add `ToString` method for the `ExecPlan` R6 object
dragosmg Jul 7, 2022
f03ced0
define `show_arrow_query()` and add it to `print.arrow_dplyr_query()`
dragosmg Jul 7, 2022
fdd20be
rename `show_arrow_query()` to `show_exec_plan()` + unit tests
dragosmg Jul 11, 2022
350bea3
lint
dragosmg Jul 11, 2022
c889ec0
use `expect_snapshot()` instead of `expect_output()`
dragosmg Jul 11, 2022
1260540
remove empty row
dragosmg Jul 11, 2022
1f4801d
update test and snapshot
dragosmg Jul 11, 2022
254e14c
document + export `show_exec_plan()`
dragosmg Jul 11, 2022
6294765
example + redocument
dragosmg Jul 11, 2022
cb5d5b7
add `show_exec_plan` to r/_pkgdown.yml under *Computation*
dragosmg Jul 11, 2022
424f12a
bump ci
dragosmg Jul 11, 2022
ccd8f76
run example only when {dplyr} is available
dragosmg Jul 12, 2022
83e5492
remove snapshots and test with `expect_output()`
dragosmg Jul 12, 2022
b9c0c30
lint
dragosmg Jul 12, 2022
07de250
bump ci
dragosmg Jul 12, 2022
0e3b894
docs
dragosmg Jul 12, 2022
3aab4e0
improved `show_exec_plan()` docs
dragosmg Jul 13, 2022
3355135
Merge branch 'master' into show_query
dragosmg Jul 13, 2022
b9b4d49
use `adq`
dragosmg Jul 13, 2022
4779f08
add minimal test
dragosmg Jul 13, 2022
5730cec
failing unit tests for `arrange()` and `head()` "nodes"
dragosmg Jul 13, 2022
23f5f22
add `show_query.arrow-dplyr_query()` and `explain.arrow_dplyr_query()`
dragosmg Jul 13, 2022
2eb3cdb
tests for `show_query()` & `explain()` + clarifications
dragosmg Jul 13, 2022
2b33495
move dataset tests to test-dataset-dplyr.R + unskip
dragosmg Jul 13, 2022
4ff3065
clean-up
dragosmg Jul 13, 2022
d686f9d
clean-up
dragosmg Jul 13, 2022
d452a40
define `ExecPlan_ToStringWithSink` C++ function
dragosmg Jul 18, 2022
f4a1ba4
define the `ToStringWithSink()` R6 ExecPlan method and use it in `sh…
dragosmg Jul 18, 2022
ac7d756
updated unit tests
dragosmg Jul 18, 2022
761b66d
comments
dragosmg Jul 18, 2022
a36f826
clang style
dragosmg Jul 18, 2022
d887d42
comment
dragosmg Jul 18, 2022
effc963
clang style
dragosmg Jul 18, 2022
4571df4
clang style
dragosmg Jul 18, 2022
d1e4ce7
clang style
dragosmg Jul 18, 2022
e2220fe
update dataset tests
dragosmg Jul 18, 2022
3c96fb7
comment
dragosmg Jul 18, 2022
068dc6d
bump ci
dragosmg Jul 18, 2022
fb2a5a0
don't run the cyclocomp linter on the ExecPlan R6 definition
dragosmg Jul 18, 2022
c14a497
typo
dragosmg Jul 19, 2022
aac9edc
`&&`
dragosmg Jul 19, 2022
3faecc6
rename `ToStrinWithSink` to `ToString` and keep only this method
dragosmg Jul 19, 2022
1afcce9
docs
dragosmg Jul 19, 2022
1de3447
C++ lint
dragosmg Jul 19, 2022
f40e5d2
add `ExecPlan_prepare` helper and update `ExecPlan_ToString` C++ func…
dragosmg Jul 19, 2022
8aae3c3
update ToString R6 ExecPlan method
dragosmg Jul 19, 2022
fae74ae
c++ lint
dragosmg Jul 19, 2022
ebd4a02
C++ lint
dragosmg Jul 19, 2022
311b70b
another C++ lint
dragosmg Jul 19, 2022
d0bc159
simpler ExecPlan_prepare
dragosmg Jul 19, 2022
d7834b0
C++ lint
dragosmg Jul 19, 2022
9a02651
more C++ linting
dragosmg Jul 19, 2022
ac37a9d
extend ExecPlan_prepare and use it for ExecPlan_run
dragosmg Jul 19, 2022
238e406
C++ lint
dragosmg Jul 19, 2022
9a34255
C++ lint
dragosmg Jul 19, 2022
1db36b6
include safe call into R header file
dragosmg Jul 19, 2022
2153451
revert to old ExecPlan_run
dragosmg Jul 19, 2022
807aa88
c++ lint
dragosmg Jul 19, 2022
6141c45
simplified ExecPlan_prepare
dragosmg Jul 19, 2022
9af1403
`ExecPlan_prepare` to return both the plan and the sink node
dragosmg Jul 20, 2022
1f88ff7
C++ lint
dragosmg Jul 20, 2022
fbb4c1e
fix the rabbit hole that I led Dragos down
paleolimbot Jul 20, 2022
da73019
always call startproducing
paleolimbot Jul 20, 2022
ee90354
bump ci
dragosmg Jul 21, 2022
66dcd22
updated unit tests + snapshots
dragosmg Jul 21, 2022
0482c6c
updated dataset unit tests + snapshots
dragosmg Jul 21, 2022
ad5024f
remove `BuildAndShow` + add `ToString`
dragosmg Jul 21, 2022
20799f9
update `as_record_batch_reader` and use it inside `show_exec_plan()`
dragosmg Jul 21, 2022
21bc8c3
docs
dragosmg Jul 21, 2022
6376bb2
lints
dragosmg Jul 21, 2022
5751543
revert to da730196fc62b7a1572dc2d40e2550744039a142
dragosmg Jul 22, 2022
e29a835
don't start producing the plan
dragosmg Jul 22, 2022
1309e0f
remove separate tests for `show_query()` and `explain()`
dragosmg Jul 22, 2022
5f890dc
revert to fb2a5a00fc24d16da8e0a596256e0c18b02f5289 and use the `Build…
dragosmg Jul 22, 2022
c4912a4
docs + comments to indicate the duplicate code in:
dragosmg Jul 22, 2022
5aaa3cd
removed `ungroup()` call
dragosmg Jul 22, 2022
ce8b17e
clang lint
dragosmg Jul 22, 2022
0ffa30a
comment
dragosmg Jul 22, 2022
6a8c753
warn and don't build & print the ExecPlan when we have a nested query
dragosmg Jul 22, 2022
7fc20ca
Merge branch 'master' into show_query
dragosmg Jul 22, 2022
35e9ef8
comments in `ExecPlan_prepare`
dragosmg Jul 22, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions r/NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -352,6 +352,7 @@ export(s3_bucket)
export(schema)
export(set_cpu_count)
export(set_io_thread_count)
export(show_exec_plan)
export(starts_with)
export(string)
export(struct)
Expand Down
3 changes: 2 additions & 1 deletion r/R/arrow-package.R
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,8 @@
"group_vars", "group_by_drop_default", "ungroup", "mutate", "transmute",
"arrange", "rename", "pull", "relocate", "compute", "collapse",
"distinct", "left_join", "right_join", "inner_join", "full_join",
"semi_join", "anti_join", "count", "tally", "rename_with", "union", "union_all", "glimpse"
"semi_join", "anti_join", "count", "tally", "rename_with", "union",
"union_all", "glimpse", "show_query", "explain"
)
)
for (cl in c("Dataset", "ArrowTabular", "RecordBatchReader", "arrow_dplyr_query")) {
Expand Down
4 changes: 4 additions & 0 deletions r/R/arrowExports.R

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

47 changes: 47 additions & 0 deletions r/R/dplyr.R
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,53 @@ tail.arrow_dplyr_query <- function(x, n = 6L, ...) {
x
}

#' Show the details of an Arrow Execution Plan
#'
#' This is a function which gives more details about the logical query plan
#' that will be executed when evaluating an `arrow_dplyr_query` object.
#' It calls the C++ `ExecPlan` object's print method.
#' Functionally, it is similar to `dplyr::explain()`. This function is used as
#' the `dplyr::explain()` and `dplyr::show_query()` methods.
#'
#' @param x an `arrow_dplyr_query` to print the `ExecPlan` for.
#'
#' @return `x`, invisibly.
#' @export
#'
#' @examplesIf arrow_with_dataset() && requireNamespace("dplyr", quietly = TRUE)
#' library(dplyr)
#' mtcars %>%
#' arrow_table() %>%
#' filter(mpg > 20) %>%
#' mutate(x = gear/carb) %>%
#' show_exec_plan()
show_exec_plan <- function(x) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't follow the discussion: why not use show_query() or explain() here? I saw something about wanting to massage the output to make explain() prettier, but why not use this for explain() today since that's something people know, and it's about logical plans?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a link to the design doc, also linked in the PR description and the Jira ticket.

In short, in my opinion, a first step would be to have show_exec_plan() as a function that outputs exactly what we get from C++. That's what I interpreted the scope of the ticket to be.

My proposal rests on this point-of-view: I think we need, for accessibility reasons, 2 separate ways of surfacing details regarding the ExecPlan, for 2 separate audiences (one readable by the seasoned Arrow developer and one readable by the regular R user).

  1. show_exec_plan() would cater to the first audience and, thus, focus on minimising cognitive load by keeping things the same across languages.
  2. explain() should probably not aim to be both and would be targeted towards the second audience and do it well. By well I understand here as in a way that makes sense to the regular R user, who expects arrow to just work. In this context, explain() would be a tool allowing them to inspect what is going on in a dplyr-like pipeline, but do so in a language they are familiar with.

Having an explain() method would absolutely be super useful, but I do not think that is fully scoped out yet. Probably, before we start the implementation of explain(), we should flesh out what the output of explain() will look like. I strongly believe we should start from the description of the generic which says: "gives more details about an object (...) and is more focused on human readable output" (my underlining).

In this PR there are already some really good suggestions of stuff we could include in the output of a function like explain() (e.g. to cover query_can_stream()).

If we add to the above the time constraint of the impending release, I do not think extending this PR to cover explain() would be the best course of action.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I see your case for why not explain(), you want to make something more R-user friendly. So then this seems like show_query(). The only reason I see in your doc for not using show_query is "show_query() states that one of its aims is to provide a more human readable output that str()". That's a really low bar here, given what str() does with R6 classes.

My recommendation, for what it's worth: in this PR, make this function be both show_query and explain, and in a followup, add more human embellishment to explain(). (dbplyr:::explain.tbl_sql, for example, calls show_query() and then prints more details after it.)

Copy link
Contributor Author

@dragosmg dragosmg Jul 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy with your recommendation. One of my earlier drafts looked a bit more like explain.tbl_sql.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nealrichardson Just to clarify, is your recommendation to have show_exec_plan() as a stand-alone function + show_query.arrow_dplyr_query() and explain.arrow_dplyr_query() or just the 2 methods and to discard show_exec_plan()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My recommendation was the latter, but I don't object to also having a standalone show_exec_plan()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of having show_exec_plan() separately as it allows us to write a bit of documentation vs show_query() and explain(). I'm also aware that this would be the first instance in which document dplyr-like behaviour (AFAIK there is no such documentation in the pkgdown website / help files).

adq <- as_adq(x)
dragosmg marked this conversation as resolved.
Show resolved Hide resolved
plan <- ExecPlan$create()
# do not show the plan if we have a nested query (as this will force the
# evaluation of the inner query/queries)
# TODO see if we can remove after ARROW-16628
if (is_collapsed(x) && has_head_tail(x$.data)) {
warn("The `ExecPlan` cannot be printed for a nested query.")
return(invisible(x))
}
final_node <- plan$Build(adq)
cat(plan$BuildAndShow(final_node))
invisible(x)
}

show_query.arrow_dplyr_query <- function(x, ...) {
show_exec_plan(x)
}

show_query.Dataset <- show_query.ArrowTabular <- show_query.RecordBatchReader <- show_query.arrow_dplyr_query

explain.arrow_dplyr_query <- function(x, ...) {
show_exec_plan(x)
}

explain.Dataset <- explain.ArrowTabular <- explain.RecordBatchReader <- explain.arrow_dplyr_query

ensure_group_vars <- function(x) {
if (inherits(x, "arrow_dplyr_query")) {
# Before pulling data from Arrow, make sure all group vars are in the projection
Expand Down
37 changes: 37 additions & 0 deletions r/R/query-engine.R
Original file line number Diff line number Diff line change
Expand Up @@ -193,6 +193,8 @@ ExecPlan <- R6Class("ExecPlan",
node
},
Run = function(node, as_table = FALSE) {
# a section of this code is used by `BuildAndShow()` too - the 2 need to be in sync
# Start of chunk used in `BuildAndShow()`
assert_is(node, "ExecNode")

# Sorting and head/tail (if sorted) are handled in the SinkNode,
Expand All @@ -210,6 +212,8 @@ ExecPlan <- R6Class("ExecPlan",
sorting$orders <- as.integer(sorting$orders)
}

# End of chunk used in `BuildAndShow()`

# If we are going to return a Table anyway, we do this in one step and
# entirely in one C++ call to ensure that we can execute user-defined
# functions from the worker threads spawned by the ExecPlan. If not, we
Expand Down Expand Up @@ -273,6 +277,39 @@ ExecPlan <- R6Class("ExecPlan",
...
)
},
# SinkNodes (involved in arrange and/or head/tail operations) are created in
# ExecPlan_run and are not captured by the regulat print method. We take a
dragosmg marked this conversation as resolved.
Show resolved Hide resolved
# similar approach to expose them before calling the print method.
BuildAndShow = function(node) {
# a section of this code is copied from `Run()` - the 2 need to be in sync
# Start of chunk copied from `Run()`

assert_is(node, "ExecNode")

# Sorting and head/tail (if sorted) are handled in the SinkNode,
# created in ExecPlan_run
sorting <- node$extras$sort %||% list()
select_k <- node$extras$head %||% -1L
has_sorting <- length(sorting) > 0
if (has_sorting) {
if (!is.null(node$extras$tail)) {
# Reverse the sort order and take the top K, then after we'll reverse
# the resulting rows so that it is ordered as expected
sorting$orders <- !sorting$orders
select_k <- node$extras$tail
}
sorting$orders <- as.integer(sorting$orders)
}
dragosmg marked this conversation as resolved.
Show resolved Hide resolved

# End of chunk copied from `Run()`

ExecPlan_BuildAndShow(
self,
node,
sorting,
select_k
)
},
Stop = function() ExecPlan_StopProducing(self)
)
)
Expand Down
1 change: 1 addition & 0 deletions r/_pkgdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -220,6 +220,7 @@ reference:
- value_counts
- list_compute_functions
- register_scalar_function
- show_exec_plan
- title: Connections to other systems
contents:
- to_arrow
Expand Down
31 changes: 31 additions & 0 deletions r/man/show_exec_plan.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

12 changes: 12 additions & 0 deletions r/src/arrowExports.cpp

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

46 changes: 46 additions & 0 deletions r/src/compute-exec.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,10 @@ std::pair<std::shared_ptr<compute::ExecPlan>, std::shared_ptr<arrow::RecordBatch
ExecPlan_prepare(const std::shared_ptr<compute::ExecPlan>& plan,
const std::shared_ptr<compute::ExecNode>& final_node,
cpp11::list sort_options, cpp11::strings metadata, int64_t head = -1) {
// a section of this code is copied and used in ExecPlan_BuildAndShow - the 2 need
// to be in sync
// Start of chunk used in ExecPlan_BuildAndShow

// For now, don't require R to construct SinkNodes.
// Instead, just pass the node we should collect as an argument.
arrow::AsyncGenerator<arrow::util::optional<compute::ExecBatch>> sink_gen;
Expand Down Expand Up @@ -88,6 +92,8 @@ ExecPlan_prepare(const std::shared_ptr<compute::ExecPlan>& plan,
compute::SinkNodeOptions{&sink_gen});
}

// End of chunk used in ExecPlan_BuildAndShow

StopIfNotOk(plan->Validate());

// If the generator is destroyed before being completely drained, inform plan
Expand Down Expand Up @@ -155,6 +161,46 @@ std::shared_ptr<arrow::Schema> ExecNode_output_schema(
return node->output_schema();
}

// [[arrow::export]]
std::string ExecPlan_BuildAndShow(const std::shared_ptr<compute::ExecPlan>& plan,
const std::shared_ptr<compute::ExecNode>& final_node,
cpp11::list sort_options, int64_t head = -1) {
// a section of this code is copied from ExecPlan_prepare - the 2 need to be in sync
// Start of chunk copied from ExecPlan_prepare

// For now, don't require R to construct SinkNodes.
// Instead, just pass the node we should collect as an argument.
arrow::AsyncGenerator<arrow::util::optional<compute::ExecBatch>> sink_gen;

// Sorting uses a different sink node; there is no general sort yet
if (sort_options.size() > 0) {
if (head >= 0) {
// Use the SelectK node to take only what we need
MakeExecNodeOrStop(
"select_k_sink", plan.get(), {final_node.get()},
compute::SelectKSinkNodeOptions{
arrow::compute::SelectKOptions(
head, std::dynamic_pointer_cast<compute::SortOptions>(
make_compute_options("sort_indices", sort_options))
->sort_keys),
&sink_gen});
} else {
MakeExecNodeOrStop("order_by_sink", plan.get(), {final_node.get()},
compute::OrderBySinkNodeOptions{
*std::dynamic_pointer_cast<compute::SortOptions>(
make_compute_options("sort_indices", sort_options)),
&sink_gen});
}
} else {
MakeExecNodeOrStop("sink", plan.get(), {final_node.get()},
compute::SinkNodeOptions{&sink_gen});
}

// End of chunk copied from ExecPlan_prepare

return plan->ToString();
}

#if defined(ARROW_R_WITH_DATASET)

#include <arrow/dataset/file_base.h>
Expand Down
77 changes: 77 additions & 0 deletions r/tests/testthat/test-dataset-dplyr.R
Original file line number Diff line number Diff line change
Expand Up @@ -340,3 +340,80 @@ test_that("dplyr method not implemented messages", {
fixed = TRUE
)
})

test_that("show_exec_plan(), show_query() and explain() with datasets", {
# show_query() and explain() are wrappers around show_exec_plan() and are not
# tested separately

ds <- open_dataset(dataset_dir, partitioning = schema(part = uint8()))
dragosmg marked this conversation as resolved.
Show resolved Hide resolved

# minimal test
expect_output(
ds %>%
show_exec_plan(),
regexp = paste0(
"ExecPlan with .* nodes:.*", # boiler plate for ExecPlan
"ProjectNode.*", # output columns
"SourceNode" # entry point
)
)

# filter and select
expect_output(
ds %>%
select(string = chr, integer = int, part) %>%
filter(integer > 6L & part == 1) %>%
show_exec_plan(),
regexp = paste0(
"ExecPlan with .* nodes:.*", # boiler plate for ExecPlan
"ProjectNode.*", # output columns
"FilterNode.*", # filter node
"int > 6.*cast.*", # filtering expressions + auto-casting of part
"SourceNode" # entry point
)
)

# group_by and summarise
expect_output(
ds %>%
group_by(part) %>%
summarise(avg = mean(int)) %>%
show_exec_plan(),
regexp = paste0(
"ExecPlan with .* nodes:.*", # boiler plate for ExecPlan
"ProjectNode.*", # output columns
"GroupByNode.*", # group by node
"keys=.*part.*", # key for aggregations
"aggregates=.*hash_mean.*", # aggregations
"ProjectNode.*", # input columns
"SourceNode" # entry point
)
)

# arrange and head
expect_output(
ds %>%
filter(lgl) %>%
arrange(chr) %>%
show_exec_plan(),
regexp = paste0(
"ExecPlan with .* nodes:.*", # boiler plate for ExecPlan
"OrderBySinkNode.*chr.*ASC.*", # arrange goes via the OrderBy sink node
"ProjectNode.*", # output columns
"FilterNode.*", # filter node
"filter=lgl.*", # filtering expression
"SourceNode" # entry point
)
)

# printing the ExecPlan for a nested query would currently force the
# evaluation of the inner one(s), which we want to avoid => no output
expect_warning(
ds %>%
filter(lgl) %>%
arrange(chr) %>%
head() %>%
show_exec_plan(),
"The `ExecPlan` cannot be printed for a nested query."
)
})