Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show the result of all optimizer passes in EXPLAIN VERBOSE #759

Merged
merged 6 commits into from
Jul 20, 2021

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jul 20, 2021

Which issue does this PR close?

Resolves #733

Rationale for this Change

Previously, only some logical optimizer passes (and no physical optimizer passes) were shown in EXPLAIN VERBOSE output. This was due to the fact that each optimizer had to special case handling for explain and so unsurprisingly some (especially newly written ones) did not.

What changes are included in this PR?

  1. Handle capturing logical optimizer output in ExecutionContext::Optimize
  2. Remove old "optimize_for_explain" plumbing
  3. Show plans that are no different than the previous as "SAME TEXT AS ABOVE"
  4. Capture physical optimizer output in PhysicalPlanner
  5. Clean up how StringifiedPlans are created using traits

Are there any user-facing changes?

Yes. Explain output is different. To see the difference, do

echo "1,2" > /tmp/foo.csv
cargo run --bin datafusion-cli

Then run

CREATE EXTERNAL TABLE foo(c1 int, c2 int)
STORED AS CSV
LOCATION '/tmp/foo.csv';

EXPLAIN VERBOSE SELECT * from foo;

Before this change:

Note the reason the optimizer passes appear to be duplicated in this explain is because that is what actually happens -- optimize is called once as part of ExecutionContext::sql() and again as part of DataFrame_impl::collect()). If we want to avoid the double optimization, I think we should treat it separately and do so in a follow on PR. This PR faithfully captures what DataFusion is actually doing.

+-----------------------------------------+--------------------------------------------------------------------------+
| plan_type                               | plan                                                                     |
+-----------------------------------------+--------------------------------------------------------------------------+
| initial_logical_plan                    | Projection: #foo.c1, #foo.c2                                             |
|                                         |   TableScan: foo projection=None                                         |
| logical_plan after projection_push_down | Projection: #foo.c1, #foo.c2                                             |
|                                         |   TableScan: foo projection=Some([0, 1])                                 |
| logical_plan after simplify_expressions | Projection: #foo.c1, #foo.c2                                             |
|                                         |   TableScan: foo projection=Some([0, 1])                                 |
| logical_plan after limit_push_down      | Projection: #foo.c1, #foo.c2                                             |
|                                         |   TableScan: foo projection=Some([0, 1])                                 |
| logical_plan after projection_push_down | Projection: #foo.c1, #foo.c2                                             |
|                                         |   TableScan: foo projection=Some([0, 1])                                 |
| logical_plan after simplify_expressions | Projection: #foo.c1, #foo.c2                                             |
|                                         |   TableScan: foo projection=Some([0, 1])                                 |
| logical_plan after limit_push_down      | Projection: #foo.c1, #foo.c2                                             |
|                                         |   TableScan: foo projection=Some([0, 1])                                 |
| logical_plan                            | Projection: #foo.c1, #foo.c2                                             |
|                                         |   TableScan: foo projection=Some([0, 1])                                 |
| initial_physical_plan                   | ProjectionExec: expr=[c1@0 as c1, c2@1 as c2]                            |
|                                         |   CsvExec: source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false   |
| physical_plan                           | ProjectionExec: expr=[c1@0 as c1, c2@1 as c2]                            |
|                                         |   RepartitionExec: partitioning=RoundRobinBatch(16)                      |
|                                         |     CsvExec: source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false |
+-----------------------------------------+--------------------------------------------------------------------------+

After this change:

> EXPLAIN VERBOSE SELECT * from foo;
+-------------------------------------------+--------------------------------------------------------------------------+
| plan_type                                 | plan                                                                     |
+-------------------------------------------+--------------------------------------------------------------------------+
| initial_logical_plan                      | Projection: #foo.c1, #foo.c2                                             |
|                                           |   TableScan: foo projection=None                                         |
| logical_plan after constant_folding       | SAME TEXT AS ABOVE                                                       |
| logical_plan after eliminate_limit        | SAME TEXT AS ABOVE                                                       |
| logical_plan after aggregate_statistics   | SAME TEXT AS ABOVE                                                       |
| logical_plan after projection_push_down   | Projection: #foo.c1, #foo.c2                                             |
|                                           |   TableScan: foo projection=Some([0, 1])                                 |
| logical_plan after filter_push_down       | SAME TEXT AS ABOVE                                                       |
| logical_plan after simplify_expressions   | SAME TEXT AS ABOVE                                                       |
| logical_plan after hash_build_probe_order | SAME TEXT AS ABOVE                                                       |
| logical_plan after limit_push_down        | SAME TEXT AS ABOVE                                                       |
| logical_plan after constant_folding       | SAME TEXT AS ABOVE                                                       |
| logical_plan after eliminate_limit        | SAME TEXT AS ABOVE                                                       |
| logical_plan after aggregate_statistics   | SAME TEXT AS ABOVE                                                       |
| logical_plan after projection_push_down   | SAME TEXT AS ABOVE                                                       |
| logical_plan after filter_push_down       | SAME TEXT AS ABOVE                                                       |
| logical_plan after simplify_expressions   | SAME TEXT AS ABOVE                                                       |
| logical_plan after hash_build_probe_order | SAME TEXT AS ABOVE                                                       |
| logical_plan after limit_push_down        | SAME TEXT AS ABOVE                                                       |
| logical_plan                              | Projection: #foo.c1, #foo.c2                                             |
|                                           |   TableScan: foo projection=Some([0, 1])                                 |
| initial_physical_plan                     | ProjectionExec: expr=[c1@0 as c1, c2@1 as c2]                            |
|                                           |   CsvExec: source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false   |
| physical_plan after coalesce_batches      | SAME TEXT AS ABOVE                                                       |
| physical_plan after repartition           | ProjectionExec: expr=[c1@0 as c1, c2@1 as c2]                            |
|                                           |   RepartitionExec: partitioning=RoundRobinBatch(16)                      |
|                                           |     CsvExec: source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false |
| physical_plan after add_merge_exec        | SAME TEXT AS ABOVE                                                       |
| physical_plan                             | ProjectionExec: expr=[c1@0 as c1, c2@1 as c2]                            |
|                                           |   RepartitionExec: partitioning=RoundRobinBatch(16)                      |
|                                           |     CsvExec: source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false |
+-------------------------------------------+--------------------------------------------------------------------------+

@alamb alamb added the api change Changes the API exposed to users of the crate label Jul 20, 2021
@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Jul 20, 2021
_,
) => {
let schema = schema.as_ref().to_owned().into();
optimize_explain(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the old style "explain" implementation that needed to be done for each optimizer for it to correctly show explain plans (and it was missing from several)

plan_builder.append_value(&*p.plan)?;
match prev {
Some(prev) if !should_show(prev, p) => {
plan_builder.append_value("SAME TEXT AS ABOVE")?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once I started dumping out all the explain plans, the mount of replication was enormous, so I also added code to avoid duplication if the optimizer pass did not make any changes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

@NGA-TRAN NGA-TRAN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

schema: schema.clone(),
})
} else {
self.optimize_internal(plan, |_, _| {})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Now we have the optimized plan displayed

plan_builder.append_value(&*p.plan)?;
match prev {
Some(prev) if !should_show(prev, p) => {
plan_builder.append_value("SAME TEXT AS ABOVE")?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jorgecarleitao jorgecarleitao merged commit 30693df into apache:master Jul 20, 2021
@alamb alamb deleted the alamb/all_the_explain branch July 20, 2021 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

EXPLAIN VERBOSE does not include all the different passes nor final physical plan
3 participants