Show the result of all optimizer passes in EXPLAIN VERBOSE #759

alamb · 2021-07-20T16:24:56Z

Which issue does this PR close?

Resolves #733

Rationale for this Change

Previously, only some logical optimizer passes (and no physical optimizer passes) were shown in EXPLAIN VERBOSE output. This was due to the fact that each optimizer had to special case handling for explain and so unsurprisingly some (especially newly written ones) did not.

What changes are included in this PR?

Handle capturing logical optimizer output in ExecutionContext::Optimize
Remove old "optimize_for_explain" plumbing
Show plans that are no different than the previous as "SAME TEXT AS ABOVE"
Capture physical optimizer output in PhysicalPlanner
Clean up how StringifiedPlans are created using traits

Are there any user-facing changes?

Yes. Explain output is different. To see the difference, do

echo "1,2" > /tmp/foo.csv
cargo run --bin datafusion-cli

Then run

CREATE EXTERNAL TABLE foo(c1 int, c2 int)
STORED AS CSV
LOCATION '/tmp/foo.csv';

EXPLAIN VERBOSE SELECT * from foo;

Before this change:

Note the reason the optimizer passes appear to be duplicated in this explain is because that is what actually happens -- optimize is called once as part of ExecutionContext::sql() and again as part of DataFrame_impl::collect()). If we want to avoid the double optimization, I think we should treat it separately and do so in a follow on PR. This PR faithfully captures what DataFusion is actually doing.

+-----------------------------------------+--------------------------------------------------------------------------+
| plan_type                               | plan                                                                     |
+-----------------------------------------+--------------------------------------------------------------------------+
| initial_logical_plan                    | Projection: #foo.c1, #foo.c2                                             |
|                                         |   TableScan: foo projection=None                                         |
| logical_plan after projection_push_down | Projection: #foo.c1, #foo.c2                                             |
|                                         |   TableScan: foo projection=Some([0, 1])                                 |
| logical_plan after simplify_expressions | Projection: #foo.c1, #foo.c2                                             |
|                                         |   TableScan: foo projection=Some([0, 1])                                 |
| logical_plan after limit_push_down      | Projection: #foo.c1, #foo.c2                                             |
|                                         |   TableScan: foo projection=Some([0, 1])                                 |
| logical_plan after projection_push_down | Projection: #foo.c1, #foo.c2                                             |
|                                         |   TableScan: foo projection=Some([0, 1])                                 |
| logical_plan after simplify_expressions | Projection: #foo.c1, #foo.c2                                             |
|                                         |   TableScan: foo projection=Some([0, 1])                                 |
| logical_plan after limit_push_down      | Projection: #foo.c1, #foo.c2                                             |
|                                         |   TableScan: foo projection=Some([0, 1])                                 |
| logical_plan                            | Projection: #foo.c1, #foo.c2                                             |
|                                         |   TableScan: foo projection=Some([0, 1])                                 |
| initial_physical_plan                   | ProjectionExec: expr=[c1@0 as c1, c2@1 as c2]                            |
|                                         |   CsvExec: source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false   |
| physical_plan                           | ProjectionExec: expr=[c1@0 as c1, c2@1 as c2]                            |
|                                         |   RepartitionExec: partitioning=RoundRobinBatch(16)                      |
|                                         |     CsvExec: source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false |
+-----------------------------------------+--------------------------------------------------------------------------+

After this change:

> EXPLAIN VERBOSE SELECT * from foo;
+-------------------------------------------+--------------------------------------------------------------------------+
| plan_type                                 | plan                                                                     |
+-------------------------------------------+--------------------------------------------------------------------------+
| initial_logical_plan                      | Projection: #foo.c1, #foo.c2                                             |
|                                           |   TableScan: foo projection=None                                         |
| logical_plan after constant_folding       | SAME TEXT AS ABOVE                                                       |
| logical_plan after eliminate_limit        | SAME TEXT AS ABOVE                                                       |
| logical_plan after aggregate_statistics   | SAME TEXT AS ABOVE                                                       |
| logical_plan after projection_push_down   | Projection: #foo.c1, #foo.c2                                             |
|                                           |   TableScan: foo projection=Some([0, 1])                                 |
| logical_plan after filter_push_down       | SAME TEXT AS ABOVE                                                       |
| logical_plan after simplify_expressions   | SAME TEXT AS ABOVE                                                       |
| logical_plan after hash_build_probe_order | SAME TEXT AS ABOVE                                                       |
| logical_plan after limit_push_down        | SAME TEXT AS ABOVE                                                       |
| logical_plan after constant_folding       | SAME TEXT AS ABOVE                                                       |
| logical_plan after eliminate_limit        | SAME TEXT AS ABOVE                                                       |
| logical_plan after aggregate_statistics   | SAME TEXT AS ABOVE                                                       |
| logical_plan after projection_push_down   | SAME TEXT AS ABOVE                                                       |
| logical_plan after filter_push_down       | SAME TEXT AS ABOVE                                                       |
| logical_plan after simplify_expressions   | SAME TEXT AS ABOVE                                                       |
| logical_plan after hash_build_probe_order | SAME TEXT AS ABOVE                                                       |
| logical_plan after limit_push_down        | SAME TEXT AS ABOVE                                                       |
| logical_plan                              | Projection: #foo.c1, #foo.c2                                             |
|                                           |   TableScan: foo projection=Some([0, 1])                                 |
| initial_physical_plan                     | ProjectionExec: expr=[c1@0 as c1, c2@1 as c2]                            |
|                                           |   CsvExec: source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false   |
| physical_plan after coalesce_batches      | SAME TEXT AS ABOVE                                                       |
| physical_plan after repartition           | ProjectionExec: expr=[c1@0 as c1, c2@1 as c2]                            |
|                                           |   RepartitionExec: partitioning=RoundRobinBatch(16)                      |
|                                           |     CsvExec: source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false |
| physical_plan after add_merge_exec        | SAME TEXT AS ABOVE                                                       |
| physical_plan                             | ProjectionExec: expr=[c1@0 as c1, c2@1 as c2]                            |
|                                           |   RepartitionExec: partitioning=RoundRobinBatch(16)                      |
|                                           |     CsvExec: source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false |
+-------------------------------------------+--------------------------------------------------------------------------+

alamb · 2021-07-20T16:29:53Z

datafusion/src/optimizer/limit_push_down.rs

-            _,
-        ) => {
-            let schema = schema.as_ref().to_owned().into();
-            optimize_explain(


this is the old style "explain" implementation that needed to be done for each optimizer for it to correctly show explain plans (and it was missing from several)

alamb · 2021-07-20T16:30:38Z

datafusion/src/physical_plan/explain.rs

-            plan_builder.append_value(&*p.plan)?;
+            match prev {
+                Some(prev) if !should_show(prev, p) => {
+                    plan_builder.append_value("SAME TEXT AS ABOVE")?;


Once I started dumping out all the explain plans, the mount of replication was enormous, so I also added code to avoid duplication if the optimizer pass did not make any changes

NGA-TRAN

LGTM

NGA-TRAN · 2021-07-20T20:25:23Z

datafusion/src/execution/context.rs

+                schema: schema.clone(),
+            })
+        } else {
+            self.optimize_internal(plan, |_, _| {})


Nice. Now we have the optimized plan displayed

NGA-TRAN · 2021-07-20T20:30:51Z

datafusion/src/physical_plan/explain.rs

-            plan_builder.append_value(&*p.plan)?;
+            match prev {
+                Some(prev) if !should_show(prev, p) => {
+                    plan_builder.append_value("SAME TEXT AS ABOVE")?;


jorgecarleitao

LGTM

alamb added 5 commits July 20, 2021 09:22

Remove old optimize_for_explain plumbing

99d7ff4

Show plans that are no different than the previous as "SAME"

f78fe0d

Add all physical optimizer passes

266d1fa

tests

a1d5a4a

cleanup

a9dcd65

alamb added the api change Changes the API exposed to users of the crate label Jul 20, 2021

github-actions bot added the datafusion Changes in the datafusion crate label Jul 20, 2021

alamb commented Jul 20, 2021

View reviewed changes

fix user_defined_plan test

8a97e07

NGA-TRAN approved these changes Jul 20, 2021

View reviewed changes

jorgecarleitao approved these changes Jul 20, 2021

View reviewed changes

jorgecarleitao merged commit 30693df into apache:master Jul 20, 2021

alamb deleted the alamb/all_the_explain branch July 20, 2021 21:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Show the result of all optimizer passes in EXPLAIN VERBOSE #759

Show the result of all optimizer passes in EXPLAIN VERBOSE #759

alamb commented Jul 20, 2021

alamb Jul 20, 2021

alamb Jul 20, 2021

NGA-TRAN Jul 20, 2021

NGA-TRAN left a comment

NGA-TRAN Jul 20, 2021

NGA-TRAN Jul 20, 2021

jorgecarleitao left a comment

Show the result of all optimizer passes in EXPLAIN VERBOSE #759

Show the result of all optimizer passes in EXPLAIN VERBOSE #759

Conversation

alamb commented Jul 20, 2021

Which issue does this PR close?

Rationale for this Change

What changes are included in this PR?

Are there any user-facing changes?

Before this change:

After this change:

alamb Jul 20, 2021

Choose a reason for hiding this comment

alamb Jul 20, 2021

Choose a reason for hiding this comment

NGA-TRAN Jul 20, 2021

Choose a reason for hiding this comment

NGA-TRAN left a comment

Choose a reason for hiding this comment

NGA-TRAN Jul 20, 2021

Choose a reason for hiding this comment

NGA-TRAN Jul 20, 2021

Choose a reason for hiding this comment

jorgecarleitao left a comment

Choose a reason for hiding this comment