Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-9683: [Rust][DataFusion] Add debug printing to physical plans and associated types #7925

Closed
wants to merge 1 commit into from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Aug 10, 2020

For ARROW-9653, I was trying to debug the execution plan and I would have found it easier if there had been a way to display the execution plan to better understand and isolate the issue. This would also be nice to have as part of EXPLAIN plan functionality in ARROW-9654

In general, for debugging purposes, we would like to be able to dump out an execution plan. To do so in the idiomatic rust way, I made ExecutionPlan also implement std::fmt::Debug and then followed rustc guidance until I got everything that was needed

Here is an example plan for "SELECT c1, c2, MIN(c3) FROM aggregate_test_100 GROUP BY c1, c2" when printed using println!("{:#?}", plan):

physical plan is HashAggregateExec {
    group_expr: [
        Column {
            name: "c1",
        },
        Column {
            name: "c2",
        },
    ],
    aggr_expr: [
        Min {
            expr: Column {
                name: "c3",
            },
        },
    ],
    input: DataSourceExec {
        schema: Schema {
            fields: [
                Field {
                    name: "c1",
                    data_type: Utf8,
                    nullable: false,
                    dict_id: 0,
                    dict_is_ordered: false,
                },
                Field {
                    name: "c2",
                    data_type: UInt32,
                    nullable: false,
                    dict_id: 0,
                    dict_is_ordered: false,
                },
                Field {
                    name: "c3",
                    data_type: Int8,
                    nullable: false,
                    dict_id: 0,
                    dict_is_ordered: false,
                },
            ],
            metadata: {},
        },
        partitions.len: 1,
    },
    schema: Schema {
        fields: [
            Field {
                name: "c1",
                data_type: Utf8,
                nullable: true,
                dict_id: 0,
                dict_is_ordered: false,
            },
            Field {
                name: "c2",
                data_type: UInt32,
                nullable: true,
                dict_id: 0,
                dict_is_ordered: false,
            },
            Field {
                name: "MIN(c3)",
                data_type: Int64,
                nullable: true,
                dict_id: 0,
                dict_is_ordered: false,
            },
        ],
        metadata: {},
    },
}

@@ -61,6 +73,12 @@ pub struct DatasourcePartition {
batch_iter: Arc<Mutex<dyn RecordBatchReader + Send + Sync>>,
}

impl Debug for DatasourcePartition {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RecordBatchReader does not implement Debug so had to implement directly here.

@@ -110,6 +112,12 @@ struct ParquetPartition {
iterator: Arc<Mutex<dyn RecordBatchReader + Send + Sync>>,
}

impl Debug for ParquetPartition {
fn fmt(&self, f: &mut Formatter<'_>) -> fmt::Result {
f.debug_struct("ParquetPartition").finish()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, this is not #derive because RecordBatchReader does not implement Debug

.field("name", &self.name)
.field("args", &self.args)
.field("return_type", &self.return_type)
.field("fun", &"<FUNC>")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ScanarUdf link is typedef'd to a function pointer and does not implement Debug leading to the two implementations in this file

@github-actions
Copy link

@alamb
Copy link
Contributor Author

alamb commented Aug 10, 2020

Rebased to pick up 37ee600

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @alamb

@andygrove andygrove closed this in e31e5d4 Aug 11, 2020
@alamb alamb deleted the alamb/ARROW-9683-debug branch August 11, 2020 15:28
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…nd associated types

For ARROW-9653, I was trying to debug the execution plan and I would have found it easier if there had been a way to display the execution plan to better understand and isolate the issue. This would also be nice to have as part of EXPLAIN plan functionality in ARROW-9654

In general, for debugging purposes, we would like to be able to dump out an execution plan. To do so in the idiomatic rust way, I made `ExecutionPlan` also implement `std::fmt::Debug` and then followed `rustc` guidance until I got everything that was needed

Here is an example plan for `"SELECT c1, c2, MIN(c3) FROM aggregate_test_100 GROUP BY c1, c2"` when printed using `println!("{:#?}", plan)`:

```
physical plan is HashAggregateExec {
    group_expr: [
        Column {
            name: "c1",
        },
        Column {
            name: "c2",
        },
    ],
    aggr_expr: [
        Min {
            expr: Column {
                name: "c3",
            },
        },
    ],
    input: DataSourceExec {
        schema: Schema {
            fields: [
                Field {
                    name: "c1",
                    data_type: Utf8,
                    nullable: false,
                    dict_id: 0,
                    dict_is_ordered: false,
                },
                Field {
                    name: "c2",
                    data_type: UInt32,
                    nullable: false,
                    dict_id: 0,
                    dict_is_ordered: false,
                },
                Field {
                    name: "c3",
                    data_type: Int8,
                    nullable: false,
                    dict_id: 0,
                    dict_is_ordered: false,
                },
            ],
            metadata: {},
        },
        partitions.len: 1,
    },
    schema: Schema {
        fields: [
            Field {
                name: "c1",
                data_type: Utf8,
                nullable: true,
                dict_id: 0,
                dict_is_ordered: false,
            },
            Field {
                name: "c2",
                data_type: UInt32,
                nullable: true,
                dict_id: 0,
                dict_is_ordered: false,
            },
            Field {
                name: "MIN(c3)",
                data_type: Int64,
                nullable: true,
                dict_id: 0,
                dict_is_ordered: false,
            },
        ],
        metadata: {},
    },
}
```

Closes apache#7925 from alamb/alamb/ARROW-9683-debug

Authored-by: alamb <andrew@nerdnetworks.org>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants