[Rust] [DataFusion] logical schema = physical schema is not true #25852

asfimport · 2020-08-20T06:20:20Z

In tests/sql.rs, we test that the physical and the optimized schema must match. However, this is not necessarily true for all our queries. An example:

#[test]
fn csv_query_sum_cast() {
    let mut ctx = ExecutionContext::new();
    register_aggregate_csv_by_sql(&mut ctx);
    // c8 = i32; c9 = i64
    let sql = "SELECT c8 + c9 FROM aggregate_test_100";
    // check that the physical and logical schemas are equal
    execute(&mut ctx, sql);
}

The physical expression (and schema) of this operation, after optimization, is CAST(c8 as Int64) Plus c9 (this test fails).

AFAIK, the invariant of the optimizer is that the output types and nullability are the same.

Also, note that the reason the optimized logical schema equals the logical schema is that our type coercer does not change the output names of the schema, even though it re-writes logical expressions. I.e. after the optimization, .to_field() of an expression may no longer match the field name nor type in the Plan's schema. IMO this is currently by (implicit?) design, as we do not want our logical schema's column names to change during optimizations, or all column references may point to non-existent columns. This is something that brought up on the mailing list about polymorphism.

Reporter: Jorge Leitão / @jorgecarleitao
Assignee: Jorge Leitão / @jorgecarleitao

Related issues:

[Rust] [DataFusion] Implement get_supertype correctly (duplicates)

PRs and other links:

GitHub Pull Request #8024

_{Note: This issue was originally created as ARROW-9809. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2020-08-20T15:24:26Z

Andy Grove / @andygrove:
This is a good point. I can now see that we should not expect the schema to match after optimization, since we are adding type coercion.

I would think that the schema should match between the optimized logical plan and the physical plan though?

asfimport · 2020-08-20T15:25:32Z

Andy Grove / @andygrove:
It seems that we should determine data types and nullability only in the logical plan and then pass that information to the physical plan rather than re-compute them there. I think this is what you have been suggesting already @jorgecarleitao ?

asfimport · 2020-08-21T05:15:07Z

Jorge Leitão / @jorgecarleitao:
In general, I think that we could consider the following:

Expressions in the logical plan have a type and nullability (function that maps inputs's meta to output's meta such as Expr::get_type)
Expressions in the physical plan have a type and nullability (functions that map inputs's meta to output's meta, currently PhysicalExpr::data_type)
PhysicalExpr::data_type must match the actual calculation return type (i.e. the builder that it is actually used depending on the input type)
The physical planner must assert that both type and nullability are preserved during conversion from logical to physical.

It is our responsibility to ensure that, within datafusion, the schema from our logical plan matches the schema from the physical plan (or the planner errors). If anyone decides to use different physical plans (e.g. GPU), it is their responsibility ensure that:
PhysicalExpr::data_type must match the actual calculation return type (i.e. the builder that it is used)
The physical planner must assert that both type and nullability are preserved during conversion from logical to physical.

From this perspective, our logical plan is the user's expectation of the output schema, while the physical plan is the developer's representation of the computation, and users can switch planners to derive different physical plans from logical plans, both at the compute level (e.g. CPU vs GPU, the create_physical_expr) and at the distribution level (local vs distributed, the create_physical_plan and create_aggregate_expr).

IMO the type coercer should not be a logical optimizer, but a physical one: before type coercion, all logical types are all already set by the scanned types, and thus the whole logical plan can be derived. We use a type coercer because our physical expressions only support a subset of all operations (e.g. we support u32 * u32, but not u32 * u16). The output type is still the same (u32) and the logical plan does not care. IMO this is a physical (compute) – not logical (types and nullability) – issue. If someone else had a way to compute u32 * u16 -> u32, IMO our logical plan should not have to know about it and our coercion rule would not need to be applied.

With this said:

I would think that the schema should match between the optimized logical plan and the physical plan though?

If we want to enforce that physical expressions are always named as logical expressions, then yes, we should enforce the same schema. Strictly speaking, we only need to enforce type and nullability.

It seems that we should determine data types and nullability only in the logical plan and then pass that information to the physical plan rather than re-compute them there.

We can do that, but it will make it more difficult to find errors when someone else writes a new physical expression: they do not have to write a data_type for it, but will have to match the data_type (via which builder they use) that the logical plan passes to them.

Since all operations are dynamically typed, I think that we should continue to have both PhysicalExpr::data_type and Expr::get_type and make the assert at the planner level and not only in tests.

Currently, we do not do this, as we use statements of the form
```
match ...
    LogicalPlan::Projection { input, expr, .. }
```
in the planner, that make no use of the logical schema and derive a new physical schema that may or may not match the logical schema.

asfimport · 2020-08-21T19:19:44Z

Jorge Leitão / @jorgecarleitao:
I gathered my thoughts about this in this document: https://docs.google.com/document/d/1Asnz29uUS1t60QNbNBU9SiME274rja-hcDvX_RDraFU/edit?usp=sharing

asfimport · 2020-08-23T14:07:51Z

Andy Grove / @andygrove:
Issue resolved by pull request 8024
#8024

asfimport closed this as completed Aug 23, 2020

asfimport assigned jorgecarleitao Jan 10, 2023

asfimport added this to the 2.0.0 milestone Jan 11, 2023

asfimport mentioned this issue Jan 11, 2023

[Rust] [DataFusion] Implement get_supertype correctly #16033

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Rust] [DataFusion] logical schema = physical schema is not true #25852

[Rust] [DataFusion] logical schema = physical schema is not true #25852

asfimport commented Aug 20, 2020 •

edited

asfimport commented Aug 20, 2020

asfimport commented Aug 20, 2020

asfimport commented Aug 21, 2020

asfimport commented Aug 21, 2020

asfimport commented Aug 23, 2020

[Rust] [DataFusion] logical schema = physical schema is not true #25852

[Rust] [DataFusion] logical schema = physical schema is not true #25852

Comments

asfimport commented Aug 20, 2020 • edited

Related issues:

PRs and other links:

asfimport commented Aug 20, 2020

asfimport commented Aug 20, 2020

asfimport commented Aug 21, 2020

asfimport commented Aug 21, 2020

asfimport commented Aug 23, 2020

asfimport commented Aug 20, 2020 •

edited