Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Rust] [DataFusion] logical schema = physical schema is not true #25852

Closed
asfimport opened this issue Aug 20, 2020 · 5 comments
Closed

[Rust] [DataFusion] logical schema = physical schema is not true #25852

asfimport opened this issue Aug 20, 2020 · 5 comments

Comments

@asfimport
Copy link

asfimport commented Aug 20, 2020

In tests/sql.rs, we test that the physical and the optimized schema must match. However, this is not necessarily true for all our queries. An example:

#[test]
fn csv_query_sum_cast() {
    let mut ctx = ExecutionContext::new();
    register_aggregate_csv_by_sql(&mut ctx);
    // c8 = i32; c9 = i64
    let sql = "SELECT c8 + c9 FROM aggregate_test_100";
    // check that the physical and logical schemas are equal
    execute(&mut ctx, sql);
}

The physical expression (and schema) of this operation, after optimization, is CAST(c8 as Int64) Plus c9 (this test fails).

AFAIK, the invariant of the optimizer is that the output types and nullability are the same.

Also, note that the reason the optimized logical schema equals the logical schema is that our type coercer does not change the output names of the schema, even though it re-writes logical expressions. I.e. after the optimization, .to_field() of an expression may no longer match the field name nor type in the Plan's schema. IMO this is currently by (implicit?) design, as we do not want our logical schema's column names to change during optimizations, or all column references may point to non-existent columns. This is something that brought up on the mailing list about polymorphism.

Reporter: Jorge Leitão / @jorgecarleitao
Assignee: Jorge Leitão / @jorgecarleitao

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-9809. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Andy Grove / @andygrove:
This is a good point. I can now see that we should not expect the schema to match after optimization, since we are adding type coercion.

I would think that the schema should match between the optimized logical plan and the physical plan though?

@asfimport
Copy link
Author

Andy Grove / @andygrove:
It seems that we should determine data types and nullability only in the logical plan and then pass that information to the physical plan rather than re-compute them there. I think this is what you have been suggesting already @jorgecarleitao ?

@asfimport
Copy link
Author

Jorge Leitão / @jorgecarleitao:
In general, I think that we could consider the following:

  • Expressions in the logical plan have a type and nullability (function that maps inputs's meta to output's meta such as Expr::get_type)

  • Expressions in the physical plan have a type and nullability (functions that map inputs's meta to output's meta, currently PhysicalExpr::data_type)

  • PhysicalExpr::data_type must match the actual calculation return type (i.e. the builder that it is actually used depending on the input type)

  • The physical planner must assert that both type and nullability are preserved during conversion from logical to physical.

    It is our responsibility to ensure that, within datafusion, the schema from our logical plan matches the schema from the physical plan (or the planner errors). If anyone decides to use different physical plans (e.g. GPU), it is their responsibility ensure that:

  • PhysicalExpr::data_type must match the actual calculation return type (i.e. the builder that it is used)

  • The physical planner must assert that both type and nullability are preserved during conversion from logical to physical.

    From this perspective, our logical plan is the user's expectation of the output schema, while the physical plan is the developer's representation of the computation, and users can switch planners to derive different physical plans from logical plans, both at the compute level (e.g. CPU vs GPU, the create_physical_expr) and at the distribution level (local vs distributed, the create_physical_plan and create_aggregate_expr).

    IMO the type coercer should not be a logical optimizer, but a physical one: before type coercion, all logical types are all already set by the scanned types, and thus the whole logical plan can be derived. We use a type coercer because our physical expressions only support a subset of all operations (e.g. we support u32 * u32, but not u32 * u16). The output type is still the same (u32) and the logical plan does not care. IMO this is a physical (compute) – not logical (types and nullability) – issue. If someone else had a way to compute u32 * u16 -> u32, IMO our logical plan should not have to know about it and our coercion rule would not need to be applied.

    With this said:

    I would think that the schema should match between the optimized logical plan and the physical plan though?

    If we want to enforce that physical expressions are always named as logical expressions, then yes, we should enforce the same schema. Strictly speaking, we only need to enforce type and nullability.

    It seems that we should determine data types and nullability only in the logical plan and then pass that information to the physical plan rather than re-compute them there.

    We can do that, but it will make it more difficult to find errors when someone else writes a new physical expression: they do not have to write a data_type for it, but will have to match the data_type (via which builder they use) that the logical plan passes to them.

    Since all operations are dynamically typed, I think that we should continue to have both PhysicalExpr::data_type and Expr::get_type and make the assert at the planner level and not only in tests.

    Currently, we do not do this, as we use statements of the form

    match ...
        LogicalPlan::Projection { input, expr, .. }

    in the planner, that make no use of the logical schema and derive a new physical schema that may or may not match the logical schema.

@asfimport
Copy link
Author

@asfimport
Copy link
Author

Andy Grove / @andygrove:
Issue resolved by pull request 8024
#8024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants