ARROW-10732: [Rust] [DataFusion] Integrate DFSchema as a step towards supporting qualified column names #8839

andygrove · 2020-12-05T17:30:32Z

This PR builds on #8840 and integrates DFSchema with the DataFusion query planning, optimization, and execution code.

This was a pretty large refactor unfortunately and I don't really see a way to break this down into smaller PRs.

There should be no functional changes in this PR. Fields are looked up using field_with_unqualified_name and I will file a separate PR to add support for referencing qualified field names.

Note that I had to update PhysicalExpr.evaluate() to pass in the input schema since we can no longer rely on the schema from the Arrow RecordBatch (because it loses the qualifiers). The other methods on PhysicalExpr already required the input schema, so this seems consistent at least, because we now always use the schema from the plan.

The rest of the changes are updating the query planning, optimization, and execution to use DFSchema instead of Schema.

Design Document: https://docs.google.com/document/d/1BFo7ruJayCulAHLa9-noaHXbgcaAH_4LuOJFGJnDHkc/edit#heading=h.3japu7255aut

github-actions · 2020-12-05T17:34:01Z

https://issues.apache.org/jira/browse/ARROW-10732

andygrove · 2020-12-05T20:35:25Z

@alamb @jorgecarleitao @Dandandan fyi

Dandandan · 2020-12-05T22:02:37Z

This is great! I didn't see any strange things now, code looks clean and it sounds like this could be integrated and further tested.

jorgecarleitao · 2020-12-06T03:17:34Z

Hey @andygrove . Thanks a lot for this!

I would benefit from understanding the use-case for DFSchema at the physical plan. Note that this is primarily for my own understanding, as I am only familiar with qualifier names in SQL to disambiguate columns in expressions concerning more than one table - not in the representation of a statement at the physical plan. Maybe you could give an example of where arrow::Schema is not sufficient at the physical level?

My current understanding is that, without qualifiers, we can't write things like (table1.a + 1) >= (table2.b - 1).

What I am trying to understand is when do we need such an expression at the physical level. Typically, these plans require some form of join and are mapped to filter(join(a, b)), in which case I do not see how a qualifier is used: before the join there are two input nodes that are joined on a key (i.e. always an equality relationship between columns); after the join, there is a single node, and thus qualifiers are not needed.

One use case case I see for this is when the join is itself over an expression, e.g. JOIN ON (table1.a + 1) == (table2.b - 1). However, in this case, at the physical level, this can always be mapped to join(projection()). I.e. it seems to me that it is more of a convenience at building a logical statement than a necessity for executing such a statement.

If the goal is that we can add the qualifier to the column name after the join, to desambiguate table1.a from table2.a, wouldn't it be easier to do that at the logical plan alone?

andygrove · 2020-12-06T03:23:03Z

Hi @jorgecarleitao did you get a chance to read the design document? There is a link to it from the JIRA.

jorgecarleitao · 2020-12-06T03:27:01Z

Hi @jorgecarleitao did you get a chance to read the design document? There is a link to it from the JIRA.

Yeah, I missed that one and the whole discussion on the issue: https://docs.google.com/document/d/1BFo7ruJayCulAHLa9-noaHXbgcaAH_4LuOJFGJnDHkc/edit#heading=h.su3u27lcpr3l , sorry about that.

andygrove · 2020-12-06T03:28:22Z

You may have a point about only needing this at the logical level. I am not sure, but I will take a look at this tomorrow.

Dandandan · 2020-12-06T08:15:13Z

rust/datafusion/src/logical_plan/dfschema.rs

+impl DFSchema {
+    /// Creates an empty `DFSchema`
+    pub fn empty() -> Self {
+        Self { fields: vec![] }


Would it make sense to make this a hashset? Or convert to vec in last step.

Dandandan · 2020-12-06T08:20:53Z

rust/datafusion/src/logical_plan/dfschema.rs

+
+    /// Find the index of the column with the given name
+    pub fn index_of(&self, name: &str) -> Result<usize> {
+        for i in 0..self.fields.len() {


This could use Vec::position

alamb

Thanks @andygrove -- this is looking really nice. I agree with @jorgecarleitao that unless there is some usecase we have overlooked, that the DFSchema notion should probably be only in the LogicalPlan, and PhysicalPlans should still only use the Arrow Schema

That is a standard division I have seen in other optimizers / planners -- at some point the distinction between relations / where the input came from is no longer relevant and the code is just focused on sending columns of data around.

alamb · 2020-12-06T12:41:19Z

rust/datafusion/src/execution/context.rs

@@ -214,7 +212,7 @@ impl ExecutionContext {
            has_header: options.has_header,
            delimiter: Some(options.delimiter),
            projection: None,
-            projected_schema: csv.schema(),
+            projected_schema: Arc::new(DFSchema::from(&csv.schema())),


We could make this code look better if we implemented impl Into<DFSchemaRef> for SchemaRef -- so then we could write something like projected_schema: csv.schema().into(),

Doing so in some follow on PR would be totally fine

alamb · 2020-12-06T12:42:31Z

rust/datafusion/src/execution/context.rs

@@ -408,7 +406,7 @@ impl ExecutionContext {
            let path = Path::new(&path).join(&filename);
            let file = fs::File::create(path)?;
            let mut writer =
-                ArrowWriter::try_new(file.try_clone().unwrap(), plan.schema(), None)?;
+                ArrowWriter::try_new(file.try_clone().unwrap(), plan.schema().to_arrow_schema(), None)?;


We could likewise implement impl Into<Schema> for DFSchema and so call into() rather than to_arrow_schema(). This is again just a stylistic thing

I've broken this out into a separate PR #8857

This PR implements a DataFusion schema that wraps the Arrow schema and adds support for qualified names. There is a follow-up PR #8839 to integrate this with DataFusion. Design doc: https://docs.google.com/document/d/1BFo7ruJayCulAHLa9-noaHXbgcaAH_4LuOJFGJnDHkc/edit#heading=h.3japu7255aut Closes #8840 from andygrove/dfschema Authored-by: Andy Grove <andygrove73@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

andygrove · 2020-12-06T16:43:07Z

@jorgecarleitao @alamb I've been looking at the question of whether the physical plan should use DFSchema. Here is the current (in master) implementation of the physical expression for Column:

impl PhysicalExpr for Column {
    /// Get the data type of this expression, given the schema of the input
    fn data_type(&self, input_schema: &Schema) -> Result<DataType> {
        Ok(input_schema
            .field_with_name(&self.name)?
            .data_type()
            .clone())
    }

    /// Decide whehter this expression is nullable, given the schema of the input
    fn nullable(&self, input_schema: &Schema) -> Result<bool> {
        Ok(input_schema.field_with_name(&self.name)?.is_nullable())
    }

    /// Evaluate the expression
    fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> {
        Ok(ColumnarValue::Array(
            batch.column(batch.schema().index_of(&self.name)?).clone(),
        ))
    }
}

As you can see, the data_type and nullable use the schema from the plan whereas the evaluate method uses the schema from the record batch, which is a little inconsistent. They should probably all use the same schema.

The bigger issue though is that this expression is looking up columns by name, so how do we support qualified names here? I see the following choices:

Have ExecutionPlan.schema() use DFSchema as I have done in this PR
Use qualified names in the Arrow schema field name e.g. "t1.foo"
Change the Column physical expression to refer to columns by index rather than name

Maybe there are other options that I am not seeing?

jorgecarleitao · 2020-12-06T17:10:41Z

Thanks a lot for looking at this. All excellent points. I now see that this is tricky :)

Thinking about what you wrote, if we plan the Logical as t1.a, t2.a, wouldn't the column names become a, a on the RecordBatch? i.e. there will be a discrepancy between the schema provided by df.schema() and the RecordBatches::schema() returned by collect(), no?

I think that this will happen even if we pass DFSchema to the physical plan (1.) or use indexes (3.), as any map qualified name -> unqualified is lossy (the qualifier), and thus never recoverable at the RecordBatch's schema.

This IMO leaves us with 2., which is what I would try: change the physical planner to alias/rewrite column names with the qualifier when the physical plan is created. This will cause the resulting RecordBatch's schema to have columns named t1.a and t2.a, thereby guaranteeing the invariant that the output schema of the physical execution matches the schema of the logical plan.

I.e. The invariant that SELECT t1.a, t2.a, c ... yields a schema whose columns are named ["t1.a", "t2.a", "c"] is preserved.

Note that we already do this when performing coercion: we preserve the logical schema name by injecting cast ops during physical (and not logical) planning, so that if the user wrote SELECT sqrt(f32) ..., the resulting name on the RecordBatch::schema() is sqrt(f32), even if the physical operation performed was sqrt(CAST(f32 as Float64)).

andygrove · 2020-12-06T18:41:24Z

Thanks @jorgecarleitao I think that makes a lot of sense. Unfortunately I am running into some issues implementing this due to the physical planner calling into the logical planner to create names and it is getting hard to mix and match these schemas.

I am going to have to take a step back and break this down into smaller steps I think.

alamb · 2020-12-07T13:58:06Z

As you can see, the data_type and nullable use the schema from the plan whereas the evaluate method uses the schema from the record batch, which is a little inconsistent. They should probably all use the same schema.

I agree -- I recommend using the schema from the plan for consistency.

This IMO leaves us with 2., which is what I would try: change the physical planner to alias/rewrite column names with the qualifier when the physical plan is created. This will cause the resulting RecordBatch's schema to have columns named t1.a and t2.a, thereby guaranteeing the invariant that the output schema of the physical execution matches the schema of the logical plan.

I agree with @jorgecarleitao 's recommendation -- I would recommend when moving from logical --> physical plan, that we always use the fully qualified name of the field, which would avoid ambiguity. If we don't like t1.foo being sprinkled around in plans that only have one table or where the column names aren't ambiguous, we could implement a (logical plan) optimizer pass to remove unneeded qualifiers.

andygrove · 2020-12-07T14:54:31Z

Thanks for the feedback. I will try and get this rebased today.

andygrove · 2020-12-07T15:12:01Z

@alamb @jorgecarleitao @Dandandan This is ready for re-review.

To recap:

At execution time we always* use the DataFusion schema from the plan now rather than the Arrow schema from the record batch
When converting the DataFusion schema to an Arrow schema for use in record batches, we use the fully qualified field names

(*) It is possible that there may still be one or two places where we are still using the batch schema but I think it will be easier to find those in the follow-up PRs where we add support for referencing columns by qualified names

Dandandan · 2020-12-07T15:23:47Z

rust/datafusion/src/logical_plan/dfschema.rs

-                return Ok(i);
-            }
+        if let Some(i) = self.fields.iter().position(|f| f.name() == name) {
+            Ok(i)


small style thing but I guess we could do it roughly like this instead?

self.fields.iter().position(|f| f.name() == name) .ok_or_else(|| DataFusionError::Plan(format!("No field named '{}'", name)))

Thanks. Fixed.

jorgecarleitao · 2020-12-07T20:04:04Z

I went carefully through this. As I understand this PR, the reason we pass DFSchema into the ExecutionPlan is that we need to pass it to PhysicalExpr.evaluate, so that we can use field_with_unqualified_name on the ColumnExpr. 95% of the changes on the PR are derived from this change.

IMO this introduces complexity on the physical execution that makes it more difficult to understand and use.

IMO the signature evaluate(&BatchRecord, &DFSchema) indicates a design issue, as the recordBatch has all information required to be evaluated by PhysicalExpr.

IMO we may be able to avoid this complexity by using field_with_unqualified_name on the physical planner, to create a Schema that is passed to the ExecutionPlan with the fields re-written, and creating ColumnExpr using the qualifier names.

Specifically, the suggestion is to have the physical planner convert DFSchema -> Schema by writing DFField (qual, name) to Field "qual.name", and, respectively, pass "qual.name" to ColumnExpr. IMO this would allow to keep all physical planning as it is in master, and IMO would make it easier to understand the physical execution and how the logical plan is being converted to the physical execution.

alamb

Something doesn't feel right with this PR -- specifically that DFSchema is leaking into physical plan execution.

I think it we can find a way to avoid introducing DFSchema into ExecutionPlan we are going to be in much better shape.

alamb · 2020-12-07T19:55:14Z

rust/datafusion/src/logical_plan/display.rs

+    }
+
+    #[test]
+    fn test_display_qualified_schema() -> Result<()> {


alamb · 2020-12-07T20:45:59Z

rust/datafusion/src/physical_plan/mod.rs

@@ -62,7 +62,7 @@ pub trait ExecutionPlan: Debug + Send + Sync {
    /// downcast to a specific implementation.
    fn as_any(&self) -> &dyn Any;
    /// Get the schema for this execution plan
-    fn schema(&self) -> SchemaRef;
+    fn schema(&self) -> DFSchemaRef;


When I was saying "physical plan doesn't use DFSchemaI guess I was imagining thatExecutionPlan::schema()continued to returnSchemaRef-- there may be some reason thatExecutionPlan` needs to return a DFSchema, but I think the design would be cleaner if we avoided this

alamb · 2020-12-07T20:50:16Z

Ah and now I see, like so often, @jorgecarleitao has beat me to the comment and has more thorough comments as well 👍

andygrove · 2020-12-07T20:59:53Z

Thanks for the continued reviews .... I think I misunderstood some of the earlier feedback. Also, I did run into a design issue when trying to leave the execution path using SchemaRef. I will see if I can find time this everning to explain this issue.

andygrove · 2020-12-08T00:29:59Z

@jorgecarleitao @alamb I now see where I got carried away with this 😄 .. this PR now updates 16 files instead of 41 and does not change the phyical plans.

alamb

I think it is looking good 🎉

alamb · 2020-12-08T00:45:12Z

rust/datafusion/src/logical_plan/expr.rs

-        Ok(Field::new(
+    pub fn to_field(&self, input_schema: &DFSchema) -> Result<DFField> {
+        Ok(DFField::new(
+            None, //TODO  qualifier


might be worth a ticket to track this work -- it would be a good initial project for someone to contribute maybe

oops, I actually forgot about that TODO.. thanks

jorgecarleitao

Sorry for the misunderstanding. Thanks for the patience and great work here! LGTM!

This PR implements a DataFusion schema that wraps the Arrow schema and adds support for qualified names. There is a follow-up PR apache#8839 to integrate this with DataFusion. Design doc: https://docs.google.com/document/d/1BFo7ruJayCulAHLa9-noaHXbgcaAH_4LuOJFGJnDHkc/edit#heading=h.3japu7255aut Closes apache#8840 from andygrove/dfschema Authored-by: Andy Grove <andygrove73@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

… supporting qualified column names This PR builds on apache#8840 and integrates DFSchema with the DataFusion query planning, optimization, and execution code. This was a pretty large refactor unfortunately and I don't really see a way to break this down into smaller PRs. There should be no functional changes in this PR. Fields are looked up using `field_with_unqualified_name` and I will file a separate PR to add support for referencing qualified field names. Note that I had to update `PhysicalExpr.evaluate()` to pass in the input schema since we can no longer rely on the schema from the Arrow `RecordBatch` (because it loses the qualifiers). The other methods on `PhysicalExpr` already required the input schema, so this seems consistent at least, because we now always use the schema from the plan. The rest of the changes are updating the query planning, optimization, and execution to use `DFSchema` instead of `Schema`. Design Document: https://docs.google.com/document/d/1BFo7ruJayCulAHLa9-noaHXbgcaAH_4LuOJFGJnDHkc/edit#heading=h.3japu7255aut Closes apache#8839 from andygrove/sql-relation-names Authored-by: Andy Grove <andygrove73@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

andygrove added Component: Rust Component: Rust - DataFusion labels Dec 5, 2020

andygrove self-assigned this Dec 5, 2020

andygrove marked this pull request as ready for review December 5, 2020 20:33

andygrove requested review from alamb and jorgecarleitao and removed request for alamb December 5, 2020 20:33

andygrove changed the title ~~ARROW-10732: [Rust] [DataFusion] Add SQL support for table/relation aliases and compound identifiers [WIP]~~ ARROW-10732: [Rust] [DataFusion] Implement DFSchema as a step towards supporting qualified column names Dec 5, 2020

andygrove mentioned this pull request Dec 5, 2020

ARROW-10813: [Rust] [DataFusion] Implement DFSchema #8840

Closed

andygrove changed the title ~~ARROW-10732: [Rust] [DataFusion] Implement DFSchema as a step towards supporting qualified column names~~ ARROW-10732: [Rust] [DataFusion] Integrate DFSchema as a step towards supporting qualified column names Dec 5, 2020

andygrove mentioned this pull request Dec 5, 2020

ARROW-10815: [Rust] [DataFusion] Finish integrating SQL relation names [WIP] #8843

Closed

Dandandan reviewed Dec 6, 2020

View reviewed changes

alamb reviewed Dec 6, 2020

View reviewed changes

github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Dec 6, 2020

andygrove force-pushed the sql-relation-names branch from de091c5 to e10ddb9 Compare December 7, 2020 15:00

andygrove force-pushed the sql-relation-names branch from e10ddb9 to bfacef5 Compare December 7, 2020 15:05

Dandandan reviewed Dec 7, 2020

View reviewed changes

github-actions bot removed the needs-rebase A PR that needs to be rebased by the author label Dec 7, 2020

alamb reviewed Dec 7, 2020

View reviewed changes

Update logical plan to use DFSchema

1d66eff

andygrove force-pushed the sql-relation-names branch from b3375c9 to 1d66eff Compare December 8, 2020 00:27

remove unnecessary into()

5dfc45b

alamb approved these changes Dec 8, 2020

View reviewed changes

jorgecarleitao approved these changes Dec 8, 2020

View reviewed changes

alamb closed this in 09c442a Dec 8, 2020

This was referenced Dec 8, 2020

[Rust] [DataFusion] Add SQL support for table/relation aliases and compound identifiers #26678

Closed

[Rust][DataFusion] More ergonomic conversion between Schema, SchemaRef, DFSchema, and DFSchemaRef #26800

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-10732: [Rust] [DataFusion] Integrate DFSchema as a step towards supporting qualified column names #8839

ARROW-10732: [Rust] [DataFusion] Integrate DFSchema as a step towards supporting qualified column names #8839

andygrove commented Dec 5, 2020 •

edited

github-actions bot commented Dec 5, 2020

andygrove commented Dec 5, 2020

Dandandan commented Dec 5, 2020

jorgecarleitao commented Dec 6, 2020 •

edited

andygrove commented Dec 6, 2020

jorgecarleitao commented Dec 6, 2020

andygrove commented Dec 6, 2020

Dandandan Dec 6, 2020

Dandandan Dec 6, 2020 •

edited

alamb left a comment

alamb Dec 6, 2020

alamb Dec 6, 2020

andygrove Dec 6, 2020

andygrove commented Dec 6, 2020

jorgecarleitao commented Dec 6, 2020

andygrove commented Dec 6, 2020

alamb commented Dec 7, 2020 •

edited

andygrove commented Dec 7, 2020

andygrove commented Dec 7, 2020 •

edited

Dandandan Dec 7, 2020

andygrove Dec 7, 2020

jorgecarleitao commented Dec 7, 2020

alamb left a comment

alamb Dec 7, 2020

alamb Dec 7, 2020

alamb commented Dec 7, 2020

andygrove commented Dec 7, 2020 •

edited

andygrove commented Dec 8, 2020

alamb left a comment

alamb Dec 8, 2020

andygrove Dec 8, 2020

jorgecarleitao left a comment

ARROW-10732: [Rust] [DataFusion] Integrate DFSchema as a step towards supporting qualified column names #8839

ARROW-10732: [Rust] [DataFusion] Integrate DFSchema as a step towards supporting qualified column names #8839

Conversation

andygrove commented Dec 5, 2020 • edited

github-actions bot commented Dec 5, 2020

andygrove commented Dec 5, 2020

Dandandan commented Dec 5, 2020

jorgecarleitao commented Dec 6, 2020 • edited

andygrove commented Dec 6, 2020

jorgecarleitao commented Dec 6, 2020

andygrove commented Dec 6, 2020

Choose a reason for hiding this comment

Dandandan Dec 6, 2020 • edited

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove commented Dec 6, 2020

jorgecarleitao commented Dec 6, 2020

andygrove commented Dec 6, 2020

alamb commented Dec 7, 2020 • edited

andygrove commented Dec 7, 2020

andygrove commented Dec 7, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorgecarleitao commented Dec 7, 2020

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Dec 7, 2020

andygrove commented Dec 7, 2020 • edited

andygrove commented Dec 8, 2020

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorgecarleitao left a comment

Choose a reason for hiding this comment

andygrove commented Dec 5, 2020 •

edited

jorgecarleitao commented Dec 6, 2020 •

edited

Dandandan Dec 6, 2020 •

edited

alamb commented Dec 7, 2020 •

edited

andygrove commented Dec 7, 2020 •

edited

andygrove commented Dec 7, 2020 •

edited