-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10732: [Rust] [DataFusion] Integrate DFSchema as a step towards supporting qualified column names #8839
Conversation
This is great! I didn't see any strange things now, code looks clean and it sounds like this could be integrated and further tested. |
Hey @andygrove . Thanks a lot for this! I would benefit from understanding the use-case for My current understanding is that, without qualifiers, we can't write things like What I am trying to understand is when do we need such an expression at the physical level. Typically, these plans require some form of join and are mapped to One use case case I see for this is when the join is itself over an expression, e.g. If the goal is that we can add the qualifier to the column name after the join, to desambiguate |
Hi @jorgecarleitao did you get a chance to read the design document? There is a link to it from the JIRA. |
Yeah, I missed that one and the whole discussion on the issue: https://docs.google.com/document/d/1BFo7ruJayCulAHLa9-noaHXbgcaAH_4LuOJFGJnDHkc/edit#heading=h.su3u27lcpr3l , sorry about that. |
You may have a point about only needing this at the logical level. I am not sure, but I will take a look at this tomorrow. |
impl DFSchema { | ||
/// Creates an empty `DFSchema` | ||
pub fn empty() -> Self { | ||
Self { fields: vec![] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to make this a hashset? Or convert to vec in last step.
|
||
/// Find the index of the column with the given name | ||
pub fn index_of(&self, name: &str) -> Result<usize> { | ||
for i in 0..self.fields.len() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could use Vec::position
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @andygrove -- this is looking really nice. I agree with @jorgecarleitao that unless there is some usecase we have overlooked, that the DFSchema
notion should probably be only in the LogicalPlan
, and PhysicalPlans
should still only use the Arrow Schema
That is a standard division I have seen in other optimizers / planners -- at some point the distinction between relations / where the input came from is no longer relevant and the code is just focused on sending columns of data around.
@@ -214,7 +212,7 @@ impl ExecutionContext { | |||
has_header: options.has_header, | |||
delimiter: Some(options.delimiter), | |||
projection: None, | |||
projected_schema: csv.schema(), | |||
projected_schema: Arc::new(DFSchema::from(&csv.schema())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could make this code look better if we implemented impl Into<DFSchemaRef> for SchemaRef
-- so then we could write something like projected_schema: csv.schema().into(),
Doing so in some follow on PR would be totally fine
@@ -408,7 +406,7 @@ impl ExecutionContext { | |||
let path = Path::new(&path).join(&filename); | |||
let file = fs::File::create(path)?; | |||
let mut writer = | |||
ArrowWriter::try_new(file.try_clone().unwrap(), plan.schema(), None)?; | |||
ArrowWriter::try_new(file.try_clone().unwrap(), plan.schema().to_arrow_schema(), None)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could likewise implement impl Into<Schema> for DFSchema
and so call into()
rather than to_arrow_schema()
. This is again just a stylistic thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've broken this out into a separate PR #8857
This PR implements a DataFusion schema that wraps the Arrow schema and adds support for qualified names. There is a follow-up PR #8839 to integrate this with DataFusion. Design doc: https://docs.google.com/document/d/1BFo7ruJayCulAHLa9-noaHXbgcaAH_4LuOJFGJnDHkc/edit#heading=h.3japu7255aut Closes #8840 from andygrove/dfschema Authored-by: Andy Grove <andygrove73@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>
@jorgecarleitao @alamb I've been looking at the question of whether the physical plan should use DFSchema. Here is the current (in master) implementation of the physical expression for Column: impl PhysicalExpr for Column {
/// Get the data type of this expression, given the schema of the input
fn data_type(&self, input_schema: &Schema) -> Result<DataType> {
Ok(input_schema
.field_with_name(&self.name)?
.data_type()
.clone())
}
/// Decide whehter this expression is nullable, given the schema of the input
fn nullable(&self, input_schema: &Schema) -> Result<bool> {
Ok(input_schema.field_with_name(&self.name)?.is_nullable())
}
/// Evaluate the expression
fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> {
Ok(ColumnarValue::Array(
batch.column(batch.schema().index_of(&self.name)?).clone(),
))
}
} As you can see, the The bigger issue though is that this expression is looking up columns by name, so how do we support qualified names here? I see the following choices:
Maybe there are other options that I am not seeing? |
Thanks a lot for looking at this. All excellent points. I now see that this is tricky :) Thinking about what you wrote, if we plan the Logical as I think that this will happen even if we pass This IMO leaves us with 2., which is what I would try: change the physical planner to alias/rewrite column names with the qualifier when the physical plan is created. This will cause the resulting I.e. The invariant that Note that we already do this when performing coercion: we preserve the logical schema name by injecting cast ops during physical (and not logical) planning, so that if the user wrote |
Thanks @jorgecarleitao I think that makes a lot of sense. Unfortunately I am running into some issues implementing this due to the physical planner calling into the logical planner to create names and it is getting hard to mix and match these schemas. I am going to have to take a step back and break this down into smaller steps I think. |
I agree -- I recommend using the schema from the plan for consistency.
I agree with @jorgecarleitao 's recommendation -- I would recommend when moving from logical --> physical plan, that we always use the fully qualified name of the field, which would avoid ambiguity. If we don't like |
Thanks for the feedback. I will try and get this rebased today. |
de091c5
to
e10ddb9
Compare
e10ddb9
to
bfacef5
Compare
@alamb @jorgecarleitao @Dandandan This is ready for re-review. To recap:
(*) It is possible that there may still be one or two places where we are still using the batch schema but I think it will be easier to find those in the follow-up PRs where we add support for referencing columns by qualified names |
return Ok(i); | ||
} | ||
if let Some(i) = self.fields.iter().position(|f| f.name() == name) { | ||
Ok(i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small style thing but I guess we could do it roughly like this instead?
self.fields.iter().position(|f| f.name() == name)
.ok_or_else(|| DataFusionError::Plan(format!("No field named '{}'", name)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Fixed.
I went carefully through this. As I understand this PR, the reason we pass IMO this introduces complexity on the physical execution that makes it more difficult to understand and use. IMO the signature IMO we may be able to avoid this complexity by using Specifically, the suggestion is to have the physical planner convert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something doesn't feel right with this PR -- specifically that DFSchema
is leaking into physical plan execution.
I think it we can find a way to avoid introducing DFSchema
into ExecutionPlan
we are going to be in much better shape.
} | ||
|
||
#[test] | ||
fn test_display_qualified_schema() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -62,7 +62,7 @@ pub trait ExecutionPlan: Debug + Send + Sync { | |||
/// downcast to a specific implementation. | |||
fn as_any(&self) -> &dyn Any; | |||
/// Get the schema for this execution plan | |||
fn schema(&self) -> SchemaRef; | |||
fn schema(&self) -> DFSchemaRef; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I was saying "physical plan doesn't use DFSchemaI guess I was imagining that
ExecutionPlan::schema()continued to return
SchemaRef-- there may be some reason that
ExecutionPlan` needs to return a DFSchema, but I think the design would be cleaner if we avoided this
Ah and now I see, like so often, @jorgecarleitao has beat me to the comment and has more thorough comments as well 👍 |
Thanks for the continued reviews .... I think I misunderstood some of the earlier feedback. Also, I did run into a design issue when trying to leave the execution path using SchemaRef. I will see if I can find time this everning to explain this issue. |
b3375c9
to
1d66eff
Compare
@jorgecarleitao @alamb I now see where I got carried away with this 😄 .. this PR now updates 16 files instead of 41 and does not change the phyical plans. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is looking good 🎉
Ok(Field::new( | ||
pub fn to_field(&self, input_schema: &DFSchema) -> Result<DFField> { | ||
Ok(DFField::new( | ||
None, //TODO qualifier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be worth a ticket to track this work -- it would be a good initial project for someone to contribute maybe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, I actually forgot about that TODO.. thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the misunderstanding. Thanks for the patience and great work here! LGTM!
This PR implements a DataFusion schema that wraps the Arrow schema and adds support for qualified names. There is a follow-up PR apache#8839 to integrate this with DataFusion. Design doc: https://docs.google.com/document/d/1BFo7ruJayCulAHLa9-noaHXbgcaAH_4LuOJFGJnDHkc/edit#heading=h.3japu7255aut Closes apache#8840 from andygrove/dfschema Authored-by: Andy Grove <andygrove73@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>
… supporting qualified column names This PR builds on apache#8840 and integrates DFSchema with the DataFusion query planning, optimization, and execution code. This was a pretty large refactor unfortunately and I don't really see a way to break this down into smaller PRs. There should be no functional changes in this PR. Fields are looked up using `field_with_unqualified_name` and I will file a separate PR to add support for referencing qualified field names. Note that I had to update `PhysicalExpr.evaluate()` to pass in the input schema since we can no longer rely on the schema from the Arrow `RecordBatch` (because it loses the qualifiers). The other methods on `PhysicalExpr` already required the input schema, so this seems consistent at least, because we now always use the schema from the plan. The rest of the changes are updating the query planning, optimization, and execution to use `DFSchema` instead of `Schema`. Design Document: https://docs.google.com/document/d/1BFo7ruJayCulAHLa9-noaHXbgcaAH_4LuOJFGJnDHkc/edit#heading=h.3japu7255aut Closes apache#8839 from andygrove/sql-relation-names Authored-by: Andy Grove <andygrove73@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
This PR builds on #8840 and integrates DFSchema with the DataFusion query planning, optimization, and execution code.
This was a pretty large refactor unfortunately and I don't really see a way to break this down into smaller PRs.
There should be no functional changes in this PR. Fields are looked up using
field_with_unqualified_name
and I will file a separate PR to add support for referencing qualified field names.Note that I had to update
PhysicalExpr.evaluate()
to pass in the input schema since we can no longer rely on the schema from the ArrowRecordBatch
(because it loses the qualifiers). The other methods onPhysicalExpr
already required the input schema, so this seems consistent at least, because we now always use the schema from the plan.The rest of the changes are updating the query planning, optimization, and execution to use
DFSchema
instead ofSchema
.Design Document: https://docs.google.com/document/d/1BFo7ruJayCulAHLa9-noaHXbgcaAH_4LuOJFGJnDHkc/edit#heading=h.3japu7255aut