Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add source_semantic_models property to DataflowPlanNode #1217

Conversation

tlento
Copy link
Contributor

@tlento tlento commented May 15, 2024

This commit adds the source_semantic_models property to the DataflowPlanNode,
with a base case implementation of returning the recursively-collected set
of all source semantic models fed into the node parents, and the single
necessary override in ReadSqlSourceNode to populate the semantic model
reference as appropriate.

We currently have two use-cases for this, one in the cloud product
that needs the semantic model inputs for a dataflow plan, and the
upcoming predicate pushdown evaluation which needs the semantic model
inputs for a given DataflowPlanNode.

This is presented as a stand-alone change for discussion, as it
represents a departure from our previous de facto approach of limiting
all traversal operations to visitor classes. There are, in fact,
three alternatives to adding this property which would all suffice
for the purposes of predicate pushdown:

  1. Write a separate class to do the traversal and fetch the property
    only from the ReadSqlSourceNodes.
  2. Use an existing property accessor to collect this information,
    either via restricting return types from the SourceNodeSet or by
    using the existing resolver and consolidating the column-level
    semantic model references
  3. Add this property as seen here.

The second option will not address the case on the cloud side, so
I chose between the traversal and the approach seen here. I went
with this for the following reasons:

  1. This is, conceptually, a property of the DataflowPlanNode. Up until
    now the existence of this property has been implicit - at some point,
    at the root of all executable DataflowPlan DAGs, is at least one ReadSqlSourceNode
    sourced from a semantic model. Adding an explicit accessor makes it more
    obvious that this is our baseline expectation. The fact that it can be
    an empty set is a shortcoming, but not a terrible one.
  2. This limits the scope of what we need to import across module (or
    even repository) boundaries in order to access this property information.
    With a separate visitor, we'd have to always import a separate class
    that implements the DataflowPlanNodeVisitor interface and pulls in
    those other elements.
  3. The property return type is always obvious to the caller, who likely
    does not care how we got the information.

The disadvantages here are:

  1. This is the first break in our current "traversals are always visitors"
    model, which is a bit annoying as now we need to look in two places to
    find traversal operations. However, this is already true on the other side
  • we have a number of implicit property accessors which we fetch via visitors -
    and I believe we'll end up in a better place if we move those to be properties
    instead.
  1. There may be other recursive property access operations we need to make,
    and these may be scattered around across multiple node objects. There are
    ways to consolidate this - including allowing a visitor-style class to operate
    on these nodes and encapsulating it within the property accesor itself - but
    that is not necessary at this juncture.

Copy link
Contributor Author

tlento commented May 15, 2024

Copy link

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

This commit adds the source_semantic_models property to the DataflowPlanNode,
with a base case implementation of returning the recursively-collected set
of all source semantic models fed into the node parents, and the single
necessary override in ReadSqlSourceNode to populate the semantic model
reference as appropriate.

We currently have two use-cases for this, one in the cloud product
that needs the semantic model inputs for a dataflow plan, and the
upcoming predicate pushdown evaluation which needs the semantic model
inputs for a given DataflowPlanNode.

This is presented as a stand-alone change for discussion, as it
represents a departure from our previous de facto approach of limiting
all traversal operations to visitor classes. There are, in fact,
three alternatives to adding this property which would all suffice
for the purposes of predicate pushdown:

1. Write a separate class to do the traversal and fetch the property
only from the ReadSqlSourceNodes.
2. Use an existing property accessor to collect this information,
either via restricting return types from the SourceNodeSet or by
using the existing resolver and consolidating the column-level
semantic model references
3. Add this property as seen here.

The second option will not address the case on the cloud side, so
I chose between the traversal and the approach seen here. I went
with this for the following reasons:

1. This is, conceptually, a property of the DataflowPlanNode. Up until
now the existence of this property has been implicit - at some point,
at the root of all executable DataflowPlan DAGs, is at least one ReadSqlSourceNode
sourced from a semantic model. Adding an explicit accessor makes it more
obvious that this is our baseline expectation. The fact that it can be
an empty set is a shortcoming, but not a terrible one.
2. This limits the scope of what we need to import across module (or
even repository) boundaries in order to access this property information.
With a separate visitor, we'd have to always import a separate class
that implements the DataflowPlanNodeVisitor interface and pulls in
those other elements.
3. The property return type is always obvious to the caller, who likely
does not care how we got the information.

The disadvantages here are:

1. This is the first break in our current "traversals are always visitors"
model, which is a bit annoying as now we need to look in two places to
find traversal operations. However, this is already true on the other side
- we have a number of implicit property accessors which we fetch via visitors -
and I believe we'll end up in a better place if we move those to be properties
instead.
2. There may be other recursive property access operations we need to make,
and these may be scattered around across multiple node objects. There are
ways to consolidate this - including allowing a visitor-style class to operate
on these nodes and encapsulating it within the property accesor itself - but
that is not necessary at this juncture.
@tlento tlento force-pushed the use-pushdown-params-for-disabling-time-constraint-pushdown branch from 26dd729 to 9435008 Compare May 15, 2024 22:47
@tlento tlento force-pushed the add-source-semantic-models-property-to-dfp-node branch from 0ce550c to 2c2a081 Compare May 15, 2024 22:47
@tlento
Copy link
Contributor Author

tlento commented May 15, 2024

Update - @plypaul and I have discussed this a bit and found a better approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant