Added proto definitions for Data Preparations#1788
Conversation
| message TableReference { | ||
| string project = 1; | ||
| string dataset = 2; | ||
| string table = 3; | ||
| } |
There was a problem hiding this comment.
It seems strange to duplicate the table target like this. Target is already defined
Line 47 in 97123fa
There was a problem hiding this comment.
It looks similar, it just uses BigQuery specific terminology. Data Preparation is based on a BigQuery Source/Destination, thus we defined the protos with BQ specific terminology.
There was a problem hiding this comment.
Definitely confusing for future readers! (fixed by separate proto suggestion)
| message DataPreparationDefinition { | ||
| repeated DataPreparationNode nodes = 1; | ||
| DataPreparationGenerated generated = 2; | ||
| } |
There was a problem hiding this comment.
These values need to be populated from somewhere, I'd imagine that the config proto needs more options that roughly correspond to the data in here?
There was a problem hiding this comment.
Yes, the Data Preparation YAML file is formatted according to this proto. During processing, we will parse the data preparation YAML and build this proto object from the parse result.
There was a problem hiding this comment.
Just to clear up any confusion - parsing YAML files as objects is the job of the config.proto, not the core.proto
| message DataPreparationDefinition { | ||
| repeated DataPreparationNode nodes = 1; | ||
| DataPreparationGenerated generated = 2; | ||
| } |
There was a problem hiding this comment.
I don't really understand why this is structured in this way - with nested nodes.
There are many downsides to having a node that contains other nested nodes:
- What prevents it from becoming a cyclic dag?
- How would one represent this in a UI?
It would be much better to keep each node as separate actions instead.
There was a problem hiding this comment.
The DAG may contain branches or nodes that join 2 other nodes. The DAG is structured in a way that a node always points into a Source:
- A BigQuery Table
- 1 previous node
- 2 previous nodes, which are combined using a SQL JOIN operation
There was a problem hiding this comment.
IIUC this is essentially a copy of the public proto. Ideally we want to keep this in a healthy DAG structure, as described in my comment about avoiding nodes that contain nested nodes.
Ideally we would destructure it to preserve the compiled graph structure, then restructure it at execution time. For example, the execution proto could contain essentially what has been put here currently.
(note this is the execution proto for the CLI - internally we have a protos, but they look very similar).
dataform/protos/execution.proto
Line 32 in 97123fa
I'm interested in your thoughts on this!
There was a problem hiding this comment.
From chat outside the PR - agree that keeping it as it is, as a nested DAG, is the best way forwards!
| string name = 1; | ||
|
|
||
| // Targets of actions that this action is dependent on. | ||
| repeated Target dependency_targets = 2; |
There was a problem hiding this comment.
You can just call it dependencies as the type is already Target, no need to repeat it in the field name.
There was a problem hiding this comment.
dependencies is a better name, but with the other action types dependencyTargets is used to keep backwards compatibility with old versions - it would be best to stay consistent with them.
More context:
- SQLX action blocks have a config option
dependencieshttps://github.com/dataform-co/dataform/blob/97123faf432781480bee6f888a4881d934aff349/core/actions/table.ts#L218C6-L218C18. - This SQLX config option is of type resolvable, which is target (as a JS object) or of string .
dataform/core/actions/table.ts
Line 579 in 97123fa
- Proto based configs are however strictly the object type.
- In unifying the SQLX and proto based definitions, we map from the string type to the object type
dataform/core/actions/assertion.ts
Line 202 in 97123fa
Ekrekr
left a comment
There was a problem hiding this comment.
In general I'm happy with this now, but one main change: I think DataPreparationDefinition, and its children definitions should be moved to its own proto file:
- It removes confusion about why we have multiple targets that look identical (agreed that it does make sense for them to be different protos), and why these nodes are different to Dataform's "action" concept.
- You can copybara the internal proto out (or in), so that keeping them in sync is easier.
- It seems to me that you'll always want the data prep config and what's sent in the compiled graph to be identical.
There was a problem hiding this comment.
From chat outside the PR - agree that keeping it as it is, as a nested DAG, is the best way forwards!
| message TableReference { | ||
| string project = 1; | ||
| string dataset = 2; | ||
| string table = 3; | ||
| } |
There was a problem hiding this comment.
Definitely confusing for future readers! (fixed by separate proto suggestion)
| message DataPreparationDefinition { | ||
| repeated DataPreparationNode nodes = 1; | ||
| DataPreparationGenerated generated = 2; | ||
| } |
There was a problem hiding this comment.
Just to clear up any confusion - parsing YAML files as objects is the job of the config.proto, not the core.proto
* Added proto definitions for Data Preparations * Moved Data preparation protos into a separate file
No description provided.