Added proto definitions for Data Preparations by fernst · Pull Request #1788 · dataform-co/dataform

fernst · 2024-07-17T18:07:31Z

No description provided.

Ekrekr · 2024-07-18T07:09:37Z

+message TableReference {
+  string project = 1;
+  string dataset = 2;
+  string table = 3;
+}


It seems strange to duplicate the table target like this. Target is already defined

dataform/protos/core.proto

Line 47 in 97123fa

message Target {

It looks similar, it just uses BigQuery specific terminology. Data Preparation is based on a BigQuery Source/Destination, thus we defined the protos with BQ specific terminology.

Definitely confusing for future readers! (fixed by separate proto suggestion)

Ekrekr · 2024-07-18T07:09:40Z

+message DataPreparationDefinition {
+  repeated DataPreparationNode nodes = 1;
+  DataPreparationGenerated generated = 2;
+}


These values need to be populated from somewhere, I'd imagine that the config proto needs more options that roughly correspond to the data in here?

Yes, the Data Preparation YAML file is formatted according to this proto. During processing, we will parse the data preparation YAML and build this proto object from the parse result.

Just to clear up any confusion - parsing YAML files as objects is the job of the config.proto, not the core.proto

Ekrekr · 2024-07-18T07:12:18Z

+message DataPreparationDefinition {
+  repeated DataPreparationNode nodes = 1;
+  DataPreparationGenerated generated = 2;
+}


I don't really understand why this is structured in this way - with nested nodes.

There are many downsides to having a node that contains other nested nodes:

What prevents it from becoming a cyclic dag?

How would one represent this in a UI?

It would be much better to keep each node as separate actions instead.

The DAG may contain branches or nodes that join 2 other nodes. The DAG is structured in a way that a node always points into a Source:

A BigQuery Table

1 previous node

2 previous nodes, which are combined using a SQL JOIN operation

Ekrekr · 2024-07-18T07:21:06Z

IIUC this is essentially a copy of the public proto. Ideally we want to keep this in a healthy DAG structure, as described in my comment about avoiding nodes that contain nested nodes.

Ideally we would destructure it to preserve the compiled graph structure, then restructure it at execution time. For example, the execution proto could contain essentially what has been put here currently.

(note this is the execution proto for the CLI - internally we have a protos, but they look very similar).

dataform/protos/execution.proto

Line 32 in 97123fa

message ExecutionAction {

I'm interested in your thoughts on this!

From chat outside the PR - agree that keeping it as it is, as a nested DAG, is the best way forwards!

chtyim · 2024-07-18T11:25:38Z

+    string name = 1;
+
+    // Targets of actions that this action is dependent on.
+    repeated Target dependency_targets = 2;


You can just call it dependencies as the type is already Target, no need to repeat it in the field name.

dependencies is a better name, but with the other action types dependencyTargets is used to keep backwards compatibility with old versions - it would be best to stay consistent with them.

More context:

SQLX action blocks have a config option dependencies https://github.com/dataform-co/dataform/blob/97123faf432781480bee6f888a4881d934aff349/core/actions/table.ts#L218C6-L218C18.

This SQLX config option is of type resolvable, which is target (as a JS object) or of string

dataform/core/actions/table.ts

Line 579 in 97123fa

public dependencies(value: Resolvable | Resolvable[]) {

.

Proto based configs are however strictly the object type.

In unifying the SQLX and proto based definitions, we map from the string type to the object type

dataform/core/actions/assertion.ts

Line 202 in 97123fa

if (unverifiedConfig.dependencies) {

Ekrekr

In general I'm happy with this now, but one main change: I think DataPreparationDefinition, and its children definitions should be moved to its own proto file:

It removes confusion about why we have multiple targets that look identical (agreed that it does make sense for them to be different protos), and why these nodes are different to Dataform's "action" concept.
You can copybara the internal proto out (or in), so that keeping them in sync is easier.
It seems to me that you'll always want the data prep config and what's sent in the compiled graph to be identical.

Ekrekr · 2024-07-18T16:33:33Z

From chat outside the PR - agree that keeping it as it is, as a nested DAG, is the best way forwards!

Ekrekr · 2024-07-18T16:34:07Z

+message TableReference {
+  string project = 1;
+  string dataset = 2;
+  string table = 3;
+}


Definitely confusing for future readers! (fixed by separate proto suggestion)

Ekrekr · 2024-07-18T16:39:49Z

+message DataPreparationDefinition {
+  repeated DataPreparationNode nodes = 1;
+  DataPreparationGenerated generated = 2;
+}


Just to clear up any confusion - parsing YAML files as objects is the job of the config.proto, not the core.proto

* Added proto definitions for Data Preparations * Moved Data preparation protos into a separate file

Added proto definitions for Data Preparations

0ba1236

fernst requested review from Ekrekr and chtyim July 17, 2024 18:07

Ekrekr reviewed Jul 18, 2024

View reviewed changes

chtyim reviewed Jul 18, 2024

View reviewed changes

Ekrekr reviewed Jul 18, 2024

View reviewed changes

Moved Data preparation protos into a separate file

4b6b489

Ekrekr approved these changes Jul 18, 2024

View reviewed changes

fernst merged commit 6635abc into main Jul 18, 2024

fernst deleted the data-preparation-proto branch July 18, 2024 19:51

bmagyarkuti pushed a commit to bmagyarkuti/dataform that referenced this pull request Jul 23, 2024

Added proto definitions for Data Preparations (dataform-co#1788)

301d860

* Added proto definitions for Data Preparations * Moved Data preparation protos into a separate file

Conversation

fernst commented Jul 17, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ekrekr Jul 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ekrekr Jul 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ekrekr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ekrekr Jul 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ekrekr Jul 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Ekrekr Jul 18, 2024 •

edited

Loading

Ekrekr Jul 18, 2024 •

edited

Loading

Ekrekr Jul 18, 2024 •

edited

Loading

Ekrekr Jul 18, 2024 •

edited

Loading