New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Moves standard URNs into .proto files #4672
Conversation
CC: @herohde |
9e78fe4
to
4f30d1c
Compare
OK, fixed the build dependencies in build.gradle too I think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love the approach. Since we are working out a standard that folks will depend on ~forever I am advising caution in coupling the pieces and being very careful about distinguishing required bits / primitives from stuff that we just have a URN for.
@@ -175,6 +176,72 @@ message PTransform { | |||
DisplayData display_data = 6; | |||
} | |||
|
|||
message StandardPTransformUrns { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate primitives, deprecated primitives, standardized composites.
Primitives are: ParDo, Flatten, GroupByKey, Impulse, AssignWindows
Deprecated primitives are: Read, CreateView
There's some mystery surrounding the status of GBK / CoGBK
The rest are just composite that have well known URNs. They should live as annotations closer to their payloads, but not alongside the primitives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
// part of the translation for a `ParDo` side input. | ||
CREATE_VIEW = 15 [(standard_urn) = "beam:transform:create_view:v1"]; | ||
|
||
MAP_WINDOWS = 16 [(standard_urn) = "beam:transform:map_windows:v1"]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't even know what this is
|
||
COMBINE_PGBKCV = 10 [(standard_urn) = "beam:transform:combine_pgbkcv:v1"]; | ||
COMBINE_MERGE_ACCUMULATORS = 11 [(standard_urn) = "beam:transform:combine_merge_accumulators:v1"]; | ||
COMBINE_EXTRACT_OUTPUTS = 12 [(standard_urn) = "beam:transform:combine_extract_outputs:v1"]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate fn api-only URNs from pipeline-level URNs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disagree here, much better to document all the combine URNs together than some in one place and some in another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can accept this. They should live perhaps in an enum inside or alongside the Combine
payload.
As you know, I broadly believe there is value is being clear about what is a PCollection(s)-to-PCollection(s) transformation versus what is an instruction that we can tell an SDK harness to apply to a bundle. Losing that distinction has already led to proposals that things that used to be automatic - like combiner lifting - being pushed back into the submitted pipeline.
So my only issue is I can't really tell from this PR if we are taking the position that these are all valid PCollection(s)-to-PCollection(s) transformations that just happen to only be used as instructions, or if we are just not being clear about the distinction.
@@ -403,6 +470,33 @@ message Coder { | |||
repeated string component_coder_ids = 2; | |||
} | |||
|
|||
message StandardCoderUrns { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe these are all actually required, but check to be sure.
Missing TIMESTAMP (I think durations are not primitives)
// beam:windowfn:global_windows:v0.1 | ||
// empty payload | ||
message StandardWindowFnUrns { | ||
enum Enum { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting example. No WindowFn is required as a primitive. They are all just "well-known". There is no particular reason they should be in the same list, except the SDK today ships with them.
What about putting these in an enum inside their payload?
message FixedWindows {
enum Urn {
FIXED_WINDOWS = 0 [(beam_urn) = "beam:windowfn:fixed_windows:v0.1"]
}
google.protobuf.Duration size = 1;
google.protobuf.Timestamp offset = 2; google.protobuf.Timestamp offset = 2;
}
(the Payload
suffix has always been extraneous; I regret introducing it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may require GlobalWindows (e.g. when we need to pass elements and don't care about the windowing).
Putting these URNs as part of the payload is an interesting idea.
// The recommended pattern for declaring it is (exemplified by coders): | ||
// | ||
// enum StandardCoderUrns { | ||
// BYTES = 0 [(standard_urn) = "beam:coder:bytes:v1"]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think beam_urn
might be better than standard_urn
@@ -175,6 +176,72 @@ message PTransform { | |||
DisplayData display_data = 6; | |||
} | |||
|
|||
message StandardPTransformUrns { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
// Less well-known. Payload: WriteFilesPayload. | ||
WRITE_FILES = 14 [(standard_urn) = "beam:transform:write_files:0.1"]; | ||
|
||
// Deprecated: runners should move away from translating `CreatePCollectionView` and treat this as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not put known-deprecated URNs here, but rather in a separate file.
// beam:windowfn:global_windows:v0.1 | ||
// empty payload | ||
message StandardWindowFnUrns { | ||
enum Enum { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may require GlobalWindows (e.g. when we need to pass elements and don't care about the windowing).
Putting these URNs as part of the payload is an interesting idea.
throw new IllegalArgumentException( | ||
"Unknown or unsupported WindowFn: " + windowFnSpec.getSpec().getUrn()); | ||
String s = windowFnSpec.getSpec().getUrn(); | ||
if (s.equals(getStandardUrn(StandardWindowFnUrns.Enum.GLOBAL_WINDOWS))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could remove the case statements, but inlining getStandardUrn(...) is quite verbose here.
sdks/python/setup.py
Outdated
'model/pipeline/src/main/resources/org/apache/beam/model/common_urns.md') | ||
out = os.path.join( | ||
os.path.dirname(__file__), | ||
'apache_beam/portability/common_urns.py') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're not auto-generating this, we need to check it in (or something else that extracts all the names from the annotations).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh snap. I did check it in, but it is ignored in .gitignore (forgot to remove this) and cleaned by "mvn clean" (remembered to remove from there when it was too late) :-/ Guess I'll have to redo it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not just check this in, that would be a regression. The key point is that there's a common source of truth (namely the protos).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I mean, checking in the auto-generated file. I assume what you did is rewrite this file to extract the proto annotations.)
Reviewed 24 of 24 files at r1. a discussion (no related file): model/pipeline/src/main/proto/standard_window_fns.proto, line 37 at r1 (raw file): Previously, robertwb (Robert Bradshaw) wrote…
GlobalWindows as required makes some sense, as it would be the windowing for an impulse (formerly, the windowing of anything just read from the outside world). But I don't actually know what functionality is gained by having any required for the runner to understand. I don't think the SDK has any that are required, since the SDK harness won't see window fns that don't originate with a pipeline the SDK core constructed. (at least in the basic 1:1 core:harness scenario) Comments from Reviewable |
(I'll resume working on this next week - currently busy preparing the Strata talk) |
OK, I refactored this a little: categorized the entities into different enums in the same message where appropriate (reads better, and allows to potentially have more properties specified as an option than just URN), and re-created the lost common_urns.py. PTAL. |
2ae5252
to
2e60039
Compare
Ping. (python failed lint - fixing; but the unit tests passed so it's generally ready for review) |
This looks like it could use a rebase and revise. |
Rebased and revised, PTAL :) |
retest this please |
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is valuable as-is so I think it should merge. Can you file some JIRAs about the other thoughts here: https://issues.apache.org/jira/browse/BEAM-3221
Specific items:
- Separate primitives & critical composites (Combine, etc) from other transforms (KafkaIO, WriteFiles) in terms of the actual artifact dependencies. Eventually as per discussion on mailing list we intend an open ended unspecified universe here.
- I really really like embedding the URN-carrying enum inside the payloads
- The suffix
Payload
adds no value and might imply something to someone, so I'd drop it if/when any breaking change migration takes place.
Thanks! Oh, realized I forgot to refactor the Go code the same way. I'll file another JIRA for that alongside those that you recommended. |
Centralizes the location of standard URNs - now they are encoded in .proto files, implemented as options on enum values. This way they are located only in one place, and references are compile-time checked (where available), rather than hard-coded in many places.
common_urns.md
was a step in this direction, but it only validated that an URN is valid, but did not provide a named language entity for each one.Fixes a bunch of code that referred to them by old URNs.
This removes the necessity for UrnUtils and for the
resources
directory.