Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moves standard URNs into .proto files #4672

Merged
merged 6 commits into from Apr 25, 2018
Merged

Conversation

jkff
Copy link
Contributor

@jkff jkff commented Feb 13, 2018

Centralizes the location of standard URNs - now they are encoded in .proto files, implemented as options on enum values. This way they are located only in one place, and references are compile-time checked (where available), rather than hard-coded in many places. common_urns.md was a step in this direction, but it only validated that an URN is valid, but did not provide a named language entity for each one.

Fixes a bunch of code that referred to them by old URNs.

This removes the necessity for UrnUtils and for the resources directory.

@jkff
Copy link
Contributor Author

jkff commented Feb 13, 2018

CC: @herohde

@jkff jkff force-pushed the standard-urns branch 4 times, most recently from 9e78fe4 to 4f30d1c Compare February 14, 2018 22:37
@jkff
Copy link
Contributor Author

jkff commented Feb 14, 2018

OK, fixed the build dependencies in build.gradle too I think.

Copy link
Member

@kennknowles kennknowles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the approach. Since we are working out a standard that folks will depend on ~forever I am advising caution in coupling the pieces and being very careful about distinguishing required bits / primitives from stuff that we just have a URN for.

@@ -175,6 +176,72 @@ message PTransform {
DisplayData display_data = 6;
}

message StandardPTransformUrns {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate primitives, deprecated primitives, standardized composites.

Primitives are: ParDo, Flatten, GroupByKey, Impulse, AssignWindows

Deprecated primitives are: Read, CreateView

There's some mystery surrounding the status of GBK / CoGBK

The rest are just composite that have well known URNs. They should live as annotations closer to their payloads, but not alongside the primitives.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

// part of the translation for a `ParDo` side input.
CREATE_VIEW = 15 [(standard_urn) = "beam:transform:create_view:v1"];

MAP_WINDOWS = 16 [(standard_urn) = "beam:transform:map_windows:v1"];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't even know what this is


COMBINE_PGBKCV = 10 [(standard_urn) = "beam:transform:combine_pgbkcv:v1"];
COMBINE_MERGE_ACCUMULATORS = 11 [(standard_urn) = "beam:transform:combine_merge_accumulators:v1"];
COMBINE_EXTRACT_OUTPUTS = 12 [(standard_urn) = "beam:transform:combine_extract_outputs:v1"];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate fn api-only URNs from pipeline-level URNs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disagree here, much better to document all the combine URNs together than some in one place and some in another.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can accept this. They should live perhaps in an enum inside or alongside the Combine payload.

As you know, I broadly believe there is value is being clear about what is a PCollection(s)-to-PCollection(s) transformation versus what is an instruction that we can tell an SDK harness to apply to a bundle. Losing that distinction has already led to proposals that things that used to be automatic - like combiner lifting - being pushed back into the submitted pipeline.

So my only issue is I can't really tell from this PR if we are taking the position that these are all valid PCollection(s)-to-PCollection(s) transformations that just happen to only be used as instructions, or if we are just not being clear about the distinction.

@@ -403,6 +470,33 @@ message Coder {
repeated string component_coder_ids = 2;
}

message StandardCoderUrns {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe these are all actually required, but check to be sure.

Missing TIMESTAMP (I think durations are not primitives)

// beam:windowfn:global_windows:v0.1
// empty payload
message StandardWindowFnUrns {
enum Enum {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting example. No WindowFn is required as a primitive. They are all just "well-known". There is no particular reason they should be in the same list, except the SDK today ships with them.

What about putting these in an enum inside their payload?

message FixedWindows {
   enum Urn {
     FIXED_WINDOWS = 0 [(beam_urn) = "beam:windowfn:fixed_windows:v0.1"]
   }
   google.protobuf.Duration size = 1;
   google.protobuf.Timestamp offset = 2;	   google.protobuf.Timestamp offset = 2;
}

(the Payload suffix has always been extraneous; I regret introducing it)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may require GlobalWindows (e.g. when we need to pass elements and don't care about the windowing).

Putting these URNs as part of the payload is an interesting idea.

// The recommended pattern for declaring it is (exemplified by coders):
//
// enum StandardCoderUrns {
// BYTES = 0 [(standard_urn) = "beam:coder:bytes:v1"];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think beam_urn might be better than standard_urn

@@ -175,6 +176,72 @@ message PTransform {
DisplayData display_data = 6;
}

message StandardPTransformUrns {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

// Less well-known. Payload: WriteFilesPayload.
WRITE_FILES = 14 [(standard_urn) = "beam:transform:write_files:0.1"];

// Deprecated: runners should move away from translating `CreatePCollectionView` and treat this as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not put known-deprecated URNs here, but rather in a separate file.

// beam:windowfn:global_windows:v0.1
// empty payload
message StandardWindowFnUrns {
enum Enum {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may require GlobalWindows (e.g. when we need to pass elements and don't care about the windowing).

Putting these URNs as part of the payload is an interesting idea.

throw new IllegalArgumentException(
"Unknown or unsupported WindowFn: " + windowFnSpec.getSpec().getUrn());
String s = windowFnSpec.getSpec().getUrn();
if (s.equals(getStandardUrn(StandardWindowFnUrns.Enum.GLOBAL_WINDOWS))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could remove the case statements, but inlining getStandardUrn(...) is quite verbose here.

'model/pipeline/src/main/resources/org/apache/beam/model/common_urns.md')
out = os.path.join(
os.path.dirname(__file__),
'apache_beam/portability/common_urns.py')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're not auto-generating this, we need to check it in (or something else that extracts all the names from the annotations).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh snap. I did check it in, but it is ignored in .gitignore (forgot to remove this) and cleaned by "mvn clean" (remembered to remove from there when it was too late) :-/ Guess I'll have to redo it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not just check this in, that would be a regression. The key point is that there's a common source of truth (namely the protos).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I mean, checking in the auto-generated file. I assume what you did is rewrite this file to extract the proto annotations.)

@kennknowles
Copy link
Member

Reviewed 24 of 24 files at r1.
Review status: all files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.


a discussion (no related file):
I really like this change at its core. Can we move it forward at all? My only major ask is to clearly distinguish required primitives from other stuff. But if you are waiting to do a large revision, I've really started to like the idea of URNs living in the payload message definitions, and those payload definitions living independently from the existing proto files.


model/pipeline/src/main/proto/standard_window_fns.proto, line 37 at r1 (raw file):

Previously, robertwb (Robert Bradshaw) wrote…

We may require GlobalWindows (e.g. when we need to pass elements and don't care about the windowing).

Putting these URNs as part of the payload is an interesting idea.

GlobalWindows as required makes some sense, as it would be the windowing for an impulse (formerly, the windowing of anything just read from the outside world). But I don't actually know what functionality is gained by having any required for the runner to understand. I don't think the SDK has any that are required, since the SDK harness won't see window fns that don't originate with a pipeline the SDK core constructed. (at least in the basic 1:1 core:harness scenario)


Comments from Reviewable

@jkff
Copy link
Contributor Author

jkff commented Feb 28, 2018

(I'll resume working on this next week - currently busy preparing the Strata talk)

@jkff
Copy link
Contributor Author

jkff commented Mar 19, 2018

OK, I refactored this a little: categorized the entities into different enums in the same message where appropriate (reads better, and allows to potentially have more properties specified as an option than just URN), and re-created the lost common_urns.py. PTAL.

@jkff jkff force-pushed the standard-urns branch 2 times, most recently from 2ae5252 to 2e60039 Compare March 23, 2018 22:09
@jkff
Copy link
Contributor Author

jkff commented Mar 23, 2018

Ping. (python failed lint - fixing; but the unit tests passed so it's generally ready for review)

@kennknowles
Copy link
Member

This looks like it could use a rebase and revise.

@jkff
Copy link
Contributor Author

jkff commented Apr 15, 2018

Rebased and revised, PTAL :)

@jkff
Copy link
Contributor Author

jkff commented Apr 16, 2018

retest this please

@jkff
Copy link
Contributor Author

jkff commented Apr 23, 2018

retest this please

Copy link
Member

@kennknowles kennknowles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is valuable as-is so I think it should merge. Can you file some JIRAs about the other thoughts here: https://issues.apache.org/jira/browse/BEAM-3221

Specific items:

  • Separate primitives & critical composites (Combine, etc) from other transforms (KafkaIO, WriteFiles) in terms of the actual artifact dependencies. Eventually as per discussion on mailing list we intend an open ended unspecified universe here.
  • I really really like embedding the URN-carrying enum inside the payloads
  • The suffix Payload adds no value and might imply something to someone, so I'd drop it if/when any breaking change migration takes place.

@jkff jkff merged commit bbabb42 into apache:master Apr 25, 2018
@jkff jkff deleted the standard-urns branch April 25, 2018 23:08
@jkff
Copy link
Contributor Author

jkff commented Apr 25, 2018

Thanks! Oh, realized I forgot to refactor the Go code the same way. I'll file another JIRA for that alongside those that you recommended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants