Skip to content

[BEAM-115] Unify Java and Python WindowingStrategy representations.#3222

Closed
robertwb wants to merge 4 commits into
apache:masterfrom
robertwb:runner-windows
Closed

[BEAM-115] Unify Java and Python WindowingStrategy representations.#3222
robertwb wants to merge 4 commits into
apache:masterfrom
robertwb:runner-windows

Conversation

@robertwb
Copy link
Copy Markdown
Contributor

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

  • Make sure the PR title is formatted like:
    [BEAM-<Jira issue #>] Description of pull request
  • Make sure tests pass via mvn clean verify.
  • Replace <Jira issue #> in the title with the actual Jira issue
    number, if there is one.
  • If this contribution is large, please file an Apache
    Individual Contributor License Agreement.

@robertwb
Copy link
Copy Markdown
Contributor Author

R: @tgroh

// This URN says that the WindowFn is just a UDF blob the Java SDK understands
// TODO: standardize such things
public static final String CUSTOM_WINDOWFN_URN = "urn:beam:windowfn:javasdk:0.1";
public static final String SERIALIZED_JAVA_WINDOWFN_URN = "beam:windowfn:javasdk:v0.1";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious - does the v add anything in particular?

Copy link
Copy Markdown
Contributor Author

@robertwb robertwb May 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes it clearer that this number is a version field.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say drop it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use 'v' elsewhere (e.g. in the package name itself). Otherwise it looks like a ratio or something.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is in the package name to make it a valid package name. Not a huge deal. If you feel strongly about it I'm fine with it.

package org.apache.beam.runner_api.v1;

option java_package = "org.apache.beam.sdk.common.runner.v1";
option java_outer_classname = "RunnerApiPayloads";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's name the classes and files consistently so they are easy to find.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open for suggestions (haven't come up with names that feel right yet).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, how about StandardWindowFns in standard_window_fns.proto?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Done.

@kennknowles
Copy link
Copy Markdown
Member

R: @kennknowles

SLIDING_WINDOWS_FN = "beam:window_fn:sliding_windows:v0.1"
SESSION_WINDOWS_FN = "beam:window_fn:session_windows:v0.1"
PICKLED_WINDOW_FN = "beam:windowfn:pickled_python:v0.1"
GLOBAL_WINDOWS_FN = "beam:windowfn:global_windows:v0.1"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had discussed using a proto definition like

message WellKnownUrns {
  string global_windows = 1 [default = "urn:beam:windowfn:global_windows:0.1"]
  ...
}

But I guess this is proto 2 only. For normal proto use, I agree with this change, but it is too bad we don't have that hack available. Any other ideas? A data file seems mostly pointless because whatever you look up the URN by is effectively just another unique name anyhow (it would probably prevent some errors).

Not that I think there is a huge danger of these getting out of sync, since they can never change once they are really used. But it would be nice for new SDKs, etc.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have a script that goes through the proto file (or something else) and pulls the name out and generates a .java, .py, .go, ... file of constants. Perhaps overkill though. Ideally this'd be co-located with the semantic definition of the urn as well, if needed.

public static final String SESSION_WINDOWS_FN = "beam:windowfn:session_windows:v0.1";
// This URN says that the WindowFn is just a UDF blob the Java SDK understands
// TODO: standardize such things
public static final String CUSTOM_WINDOWFN_URN = "urn:beam:windowfn:javasdk:0.1";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW the urn: at the beginning is the URI scheme that makes it a URN. The official syntax is "urn:" Namespace : Namespace-specific format

(https://tools.ietf.org/html/rfc8141#section-2)

I agree the scheme is mostly pointless but wanted to just note the departure.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(in particular, top level schemes and the remainder of the URI have exactly the same relationship as URN namespaces and the NSS, as far as I can tell)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know urns were an official thing with a spec... We can prefix everything with urn: if you want.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I just wanted to note it. I find the prefixing a bit tedious. I've been doing it because following internet standards is my first instinct, but I don't think it is worthwhile. We should lay out the namespaces in a section of the runner guide on the website.

@kennknowles
Copy link
Copy Markdown
Member

(btw in commit message s/Stragegy/Strategy/)

@robertwb robertwb changed the title [BEAM-115] Unify Java and Python WindowingStragegy representations. [BEAM-115] Unify Java and Python WindowingStrategy representations. May 25, 2017
Copy link
Copy Markdown
Member

@tgroh tgroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that it would be worth adding the serialization as well here and testing the round tripping.

@robertwb
Copy link
Copy Markdown
Contributor Author

We have serialization and round-tripping tests.

@tgroh
Copy link
Copy Markdown
Member

tgroh commented May 25, 2017

What I mean is my understanding is that we do not have any logic to serialize known types (so this new deserialization code is not exercised). I'm happy to be wrong though.

@robertwb
Copy link
Copy Markdown
Contributor Author

Ah, yes, the serialization logic is only available in Python.

@robertwb
Copy link
Copy Markdown
Contributor Author

PTAL

import org.apache.beam.sdk.transforms.windowing.FixedWindows;
import org.apache.beam.sdk.transforms.windowing.TimestampCombiner;
import org.apache.beam.sdk.transforms.windowing.Trigger;
import org.apache.beam.sdk.transforms.windowing.*;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No * imports

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, just disabled that setting in intellej...

windowFnSpec.getSpec().getParameter().unpack(
RunnerApiPayloads.SlidingWindowsPayload.class);
return SlidingWindows.of(
Duration.millis(Durations.toMillis(slidingParams.getSize())))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formatting here is pretty funky.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are long lines, you have to break somewhere. I generally abide by the rectangle rule when I have to break up large expressions.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW since you mentioned IntelliJ there's google-java-format plugin which is stellar.

.build())))
.build();
// TODO: Set environment IDs
if (windowFn instanceof GlobalWindows) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should probably translated similarly to how we're translating the PTransform Payloads (i.e. with a registrar and some interface that can translate to and from a known type in some meaningful format. We probably want to wait until we've determined that API, though.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll defer on that.

@robertwb
Copy link
Copy Markdown
Contributor Author

Jenkins: retest this please

@tgroh
Copy link
Copy Markdown
Member

tgroh commented May 26, 2017

retest this please

@robertwb
Copy link
Copy Markdown
Contributor Author

Jenkins: retest this please

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage increased (+0.1%) to 70.836% when pulling f5ff727 on robertwb:runner-windows into 3643676 on apache:master.

@robertwb
Copy link
Copy Markdown
Contributor Author

Connect to conjars.org:80 [conjars.org/107.22.169.25] failed: Connection timed out

retest this please

@robertwb
Copy link
Copy Markdown
Contributor Author

retest this please

@asfbot
Copy link
Copy Markdown

asfbot commented May 30, 2017

--none--

1 similar comment
@asfbot
Copy link
Copy Markdown

asfbot commented May 30, 2017

--none--

@robertwb
Copy link
Copy Markdown
Contributor Author

Retest this please

@robertwb
Copy link
Copy Markdown
Contributor Author

Jenkins: retest this please (Apache Beam :: SDKs :: Java :: IO :: HBase failed downloading dependencies.)

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage increased (+0.01%) to 70.679% when pulling 97e72a0 on robertwb:runner-windows into 2df9dbd on apache:master.

@robertwb
Copy link
Copy Markdown
Contributor Author

robertwb commented Jun 1, 2017

Jenkins is happy, PTAL.

@kennknowles
Copy link
Copy Markdown
Member

Jenkins appears sad due to a typo in an attribute name.

Copy link
Copy Markdown
Member

@kennknowles kennknowles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once you get Jenkins happy.

return (WindowFn<?, ?>) deserializedWindowFn;
default:
throw new IllegalArgumentException(
"Unknown or unsupported WindowFn: " + windowFnSpec.getSpec().getUrn());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use WindowFn.class.getSimpleName()

self.assertEqual(
strategy,
DataflowRunner.deserialize_windowing_strategy(
DataflowRunner.serialize_windwoing_strategy(strategy)))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the typo.

offset=proto_utils.from_micros(
timestamp_pb2.Timestamp, self.offset.micros)))

@urns.RunnerApiFn.register_urn(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One design decision on the Java side (which I don't feel wedded to) is that URNs and payloads are entirely divorced from their Java native classes. I had imagined adding PTransform.getUrn() but instead we have a registrar from Java class to URN and payload which lives entirely outside the core SDK to keep proto off the API surface.

I don't entirely understand the ramifications of the Python SDK making both of these decisions separately. (PTransform having their URN, proto messages being right there too). I'd love your thoughts on it sometime, and whether we can easily make Java more natural.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are certainly pros and cons to each, but we should be consistent. Certainly going from the urn to object requires some kind of registration (even if the other way around can be handled by polymorphism).

Probably wouldn't hurt to put together a doc detailing these design choices, which wouldn't have to all be resolved now (but before the Runner/Fn Api is public we would make sure all SDKs are consistent).

Copy link
Copy Markdown
Member

@kennknowles kennknowles Jun 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally think that two-way translation is better via separate registered translators than dynamic dispatch. So that is where the design comes from. But actually, we don't really do it, for a specific reason. When we are parsing a proto that has URN "beam:transforms:pardo" we don't try to make a Java ParDo object because the types wouldn't work out, etc. Instead all transforms just become RawPTransform that can vend its URN and language-agnostic bits of the payload.

I am currently fleshing out the kinds of access runners will need and will then add all this to the runner guide.

Just notes, nothing particularly needed here.

@asfbot
Copy link
Copy Markdown

asfbot commented Jun 1, 2017

Build finished.
--none--

Copy link
Copy Markdown
Member

@tgroh tgroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with a nit.

SerializableUtils.serializeToByteArray(windowFn)))
.build())))
.build();
} else if (windowFn instanceof GlobalWindows) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have some sort of registration scheme for WindowFns in the medium-long term, even if only to simplify this method. File a JIRA?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage decreased (-0.003%) to 70.678% when pulling 3b23198 on robertwb:runner-windows into 217f085 on apache:master.

@asfgit asfgit closed this in 2f9428c Jun 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants