Skip to content

[BEAM-7850] Makes environment ID a top level attribute of PTransform.#10183

Merged
chamikaramj merged 1 commit intoapache:masterfrom
chamikaramj:mv_environment_to_ptransform
Dec 20, 2019
Merged

[BEAM-7850] Makes environment ID a top level attribute of PTransform.#10183
chamikaramj merged 1 commit intoapache:masterfrom
chamikaramj:mv_environment_to_ptransform

Conversation

@chamikaramj
Copy link
Contributor

@chamikaramj chamikaramj commented Nov 21, 2019

Removes SDKFunctionSpec and replaces all usages of it with FunctionSpec.

Maintains the current behavior as much as possible. For example, added a list of known transform for which we'll be setting the environment (will leave the default value, empty string, for others). If needed, this behavior can be changed in a future change.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- Build Status --- --- Build Status
Java Build Status Build Status Build Status Build Status
Build Status
Build Status
Build Status Build Status Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
--- Build Status
Build Status
Build Status
Build Status
--- --- Build Status
XLang --- --- --- Build Status --- --- ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status
Build Status
Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@chamikaramj
Copy link
Contributor Author

cc: @robertwb and @lukecwik

I'm adding the rest of the refactoring needed for this but can you take a quick look to see if the proto changes look good ?

I did not preserve tags since I think we do not worry about backwards compatibility at this point but lemme know if I should. Thanks.

@robertwb
Copy link
Contributor

The proto changes makes sense to me. The one bit I'm not sure of is how this will look for cross-language UDFs, but I think that will still look very different than the current SdfFunctionSpecs, and possibly will be modeled as "side" PTransforms which this would be in line with.

@chamikaramj
Copy link
Contributor Author

chamikaramj commented Nov 21, 2019

Thanks. Yeah, design for cross-language UDFs should be done separately and SdkFunctionSpec is inadequate for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All other uses of SdkFunctionSpec are part of a transform but finding the envrionment for the windowing strategy that the window_fn is part of when looking at a PCollection is difficult since one needs to find the environment that the upstream assign windows transform is part of. It is possible but annoying. Also this is different then coders which don't have an environment since both the upstream and downstream transforms need to understand the encoding and runners have the length prefixing technique to convert coders from unknown ones to known ones.

For now I'm for not adding an environment id here and forcing that graph traversal for the assign windows transform to find the environment and in the future to truly support custom window fns its likely we will have to come up with techniques like the coder length prefixing to make it so that they can be treated opaquely. (note that rest of the change looks good)

@chamikaramj @robertwb what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense and I'm actually running into this issue while updating SDKs. We probably have to add environment_id to WindowingStrategy since it's not directly attached to PTransform.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The windowing strategy cannot be coerced into an environment-agnostic one the same way a coder can be, as it involves actually executing user code as oppose to specifying constraints on its behavior. (In some sense, it is in a very real sense a cross-language UDF.)

Having to traverse the graph is ugly, but I suppose possible for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After hearing about some of the difficulties that Cham is running into. I would go either way on whether we add an environment id here or remove it completely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an environment_id to WindowingStrategy for now (retaining the current behavior). This can be further simplified in the future if needed.

@chamikaramj chamikaramj force-pushed the mv_environment_to_ptransform branch 5 times, most recently from 16d484b to 6f57355 Compare December 10, 2019 01:38
@chamikaramj chamikaramj force-pushed the mv_environment_to_ptransform branch from 66d8153 to a636ea0 Compare December 12, 2019 19:42
@chamikaramj
Copy link
Contributor Author

R: @robertwb

I'm trying to get the last set of failing tests to work but I think this is ready to be reviewed now.

Copy link
Contributor

@robertwb robertwb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this cleanup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: for consistency, order this as above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems it'd be better to have a prohibited list. Or possibly call these sdkTransformsWithEnvironment as "known" means lots of different things, and Flatten, GBK, ... would certainly seem to belong to a known this list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to sdkTransformsWithEnvironment since we are trying to maintain current behavior in this PR. In the future I think we should set environment ID in all transforms by default and let runners override for naively implemented transforms. Created https://issues.apache.org/jira/browse/BEAM-9001 for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for Java, known is probably not the right adjective here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed anymore. Removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(On that note, a set of environment-less ones probably makes more sense, as it's more likely to be a fixed set.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. Renamed the variable and created https://issues.apache.org/jira/browse/BEAM-9001.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. Renamed the variable and created https://issues.apache.org/jira/browse/BEAM-9001.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed given the above?

(Also, if it is needed, should it be nested for clarity, i.e.

if not environment_id:
  if transform_urn and transform_urn in KNOWN_SDK_TRANSFORMS:
      environment_id = context.default_environment_id()
  elif ...:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one is needed. Removed the other.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are specific to a runner, shouldn't be here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved back to original location.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unneeded now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved back to original location.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit out of place here, I'd move it back to where it was.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved back to original location.

@chamikaramj chamikaramj force-pushed the mv_environment_to_ptransform branch from dedbe59 to 31eddd1 Compare December 17, 2019 22:05
@youngoli
Copy link
Contributor

Heads up, in #10393 I updated the Go proto compiler version and rebuilt the generated code.You'll probably need to update the generated Go code before merging this (and make sure you update the protoc-gen-go version to the most recent one this time). If it says the generated code is using ProtoVersion3 instead of 2, then your change is good to go.

@chamikaramj chamikaramj force-pushed the mv_environment_to_ptransform branch from 31eddd1 to dc3fd94 Compare December 19, 2019 04:04
@chamikaramj
Copy link
Contributor Author

Run Python2_PVR_Flink PreCommit

@chamikaramj
Copy link
Contributor Author

Run Flink ValidatesRunner

@chamikaramj
Copy link
Contributor Author

Run Python PreCommit

Copy link
Contributor Author

@chamikaramj chamikaramj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. PTAL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an environment_id to WindowingStrategy for now (retaining the current behavior). This can be further simplified in the future if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to sdkTransformsWithEnvironment since we are trying to maintain current behavior in this PR. In the future I think we should set environment ID in all transforms by default and let runners override for naively implemented transforms. Created https://issues.apache.org/jira/browse/BEAM-9001 for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed anymore. Removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. Renamed the variable and created https://issues.apache.org/jira/browse/BEAM-9001.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. Renamed the variable and created https://issues.apache.org/jira/browse/BEAM-9001.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one is needed. Removed the other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved back to original location.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved back to original location.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved back to original location.

@chamikaramj
Copy link
Contributor Author

Run PythonLint PreCommit

@chamikaramj
Copy link
Contributor Author

Run Flink ValidatesRunner

@chamikaramj
Copy link
Contributor Author

Run Spark ValidatesRunner

@chamikaramj
Copy link
Contributor Author

Run Dataflow ValidatesRunner

Copy link
Contributor

@robertwb robertwb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks!

Removes SDKFunctionSpec and replaces all usages of it with FunctionSpec.
@chamikaramj chamikaramj force-pushed the mv_environment_to_ptransform branch from 084d4d9 to edff83f Compare December 20, 2019 00:05
Copy link
Contributor Author

@chamikaramj chamikaramj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@chamikaramj chamikaramj merged commit 3174445 into apache:master Dec 20, 2019
@lostluck lostluck self-assigned this Dec 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants