Skip to content

Conversation

@angoenka
Copy link
Contributor

@angoenka angoenka commented Apr 12, 2019

sdkWorkerParallelism behavior was not consistent in the previous implementation and was not easy to describe.
The new behavior of sdkWorkerParallelism is as follows.

A) User input sdkWorkerParallelism = null -> Default value of 0 is set by PortablePipelineOptions -> Jobserver Overwrite(default value 1) to 1 -> 1 is used as the final value and sdkWorkerParallelism is dynamically decided.

B) User input sdkWorkerParallelism = 5 -> No default overwrite by PortablePipelineOptions and 5 is passed along -> Jobserver Overwrite not used -> 5 is used as the final value.

C) User input sdkWorkerParallelism = null -> Default value of 0 is set by PortablePipelineOptions -> Jobserver Overwrite (2) used -> 2 is used as the final value.

D) User input sdkWorkerParallelism = 0 -> No default overwrite by PortablePipelineOptions and 0 is passed along -> Jobserver Overwrite(default value 0) to 0 -> 0 is used as the final value and actual worker parallelism is determined based on CPU.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- --- --- --- ---
Java Build Status Build Status Build Status Build Status
Build Status
Build Status
Build Status Build Status Build Status
Python Build Status
Build Status
--- Build Status
Build Status
Build Status --- --- ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@angoenka
Copy link
Contributor Author

R: @tvalentyn

@angoenka angoenka force-pushed the portable_options branch 2 times, most recently from 8bbe1a1 to 0a5c1af Compare April 12, 2019 21:24
@angoenka angoenka changed the title Make sdkWorkerParallelism option consistent [BEAM-6853] Make sdkWorkerParallelism option consistent Apr 12, 2019
@angoenka
Copy link
Contributor Author

Run Java PreCommit

@angoenka
Copy link
Contributor Author

Run Java PreCommit

1 similar comment
@angoenka
Copy link
Contributor Author

Run Java PreCommit

Copy link
Contributor

@tvalentyn tvalentyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @angoenka


class PortableOptions(PipelineOptions):

"""Portable options are common options expected to be understood by most of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angoenka What do you think about removing current RunnerOptions in Python SDK, and renaming PortableOptions to either RunnerOptions or PortableRunnerOptions in all SDKs? See also: #8225 (comment)
cc: @mxm, @robertwb .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me to remove RunnerOptions because it is just a stub. Not sure about PortableOptions => PortableRunnerOptions because we currently do not include Runner in any option name, e.g. FlinkPipelineOptions, not FlinkRunnerPipelineOptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we can remove the RunnerOptions as its not used anywhere.
I would like to stick with the current convention of not adding "Runner" to PortableOptions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think PortableOptions is understandable by developers but might be less user-friendly, since users might not know what portability refers to.

usage = "Default parallelism for SDK worker processes (see portable pipeline options)")
private Long sdkWorkerParallelism = 1L;
usage =
"Default parallelism for SDK worker processes (see portable pipeline options). "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider slightly different order: Parallelism level for SDK worker processes. This value is used only when the pipeline option sdkWorkerParallelism is set to 0, otherwise the value provided in pipeline options takes precedence. Default is 0, which means worker parallelism will be dynamically decided by a runner. See also: sdkWorkerParallelism Pipeline Option. Also, should last sentence be a Javadoc reference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I will update the wording.

'command.'))
parser.add_argument(
'--sdk_worker_parallelism', default=None,
'--sdk_worker_parallelism', default=1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have strong reasons not to make 0 be a default value? Folks working on portability probably have an informed opinion on this. cc: @mxm, @robertwb.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we probably want 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have seen machine lock down when running python processes equal to the number of cores.
We can still go with 0 here as long as we set the default to be 1 in JobServerDriver.java.
And I think this is reasonable.

@angoenka
Copy link
Contributor Author

Run Java PreCommit

@angoenka
Copy link
Contributor Author

Run Java_Examples_Dataflow PreCommit

@angoenka
Copy link
Contributor Author

Ping @tvalentyn

Copy link
Contributor

@tvalentyn tvalentyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Please update the PR description, looks like it does not reflect changes in last iteration.

@angoenka
Copy link
Contributor Author

Thanks @tvalentyn
Updated the description and Merging the PR.

@angoenka angoenka merged commit fe676c5 into apache:master Apr 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants