Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-1074] Set default-partitioner in SourceRDD.Unbounded #2288

Closed

Conversation

aviemzur
Copy link
Member

@aviemzur aviemzur commented Mar 22, 2017

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

  • Make sure the PR title is formatted like:
    [BEAM-<Jira issue #>] Description of pull request
  • Make sure tests pass via mvn clean verify. (Even better, enable
    Travis-CI on your fork and ensure the whole test matrix passes).
  • Replace <Jira issue #> in the title with the actual Jira issue
    number, if there is one.
  • If this contribution is large, please file an Apache
    Individual Contributor License Agreement.

@aviemzur
Copy link
Member Author

Run Spark RunnableOnService

@asfbot
Copy link

asfbot commented Mar 22, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PostCommit_Java_RunnableOnService_Spark/1323/
--none--

@coveralls
Copy link

Coverage Status

Coverage increased (+0.003%) to 69.899% when pulling 581233f on aviemzur:sourcerdd-unbounded-default-partitioner into 2d9bf27 on apache:master.

@asfbot
Copy link

asfbot commented Mar 22, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/8668/
--none--

@aviemzur
Copy link
Member Author

R: @amitsela

Copy link
Member

@amitsela amitsela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only comments are, well, on comments 😄
The source implementations are tricky and rely on some members to be persisted in checkpoint so it is important to note this.
Besides that, LGTM.
Feel free to merge after taking care of the notes in the code.

@@ -60,47 +59,64 @@
private final UnboundedSource<T, CheckpointMarkT> unboundedSource;
private final SparkRuntimeContext runtimeContext;
private final Duration boundReadDuration;
private final int numPartitions;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a member of the DStream it is initialized once, and would be recovered on checkpoint recovery, right ?
So it should be noted here that this is a "one time" set of partitions/splits, and cannot change throughout the entire life of the application (+ recovery/resume).

this.boundMaxRecords = boundMaxRecords > 0 ? boundMaxRecords : rateControlledMaxRecords();

try {
this.numPartitions =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit more notes here too - it's an init. of splits to figure out source parallelism...

@@ -112,6 +128,10 @@ public String name() {
return "Beam UnboundedSource [" + id() + "]";
}

public int getNumPartitions() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notes: "exposing the number of partitions for this SourceDStream so we can set the appropriate partitioner on the read via mapWithState..." or something.

@@ -247,6 +253,13 @@ public Unbounded(SparkContext sc,
}

@Override
public Option<Partitioner> partitioner() {
// setting the partitioner helps to "keep" the same partitioner in the following
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

JavaDStream<WindowedValue<T>> readUnboundedStream = mapWithStateDStream.flatMap(
new FlatMapFunction<Tuple2<Iterable<byte[]>, Metadata>, byte[]>() {
@Override
public Iterable<byte[]> call(Tuple2<Iterable<byte[]>, Metadata> t2) throws Exception {
return t2._1();
}
}).map(CoderHelpers.fromByteFunction(coder));

if (sourceDStream.getNumPartitions() < defaultParallelism) {
// Repartition up to default parallelism if there are too few partitions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimizing parallelism instead of Repartitioning-up ?

@aviemzur aviemzur force-pushed the sourcerdd-unbounded-default-partitioner branch from 581233f to 0fa4d29 Compare March 23, 2017 09:36
@aviemzur aviemzur changed the title [BEAM-848] A better shuffle after reading from within mapWithState. [BEAM-1074] Set default-partitioner in SourceRDD.Unbounded Mar 23, 2017
@aviemzur
Copy link
Member Author

Moved BEAM-1075 commit to a different branch for now, we'll sit on it and see when and if to integrate it.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 70.159% when pulling 0fa4d29 on aviemzur:sourcerdd-unbounded-default-partitioner into 5e1be9f on apache:master.

@asfbot
Copy link

asfbot commented Mar 23, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/8707/
--none--

@asfbot
Copy link

asfbot commented Mar 23, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PostCommit_Java_RunnableOnService_Spark/1338/
--none--

@coveralls
Copy link

Coverage Status

Coverage increased (+0.006%) to 70.156% when pulling 8de455c on aviemzur:sourcerdd-unbounded-default-partitioner into 5e1be9f on apache:master.

@asfbot
Copy link

asfbot commented Mar 23, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PostCommit_Java_RunnableOnService_Spark/1339/
--none--

@asfbot
Copy link

asfbot commented Mar 23, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/8717/
--none--

@asfgit asfgit closed this in 9ac1ffc Mar 23, 2017
@aviemzur aviemzur deleted the sourcerdd-unbounded-default-partitioner branch March 28, 2017 03:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants