-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[BEAM-4783] Fix invalid parameter to set the partitioner in Spark GbK #6884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -65,6 +65,8 @@ | |
| import org.apache.beam.sdk.values.TupleTag; | ||
| import org.apache.beam.sdk.values.WindowingStrategy; | ||
| import org.apache.spark.Accumulator; | ||
| import org.apache.spark.HashPartitioner; | ||
| import org.apache.spark.Partitioner; | ||
| import org.apache.spark.api.java.JavaPairRDD; | ||
| import org.apache.spark.api.java.JavaRDD; | ||
| import org.apache.spark.api.java.JavaSparkContext; | ||
|
|
@@ -130,17 +132,14 @@ public void evaluate(GroupByKey<K, V> transform, EvaluationContext context) { | |
| WindowedValue.FullWindowedValueCoder.of(coder.getValueCoder(), windowFn.windowCoder()); | ||
|
|
||
| // --- group by key only. | ||
| Long bundleSize = | ||
| context.getSerializableOptions().get().as(SparkPipelineOptions.class).getBundleSize(); | ||
| Partitioner partitioner = | ||
| (bundleSize > 0) | ||
| ? new HashPartitioner(context.getSparkContext().defaultParallelism()) | ||
| : null; | ||
| JavaRDD<WindowedValue<KV<K, Iterable<WindowedValue<V>>>>> groupedByKey = | ||
| GroupCombineFunctions.groupByKeyOnly( | ||
| inRDD, | ||
| keyCoder, | ||
| wvCoder, | ||
| context | ||
| .getSerializableOptions() | ||
| .get() | ||
| .as(SparkPipelineOptions.class) | ||
| .getBundleSize() | ||
| > 0); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For anyone observing: this is the real issue where it should have been The refactor removes the likelihood of error by explicitly taking a |
||
| GroupCombineFunctions.groupByKeyOnly(inRDD, keyCoder, wvCoder, partitioner); | ||
|
|
||
| // --- now group also by window. | ||
| // for batch, GroupAlsoByWindow uses an in-memory StateInternals. | ||
|
|
@@ -433,7 +432,7 @@ private static <K, V, OutputT> JavaPairRDD<TupleTag<?>, WindowedValue<?>> statef | |
| WindowedValue.FullWindowedValueCoder.of(kvCoder.getValueCoder(), windowCoder); | ||
|
|
||
| JavaRDD<WindowedValue<KV<K, Iterable<WindowedValue<V>>>>> groupRDD = | ||
| GroupCombineFunctions.groupByKeyOnly(kvInRDD, keyCoder, wvCoder, true); | ||
| GroupCombineFunctions.groupByKeyOnly(kvInRDD, keyCoder, wvCoder, null); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe this is actually correct and I did it wrong originally. |
||
|
|
||
| return groupRDD | ||
| .map( | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -301,7 +301,7 @@ public void evaluate(GroupByKey<K, V> transform, EvaluationContext context) { | |
| JavaDStream<WindowedValue<KV<K, Iterable<WindowedValue<V>>>>> groupedByKeyStream = | ||
| dStream.transform( | ||
| rdd -> | ||
| GroupCombineFunctions.groupByKeyOnly(rdd, coder.getKeyCoder(), wvCoder, true)); | ||
| GroupCombineFunctions.groupByKeyOnly(rdd, coder.getKeyCoder(), wvCoder, null)); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To maintain old functionality: GroupCombineFunctions.groupByKeyOnly(rdd, coder.getKeyCoder(), wvCoder, new HashPartitioner(context.getSparkContext().defaultParallelism())); |
||
|
|
||
| // --- now group also by window. | ||
| JavaDStream<WindowedValue<KV<K, Iterable<V>>>> outStream = | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be flipped to maintain old functionality:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think it will have basically the same effect if we always use null in the batch context. If we split a source on bundleSize and it has
npartitions or split it on defaultParallelism and it hasdefaultParallelismpartitions, either way going forward we want to maintain that many partitions which is what noHashPartitionerdoes. Only case where this is weird is if we try to split on defaultParallelism but the source doesn't support splitting that much causing< defaultParallelismpartitions.