[FLINK-2590] fixing DataSetUtils.zipWithUniqueId() #1075

s1ck · 2015-08-29T22:09:16Z

modified algorithm as explained in the issue
updated method documentation

* modified algorithm as explained in the issue * updated method documentation

rmetzger · 2015-08-30T11:24:52Z

Thanks a lot for the contribution.
Can you add a test case for the method to make sure the issue is not re-introduced again when somebody else is changing the code?

HuangWHWHW · 2015-08-31T03:53:57Z

@rmetzger +1. I think add a test is helpful.
Otherwise can you give us a infomation that prove the 'id = (counter << shifter) + taskId; ' will never generate the same id in different task?
And a minor thing in you issue description:
Is log2(8)=3 not 4?

StephanEwen · 2015-08-31T12:02:04Z

flink-java/src/main/java/org/apache/flink/api/java/utils/DataSetUtils.java

@@ -121,6 +122,7 @@ public void mapPartition(Iterable<T> values, Collector<Tuple2<Long, T>> out) thr

 		return input.mapPartition(new RichMapPartitionFunction<T, Tuple2<Long, T>>() {

+			long maxLength = log2(Long.MAX_VALUE);


You can make this static final

StephanEwen · 2015-08-31T12:06:02Z

+1 for a test, otherwise this looks good!

s1ck · 2015-08-31T14:05:35Z

There is already a test case for zipWithUniqueId() in https://github.com/apache/flink/blob/master/flink-tests/src/test/java/org/apache/flink/test/util/DataSetUtilsITCase.java#L66
However, this test is under the assumption that there is only one task running, which is why it did not fail in the first place.
If there are multiple tasks, the resulting unique id is not deterministic for a single dataset element. I would implement a test, that creates a dataset, applies the zipWithUniqueId method, calls distinct(0) on the created ids and checks the number of resulting elements (must be equal to the input dataset). Would this be sufficient?
Furthermore, the current test cases for DataSetUtils assume a resulting dataset as string and check this after each test run. My proposed test would not fit in that scheme. Should I create a new test case class for this method?

@StephanEwen I wanted to do this, but static doesn't work with anonymous classes. However, I can declare the UDF as a private inner class (didn't want to change much code).
@HuangWHWHW the log2 method already existed and in the issue, I proposed to rename it. Maybe getBitSize(long value)? As for the "proof": if each task id is smaller than the total number of parallel tasks t, its bit representation is also smaller than the bit representation of t. Thus, when we shift the counter by the number of bits of t, there cannot be a collision for different task ids

StephanEwen · 2015-08-31T14:13:35Z

@s1ck Good idea. You can also call collect(), add the IDs to a set and make sure the set has the right cardinality. In general, avoiding temp files and Strings for comparison is a good idea.

tillrohrmann · 2015-08-31T14:24:28Z

@s1ck, the testZipWithUniqueId test is bogus. You can remove this test case an replace it with your described test. It would also be great if you could set the parallelism of testZipWithIndex to something greater than 1. Here it would also make sense to use collect instead of writing to disk.

+1 for renaming log2 into getBitSize(long value). When you rename the method, could you also change the line shifter = getBitSize(getRuntimeContext().getNumberOfParallelSubtasks()) into shifter = getBitSize(getRuntimeContext().getNumberOfParallelSubtasks() - 1). That way, we would also get the right unique ids in case of parallelism = 1.

s1ck · 2015-08-31T16:48:59Z

@tillrohrmann While writing the new tests for both methods, I encountered that zipWithIndex is broken, too. It sometimes throws ConcurrentModificationException. This is because each task sorts a broadcasted list in the open method. This could not fail before due to parallelism = 1.
I would fix this by creating a local copy of that list (which should be small in that specific case). Shall I fix this in the same issue or do you want me to create a new issue for that?

StephanEwen · 2015-08-31T16:54:28Z

There is an issue that tracks the ConcurrentModificationExceptionproblem. As per discussion in that issue, can you use a BroadcastVariableInitializer? Safes redundant sorts.

s1ck · 2015-08-31T17:10:49Z

@StephanEwen thx for the hint. works fine! Will cleanup and commit now.

* added tests for parallel execution of both zip functions * renamed log2 -> getBitSize * updated documentation

s1ck · 2015-08-31T17:57:16Z

@tillrohrmann I did not include the shifter = getBitSize(getRuntimeContext().getNumberOfParallelSubtasks() - 1) as your hint only applies for power of 2 values. E.g., getBitSize(7) returns 3 and we need 3 bits to cover the range from 0 to 6.

HuangWHWHW · 2015-09-01T01:29:28Z

Ah, thank you for the proof.
And didn`t see the log2 in detail before, sorry.

tillrohrmann · 2015-09-01T07:52:56Z

@s1ck, it's important to note that 1 will be subtracted from getRuntimeContext().getNumberOfParallelSubtasks() and not getBitSize(). The reason is that we have 0 based indices for the subtasks. Thus, we only have to calculate the maximum number of bits to represent the highest index we can encounter. And this is getRuntimeContext().getNumberOfParallelSubtasks() - 1. Thus if getNumberOfParallelSubtasks == 7, then we would calculate getBitSize(6) == 3.

* maximum bit size is changed to getNumberOfParallelSubTasks() - 1

s1ck · 2015-09-01T08:20:00Z

@tillrohrmann of course you are right, I thought wrong about it. it's committed

tillrohrmann · 2015-09-01T08:57:00Z

@s1ck, looks really good. Thanks for your contribution. Will merge it now.

s1ck · 2015-09-01T09:48:47Z

Sorry, I did not see that there are also identical test cases in Scala which now fail due to the -1 change. As those scala methods wrap the Java methods, is it necessary to run the same tests on them again?

tillrohrmann · 2015-09-01T09:51:44Z

No problem @s1ck. It might be a bit redundant but it tests that the forwarding is done correctly. Therefore, I fixed the test case.

s1ck · 2015-09-01T09:55:32Z

Ok, thank you.

…ipWithIndex() * modified algorithm as explained in the issue * updated method documentation [FLINK-2590] reducing required bit shift size * maximum bit size is changed to getNumberOfParallelSubTasks() - 1 This closes #1075.

…ipWithIndex() * modified algorithm as explained in the issue * updated method documentation [FLINK-2590] reducing required bit shift size * maximum bit size is changed to getNumberOfParallelSubTasks() - 1 This closes apache#1075.

[FLINK-2590] fixing DataSetUtils.zipWithUniqueId()

ab362b5

* modified algorithm as explained in the issue * updated method documentation

StephanEwen reviewed Aug 31, 2015
View reviewed changes

[FLINK-2590] fixing DataSetUtils.zipWithIndex()

1129626

* added tests for parallel execution of both zip functions * renamed log2 -> getBitSize * updated documentation

[FLINK-2590] reducing required bit shift size

3440996

* maximum bit size is changed to getNumberOfParallelSubTasks() - 1

asfgit closed this in ab14f90 Sep 2, 2015

andralungu mentioned this pull request Sep 8, 2015

[FLINK-2152] [bugfix] Used a concurrent list in the open method #1058

Closed

rmetzger added the component=API/Scala label Mar 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-2590] fixing DataSetUtils.zipWithUniqueId() #1075

[FLINK-2590] fixing DataSetUtils.zipWithUniqueId() #1075

s1ck commented Aug 29, 2015

rmetzger commented Aug 30, 2015

HuangWHWHW commented Aug 31, 2015

StephanEwen Aug 31, 2015

StephanEwen commented Aug 31, 2015

s1ck commented Aug 31, 2015

StephanEwen commented Aug 31, 2015

tillrohrmann commented Aug 31, 2015

s1ck commented Aug 31, 2015

StephanEwen commented Aug 31, 2015

s1ck commented Aug 31, 2015

s1ck commented Aug 31, 2015

HuangWHWHW commented Sep 1, 2015

tillrohrmann commented Sep 1, 2015

s1ck commented Sep 1, 2015

tillrohrmann commented Sep 1, 2015

s1ck commented Sep 1, 2015

tillrohrmann commented Sep 1, 2015

s1ck commented Sep 1, 2015

		@@ -121,6 +122,7 @@ public void mapPartition(Iterable<T> values, Collector<Tuple2<Long, T>> out) thr

		return input.mapPartition(new RichMapPartitionFunction<T, Tuple2<Long, T>>() {

		long maxLength = log2(Long.MAX_VALUE);

[FLINK-2590] fixing DataSetUtils.zipWithUniqueId() #1075

[FLINK-2590] fixing DataSetUtils.zipWithUniqueId() #1075

Conversation

s1ck commented Aug 29, 2015

rmetzger commented Aug 30, 2015

HuangWHWHW commented Aug 31, 2015

StephanEwen Aug 31, 2015

Choose a reason for hiding this comment

StephanEwen commented Aug 31, 2015

s1ck commented Aug 31, 2015

StephanEwen commented Aug 31, 2015

tillrohrmann commented Aug 31, 2015

s1ck commented Aug 31, 2015

StephanEwen commented Aug 31, 2015

s1ck commented Aug 31, 2015

s1ck commented Aug 31, 2015

HuangWHWHW commented Sep 1, 2015

tillrohrmann commented Sep 1, 2015

s1ck commented Sep 1, 2015

tillrohrmann commented Sep 1, 2015

s1ck commented Sep 1, 2015

tillrohrmann commented Sep 1, 2015

s1ck commented Sep 1, 2015