-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-2997] Support range partition with user customized data distribution. #1776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a3cd265
to
e91407d
Compare
@fhueske @ChengXiangLi Can you please help me with review work? The error of CI build failure is not relevant. |
|
||
String expected = "[(1), (4), (2), (3), (5), (12)]"; | ||
assertEquals(expected, out1.collect().toString()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you add a test which use 2 fields as partition key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, i will modify the test code.
Just a minor comment, mostly looks good to me. @fhueske , do you want to take a look at this PR? |
this.pKeys = pKeys; | ||
this.partitionLocationName = partitionLocationName; | ||
this.customPartitioner = customPartitioner; | ||
this.distribution = distribution; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a check that a distribution
is only set if pMethod == PartitionMethod.RANGE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other pMethod
like hashPartition
and customPartition
don't need the distribution
and the distribution
use its default value: null
, i think no check here also can make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check would prevent an invalid parameter combination and also shows readers of the code that such a combination is not valid. Hence, I think the check should be added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check here that the key types of distribution.getKeyTypes()
are equal to pKeys.getKeyFieldTypes()
.
Thanks @gallenvara for opening the PR. Looks mostly good but the tests need to be improved, IMO. It would also be good to extend the |
Hi, @fhueske . I have modified the relevant code. I still use the generic class |
Sorry @gallenvara, I was busy the last days. I will look into this PR tomorrow. |
prPlanNode.setParallelism(targetParallelism); | ||
GlobalProperties globalProperties = new GlobalProperties(); | ||
globalProperties.setRangePartitioned(new Ordering(0, null, Order.ASCENDING)); | ||
globalProperties.setRangePartitioned(new Ordering(0, null, Order.ASCENDING), channel.getDataDistribution()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you pass null
instead of channel.getDataDistribution()
to make it clear that no provided data distribution is available?
checkPartition = false; | ||
} | ||
} | ||
out.collect(checkPartition); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to emit something if assert is done in function.
I had a few comments, but we are getting closer :-) |
@fhueske , thanks a lot for review work, codes have been modified based on your advice. I change the second test with modifying the range boundary from |
7094543
to
ecdb437
Compare
if (distribution != null) { | ||
Preconditions.checkArgument(distribution.getNumberOfFields() == pKeys.getNumberOfKeyFields(), "The number of key fields in the distribution and range partitioner should be the same."); | ||
for (int i = 0; i < pKeys.getNumberOfKeyFields(); i++) { | ||
Preconditions.checkArgument(distribution.getKeyTypes()[i].equals(pKeys.getKeyFieldTypes()[i]), "The types of key from the distribution and range partitioner are not equal."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you fetch the key type arrays only once and not for every key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code modified and rebase the new commit with previous one.( you must have stayed up late last night :) )
…n the second test.
/** | ||
* The class is used to do the tests of range partition with customed data distribution. | ||
*/ | ||
public static class TestDataDist implements DataDistribution { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two distributions defined by this class have not a lot in common. How about we define two classes, one for each distribution?
@fhueske , PR has been updated. |
public static class TestDataDist2 implements DataDistribution { | ||
|
||
public int rightBoundary[] = new int[]{6, 4, 9, 1, 2}; | ||
private int dim; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dim
should always be 2
. Remove the field, the constructor, and update write()
and read()
.
Thanks for the quick updates @gallenvara! I think we are almost done. The test distribution and input data looks good. Only the boundary checks need to be fixed. |
@fhueske codes has been modified :) |
…with lower and upper boundary.
Thanks for the fast update. The PR looks good. :-) |
… data distribution. This closes apache#1776
Sometime user have better knowledge of the source data, and they can build customized
data distribution
to do range partition more efficiently.