New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-7] [Runtime] Enable Range Partitioner. #1255
Conversation
8a41b18
to
a22fc27
Compare
Hi @ChengXiangLi, thanks a lot for the PR! |
Thanks for the review in advance, @fhueske . Something i want to mention here is that, the range partitioner implementation in this PR follow the existed partitioner design pattern, which in my personal view, is not well designed by the principle of object-oriented programing, the parameters of different partitioners are passed through Flink all together, and all partitioners logic mixed in OutputEmitter, i'm not very satisfied with the implementation in this PR actually. I think we need refactor the partitioner as a followup work, but this should not block this PR. Range partitioner is a little different from other partitioners, with it enabled, we could have a better overview about how to design the abstraction of partitioner. The refactor should require a design doc and fully discussed in community, what do you think about this? |
Hi @ChengXiangLi, I had a look at your PR and I think we need to change the implementation a bit. It would be better to inject the sampling into the actual job. This can be done for example as follow.
could be translated into:
This would inject the sampling into the original program. The sampling is done as before, but the data distribution is broadcasted to a map operator that uses the distribution to assign partition IDs to records and converts the I agree it is not super nice, but this implementationx would cache the intermediate result instead of recomputing it. In addition it barely touches the internals. It is also possible to integrate the partitioning more tightly into the runtime by providing the data distribution directly to the Partitioner. However, that would mean we need to implement a partitioning operator for the runtime (instead of using the regular operator and a NOOP driver). Btw. I have some code lying around (for a not-yet-completed features) to extract keys from a record given the key specification. Let me know if that would help for your implementation. Regarding the implementation of the What do you think? Thanks, Fabian |
Thanks, @fhueske , it's definitely great that if we can only execute the pre-sample logic just once, i would update the code later. |
I think the right way to implement the range partitioner feature is to inject the sampler during the JobGraph generation phase after the OptimizedPlan has been created. This would allow to transparently handle range partitioning during optimization and make it a completely runtime-related feature. |
a22fc27
to
cd78ac6
Compare
Implemented the range partitioner with KeySelector based on broadcast data distribution, it's not fully finished yet, @fhueske , for the range partitioner with filed index or field name, it should require the tool you have mentioned before about extracting keys from a record. |
public void mapPartition(Iterable<Tuple2<K, IN>> values, Collector<Tuple2<Integer, IN>> out) throws Exception { | ||
|
||
List<Object> broadcastVariable = getRuntimeContext().getBroadcastVariable("DataDistribution"); | ||
if (broadcastVariable == null || broadcastVariable.size() != 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can move the broadcast variable initialization into the open method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, I thought you were using a MapFunction, but its a MapPartitionFunction. So this is only done once.
Hi, I left a few comments inside. Let me know what you think. |
a39dc9a
to
fe279eb
Compare
I add the |
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
package org.apache.flink.api.java.functions; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The range partition functions (AssignRangeIndex
, RangeBoundaryBuilder
, and PartitionIDRemoveWrapper
) should be moved from flink-java
to flink-runtime
because they are rather part of the runtime than of the user-facing API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should they be part of flink-optimizer
? The rewrite happens during optimization phase, we have not touch runtime yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intention was to move the classes to flink-runtime
because this is where most of the execution code is located, including the execution vertices. flink-optimizer
does not contain execution code AFAIK and does only translates the API code into an executable representation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, make sense to me.
1383039
to
303bbf3
Compare
Hi @ChengXiangLi, I tried the PR locally on my machine and it was working quite well.
What do you think? Btw. The Record specific code paths (incl. the RecordOutputEmitter) have been removed. |
Thanks for the fast update! I also got feedback from @StephanEwen about the use of the IntDistribution. He shares your opinion, that it is not really needed and just incurs additional overhead due to the second binary search. He proposes to use a On last very small remark. Could you please set the parallelism also for the I think we're almost there. :-) |
… and some code refactor.
be08030
to
a9c1af7
Compare
Thanks, @fhueske . I've updated the PR to use |
Thanks for the update. Looks good. Will try it once more on a cluster. @uce, @StephanEwen , we inject a |
Sorry for not responding earlier. I've missed the notification somehow. Batch exchange (the blocking intermediate results) do not work within iterations. Therefore the optimizer never sets the exchange mode to batch within iterations currently. I think we need to suppress range partitionings within iterations for the time being. |
Thanks @uce! I extended the I also found that no combiners are inserted if the input of a reduce or combinable groupreduce is explicitly partitioned (FLINK-3179). This is not directly related to this PR (because it also affects I am trying the PR right now on a cluster setup. Thanks @ChengXiangLi for your patience with this one! |
I executed the Flink WordCount example on 4 nodes with 8 parallel tasks and roughly 17GB of input data once with hash partitioning and once with range partitioning. Both times no combiner was used. The hash partitioned WC took 8:00 mins and the range partitioned 13:17 mins.
The breakdown of the hash partitioned WC is:
So, the overhead of the range partitioned WC comes from additional IO of reading the flatMapped words and the additional 4-byte integer. Also the sorting of the group reduce does not happen concurrently with the source IO. Reducing (w/o sort) and sink take about the same amount of time. I also check the distribution of input and output records / bytes for the GroupReduce.
We see that the range partitioner does not perform better (in fact a bit worse) than the hash partitioner (the differences for output records are expected). Maybe increasing the sample size helps? The overhead of reading the the intermediate data set from disk is so high, that a more fine-grained histogram can be justified, IMO. How about increasing the sample size from Other thoughts? |
Hi, @fhueske , For the partition part, i think it's normal that |
Range partitioning serves two purposes:
Producing sorted results is working fine. However, producing balanced partitions does not seem to work so well. Looking at the numbers I posted, the partitions produced by the range partitioner are less balanced than the hash partitioned ones (records-in / bytes-in). The difference is not huge, but still range partitioning should be able to do better than hash partitioning. I proposed to increase the sample size, because this should improve the accuracy of the histogram without having a (measurable) impact on the performance. If we pay so much time to generate a histogram, the histogram should be accurate enough to result in balanced partitions. Can you explain how you calculated the sample size of |
Sorry, @fhueske , i misunderstood your test data, the keys should be skewed on some value, while in my previous test, the keys are not skewed. it's complicate to calculate how many samples should be taken from a dataset to meet an a priori specified accuracy guarantee, one of the algorithm is described at http://research.microsoft.com/pubs/159275/MSR-TR-2012-18.pdf which i used before, but it should not totally fit into the case which keys are skewed. |
Thanks @ChengXiangLi. Thanks for your patience! |
Updated at 11/05
This PR handle the range partition with automatic sampled data distribution. Currently, this PR rewrite the
PlanNode
DAG inOptimizer
to support range partitioner, the rewrite logic is as following:To be noted, the
PartitionNode
andPartitionIDRemoverNode
is switched in our final implementation.