-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-16012][runtime] Reduce the default number of buffers per channel from 2 to 1 #11088
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…el from 2 to 1 To speed up checkpoint in the case of back pressure, this commit tries to reduce the amount of data in flight by reducing the default number of buffers per channel from 2 to 1. Together with the default 8 floating buffers, one buffer per channel should be enough for most cases without performance regression. And one can increase it if there are any performance issues.
|
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 424b744 (Thu Sep 23 18:02:23 UTC 2021) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. DetailsThe Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
|
Theoretically, reducing the number of buffers may break the data processing pipeline which can influence the performance. For verification, I hava tested the change using the flink micro benchmark and a simple benchmark job. Unfortunately, regressions are seen for both tests. For micro benchmark, the following are some results with regression (Because of the unstable result, I run each test three times.): Using 2 buffer: Using 1 buffer: From the above results, we can see about 20% performance regression. For the benchmark job, there are also regressions (about 10% - 20%) in some cases where input channel numbers are small, for example 2 input channels, which means the number of buffer can be used is limited. |
|
@wsry thanks for the results. Have you tried increasing floating buffers to try to compensate fewer exclusive buffers? Also would be nice to check the micro benchmark results on our benchmarking machine, but I think @zhijiangW is already doing that. |
|
Yes, I executed the micro-benchmark on our benchmarking machine yesterday and almost all the tests for network throughput has a bit regression except the data skew case. I confirmed with @wsry to also verify the results for only reducing the receiver side, keeping the buffer same as before on sender side. And the results still have a bit regression, but better than reducing both receiver and sender sides. Next step we can further test how many buffers are really needed for both output and input sides in different scales, without regression. Then we can decide how to tune this setting, based on job scale or dynamic option. |
|
I did not increase the number of floating buffers. I guess we can make up the regression if we increase the number of floating buffers (maybe the only problem is that the floating buffer can't be always fulfilled). I'd like to do some test to see how many floating buffer we need to compensate the regression for the micro benchmark. |
|
We ever had the jira ticket FLINK-9142 about reducing the exclusive buffer from 2 to 1, and then put them into floating buffers instead. That means the total number of buffers on receiver side is not changed. We can also verify this way before determining how many floating buffers are required for keeping the performance. If the performance regression still exists based on the same buffer amount, then we should consider the cost difference between exclusive and floating buffer roles. |
|
Keep in mind that the purpose of this is to reduce the amount of in-flight data. So just moving the exclusive buffers to floating pool, is not helping us with this. I was hoping that in setups with 1000 channels, if we decrease the exclusive buffers from 2000 to 1000, while increasing floating from 7 to for example 30, might give us good balance.
That's also true for exclusive buffers. If there is not enough memory/buffers on the TM, network stack will allocate only one exclusive buffer per channel. That also means, for some users (I would guess for quite a lot of them) decreasing the exclusive buffers from 2 to 1, will not reduce the amount of in-flight records. |
Yes, that is not the formal solution for our final purpose. My suggestion was only for analyzing the performance regression step by step for better tracing. I wonder that the performance might be sensitive with two factors. One factor is the total amount buffers on receiver side, and another factor is the ratios of exclusive and floating buffers distribution. So if we keep the total buffer amount same, only change the buffer role. If there is no regression for benchmark at all, then we are relieved to only focus on the factor of total floating buffers amount. If not, we should also keep an eye on whether there are extra cost for requesting floating buffers. I hope the second factor is unnecessary worried. |
|
I have tested reducing the exclusive buffer from 2 to 1, and then put them into floating buffers instead. It has no evident influence on the performance. |
|
I did some further performance tests using the micro benchmark with changes in #11155 and the results are as follows. From the results we can see that if we reduce the buffer per channel of both upstream and downstream, then we need almost equivalent number of floating buffers to compensate the reduced ones to mitigate the performance regression. If we reduce the exclusive buffer of downstream only and keeps upstream unchanged, there is no visible regression and the performance of the fast flush cases (1ms flushTimeout) becomes even better, IMO, the reason is that less credit prevent the upstream from sending too many small buffers to downstream which can increase the TPS. What do you think? Should we reduce the number of buffer statically? @pnowojski @zhijiangW settings: 2 buffers per outgoing channel/2 buffer per incoming buffer/8 floating buffer round 2# round 3# settings: 1 buffers per outgoing channel/1 buffer per incoming buffer/8 floating buffer round 2# round 3# settings: 1 buffers per outgoing channel/1 buffer per incoming buffer/508 floating buffer round 2# round 3# 1 buffers per outgoing channel/1 buffer per incoming buffer/1008 floating buffer round 2# round 3# If we only reduce the exclusive buffer of downstream and keep the upstream unchanged: round 2# round 3# round 4# round 5# |
|
Thanks for the updates @wsry ! If we only adjust the exclusive buffer from 2 to 1 on receiver side, and still keep 2 on sender side, then the micro-benchmark has no regressions. We could further verify it in cluster jobs because the network transport delay might behave different with micro-benchmark. If the conclusion is the same, then I think it is reasonable to make this change for receiver side. We can save 25% in-flight buffers almost. But we might need to introduce another buffer setting parameter which would break the compatibility. Regarding the upstream side, I guess it might not be feasible to give a static buffer setting when setup. And it is more likely with the data distribution among subpartitions. The hash partitioner might have the similar behavior with rebalance partitioner in practice. And we can further consider the dynamic way on sender side in another ticket. |
|
Yes, thank you for the results @wsry. Indeed it looks like idea of decreasing the exclusive buffers on the receiver side by default to 1 is worth pursuing. Actually I'm a bit surprised that reducing exclusive buffers to 1 on the sender side is causing problems. The number of 2 exclusive buffers + ~8 floating was estimated to provide good bandwidth for single channel performance. Assuming 1ms round trip for requesting/receiving a network credit 10 buffers alone should support ~320MB/s network traffic. (32KB*10/1ms = 320MB/s). What is the network traffic of our network benchmarks? 28161 ops/ms * 32 bytes/ops = ~900MB/s. From this, I would expect 30 floating buffers to suffice on your machine, unless the round trip of credit assignment on local host is more than 1ms? Regardless of that, as I discussed with @zhijiangW offline. For many scenarios (probably most) we can accept a performance regression in favour of smaller memory requirements and faster checkpointing under back pressure - at least as an alternative configuration of Flink. In other words, we could document two recommended configuration setups:
|
|
@pnowojski IMO, the reason why reducing buffers of sender side is causing problem is that our benchmark can request a large amount of buffers in a really short time. Our benchmark uses roundrobin partitioner and the size of record is fixed, so as a result, almost all subpartitions will fill up the current buffer and need to request next one. If there are no enough buffers at the time, the sender thread need to wait the all previously sent buffers to be recycled which need some time. I agree that we need to document the configuration setups you mentioned and except for reducing the number of buffers, users can also try to reduce the size of network buffer to reduce in-flight data in cases of back pressure (or maybe we can reduce the size of buffer dynamically). |
|
I guess you might be right @wsry. If the distribution was random, the buffers would be filling up gradually and fewer floating buffers should suffice for the smooth progress. We could also explore this option @wsry - replacing the |
|
@pnowojski |
|
Sorry for the late update. I implement an imbalance partitioner which emits @pnowojski @zhijiangW What do you think about the results, should be reduce both the upstream and downstream buffer per channel to 1 or we only reduce the downstream buffer (we may need to add a new config option)? 2 buffers per channel and 8 floating buffers per gate 1 buffers per channel and 8 floating buffers per gate 1 buffers per channel and 128 floating buffers per gate |
|
Thanks for the update @wsry, good that we at least confirmed what is the issue here. And good job on figuring this out. Let's maybe try to gather the data for the 0 exclusive buffers first. Since 1 exclusive buffer can cause some performance regressions in some rare cases, we might prefer in the end to just advertise two main different setups: full throughput with 2 exclusive buffers and low throughput with 1/0 exclusive buffers. Anything in between would be for power users to self tune? |
|
@pnowojski I will update the results with 0 exclusive buffer (only downstream) latter after the dead lock problem is solved by FLINK-16641. For upstream, at least 1 buffer is needed for each subpartition currently, so 0 exclusive buffer is not enough. (Maybe an easy way to fix it is to further split buffer into smaller ones if the number of available buffers is not enough) |
I meant 0 exclusive buffers for the receiver. Yes, sender needs at least 1 buffer :)
Let's not go this way just now. Easy hot fix for a user is to configure smaller buffers. |
Agree, i think it is the right way to go. After we solve the deadlock issue for 0 exclusive buffers on receiver side, we can accept any settings by users and only give some explanations how different setting would impact the performance. Adjust the default floating buffers might not make sense ATM, because we do not fully know yet how many floating buffers are really needed for different scale and partitioner mode to not impact performance. I guess finally we still need a separate exclusive buffer configuration for receiver side different with sender side. Then users can tune exclusive buffer as 0 for receiver side separately, but keep at-least 1 exclusive buffer for sender side. Even the default exclusive buffer for receiver can be adjusted as 1 instead of 2, as we already verified it would not impact performance. |
|
Really impressive discussion! I've learnt a lot :). |
What is the purpose of the change
To speed up checkpoint in the case of back pressure, this commit tries to reduce the amount of data in flight by reducing the default number of buffers per channel from 2 to 1. Together with the default 8 floating buffers, one buffer per channel should be enough for most cases without performance regression. And one can increase it if there are any performance issues.
Brief change log
Verifying this change
Does this pull request potentially affect one of the following parts:
@Public(Evolving): (yes / no)Documentation