[FLINK-4545] preparations for removing the network buffers parameter #3467

NicoK · 2017-03-03T12:08:08Z

This PR includes some preparations for following PRs that ultimately lead to removing the network buffer parameter that was hard to tune.

These were unused except for unit tests and will be replaced with bounded BufferPool instances.

zhijiangW · 2017-03-06T09:46:13Z

Hi @NicoK , I am interested in this issue and I like the way of asserting hold lock in this PR.

It is really necessary to manage network buffers by framework, because it is difficult to set the exact number of buffers by users. And our current simple solution is to expand the ResourceProfile by adding the total number of input and output edges for Execution. Then the ResourceManager would calculate the buffer amounts based on that and overwrite the parameter value to TaskManager configuration.

From @StephanEwen mentioned before, I know a little for this issue. Would you share some detail designs for plans for it if have, then I can learn and track the progress in time. Thank you !

NicoK · 2017-03-06T10:25:36Z

Hi @zhijiangW,
actually, the solution I am working on is to replace the network buffers parameter by something like "max memory in percent" and "min MB to use". For this to not create buffer bloat in our network stack, I have started to implement limited LocalBufferPool instances which tune their size based on the actual number of outgoing and ingoing channels. It is actually not much more complicated than this and I already started on this in my local branch at https://github.com/NicoK/flink/tree/flink-4545 - expect a new PR within the week with more details.

…t partition type This removes JobVertex#connectNewDataSetAsInput(JobVertex input, DistributionPattern distPattern) and requires the developer to call JobVertex#connectNewDataSetAsInput(JobVertex input, DistributionPattern distPattern, ResultPartitionType partitionType) instead and think about the partition type to add.

These were implying a default result partition type which we want the developer to actively decide upon.

StephanEwen · 2017-03-08T20:32:58Z

I think this is a good change, merging this...

@zhijiangW Managing the buffers changes in some followup PR, first adjusting the local pools, then the global pool. Managing buffers in a global pool can help when caching data, such as for batch jobs. But we can take suggestions followup improvements as a separate thread, after this improvement is in.

wenlong88 · 2017-03-09T03:43:04Z

flink-runtime/src/main/java/org/apache/flink/runtime/io/network/buffer/LocalBufferPool.java

@@ -265,11 +281,15 @@ public String toString() {
 	// ------------------------------------------------------------------------

 	private void returnMemorySegment(MemorySegment segment) {
+		assert Thread.holdsLock(availableMemorySegments);


Hi, I have a question about assert, because I found that assertion is disabled in java by default. why not use explicit synchronized(availableMemorySegments) which may be more common usage.

Using synchronized again would impact performance, while assertions only do when they are enabled which is the case in our unit tests (see https://maven.apache.org/surefire/maven-surefire-plugin/test-mojo.html#enableAssertions).

This closes apache#3467

zhijiangW · 2017-03-14T04:29:58Z

@NicoK ，thank you for explanation, and I already trace the code in your local branch. Wish your further change commit in global pool.

@StephanEwen , thanks for further elaboration. From my understanding, each task can decide the core number of buffers in LocalBufferPool based on input, output channels and configuration, the maximum number of buffers based on ResultPartitionType. And all the LocalBufferPools make effect on the total number of buffers in NetworkBufferPool, may need consider maximum memory usages.

And my concern is to consider the memory usages in NetworkBufferPool before starts the TaskManager, and this part of memory should be added into the total resource of TaskManager.
I am willing to do that as a part of my current work in Fine-grained Resource Configuration after this feature completes.

StephanEwen · 2017-03-14T08:55:14Z

@zhijiangW Yes, let's discuss this when the feature is complete.
Our thinking so far is:

One can specify an absolute amount of network memory (similar as one can specify an absolute amount of managed memory for batch)
If no absolute amount is specified, a relative fraction of the JVM heap will be pre-allocated as network buffers.

This closes apache#3467

Nico Kruber added 4 commits March 3, 2017 11:46

[docs] improve some documentation around network buffers

11557c0

[hotfix][network] add some assertions documenting on which locks we rely

cd99906

[FLINK-4545] remove fixed-size BufferPool instances

8f529bb

These were unused except for unit tests and will be replaced with bounded BufferPool instances.

[FLINK-4545] remove (unused) persistent partition type

dfea1ba

Nico Kruber added 2 commits March 6, 2017 14:19

[FLINK-4545] remove unused IntermediateDataSet constructors

83d1404

These were implying a default result partition type which we want the developer to actively decide upon.

NicoK mentioned this pull request Mar 6, 2017

[FLINK-4545] use size-restricted LocalBufferPool instances for network communication #3480

Closed

wenlong88 reviewed Mar 9, 2017

View reviewed changes

StephanEwen pushed a commit to StephanEwen/flink that referenced this pull request Mar 9, 2017

[FLINK-4545] [network] remove fixed-size BufferPool instances

3cc3e3e

This closes apache#3467

StephanEwen pushed a commit to StephanEwen/flink that referenced this pull request Mar 9, 2017

[FLINK-4545] [network] remove fixed-size BufferPool instances

a0ea564

This closes apache#3467

StephanEwen pushed a commit to StephanEwen/flink that referenced this pull request Mar 9, 2017

[FLINK-4545] [network] remove fixed-size BufferPool instances

233ddb7

This closes apache#3467

asfgit closed this in 8b49ee5 Mar 9, 2017

p16i pushed a commit to p16i/flink that referenced this pull request Apr 16, 2017

[FLINK-4545] [network] remove fixed-size BufferPool instances

68c2c4f

This closes apache#3467

rmetzger added the component=Runtime/Network label Mar 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-4545] preparations for removing the network buffers parameter #3467

[FLINK-4545] preparations for removing the network buffers parameter #3467

NicoK commented Mar 3, 2017

zhijiangW commented Mar 6, 2017

NicoK commented Mar 6, 2017

StephanEwen commented Mar 8, 2017

wenlong88 Mar 9, 2017

NicoK Mar 9, 2017

zhijiangW commented Mar 14, 2017

StephanEwen commented Mar 14, 2017

[FLINK-4545] preparations for removing the network buffers parameter #3467

[FLINK-4545] preparations for removing the network buffers parameter #3467

Conversation

NicoK commented Mar 3, 2017

zhijiangW commented Mar 6, 2017

NicoK commented Mar 6, 2017

StephanEwen commented Mar 8, 2017

wenlong88 Mar 9, 2017

Choose a reason for hiding this comment

NicoK Mar 9, 2017

Choose a reason for hiding this comment

zhijiangW commented Mar 14, 2017

StephanEwen commented Mar 14, 2017