[FLINK-8581] Improve performance for low latency network #5423

pnowojski · 2018-02-07T14:31:55Z

This big PR depends on #4552 and #5314. Main purpose of this change is to increase network throughput/performance in low latency cases. On its own, #4552 and #5314 are causing huge performance degradation for ~1ms flushing intervals (on top of already very poor Flink's performance in such case). This PR is fixing making throughput in ~1ms more or less similar to ~100ms flushing interval.

Quick (noisy) benchmark results:

master branch:

Benchmark    (number of output channels, flush interval)   Mode  Cnt      Score       Error   Units
networkThroughput                 1,100ms  thrpt    5  53776.816 ± 8566.861  ops/ms
networkThroughput                 100,1ms  thrpt    5    536.800 ±  821.872  ops/ms
networkThroughput              1000,100ms  thrpt    5  30679.754 ± 3737.085  ops/ms

master + credit based flow control

Benchmark    (number of output channels, flush interval)   Mode  Cnt      Score       Error   Units
networkThroughput                 1,100ms  thrpt    5  49768.778 ± 13329.952  ops/ms
networkThroughput                 100,1ms  thrpt    5  BENCHMARK TIMEOUT! below ~150 ops/ms
networkThroughput              1000,100ms  thrpt    5  27793.594 ±  3428.951  ops/ms

credit based + low latency fixes (this PR):

Benchmark    (number of output channels, flush interval)   Mode  Cnt      Score       Error   Units
networkThroughput                 1,100ms  thrpt    5  47576.352 ± 12641.958  ops/ms
networkThroughput                 100,1ms  thrpt    5  41898.764 ±  4450.404  ops/ms
networkThroughput              1000,100ms  thrpt    5  27642.259 ±  9086.744  ops/ms

Brief change log

[FLINK-7456][network] Implement Netty sender incoming pipeline for credit-based #4552 and [FLINK-8425][network] fix SpilledSubpartitionView not protected against concurrent release calls #5314 dependencies
bunch of hotfixes/prerequisiting fixes
[FLINK-8582][runtime] Introduce BufferConsumer
bunch of hotfixes/prerequisiting fixes
[FLINK-8583] Pass BufferConsumer to subpartitions
bunch of hotfixes/prerequisiting fixes
[FLINK-8591][runtime] Pass unfinished bufferConsumers to subpartitions
some hotfixes

This last one ([FLINK-8591]) is the one commit that actually improves the performance by allowing sender to append a records to a memory segment, while PartitionRequestQueue in Netty is busy handling/processing/flushing previous memory segment and when it is blocked for a new credit to arrive.

Verifying this change

This change is a trivial rework ;)

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
The serializers: (yes / no / don't know)
The runtime per-record code paths (performance sensitive): (yes / no / don't know)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / no / don't know)
The S3 file system connector: (yes / no / don't know)

Documentation

Does this pull request introduce a new feature? (yes / no)

StefanRRichter

Overall, the changes look good and I had some comments (inlined), but nothing blocking.

StefanRRichter · 2018-02-12T10:53:19Z

flink-runtime/src/main/java/org/apache/flink/runtime/io/network/buffer/BufferConsumer.java

+ * {@link BufferBuilder} and there can be a different thread reading from it using {@link BufferConsumer}.
+ */
+@NotThreadSafe
+public class BufferConsumer implements Closeable {


Just a thought about names: this is called BufferConsumer, but it does not "consume" buffers. It is coordinating the production of read slices from a shared buffer. BufferBuilder makes more sense then this. Even worse, this class has a build() : Buffer method :-(.

Yes, I know. Can you propose some different naming scheme? BufferWriter and BufferBuilder?

StefanRRichter · 2018-02-12T11:27:39Z

...me/src/main/java/org/apache/flink/runtime/io/network/api/serialization/RecordSerializer.java

 	 * @return how much information was written to the target buffer and
 	 *         whether this buffer is full
 	 */
 	SerializationResult setNextBufferBuilder(BufferBuilder bufferBuilder) throws IOException;


One remark from reading the code, I found it a bit surprising that a method that looks like a setter will case the write to continue. Maybe this is better called something like continueWritingWithNextBufferBuilder or split the setter from a continueWrite method?

StefanRRichter · 2018-02-12T11:31:07Z

flink-runtime/src/main/java/org/apache/flink/runtime/io/network/api/writer/RecordWriter.java

+		bufferBuilders[targetChannel] = Optional.empty();
+
+		numBytesOut.inc(bufferBuilder.getWrittenBytes());
+		bufferBuilder.finish();


You could combine this into numBytesOut.inc(bufferBuilder.finish()) or maybe finish() should not need to have a return value?

StefanRRichter · 2018-02-12T12:57:56Z

flink-runtime/src/main/java/org/apache/flink/runtime/io/network/api/writer/RecordWriter.java

-					result = serializer.setNextBufferBuilder(bufferBuilder);
+		SerializationResult result = serializer.addRecord(record);
+
+		while (result.isFullBuffer()) {


I wonder if this loop could not be simplified to

while (!result.isFullRecord()) { tryFinishCurrentBufferBuilder(targetChannel, serializer); BufferBuilder bufferBuilder = requestNewBufferBuilder(targetChannel); result = serializer.setNextBufferBuilder(bufferBuilder); }

This would introduce a minor change in behaviour in cases where the end of the record falls exactly to the end of a buffer. With the change, the buffer is only finished by the next record and not on the spot. However this should not be a problem because this outcome is what usually should happen for almost every record beside those corner cases and thus the code should already handle them well.
With this change, tryFinishCurrentBufferBuilder also does not longer require a return value.

As we discussed, I'm not entirely sure. This "minor change" can be a significant overhead in case of many channels and large records. I don't want to risk increasing the scope of potential problems with this PR :(

👍 Can introduce this change later after some more extensive tests.

StefanRRichter · 2018-02-12T13:33:36Z

flink-core/src/main/java/org/apache/flink/util/FutureUtil.java

+	public static void waitForAll(long timeoutMillis, Collection<Future<?>> futures) throws Exception {
+		long startMillis = System.currentTimeMillis();
+		Set<Future<?>> futuresSet = new HashSet<>();
+		for (Future<?> future : futures) {


Could be replaced with addAll() or even the constructor taking collection.

StefanRRichter · 2018-02-13T13:17:35Z

flink-runtime/src/main/java/org/apache/flink/runtime/io/network/api/writer/RecordWriter.java

-			Buffer buffer,
+	private boolean tryFinishCurrentBufferBuilder(
 			int targetChannel,
 			RecordSerializer<T> serializer) throws IOException {


This code no longer throws IOException.

StefanRRichter · 2018-02-13T14:26:17Z

...k-runtime/src/main/java/org/apache/flink/runtime/io/network/netty/PartitionRequestQueue.java

+		reader.setRegisteredAsAvailable(true);
+	}
+
+	private NetworkSequenceViewReader poolAvailableReader() {


This should probably be pollAvailableReader()

StefanRRichter · 2018-02-13T14:39:26Z

...k-runtime/src/main/java/org/apache/flink/runtime/io/network/netty/PartitionRequestQueue.java

+				} else {
+					// This channel was now removed from the available reader queue.
+					// We re-add it into the queue if it is still available
+					if (next.moreAvailable()) {


This looks like the most common case, and I wonder why we cannot just peek the queue and only remove reader in the other cases?

This is not the most common case - except of super low latencies cases, network is much faster then our capabilities to produce data.

Secondly, there are three branches that we need to cover here. With as it is no, we poll reader once, and only re-enqueue it once (in this case that you commented). With peek we would have to pop it in two places.

StefanRRichter · 2018-02-13T15:20:38Z

.../org/apache/flink/runtime/io/network/api/writer/AbstractCollectingResultPartitionWriter.java

+
+	@Override
+	public synchronized ResultPartitionID getPartitionId() {
+		return new ResultPartitionID();


What is the intended effect of having this synchronized, looks like it does nothing?

StefanRRichter · 2018-02-13T15:25:07Z

.../org/apache/flink/runtime/io/network/api/writer/AbstractCollectingResultPartitionWriter.java

+
+	@Override
+	public synchronized BufferProvider getBufferProvider() {
+		return bufferProvider;


What does this synchronize help? The field is final, so I would assume this change is not required.

…ove readability

This simplifies an API a little bit

Deduplicated code was identical.

…dSerializationTest Dedupilcated code was effectively identical, but implemented in a slightly different way.

BufferConsumer will be used in the future for reading partially written MemorySegments. On flushes instead of requesting new MemorySegment BufferConsumer code will allow to continue writting to partially filled up MemmorySegment.

…Spillable subtartitions

…s fails

…classes

notifyBuffersAvailable is a quick call that doesn't need to be executed outside of the lock

SpilledSubpartitionViewTest duplicates a lot of production logic (TestSubpartitionConsumer is a duplicated logic of LocalInputChannel and mix of CreditBasedSequenceNumberingViewReader with PartitionRequestQueue. Also it seems like most of the logic is covered by SpillableSubpartitionTest.

…nputGate and handle redundant data notifications

This is a preparation for changes in data notifications, which will not be that strict as they are now.

… in the loop

By introducing #commit() method on critical path we reduce number of volatile writes from 2 down to 1. This improves network throughput by 20% and restores the orignal performance for high latency cases.

…extBufferBuilder

pnowojski · 2018-02-19T11:26:16Z

I have rebased the PR and squashed the fixup commits.

StefanRRichter · 2018-02-19T13:59:40Z

Thanks for those very good improvements, I will merge this.

pnowojski changed the title ~~Low latency network changes~~ [FLINK-8581] Improve performance for low latency network Feb 8, 2018

pnowojski force-pushed the buffer-consumer branch 3 times, most recently from 3938b02 to 0896f88 Compare February 8, 2018 13:27

StefanRRichter reviewed Feb 13, 2018

View reviewed changes

pnowojski force-pushed the buffer-consumer branch from 3c9450a to c1f90bf Compare February 15, 2018 14:43

pnowojski added 24 commits February 19, 2018 12:21

[hotfix][network] Invert if check in SpanningRecordSerializer to impr…

c520f6b

…ove readability

[hotfix][tests] Do not hide original exception in Serialization tests

058c0ed

[hotfix][runtime] Drop one of the two clear methods in RecordSerializer

8f59e7b

This simplifies an API a little bit

[hotfix][tests] Deduplicate code in LargeRecordsTest

56d3184

Deduplicated code was identical.

[hotfix][test] Deduplicate code in LargeRecordsTest and SpanningRecor…

66ac59f

…dSerializationTest Dedupilcated code was effectively identical, but implemented in a slightly different way.

[FLINK-8582][runtime] Introduce BufferConsumer

306fd8e

BufferConsumer will be used in the future for reading partially written MemorySegments. On flushes instead of requesting new MemorySegment BufferConsumer code will allow to continue writting to partially filled up MemmorySegment.

[hotfix][test] Simplify RecordWriterTest

ec7934e

[hotfix][runtime] Refactor ResultPartition for cleaner recycle path

5ad8450

[hotfix][runtime] Fix recycleBuffer in ResultPartitionTest

6c3c334

[hotfix][runtime] Deduplicate code in PipelinedSubpartition

2214a24

[hotfix][runtime] Deduplicate buffersInBacklog code in Pipelined and …

10d11d7

…Spillable subtartitions

[hotfix][runtime-tests] Immediatelly fail test when one of the future…

433e05c

…s fails

[hotfix][runtime-tests] Deduplicate CollectingResultPartitionWriters …

6b24757

…classes

[hotfix][tests] Reduce mockito usage in StreamTaskTest

0af22bf

[FLINK-8590][runtime] Drop addBufferConsumerToAllSubpartitions method

eb96d5d

[FLINK-8584] handle read-only buffers in deserializer

329f096

[FLINK-8583] Pass BufferConsumer to subpartitions

e9943c5

[hotfix][runtime] Simplify RecordWriter code

91dc1c9

[hotfix][java-docs] Improve ResultSubpartition java doc

3eb4cc0

[hotfix][runtime] Simplify PipelinedSubpartition

5722814

notifyBuffersAvailable is a quick call that doesn't need to be executed outside of the lock

[hotfix][runtime] Drop unused throws IOException

635c29d

[hotfix][tests] Properly close StreamRecordWriter in network benchmarks

c6526fb

[hotfix][tests] Correctly set moreAvailable flag in StreamTestSingleI…

78df079

…nputGate and handle redundant data notifications

pnowojski added 9 commits February 19, 2018 12:21

[FLINK-8587][runtime] Drop unused AdaptiveSpanningRecordDeserializer

2c0f4d4

[FLINK-8588][runtime] Handle sliced buffers in RecordDeserializer

1310c72

[FLINK-8589][runtime] Add polling method to InputGate

98bd689

This is a preparation for changes in data notifications, which will not be that strict as they are now.

[FLINK-8591][runtime] Pass unfinished bufferConsumers to subpartitions

5b1e127

[hotfix][benchmarks] Add network stack benchmarks for LocalInputChannels

08f7284

[hotfix][tests] Properly clean up RescalingITCase and allow it to run…

e7d7ef1

… in the loop

[hotfix][tests] Remove masking original exception in StreamTaskTimerTest

4c38b38

[FLINK-8582][runtime] Optimize BufferBuilder writes

a687051

By introducing #commit() method on critical path we reduce number of volatile writes from 2 down to 1. This improves network throughput by 20% and restores the orignal performance for high latency cases.

[hotfix][runtime] Rename setNextBufferBuilder to continueWritingWithN…

72b287a

…extBufferBuilder

pnowojski force-pushed the buffer-consumer branch from c1f90bf to 72b287a Compare February 19, 2018 11:23

asfgit closed this in bc55d7a Feb 19, 2018

rmetzger added the component=Runtime/Network label Mar 18, 2019

[FLINK-8581] Improve performance for low latency network #5423

[FLINK-8581] Improve performance for low latency network #5423

Uh oh!

Conversation

pnowojski commented Feb 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

StefanRRichter left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pnowojski commented Feb 19, 2018

Uh oh!

StefanRRichter commented Feb 19, 2018

Uh oh!

Uh oh!

pnowojski commented Feb 7, 2018 •

edited

Loading