ARTEMIS-3282 Expose replication response batching tuning #3566

franz1981 · 2021-05-04T16:19:29Z

https://issues.apache.org/jira/browse/ARTEMIS-3282

franz1981 · 2021-05-04T16:22:59Z

This can be relevant for you, @michaelandrepearce , because with super fast kernel bypass drivers batching can cause some tail latency outlier; this tuning can save it to happen (by setting the batching === 0) and can be used to improve performance too.

I still need to perform some tests and see if the new default (using ~ TCP MTU === 1500 bytes) is good enough for most users.

franz1981 · 2021-05-05T07:20:40Z

I believe this can be further improved too, just need to think twice how :)

franz1981 · 2021-05-05T12:02:37Z

Results shows that with a long enough stream of packet types, switch perform slightly better then the if cascade:

Benchmark                              (mostSeenProb)  (size)  Mode  Cnt   Score   Error  Units
SwitchVsIfCascadeBenchmark.testIf                  40    8192  avgt   16  21.344 ± 1.075  ns/op
SwitchVsIfCascadeBenchmark.testIf                  60    8192  avgt   16  17.265 ± 0.091  ns/op
SwitchVsIfCascadeBenchmark.testIf                  80    8192  avgt   16  11.571 ± 0.046  ns/op
SwitchVsIfCascadeBenchmark.testSwitch              40    8192  avgt   16  19.416 ± 0.166  ns/op
SwitchVsIfCascadeBenchmark.testSwitch              60    8192  avgt   16  15.342 ± 0.074  ns/op
SwitchVsIfCascadeBenchmark.testSwitch              80    8192  avgt   16  10.942 ± 0.013  ns/op

So there's no point to use the if one given that's a bit less readable too.

franz1981 · 2021-05-06T19:44:46Z

I was expecting exposing batching size was meant to give a good speedup with kernel bypass drivers but seems not: I see instead that the current implementation can hide a subtle perf regression with paging/large messages.
I'm going to perform some test to verify it, but seems related to the fact that, on backup, large msgs write and page writes are still happening on the Netty thread and given that they perform synchronous operations on file channels (some form of FileChannel::write) they can delay for some time any already batched responses to be sent back to the master broker.

franz1981 · 2021-05-06T20:21:55Z

Preliminary tests using paging won't show such regression anyway:
no batching

**************
RUN 1	EndToEnd Throughput: 10626 ops/sec
**************
EndToEnd SERVICE-TIME Latencies distribution in MICROSECONDS
mean               1522.66
min                 129.54
50.00%             1376.26
90.00%             1507.33
99.00%             5734.40
99.90%            10747.90
99.99%            18350.08
max               26214.40
count               160000

batching:

**************
RUN 1	EndToEnd Throughput: 11202 ops/sec
**************
EndToEnd SERVICE-TIME Latencies distribution in MICROSECONDS
mean               1446.31
min                 111.10
50.00%             1376.26
90.00%             1482.75
99.00%             3784.70
99.90%             8224.77
99.99%            14024.70
max               17563.65
count               160000

Same scenario: 16 producer / consumers sending to 16 queues 100 bytes persistent msg with replication vs a broker that's already paging due to

./artemis producer --message-count 22000 --message-size 100000

During the whole load test the broker is paging over the 16 queues != TEST and batching still seems beneficial

franz1981 · 2021-05-06T20:30:15Z

Tests has shown something interesting going on on backup side, although it's not a regression:

In violet there are calls to FileChannel::write due to handlePageWriteEvent and FileChannelImpl::force due to handlePageEvent. The latter doesn't seem right, considered what we deliver with common journal writes ie writes are performed in async and we send response back before hitting the journal. @clebertsuconic wdyt?
I'm thinking to save fdatasync due to paging to happen in the Netty event loop....

franz1981 · 2021-05-06T21:30:16Z

The last commit isn't givin the nice speedup I was expecting:

**************
RUN 1	EndToEnd Throughput: 11155 ops/sec
**************
EndToEnd SERVICE-TIME Latencies distribution in MICROSECONDS
mean               1438.52
min                 116.74
50.00%             1376.26
90.00%             1482.75
99.00%             3424.26
99.90%             7143.42
99.99%            15400.96
max               17825.79
count               160000

latency are on par with normal batching + fdatasync, although the box used have a very fast disk and the amount of paged data wasn't enough to make fdatasync to be long enough.

franz1981 · 2021-05-12T07:46:23Z

...s-server/src/main/java/org/apache/activemq/artemis/core/replication/ReplicationEndpoint.java

@@ -221,6 +221,8 @@ public void handlePacket(final Packet packet) {
               handleCommitRollback((ReplicationCommitMessage) packet);
               break;
            case PacketImpl.REPLICATION_PAGE_WRITE:
+               // potential blocking I/O operation! flush existing packets to save long tail latency
+               endOfBatch();


@clebertsuconic @gtully
I'm not yet sure about it: a page write can be a very fast operation depending on the amount of data to be written ie it's the cost to copy some buffer X times, really. Flushing any existing batch is just to ensure no further delay is going to affect the overall response time of replication, but if the batch is too small, bad network utilization will still hurt scalability and average latencies.

I think if blocking ops are removed, there is no need to eagerly flush the batch here. However I really don't see how a user can choose a value for maxReplicaResponseBatchBytes. I think it has to be automatic or automagical, based on the some limit on what can be read.
5.x has optimiseAck on by default, at 70% of the prefetch for openwire consumers. That covers the auto ack case and only sends an actual ack every X non persistent messages. I wonder if something similar based on confirmation-window could work here. Tho the endpoint probably has no way of knowing what is set on the other end I guess. Is there a relationship between the confirmation window and responses already?

I really don't see how a user can choose a value for maxReplicaResponseBatchBytes

It can be 0 with users that care about 99.XXX percentile latencies and have configured kernel bypass drivers.
It could be the MTU size for users that knows it, it should be -1 for "common" users (right now I've chosen 1500 that's the typical MTU size).
I think is very low level config really, yet to be seen how it can be useful: I'm still in the process of validating it before dropping the "draft" status of the PR.

I think it has to be automatic or automagical, based on the some limit on what can be read.

From the point of view of network utilization and memory usage, just using -1 or 1500 is already a huge step forward if compared with the "previous" (pre-https://issues.apache.org/jira/browse/ARTEMIS-2877) behaviour.

Size of Ethernet frame - 24 Bytes
Size of IPv4 Header (without any options) - 20 bytes
Size of TCP Header (without any options) - 20 Bytes

Total size of an Ethernet Frame carrying an IP Packet with an empty TCP Segment - 24 + 20 + 20 = 64 bytes

When packet size is > MTU, the TCP packets are going to be fragmented, but that's fine because it will amortize syscall cost instead, while maximizing network usage too.
While just sending responses one by one means sending a ~3X overhead of data for each response sent, that will hurt both latencies and CPU/network usage.

I wonder if something similar based on confirmation-window could work here

No idea, IIRC the replication flow of packets won't obey any of the other cluster connection channel packets flow rules, no duplicate checks/no confirmation window or anything similar, @clebertsuconic can you confirm?

For sure batching is good, what I am struggling with is the use case for setting a limit on accumulating, b/c the other end is waiting, there also needs to be some timeout in case the limit is not reached. Batching on low utilisation seems unfair on the sender in this case, and the sender is blocking. I need to peek more to see what -1 does :-) thanks for the detail.

ok, I understand better, the batching is limited by the consumed buffers, before the next read there will always be a batch end.

franz1981 · 2021-05-20T14:51:20Z

I think this one is a good shape now and it's fixing the bug of sync'ing on the replca too @clebertsuconic

franz1981 · 2021-06-09T07:31:44Z

I've decided to NOT expose the response batch size, but save sync on paging/large messages to happen, instead, going to create a separate PR to address this.

franz1981 marked this pull request as draft May 4, 2021 16:23

franz1981 force-pushed the ARTEMIS-3282 branch from 5e9b5be to bf47502 Compare May 4, 2021 16:25

franz1981 commented May 12, 2021

View reviewed changes

franz1981 force-pushed the ARTEMIS-3282 branch 2 times, most recently from 31819a3 to 0d6d57a Compare May 20, 2021 14:45

franz1981 added 3 commits May 20, 2021 16:46

ARTEMIS-3282 Expose replication response batching tuning

6d997ef

ARTEMIS-3282 Turn if cascade replica packet handling into switch

8060be5

ARTEMIS-3282 save fdatasync on page close/lrg msg end

ca90b5d

franz1981 force-pushed the ARTEMIS-3282 branch from 0d6d57a to ca90b5d Compare May 20, 2021 14:50

franz1981 marked this pull request as ready for review May 20, 2021 14:50

franz1981 closed this Jun 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARTEMIS-3282 Expose replication response batching tuning #3566

ARTEMIS-3282 Expose replication response batching tuning #3566

franz1981 commented May 4, 2021

franz1981 commented May 4, 2021

franz1981 commented May 5, 2021

franz1981 commented May 5, 2021

franz1981 commented May 6, 2021 •

edited

franz1981 commented May 6, 2021

franz1981 commented May 6, 2021 •

edited

franz1981 commented May 6, 2021

franz1981 May 12, 2021

gtully May 12, 2021

franz1981 May 12, 2021 •

edited

gtully May 12, 2021

gtully May 21, 2021

franz1981 commented May 20, 2021

franz1981 commented Jun 9, 2021

ARTEMIS-3282 Expose replication response batching tuning #3566

ARTEMIS-3282 Expose replication response batching tuning #3566

Conversation

franz1981 commented May 4, 2021

franz1981 commented May 4, 2021

franz1981 commented May 5, 2021

franz1981 commented May 5, 2021

franz1981 commented May 6, 2021 • edited

franz1981 commented May 6, 2021

franz1981 commented May 6, 2021 • edited

franz1981 commented May 6, 2021

franz1981 May 12, 2021

Choose a reason for hiding this comment

gtully May 12, 2021

Choose a reason for hiding this comment

franz1981 May 12, 2021 • edited

Choose a reason for hiding this comment

gtully May 12, 2021

Choose a reason for hiding this comment

gtully May 21, 2021

Choose a reason for hiding this comment

franz1981 commented May 20, 2021

franz1981 commented Jun 9, 2021

franz1981 commented May 6, 2021 •

edited

franz1981 commented May 6, 2021 •

edited

franz1981 May 12, 2021 •

edited