Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARTEMIS-3282 Expose replication response batching tuning #3566

Closed
wants to merge 3 commits into from

Conversation

franz1981
Copy link
Contributor

@franz1981
Copy link
Contributor Author

This can be relevant for you, @michaelandrepearce , because with super fast kernel bypass drivers batching can cause some tail latency outlier; this tuning can save it to happen (by setting the batching === 0) and can be used to improve performance too.

I still need to perform some tests and see if the new default (using ~ TCP MTU === 1500 bytes) is good enough for most users.

@franz1981 franz1981 marked this pull request as draft May 4, 2021 16:23
@franz1981
Copy link
Contributor Author

I believe this can be further improved too, just need to think twice how :)

@franz1981
Copy link
Contributor Author

Results shows that with a long enough stream of packet types, switch perform slightly better then the if cascade:

Benchmark                              (mostSeenProb)  (size)  Mode  Cnt   Score   Error  Units
SwitchVsIfCascadeBenchmark.testIf                  40    8192  avgt   16  21.344 ± 1.075  ns/op
SwitchVsIfCascadeBenchmark.testIf                  60    8192  avgt   16  17.265 ± 0.091  ns/op
SwitchVsIfCascadeBenchmark.testIf                  80    8192  avgt   16  11.571 ± 0.046  ns/op
SwitchVsIfCascadeBenchmark.testSwitch              40    8192  avgt   16  19.416 ± 0.166  ns/op
SwitchVsIfCascadeBenchmark.testSwitch              60    8192  avgt   16  15.342 ± 0.074  ns/op
SwitchVsIfCascadeBenchmark.testSwitch              80    8192  avgt   16  10.942 ± 0.013  ns/op

So there's no point to use the if one given that's a bit less readable too.

@franz1981
Copy link
Contributor Author

franz1981 commented May 6, 2021

I was expecting exposing batching size was meant to give a good speedup with kernel bypass drivers but seems not: I see instead that the current implementation can hide a subtle perf regression with paging/large messages.
I'm going to perform some test to verify it, but seems related to the fact that, on backup, large msgs write and page writes are still happening on the Netty thread and given that they perform synchronous operations on file channels (some form of FileChannel::write) they can delay for some time any already batched responses to be sent back to the master broker.

@franz1981
Copy link
Contributor Author

Preliminary tests using paging won't show such regression anyway:
no batching

**************
RUN 1	EndToEnd Throughput: 10626 ops/sec
**************
EndToEnd SERVICE-TIME Latencies distribution in MICROSECONDS
mean               1522.66
min                 129.54
50.00%             1376.26
90.00%             1507.33
99.00%             5734.40
99.90%            10747.90
99.99%            18350.08
max               26214.40
count               160000

batching:

**************
RUN 1	EndToEnd Throughput: 11202 ops/sec
**************
EndToEnd SERVICE-TIME Latencies distribution in MICROSECONDS
mean               1446.31
min                 111.10
50.00%             1376.26
90.00%             1482.75
99.00%             3784.70
99.90%             8224.77
99.99%            14024.70
max               17563.65
count               160000

Same scenario: 16 producer / consumers sending to 16 queues 100 bytes persistent msg with replication vs a broker that's already paging due to

./artemis producer --message-count 22000 --message-size 100000

During the whole load test the broker is paging over the 16 queues != TEST and batching still seems beneficial

@franz1981
Copy link
Contributor Author

franz1981 commented May 6, 2021

Tests has shown something interesting going on on backup side, although it's not a regression:
image
In violet there are calls to FileChannel::write due to handlePageWriteEvent and FileChannelImpl::force due to handlePageEvent. The latter doesn't seem right, considered what we deliver with common journal writes ie writes are performed in async and we send response back before hitting the journal. @clebertsuconic wdyt?
I'm thinking to save fdatasync due to paging to happen in the Netty event loop....

@franz1981
Copy link
Contributor Author

The last commit isn't givin the nice speedup I was expecting:

**************
RUN 1	EndToEnd Throughput: 11155 ops/sec
**************
EndToEnd SERVICE-TIME Latencies distribution in MICROSECONDS
mean               1438.52
min                 116.74
50.00%             1376.26
90.00%             1482.75
99.00%             3424.26
99.90%             7143.42
99.99%            15400.96
max               17825.79
count               160000

latency are on par with normal batching + fdatasync, although the box used have a very fast disk and the amount of paged data wasn't enough to make fdatasync to be long enough.

@@ -221,6 +221,8 @@ public void handlePacket(final Packet packet) {
handleCommitRollback((ReplicationCommitMessage) packet);
break;
case PacketImpl.REPLICATION_PAGE_WRITE:
// potential blocking I/O operation! flush existing packets to save long tail latency
endOfBatch();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clebertsuconic @gtully
I'm not yet sure about it: a page write can be a very fast operation depending on the amount of data to be written ie it's the cost to copy some buffer X times, really. Flushing any existing batch is just to ensure no further delay is going to affect the overall response time of replication, but if the batch is too small, bad network utilization will still hurt scalability and average latencies.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if blocking ops are removed, there is no need to eagerly flush the batch here. However I really don't see how a user can choose a value for maxReplicaResponseBatchBytes. I think it has to be automatic or automagical, based on the some limit on what can be read.
5.x has optimiseAck on by default, at 70% of the prefetch for openwire consumers. That covers the auto ack case and only sends an actual ack every X non persistent messages. I wonder if something similar based on confirmation-window could work here. Tho the endpoint probably has no way of knowing what is set on the other end I guess. Is there a relationship between the confirmation window and responses already?

Copy link
Contributor Author

@franz1981 franz1981 May 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't see how a user can choose a value for maxReplicaResponseBatchBytes

It can be 0 with users that care about 99.XXX percentile latencies and have configured kernel bypass drivers.
It could be the MTU size for users that knows it, it should be -1 for "common" users (right now I've chosen 1500 that's the typical MTU size).
I think is very low level config really, yet to be seen how it can be useful: I'm still in the process of validating it before dropping the "draft" status of the PR.

I think it has to be automatic or automagical, based on the some limit on what can be read.

From the point of view of network utilization and memory usage, just using -1 or 1500 is already a huge step forward if compared with the "previous" (pre-https://issues.apache.org/jira/browse/ARTEMIS-2877) behaviour.

Size of Ethernet frame - 24 Bytes
Size of IPv4 Header (without any options) - 20 bytes
Size of TCP Header (without any options) - 20 Bytes

Total size of an Ethernet Frame carrying an IP Packet with an empty TCP Segment - 24 + 20 + 20 = 64 bytes

When packet size is > MTU, the TCP packets are going to be fragmented, but that's fine because it will amortize syscall cost instead, while maximizing network usage too.
While just sending responses one by one means sending a ~3X overhead of data for each response sent, that will hurt both latencies and CPU/network usage.

I wonder if something similar based on confirmation-window could work here

No idea, IIRC the replication flow of packets won't obey any of the other cluster connection channel packets flow rules, no duplicate checks/no confirmation window or anything similar, @clebertsuconic can you confirm?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure batching is good, what I am struggling with is the use case for setting a limit on accumulating, b/c the other end is waiting, there also needs to be some timeout in case the limit is not reached. Batching on low utilisation seems unfair on the sender in this case, and the sender is blocking. I need to peek more to see what -1 does :-) thanks for the detail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I understand better, the batching is limited by the consumed buffers, before the next read there will always be a batch end.

@franz1981 franz1981 force-pushed the ARTEMIS-3282 branch 2 times, most recently from 31819a3 to 0d6d57a Compare May 20, 2021 14:45
@franz1981 franz1981 marked this pull request as ready for review May 20, 2021 14:50
@franz1981
Copy link
Contributor Author

I think this one is a good shape now and it's fixing the bug of sync'ing on the replca too @clebertsuconic

@franz1981
Copy link
Contributor Author

I've decided to NOT expose the response batch size, but save sync on paging/large messages to happen, instead, going to create a separate PR to address this.

@franz1981 franz1981 closed this Jun 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants