New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARTEMIS-3282 Expose replication response batching tuning #3566
Conversation
This can be relevant for you, @michaelandrepearce , because with super fast kernel bypass drivers batching can cause some tail latency outlier; this tuning can save it to happen (by setting the batching === 0) and can be used to improve performance too. I still need to perform some tests and see if the new default (using ~ TCP MTU === 1500 bytes) is good enough for most users. |
I believe this can be further improved too, just need to think twice how :) |
Results shows that with a long enough stream of packet types,
So there's no point to use the if one given that's a bit less readable too. |
I was expecting exposing batching size was meant to give a good speedup with kernel bypass drivers but seems not: I see instead that the current implementation can hide a subtle perf regression with paging/large messages. |
Preliminary tests using paging won't show such regression anyway:
batching:
Same scenario: 16 producer / consumers sending to 16 queues 100 bytes persistent msg with replication vs a broker that's already paging due to
During the whole load test the broker is paging over the 16 queues != TEST and batching still seems beneficial |
Tests has shown something interesting going on on backup side, although it's not a regression: |
The last commit isn't givin the nice speedup I was expecting:
latency are on par with normal batching + fdatasync, although the box used have a very fast disk and the amount of paged data wasn't enough to make fdatasync to be long enough. |
@@ -221,6 +221,8 @@ public void handlePacket(final Packet packet) { | |||
handleCommitRollback((ReplicationCommitMessage) packet); | |||
break; | |||
case PacketImpl.REPLICATION_PAGE_WRITE: | |||
// potential blocking I/O operation! flush existing packets to save long tail latency | |||
endOfBatch(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clebertsuconic @gtully
I'm not yet sure about it: a page write can be a very fast operation depending on the amount of data to be written ie it's the cost to copy some buffer X times, really. Flushing any existing batch is just to ensure no further delay is going to affect the overall response time of replication, but if the batch is too small, bad network utilization will still hurt scalability and average latencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if blocking ops are removed, there is no need to eagerly flush the batch here. However I really don't see how a user can choose a value for maxReplicaResponseBatchBytes. I think it has to be automatic or automagical, based on the some limit on what can be read.
5.x has optimiseAck on by default, at 70% of the prefetch for openwire consumers. That covers the auto ack case and only sends an actual ack every X non persistent messages. I wonder if something similar based on confirmation-window could work here. Tho the endpoint probably has no way of knowing what is set on the other end I guess. Is there a relationship between the confirmation window and responses already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't see how a user can choose a value for maxReplicaResponseBatchBytes
It can be 0 with users that care about 99.XXX percentile latencies and have configured kernel bypass drivers.
It could be the MTU size for users that knows it, it should be -1 for "common" users (right now I've chosen 1500 that's the typical MTU size).
I think is very low level config really, yet to be seen how it can be useful: I'm still in the process of validating it before dropping the "draft" status of the PR.
I think it has to be automatic or automagical, based on the some limit on what can be read.
From the point of view of network utilization and memory usage, just using -1 or 1500 is already a huge step forward if compared with the "previous" (pre-https://issues.apache.org/jira/browse/ARTEMIS-2877) behaviour.
Size of Ethernet frame - 24 Bytes
Size of IPv4 Header (without any options) - 20 bytes
Size of TCP Header (without any options) - 20 Bytes
Total size of an Ethernet Frame carrying an IP Packet with an empty TCP Segment - 24 + 20 + 20 = 64 bytes
When packet size is > MTU, the TCP packets are going to be fragmented, but that's fine because it will amortize syscall cost instead, while maximizing network usage too.
While just sending responses one by one means sending a ~3X overhead of data for each response sent, that will hurt both latencies and CPU/network usage.
I wonder if something similar based on confirmation-window could work here
No idea, IIRC the replication flow of packets won't obey any of the other cluster connection channel packets flow rules, no duplicate checks/no confirmation window or anything similar, @clebertsuconic can you confirm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure batching is good, what I am struggling with is the use case for setting a limit on accumulating, b/c the other end is waiting, there also needs to be some timeout in case the limit is not reached. Batching on low utilisation seems unfair on the sender in this case, and the sender is blocking. I need to peek more to see what -1 does :-) thanks for the detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I understand better, the batching is limited by the consumed buffers, before the next read there will always be a batch end.
31819a3
to
0d6d57a
Compare
I think this one is a good shape now and it's fixing the bug of sync'ing on the replca too @clebertsuconic |
I've decided to NOT expose the response batch size, but save sync on paging/large messages to happen, instead, going to create a separate PR to address this. |
https://issues.apache.org/jira/browse/ARTEMIS-3282