-
Notifications
You must be signed in to change notification settings - Fork 556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Commit latency is fluctuating and too high #8551
Comments
As a first step I would like to understand what is included in a commit (measured by the commit latency related #8552) to understand what might take longer as usual. I will use this issue to summarize this. |
|
I don't want to derail your investigation, so take this as a single data point and not a lead per se: someone mentioned that we have the wrong snapshot replication threshold for our throughput - would lots of install requests lead to a higher commit latency? Also wish I could remember who mentioned this 😅 |
This is my current assumption. Since we first need to send the first snapshot chunk before we can commit the entry. Even if we can send the entry to another node right away. Sending the the chunk and the entry to the other node is sync. |
Just to give an update what I found and did so far 10-01-2022
MetricsAs I wrote before I compared existing metrics which are subset of the commit latency, like record write latency or the append metrics. BTW we had no panel which shows that metric. The write latency look ok'ish. We have some outliers, which might be due to compaction or segment creation. But this doesn't explain the commit latency. The Append latency looks a bit more fluctuating but still not 100% clear whether this is the reason. Execution TimeThe new metrics show that it sometimes it takes quite a while until an scheduled job is executed. Furthermore that some jobs take quite long, which cause to block others ofc. New IssuesDuring my investigation I found several other things like that the append requests are not equally distributed #8566 and install requests are always send when the leader has a new snapshot #8565 Next StepsI found something which might slow the complete process of committing a bit down https://github.com/camunda-cloud/zeebe/blob/develop/atomix/cluster/src/main/java/io/atomix/raft/roles/LeaderRole.java#L597 Here we ask for a This is a low hanging fruit which I will try out next. Furthermore I will investigate what might be take longer in the raft execution or could potentially slow down the commit. |
Investigation 11-01-2022
Execution MetricsAs written yesterday I added some execution metrics which show has the time how long it takes until a job is executed and how long it takes to execute that job. Furthermore we can see the rate of scheduled jobs and executions. No async whenCompleteI mentioned yesterday that a low hanging fruit would be to remove the async step which causes to schedule a new job all the time. After doing so we reduced the execution rate by half. First it looked like it would help on the commit latency, but it turned later out that this was only for a short period of time. Turn off Flush on commitBased on the investigation of yesterday I realized that we always flush to disk when we commit. I wanted to verify how this behaves when we skip the flush and what kind of affect this would have. We can clearly see that the commit latency went tremendously down, but we can also see some quite outliers. This setting of course also affects the throughput here we can reach more than ~90 PI/s again. @oleschoenburg mentioned it might be worth also to verify against the 1.2.9 how much difference we see in the throughput and latency. I will try to run a benchmark soon. More parallel appendsIn our current RAFT configuration we allow only 2 append requests concurrently. I increased that setting to 5 to verify whether this might have an affect on the system in regards to performance. It showed a much better throughput. Ofc not the same as turning off flush but still better. Interesting is that this is not that stable (in regards to throughput as compared to no flush). We can see that most of the time the commit latency is better but we also have time frames where it becomes bad (100-250 ms). IO ThrottlingBased on the previous results we thought about issues with IO throttling or whether we might write more than in earlier versions. For that I compared the long running clusters.
Used Query:
@oleschoenburg helped me here to also investigate that on the GKE metrics level. Used query:
This metrics shows how much MB is throttled per second. We can see that read itself is not throttled but the writes. Via the filter
Bigger DiskWhen increasing the disk size we see much less write throttling: We increased the disk to 256 Gi but still see some throttling but it is much smaller. Interesting is also that the throughput is not really better, it looks worse than before actually. We can see in most of the benchmarks that at the begin often the commit latency looks ok'ish but it goes quite fast up (like something is filled up). Write throttling 1.2We checked also the write throttling with 1.2 and can see that it is similar to the benchmark were we increased the disk size, so this means we have some throttling but much less. Next Steps
|
Investigation 12-01-2022
1.2.4We can see here the same pattern, it started with smaller throughput but at some point the commit latency went up and so the throughput. I feel with 1.2.x the chance is much higher that it recovers. What we can see here is that the append latency to have an high impact, idk how I have overseen that before.
Interesting is here to check also the quantiles. Write throttlingI checked again the write throttling (based on our investigation yesterday) and we can see that due to higher throughput we are more throttled but it has not really an negative impact. In general it seems we have here no problems with writes. FollowerInteresting for the #8565 is that we can see that the leader which gets continuously the install request is never able to catch up it is always missing ~4 k records. 1.2.9 and baseWe can see in 1.2.9 and the base benchmark the same affect of better append latency to affect commit latency and as soon the append latency goes up to ~25 ms we reach the point of ~100-250 ms commit latency which has a high negative impact on throughput. No limitWhen starting the benchmark with no limit I was surprised that the numbers where so bad. I had to add some limits for the statefulset again so we reach better numbers again, but still we never use more than ~1-2 CPU. But still we are not reaching the same numbers as on 1.2.x 1.2.9 no flushAs expected also in 1.2.9 we see a quite good throughput with no flush The interesting part is that the append latency also went down. I think the issue here is that the follower also not flush anymore. Less LoadWith lesser load we can see that the commit latency has less issues, but append seems to be still quite high. So stressing the system seems to make it worse. LatenciesWhat I can see based on the data is that mostly one follower has an append latency of ~25 ms, which might also cause #8566 as soon as both have this latency we can see that the commit latency is affected and the throughput as well. If we have good commit latencies, this is mostly because one of the follower is able to append faster ~9 ms in p90 quantile. The commit latency is not only affected by the appends, but also by the flush latency of the leader and job scheduling/execution latency etc. Next Steps
|
Investigation 17-01-2022
BenchmarksI had another deeper look at the 1.2.4 benchmark from last week. As I have written before we can see that the benchmark runs to the same issue, but there are also times where it reaches the full potential. We can clearly see how the append/commit latency then affect the throughput. Digging deeper into the metrics and code I found the segment flush metrics (`atomix_segment_flush_time_bucket) . This https://github.com/camunda-cloud/zeebe/blob/develop/journal/src/main/java/io/camunda/zeebe/journal/file/JournalMetrics.java#L74 is called everytime we commit on the leader or we append on the followers and send a response back. The metric observes how long it takes to flush the entire buffer to disk. I created some new panels and views to compare the different latencies better. What we can see is that IF the segment flush goes up, the append latency goes up this can directly impact the processing throughput as we can see in the panels. If more than one node is affected by the segment flush issue, respectively the commit latency is impacted which of course also impacts processing throughput. I verified that on several other benchmarks, what we can see is that on all we have from time to time high flush latency, which affects then the whole system.
Interesting is I think that the throughput difference are not that visible with three partitions, but we can also see the high latencies.
Conclusion for me is that this happens also on 1.2.x, which I have also written above (but seems still to be less likely probably because some changes are not made in 1.2). Furthermore it seems to be less likely or have less an impact on multi partitions. It seems to be related to the segment flush latency. We have multiple changes we made in 1.3 (see above), which might add additionally latency which sum up together and make the issues worse. I added today some new issues, which we could tackle to also improve in some areas #8602 #8601 I would like to pair with someone @npepinpe @oleschoenburg in order to discuss how we want to continue further. The segment flush metric only observes the mapped buffer flush, currently idk how this can be sometimes impacted? Maybe depends on the size of the buffer which it flushes? Since I have also another topic I should work on this quarter I don't know how much time I should spent here. BTW might also be that I totally misinterpret the numbers and metrics, so I think it makes sense that some one challenges that. |
Investigation 18-01-2022I run benchmarks over the night an made some observations. ObservationsThe benchmark with keeping the readers (reverting of #8124) went in throughput up to ~193 PI/s. It looks like that over night the latency goes totally down, and the throughput respectively up. Around ~8 am we see an increase in latency again and dropping of throughput. Yesterday evening I saw that the performance became better and wanted to verify that with a second benchmark with the same image. Here we still not reach the expected goal of ~195-200 PI/s. I thought about the flush latency and wanted to try again whether it is disk related. Actually I verified that before, but I wanted to be sure. Reading docs:
I decided to try again to increase the disk size and CPU's.
We can see in the metrics that the general flush maximum latency is lower than before BUT the distribution is still big. The avg commit latency is also higher which means we not able to reach our expected throughput. So it seems we can fix it with bigger disks nor with more CPU's. Somehow I feel it is more related how many benchmarks are running and how many brokers are running on the same node, because in the benchmark above we we see it goes up ~8 am when I started more benchmarks. Debian ImageAnother guess was that this might be again related to our different base image, since we change from Debian to Ubuntu due to the jdk 17 migration and that this might have different flush performance 🤷 We see also here that this doesn't have any affect. Benchmark result for one partition: |
Discussed together with @npepinpe the next steps. @npepinpe will validate the priority of this issue and in general performance over other issues we need to do before our upcoming release in april. @Zelldon will produce flamegraphs for the different versions, maybe this gives us some hints. |
Additional observationsAs described above I run new benchmarks to take flamegraphs from the different version (1.1.10, 1.2.9, 1.3.0) to see what has changed. Unfortunately I haven't seen/found any thing problematic. FlamegraphsFrom left to right, the flamegraphs are from version 1.3.0, 1.2.9 and 1.1.10. They are also attached here to the comment ActorZooming into an Actor we can see we yield quite often, but this was already before. But I think it makes sense to configure the actor threads wisely depending on the partition count. AppendZooming into the Raft thread an searching for the term
CommitSearching for the term FollowerAll above was on the leader nodes, but I thought it might be also interesting results from the followers. I did that for 1.3.0 and 1.2.9, since we had no replay in 1.1.10. In general it looks quite similar, nothing stands really out. AppendOn the follower append side, we can see that we have an additional copy in 1.3.0. To be specific in the MappedSegmentWriter when we writeData we call https://github.com/camunda-cloud/zeebe/blame/3da1a75d52e524b123f91a8ec9e480a0337bbf42/journal/src/main/java/io/camunda/zeebe/journal/file/record/SBESerializer.java#L60. But the commit which added this is 11 month ago, I'm also not sure whether this really can cause our issue. Looking at the git history, I also found #7967, which I tried to revert and run several benchmarks over the night. ReplayThe replay on followers look quite similar, some stacktraces are different but this can also be since it is just a snapshot of what happens. BenchmarksAs I had the benchmarks set up for the flamegraphs I had also time to observe them and look at their metrics.
Interesting is that 1.1.10 performed quite similar to 1.3.0 (with avg ~80 PI/s throughput), and 1.2.9 was able to do ~88 PI/s in avg. NodesI still feel that our current benchmarks are highly depended on how they are scheduled in k8. The Leader and follower of 1.2.9 are scheduled on the same node, which might give them some performance benefits. We can also look at the node metrics, currently not sure what they say us but maybe interesting for later. The followers for 1.3.0 are together scheduled on another node. The leader of 1.1.10, which perform similar to 1.3.0, is also not scheduled with his followers. Here we can see that the DISK IO is much higher, but tbh Idk why. LatenciesI opened last a week an issues which described that nodes are preferred on appending #8566 . The issue seems to have another impact. I observed that if the node which is always preferred has good append latency, the commit latency also quite good, but as soon as the append latency of that node goes up the commit latency also goes up! We can see that in the benchmark where I reverted the #7967 Around 22:00 the latencies swapped on one follower appends are now faster and on the other slower. But the commit latency goes up and never comes back. Here it seems to prefer the Node 2. Revert 7967As written before due to the flamegraphs and looking at the git history, I guessed that it might be interesting to revert #7967. It looked at the begin also quite good. We can see that the leader and the follower are again scheduled on the same node. But over the night it went down again to the usual performance 😢 Taking a look at the nodes, we can see that the leader is now alone on the node (with other pods from different namespaces) Which brings me again to the point that the general performance is highly depending on where it is scheduled, which makes it really hard to compare in short benchmarks / time frames. Running benchmarks over longer time (like 12 or 24 hours) the avg throughput shows a good approximation (due to rescheduling, preemption etc.). So yes we have still a performance regression and it was not resolved with reverting the above PR, but what I learned is take the results of short benchmarks with a fine grain of salt. Next Steps@npepinpe any updates on your side regarding Prio? Another Idea which came to my mind is to bring back the medic benchmarks between 1.2 and 1.3.0 to see when it changed. I think I have seen before that it was between alpha1 and alpha2 #8425 (comment) but looking at the comment we see already degrade in alpha1. So I would start with the first week after the 1.2.0 release. |
Moving to the backlog for now, probably until Q2. See the last comment in #8425 |
Did some tests the last days. What it observed is that it seems always one node is really prefered (related to #8566 ) if this one has a higher append latency then the avg commit latency goes up, even if for the other node the append latency goes down. This is something I wouldn't expect. See the screenshot below. Append latency to one goes up, and to two goes down, but still avg commit latency goes up. Considering the flush rate we can see that the flush rate seems to go down on one, which cause the flush duration to increase (if we flush less, but more bytes at once it takes longer). It looks like it is related to install requests, since at the same time the node one receives install requests from the leader. (Related #8565) |
This week, I spent a bit of time trying to understand the impact of taking snapshots, replicating snapshots, and compaction. Therefore, I did some micro-benchmarks (in an old-fashion way) to understand which operations spike, how they are related to each other, and influence each other. Eventually, I ended up micro-benchmarking the "Commit Latency" as this brings everything together. I just want to share my findings here because on one side they underline some of @Zelldon's observations and on another side, it might give additional input to this issue. Hopefully, this gives a bit more insight into how things are tight together. tl;tr When running the micro-benchmarks, the following operations often ended up in latency spikes:
To counter-proof it, I turned off all these operations, and as expected the cluster was capable of holding the load:
In the counter-proof, the only limiting factors had been:
Now, the long version follows with some more details 😄 Record Write Latency The Here are my observations:
To sum up, the creation of a new log segment may cause latency spikes. Append Entry Latency The Here are my observations:
To sum up, the following influences mostly this metric:
Commit Latency The Here are my observations:
Also here, in addition to the
Additional Overhead All the latency metrics above contain additional overhead, in the sense that they may include latency spikes of operations of other jobs. Meaning, the "Raft thread" maintains a worker queue where itself and others can submit work (i.e., jobs) to the queue. And the Raft thread executes those jobs sequentially. That means, if there are 10 jobs inside the worker queue they are executed one by one which influences the latency as follows:
Basically, the latency of job Taking Snapshot A broker takes snapshots in intervals for each partition. Taking a snapshot happens asynchronously and has no impact on the processing or replication path. At least, I couldn't find anything that has an impact here. Compaction After taking a snapshot, the log might get compacted. Meaning, it will delete log segments up to the snapshotted index. Depending on how many log segments need to be deleted, this may take some time. In my benchmarks with a snapshot interval of 5 minutes, the compaction took roughly ~150ms and more. In addition, if all log segments get deleted, it will require the creation of a new log segment which again adds additional latency. However, compaction is part of the replication path in Raft. Meaning, after taking a snapshot, the compaction is submitted to the respective Raft thread. Eventually, the Raft thread will execute that compaction which takes time. Basically, during compaction, the leader won't replicate any entries. And in case of compaction on the follower side, the follower won't handle any append requests during that time and start lagging behind. Things are getting even worse, if the follower triggers compaction on all partitions it is following at the same time, and the follower is part of the quorum. This will have an impact on the Also to consider, the compaction happens after installing a received snapshot from a leader on a partition. Just a side note: Another potential issue of compaction is that it does not consider any inflight requests. For example, the most recent follower might lag 10 entries behind, but with the compaction, those 10 entries may get deleted. And as a consequence, the leader needs to replicate its latest snapshot to the follower which again might cause the follower to lag behind. Snapshot Replication Whenever a snapshot is replicated (for whatever reason), the replication takes typically up to ~200ms. That means, during that time the follower the snapshot is replicated to does not receive any append requests - which is expected. The downside of it is that after installing the snapshot the follower is already lagging behind the leader. This increases the probability of receiving a new snapshot when the leader takes a new one which may end up in a snapshot replication loop. In addition, on the follower side, after installing the replicated snapshot, the follower will trigger compaction. This again blocks the Raft thread for a while so that it won't be able to handle any incoming append requests. This may result in lagging behind the leader even more. There might also be cases, where taking a snapshot and receiving a snapshot happens frequently in a short period of time In this case, the follower took a snapshot on its own, afterward got another snapshot replicated from the leader, and just a minute later took another snapshot on its own. So, it triggered 3 compactions in a row. This may happen multiple times so the follower had a hard time catching up with the leader over time. On the leader side, after replicating the snapshot, it resets the follower's next index to replicate. This seek takes ~50ms typically. To close the loop, there are multiple factors/variables that influence the commit latency. I think the latency spike caused by Raft's metastore update might be fixed easily by switching to |
Updates to metastore must be flushed to disk immediately to handle crashes. So switching to mmap might not help. |
@deepthidevaki, thanks for your comment, and you are right about |
jfyi #8741 (comment) My current benchmark is currently able to reach ~200 PI/s again. |
Discussed with @romansmirnov Current assumption is that the actor threads are quite aggressive in work stealing and cycle fast between idle and running back and forth which consumes lot of cpu time. Can be also seen here #8551 (comment) Reducing the threads means that the actors thread itself get more work to do and not run into these issues, where they have to compete for jobs/work. |
@romansmirnov the benchmarks I started with the reduced threads are now slow as the others benchmarks. But the current medic week benchmark looks a bit faster. 🤷♂️ |
Today, I had another look at this, since I was wondering why it not happens on the long running cluster. I realized that we use different nodes in our long running cluster. We use I checked some metrics in the gke and as expected saw that we have much more disk throughput per vm in our preemtable zeebe cluster. Zeebe-cluster: Long Running cluster: Machine Type n8 I think we can verify that with anti-affinity rules, and check how it behaves and performs. |
I tried it out with: affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: "app.kubernetes.io/component"
operator: In
values:
- zeebe-broker
topologyKey: "kubernetes.io/hostname"
namespaces:
- ccsm-helm
- medic-cw-07-b83992c74d-benchmark
- medic-cw-08-12c4ea63e6-benchmark
- medic-cw-09-e4718dc49c-benchmark
- medic-cw-10-ba172d7b06-benchmark
- zell-affinity This allows to run the brokers on nodes without other brokers (also without brokers from other namespaces)
Still we see low performance (or at least lower than I would expect and high commit latency) |
What about ElasticSearch running on the same Kube node as the broker? |
@falko I was not able do schedule them on different nodes. I failed to configured this proberly. I added anti affinity also for the elasticmaster labels, but somehow they always ended on the same node 🤷♂️. I can try next days again. But still it is already mich less load of the nodes as on the others |
I did another benchmark with resources:
limits:
cpu: 15
memory: 4Gi
requests:
cpu: 15
memory: 4Gi This results in one broker per node:
The numbers look far better. In the following screenshot we see the benchmark first with affinity only, which is still fluctuating. Around 2 pm I restarted the benchmark with higher cpu request/limits. We can see a flat line now. If we take a look at the last 3 hours we can see that the avg is almost 200 again, as we want to have. Latency: In the latency we can see a bit a difference. There are less outliers, so the max is not that high. I haven't invested in more time to investigate the latency here. CPU Snapshots The install request rate looks still the same as before. Concluding this I would say that our benchmarks are highly affected by our current parallel benchmarks, means pods which are scheduled on the same node. Not only broker but also elastic impacts our throughput. This started to happen when we decreased the resources for our benchmarks. This was not reproducible in our long running cluster, since we use smaller nodes here. Still we reach not our 200 PI/s as I would expect, so we have potentially also other issues which can influence this. Probably things we already discussed above. How we want to continue here? @npepinpe \cc @falko @deepthidevaki @romansmirnov |
Today I discussed that topic with @npepinpe. In general we currently not focusing on it and spent not much time on this, but if I have time I will setup a new node pool with local ssd attached. This helps to exclude whether it is network io related. If the benchmarks have still low performance with local ssd, then it is probably related to CPU cache contention. Otherwise it is related to network io. In general I will add soon some network anti-affinity rules to the helm charts, which should help a bit here as well. All the insights here we should probably document better for users, so they know what to do if they face some performance issues. |
8977: Add random protocol record/value factory r=npepinpe a=npepinpe ## Description Adds a new utility to the `protocol-test-util` module which allows to generate random records and values in a deterministic way. Note that as data is randomly generated, the data has no meaning in itself - keys, positions, etc., are completely random. However, the `value` and `intent` are guaranteed to always be derived from the `valueType`. This is currently used to properly test the deserialization of the protocol via Jackson, but can later be used for exporter related unit tests. ## Related issues closes #8837 8999: Add commit and record write quantile panels r=Zelldon a=Zelldon ## Description Taking a look at performance issues often forces me at the end to look at the commit latencies, because of #8551 It is currently quite hard to compare latencies from two benchmarks, since this is show as a heatmap. ![old-panels](https://user-images.githubusercontent.com/2758593/160366627-bfd3e8f3-9c59-4bd6-8831-6df5f4a6e8ca.png) This PR should solve this issue, and allows me to not always recreate my panels (to see a difference). It addes two new panels, one for the commit latency and one for the record write latency. Both show the quantiles (p90, p99), the median and the avg. The formulars were based on : https://theswissbay.ch/pdf/Books/Computer%20science/prometheus_upandrunning.pdf (I use the book) The new panels look like this: ![new-panels](https://user-images.githubusercontent.com/2758593/160366881-52720cfc-1bcd-4d46-8055-27c59a873c64.png) this allows to easier compare it to other benchmark which perform worse like: ![otherbench-new-panels](https://user-images.githubusercontent.com/2758593/160366975-ffc7ac5d-3488-4ea7-9fd9-0839b365bfb1.png) <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> related to #8551 Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com> Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
8999: Add commit and record write quantile panels r=Zelldon a=Zelldon ## Description Taking a look at performance issues often forces me at the end to look at the commit latencies, because of #8551 It is currently quite hard to compare latencies from two benchmarks, since this is show as a heatmap. ![old-panels](https://user-images.githubusercontent.com/2758593/160366627-bfd3e8f3-9c59-4bd6-8831-6df5f4a6e8ca.png) This PR should solve this issue, and allows me to not always recreate my panels (to see a difference). It addes two new panels, one for the commit latency and one for the record write latency. Both show the quantiles (p90, p99), the median and the avg. The formulars were based on : https://theswissbay.ch/pdf/Books/Computer%20science/prometheus_upandrunning.pdf (I use the book) The new panels look like this: ![new-panels](https://user-images.githubusercontent.com/2758593/160366881-52720cfc-1bcd-4d46-8055-27c59a873c64.png) this allows to easier compare it to other benchmark which perform worse like: ![otherbench-new-panels](https://user-images.githubusercontent.com/2758593/160366975-ffc7ac5d-3488-4ea7-9fd9-0839b365bfb1.png) <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> related to #8551 9017: chore(maven): add trailing slashes to new Artifactory URL r=cmur2 a=cmur2 ## Description I got review feedback that it is recommended to have trailing slashes to the URL so I'm adding them for consistency. ## Related issues Related to INFRA-3107 Co-authored-by: Christopher Zell <zelldon91@googlemail.com> Co-authored-by: Christian Nicolai <christian.nicolai@camunda.com>
8999: Add commit and record write quantile panels r=Zelldon a=Zelldon ## Description Taking a look at performance issues often forces me at the end to look at the commit latencies, because of #8551 It is currently quite hard to compare latencies from two benchmarks, since this is show as a heatmap. ![old-panels](https://user-images.githubusercontent.com/2758593/160366627-bfd3e8f3-9c59-4bd6-8831-6df5f4a6e8ca.png) This PR should solve this issue, and allows me to not always recreate my panels (to see a difference). It addes two new panels, one for the commit latency and one for the record write latency. Both show the quantiles (p90, p99), the median and the avg. The formulars were based on : https://theswissbay.ch/pdf/Books/Computer%20science/prometheus_upandrunning.pdf (I use the book) The new panels look like this: ![new-panels](https://user-images.githubusercontent.com/2758593/160366881-52720cfc-1bcd-4d46-8055-27c59a873c64.png) this allows to easier compare it to other benchmark which perform worse like: ![otherbench-new-panels](https://user-images.githubusercontent.com/2758593/160366975-ffc7ac5d-3488-4ea7-9fd9-0839b365bfb1.png) <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> related to #8551 Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
It is crazy, I checked the benchmarks the last days and we have really volatile backpressure metrics, on day it is 3% today it was on one benchmark even 30%. It often seems to be related to which Broker is currently the leader of all partitions (it seems that it is almost always the case that one broker is the leader for all partitions). Probably related to #8566 What I can see is that the throughput is of course also affected by this which then reaches from 175 PI/s up to 188 PI/s. All of that seems to be caused by the fluctuating commit latency. This is then related to how many other pods (other brokers or elastic) are running on the same kubernetes node.
BTW You may ask why this is happening, why we have multiple brokers on the same node, we have anti-affinity defined? Well, anti-affinity is only per namespace, which means other brokers from other namespaces can be assigned to the same node. |
Closing as version is EOL. |
Describe the bug
Based on experiments and benchmarks done #8425 (comment) we have observed a high commit latency (100-250ms), which seem to ofc affect all other latencies and in the end also impact the throughput.
Normally we can reach 25-50ms in commit latency (especially in 1.2.9 it is more likely to reach this latency), but it happens from time to time (in 1.3.0 more often) that we jump to a latency of 100-250 ms.
We should investigate further what can cause this and why.
Might be related to #8132
To Reproduce
See #8425 run a benchmark with 1.2.x or with 1.3 (more likely). Ideally to decrease the blast radius/scope use one partition and lesser load (100 pi/s and 3 worker is enough).
Expected behavior
The commit latency is constant at a level of 25-50 ms.
Environment:
The text was updated successfully, but these errors were encountered: