Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Commit latency is fluctuating and too high #8551

Closed
Zelldon opened this issue Jan 7, 2022 · 28 comments
Closed

Commit latency is fluctuating and too high #8551

Zelldon opened this issue Jan 7, 2022 · 28 comments
Labels
area/performance Marks an issue as performance related kind/bug Categorizes an issue or PR as a bug kind/research Marks an issue as part of a research or investigation scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/high Marks a bug as having a noticeable impact on the user with no known workaround

Comments

@Zelldon
Copy link
Member

Zelldon commented Jan 7, 2022

Describe the bug

Based on experiments and benchmarks done #8425 (comment) we have observed a high commit latency (100-250ms), which seem to ofc affect all other latencies and in the end also impact the throughput.

Normally we can reach 25-50ms in commit latency (especially in 1.2.9 it is more likely to reach this latency), but it happens from time to time (in 1.3.0 more often) that we jump to a latency of 100-250 ms.

dev-both

We should investigate further what can cause this and why.

Might be related to #8132

To Reproduce

See #8425 run a benchmark with 1.2.x or with 1.3 (more likely). Ideally to decrease the blast radius/scope use one partition and lesser load (100 pi/s and 3 worker is enough).

Expected behavior

The commit latency is constant at a level of 25-50 ms.

Environment:

  • Zeebe Version: 1.2.9, 1.3.0, develop
@Zelldon Zelldon added kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog area/performance Marks an issue as performance related severity/high Marks a bug as having a noticeable impact on the user with no known workaround 1.2.9 labels Jan 7, 2022
@Zelldon
Copy link
Member Author

Zelldon commented Jan 7, 2022

As a first step I would like to understand what is included in a commit (measured by the commit latency related #8552) to understand what might take longer as usual. I will use this issue to summarize this.

@Zelldon Zelldon added this to In progress in Zeebe Jan 7, 2022
@Zelldon
Copy link
Member Author

Zelldon commented Jan 10, 2022

  • CommitLatencyTimer is started after blockPeek is read from Dispatcher and we append the block to the storage (only if acquire succeeded)
  • Calls #appendEntry in LeaderRole -> Does a context switch (enqueues into RaftContext), if we have lot things todo (in the queue) this might take longer
    • Calls #safeAppendEntry
    • Validate entry
    • Call #append - when complete listener is called (onWrite) and replication
      • calls #tryToAppend
      • Calls #raftLog#append
      • Serialize entry, call journal#append
    • On append Complete:
      • We should verify whether the write metrics show the similar pattern (or only the commit metrics) Write latency is subset of commit latency
      • Calling #replicate caused to appendEntries for members in cluster
        • If the follower is not up to date - this cause replicating of a snapshot
        • We can see in our recent benchmarks this happens quite often
        • Snapshot is sent either:
          • when follower has a lower index as our first index (which is likely when we do a snapshot we compact)
          • When the follower lags behind by a threshold (~100 events)
        • When events a replicate LeaderAppender#replicateEvents is called
          • Calls #sendAppendRequest - we can check the appendLatency metrics, whether it shows a similar pattern (time)
        • Commit futures are completed when append response is received and we reached quorum (in LeaderAppender#handleAppendResponseOk)
        • After commit future is complete we enqueue the on future completion job and count the commit latency (commit listeners are called)
        • job is scheduled in LogAppenderActor to count metrics

@npepinpe
Copy link
Member

npepinpe commented Jan 10, 2022

I don't want to derail your investigation, so take this as a single data point and not a lead per se: someone mentioned that we have the wrong snapshot replication threshold for our throughput - would lots of install requests lead to a higher commit latency? Also wish I could remember who mentioned this 😅

@Zelldon
Copy link
Member Author

Zelldon commented Jan 10, 2022

would lots of install requests lead to a higher commit latency?

This is my current assumption. Since we first need to send the first snapshot chunk before we can commit the entry. Even if we can send the entry to another node right away. Sending the the chunk and the entry to the other node is sync.

@lenaschoenburg
Copy link
Member

@npepinpe I think that was @korthout who recently did an experiment where he set the threshold much higher.

@Zelldon
Copy link
Member Author

Zelldon commented Jan 10, 2022

Just to give an update what I found and did so far

10-01-2022

  • I had a deeper look at the code, see above.
  • I found some interesting metrics we have and compare them with the commit latency
  • I added new metrics for the single thread executor to show how long it takes until a job is executed and what the execution time is

Metrics

As I wrote before I compared existing metrics which are subset of the commit latency, like record write latency or the append metrics. BTW we had no panel which shows that metric.

commitWriteAppendLatency

The write latency look ok'ish. We have some outliers, which might be due to compaction or segment creation. But this doesn't explain the commit latency.

The Append latency looks a bit more fluctuating but still not 100% clear whether this is the reason.

Execution Time

The new metrics show that it sometimes it takes quite a while until an scheduled job is executed. Furthermore that some jobs take quite long, which cause to block others ofc.

executionLatency

New Issues

During my investigation I found several other things like that the append requests are not equally distributed #8566 and install requests are always send when the leader has a new snapshot #8565

Next Steps

I found something which might slow the complete process of committing a bit down https://github.com/camunda-cloud/zeebe/blob/develop/atomix/cluster/src/main/java/io/atomix/raft/roles/LeaderRole.java#L597 Here we ask for a whenCompleteAsync which cause to schedule a new job when the future is complete. If the execution queue is long this might take a while. In this place this is actually not necessary since we already switch before to the raft context, so it is enough to check for the context in the whenComplete.

This is a low hanging fruit which I will try out next. Furthermore I will investigate what might be take longer in the raft execution or could potentially slow down the commit.

@Zelldon Zelldon self-assigned this Jan 11, 2022
@Zelldon
Copy link
Member Author

Zelldon commented Jan 11, 2022

Investigation 11-01-2022

  • I run several benchmarks based on the findings yesterday
  • Looked at gke metrics together with @oleschoenburg
  • Tried some different benchmarks

Execution Metrics

As written yesterday I added some execution metrics which show has the time how long it takes until a job is executed and how long it takes to execute that job. Furthermore we can see the rate of scheduled jobs and executions.

executionrate

No async whenComplete

I mentioned yesterday that a low hanging fruit would be to remove the async step which causes to schedule a new job all the time.

After doing so we reduced the execution rate by half.

executionrate-reduced

First it looked like it would help on the commit latency, but it turned later out that this was only for a short period of time.

executionrate-reduced-commit-latency

Turn off Flush on commit

Based on the investigation of yesterday I realized that we always flush to disk when we commit. I wanted to verify how this behaves when we skip the flush and what kind of affect this would have.

no-flush-commit-latency

We can clearly see that the commit latency went tremendously down, but we can also see some quite outliers. This setting of course also affects the throughput here we can reach more than ~90 PI/s again.

no-flush-throughput

@oleschoenburg mentioned it might be worth also to verify against the 1.2.9 how much difference we see in the throughput and latency. I will try to run a benchmark soon.

More parallel appends

In our current RAFT configuration we allow only 2 append requests concurrently. I increased that setting to 5 to verify whether this might have an affect on the system in regards to performance. It showed a much better throughput. Ofc not the same as turning off flush but still better.

more-appends-throughput

Interesting is that this is not that stable (in regards to throughput as compared to no flush). We can see that most of the time the commit latency is better but we also have time frames where it becomes bad (100-250 ms).

more-appends-commit-latency

IO Throttling

Based on the previous results we thought about issues with IO throttling or whether we might write more than in earlier versions. For that I compared the long running clusters.

writes-per-second
What we can see is that 1.3 is writing ~3-5 MB per second more than 1.2.

Used Query:

sum(rate(container_fs_writes_bytes_total{container="zeebe-cluster-helm", namespace=~"release-1-2-x", pod=~".*-zeebe-[0-9]", device="/dev/sdb"}[1h])) by (namespace)

sum(rate(container_fs_writes_bytes_total{container="zeebe-cluster-helm", namespace=~"release-1-3-x", pod=~".*-zeebe-[0-9]", device="/dev/sdb"}[1h])) by (namespace)

@oleschoenburg helped me here to also investigate that on the GKE metrics level.

write-throttling-1 3

Used query:

fetch gce_instance
| filter
    resource.project_id == "zeebe-io"
    && resource.zone == "europe-west1-b"
| { t_read:
      metric 'compute.googleapis.com/instance/disk/throttled_read_bytes_count'
      | filter metric.device_name == "gke-zeebe-cluster-100d-pvc-787e1424-d589-48ae-9712-621fefaff8fe"
      | align rate(1m)
      | group_by [], [total: sum(value.throttled_read_bytes_count)]
      | map add[direction: 'Read']
  ; t_write:
      metric 'compute.googleapis.com/instance/disk/throttled_write_bytes_count'
      | filter metric.device_name == "gke-zeebe-cluster-100d-pvc-787e1424-d589-48ae-9712-621fefaff8fe"
      | filter metric.throttle_reason == "PER_GB"
      | align rate(1m)
      | group_by [], [total: sum(value.throttled_write_bytes_count)]
      | map add[direction: 'Write'] }
| union

This metrics shows how much MB is throttled per second. We can see that read itself is not throttled but the writes. Via the filter filter metric.throttle_reason == "PER_GB" we are able to find out the reason (which is in this case to small disk).

https://cloud.google.com/compute/docs/disks/review-disk-metrics#throttling_metrics
The throttling metrics include a throttle_reason label that indicates whether throttling is due to limits based on the disk size or
limits based on the the number of vCPUs on the VM instance. Consider the following steps to increase performance, especially for latency- sensitive workloads such as databases:

If the throttle_reason is PER_GB, increase the size of your disk.
If the throttle_reason is PER_VM, add more vCPUs to your VM instance.

Bigger Disk

When increasing the disk size we see much less write throttling:

write-throttling-1 3-disk-larger

We increased the disk to 256 Gi but still see some throttling but it is much smaller. Interesting is also that the throughput is not really better, it looks worse than before actually.

write-throttling-1 3-disk-larger-throughput

We can see in most of the benchmarks that at the begin often the commit latency looks ok'ish but it goes quite fast up (like something is filled up).

write-throttling-1 3-disk-larger-commit-latency

Write throttling 1.2

We checked also the write throttling with 1.2 and can see that it is similar to the benchmark were we increased the disk size, so this means we have some throttling but much less.

write-throttling-1 2

Next Steps

  • I would like to understand why we have such outliers in the execution metrics (10+ secs in scheduling or high job execution time).
  • Create benchmark 1.2.9 with no flush
  • Benchmark earlier 1.2.x version
  • Verify what we changed in the engine regards to records or state
  • Run jmh with eze and the different versions (just to verify)

@Zelldon
Copy link
Member Author

Zelldon commented Jan 13, 2022

Investigation 12-01-2022

  • Run several new benchmarks:
    • 1.2.4
    • 1.2.9
    • 1.2.9 without flush
    • current dev as base
    • current dev with less load
    • current dev with no resource limits

1.2.4

We can see here the same pattern, it started with smaller throughput but at some point the commit latency went up and so the throughput. I feel with 1.2.x the chance is much higher that it recovers.

124-general

What we can see here is that the append latency to have an high impact, idk how I have overseen that before.

124-appends-relates
At the point when the append latency went down the commit latency also went down, which ofc makes sense since we have to wait for the append before we can commit.

Interesting is here to check also the quantiles.

124-quantiles-fluhs-latency

Write throttling

I checked again the write throttling (based on our investigation yesterday) and we can see that due to higher throughput we are more throttled but it has not really an negative impact.

124-write-throttling
124-written

In general it seems we have here no problems with writes.

no-problem-with-writes

Follower

Interesting for the #8565 is that we can see that the leader which gets continuously the install request is never able to catch up it is always missing ~4 k records.

124-lagging-behind-follower

1.2.9 and base

We can see in 1.2.9 and the base benchmark the same affect of better append latency to affect commit latency and as soon the append latency goes up to ~25 ms we reach the point of ~100-250 ms commit latency which has a high negative impact on throughput.

129-slower-appends-relates

No limit

When starting the benchmark with no limit I was surprised that the numbers where so bad.

no-limits
no-limits-cpu

I had to add some limits for the statefulset again so we reach better numbers again, but still we never use more than ~1-2 CPU.
no-limit-cpu7-general

But still we are not reaching the same numbers as on 1.2.x

1.2.9 no flush

As expected also in 1.2.9 we see a quite good throughput with no flush

no-fluhs-general

The interesting part is that the append latency also went down. I think the issue here is that the follower also not flush anymore.
This reduced both latencies to ~5ms.

no-fluhs-latency

Less Load

With lesser load we can see that the commit latency has less issues, but append seems to be still quite high. So stressing the system seems to make it worse.

less-load-quantiles-fluhs-latency

Latencies

What I can see based on the data is that mostly one follower has an append latency of ~25 ms, which might also cause #8566 as soon as both have this latency we can see that the commit latency is affected and the throughput as well.

If we have good commit latencies, this is mostly because one of the follower is able to append faster ~9 ms in p90 quantile. The commit latency is not only affected by the appends, but also by the flush latency of the leader and job scheduling/execution latency etc.

Next Steps

  • what affects the append latency and why some nodes are slower then the other
  • Furthermore I would like to understand what has changed in 1.3
  • Still need to check engine etc.

@Zelldon
Copy link
Member Author

Zelldon commented Jan 17, 2022

Investigation 17-01-2022

Benchmarks

I had another deeper look at the 1.2.4 benchmark from last week. As I have written before we can see that the benchmark runs to the same issue, but there are also times where it reaches the full potential. We can clearly see how the append/commit latency then affect the throughput. Digging deeper into the metrics and code I found the segment flush metrics (`atomix_segment_flush_time_bucket) . This https://github.com/camunda-cloud/zeebe/blob/develop/journal/src/main/java/io/camunda/zeebe/journal/file/JournalMetrics.java#L74 is called everytime we commit on the leader or we append on the followers and send a response back. The metric observes how long it takes to flush the entire buffer to disk.

I created some new panels and views to compare the different latencies better.

124-lats-modified

What we can see is that IF the segment flush goes up, the append latency goes up this can directly impact the processing throughput as we can see in the panels.

If more than one node is affected by the segment flush issue, respectively the commit latency is impacted which of course also impacts processing throughput.

I verified that on several other benchmarks, what we can see is that on all we have from time to time high flush latency, which affects then the whole system.

Benchmark Results Comments
Benchmark with 1.2.9 129-lats Here we see that even in 1.2.9, the throughput can really go down. The flush latency seems to be higher over longer period of time.
Benchmark Revert #8124 keep-readers-one-part-lats Even with the reverted PR, we can see that the avg commit latency is still high and we not reaching our expected performance, since the change doesn't affect the segment flush latency.
Benchmark not always commit (3e4556c) not-alwayscommit-lats Same here.

Interesting is I think that the throughput difference are not that visible with three partitions, but we can also see the high latencies.

Benchmark Results Comments
Benchmark medic CW 2 p3-medic-cw2-lats This benchmark looks quite bad, here we reach really low performance and we see indication of the high latencies.
Benchmark Revert #8124 three partitions p3-keep-reader-lats The avg throughput looks better and also the latency. Over longer time it seems to become also better. I started a second benchmark to verify that.
Benchmark not always commit (3e4556c) p3-notalways-commit-lats Here we see no improvement at all.
Benchmark Release 1.3.0-alpha1 alpha1-lats The throughput is here quite fluctuating, the segment latency is high as above.
Benchmark Release 1.3.0-alpha2 alpha2-lats With alpha 2 we have less drops, in general it looks a bit better.
Benchmark release 1.2.x release-12x-lats This is the only benchmark left which reaches the ~200 throughput, here we can clearly see that the segment flush is also better (red is better) than in the other benchmarks.

Conclusion for me is that this happens also on 1.2.x, which I have also written above (but seems still to be less likely probably because some changes are not made in 1.2). Furthermore it seems to be less likely or have less an impact on multi partitions. It seems to be related to the segment flush latency. We have multiple changes we made in 1.3 (see above), which might add additionally latency which sum up together and make the issues worse. I added today some new issues, which we could tackle to also improve in some areas #8602 #8601

I would like to pair with someone @npepinpe @oleschoenburg in order to discuss how we want to continue further. The segment flush metric only observes the mapped buffer flush, currently idk how this can be sometimes impacted? Maybe depends on the size of the buffer which it flushes? Since I have also another topic I should work on this quarter I don't know how much time I should spent here. BTW might also be that I totally misinterpret the numbers and metrics, so I think it makes sense that some one challenges that.

@Zelldon Zelldon added the kind/research Marks an issue as part of a research or investigation label Jan 17, 2022
@Zelldon
Copy link
Member Author

Zelldon commented Jan 18, 2022

Investigation 18-01-2022

I run benchmarks over the night an made some observations.

Observations

The benchmark with keeping the readers (reverting of #8124) went in throughput up to ~193 PI/s.

keep-lats

It looks like that over night the latency goes totally down, and the throughput respectively up. Around ~8 am we see an increase in latency again and dropping of throughput.

Yesterday evening I saw that the performance became better and wanted to verify that with a second benchmark with the same image. Here we still not reach the expected goal of ~195-200 PI/s.

keep2-lats

I thought about the flush latency and wanted to try again whether it is disk related. Actually I verified that before, but I wanted to be sure.
I checked the disk metrics and saw that we are heavily write throttled again.

keep2-disk

Reading docs:

I decided to try again to increase the disk size and CPU's.

# RESOURCES
resources:
  limits:
    cpu: 8
    memory: 4Gi
  requests:
    cpu: 8
    memory: 4Gi

# PVC
pvcAccessMode: ["ReadWriteOnce"]
pvcSize: 512Gi
pvcStorageClassName: ssd

diskincrease

We can see in the metrics that the general flush maximum latency is lower than before BUT the distribution is still big. The avg commit latency is also higher which means we not able to reach our expected throughput. So it seems we can fix it with bigger disks nor with more CPU's.

Somehow I feel it is more related how many benchmarks are running and how many brokers are running on the same node, because in the benchmark above we we see it goes up ~8 am when I started more benchmarks.

Debian Image

Another guess was that this might be again related to our different base image, since we change from Debian to Ubuntu due to the jdk 17 migration and that this might have different flush performance 🤷

debian-lats

We see also here that this doesn't have any affect.

Benchmark result for one partition:

debian-lats-one-partition

@Zelldon
Copy link
Member Author

Zelldon commented Jan 18, 2022

Discussed together with @npepinpe the next steps.

@npepinpe will validate the priority of this issue and in general performance over other issues we need to do before our upcoming release in april.

@Zelldon will produce flamegraphs for the different versions, maybe this gives us some hints.
@Zelldon will add some more fine granular metrics which show what happens during a commit.

@Zelldon
Copy link
Member Author

Zelldon commented Jan 19, 2022

Additional observations

As described above I run new benchmarks to take flamegraphs from the different version (1.1.10, 1.2.9, 1.3.0) to see what has changed. Unfortunately I haven't seen/found any thing problematic.

Flamegraphs

From left to right, the flamegraphs are from version 1.3.0, 1.2.9 and 1.1.10. They are also attached here to the comment
flames.zip

flames-general

Actor

Zooming into an Actor we can see we yield quite often, but this was already before. But I think it makes sense to configure the actor threads wisely depending on the partition count.
flames-actor

Append

Zooming into the Raft thread an searching for the term append gives us similar results in all versions.

flames-append
One thing I saw is that the stack of the MappedBuffer flush/force changed, due to the jdk change (change to v17). Furthermore in the AbstractAppender some more methods are called it seems.

Commit

Searching for the term commit doesn't show us unexpected results. We can see that the listeners are now called bit differently.

flames-commit

Follower

All above was on the leader nodes, but I thought it might be also interesting results from the followers. I did that for 1.3.0 and 1.2.9, since we had no replay in 1.1.10.

In general it looks quite similar, nothing stands really out.

followers-append

Append

On the follower append side, we can see that we have an additional copy in 1.3.0. To be specific in the MappedSegmentWriter when we writeData we call https://github.com/camunda-cloud/zeebe/blame/3da1a75d52e524b123f91a8ec9e480a0337bbf42/journal/src/main/java/io/camunda/zeebe/journal/file/record/SBESerializer.java#L60. But the commit which added this is 11 month ago, I'm also not sure whether this really can cause our issue. Looking at the git history, I also found #7967, which I tried to revert and run several benchmarks over the night.

follower-append-write

Replay

The replay on followers look quite similar, some stacktraces are different but this can also be since it is just a snapshot of what happens.

follower-replay

Benchmarks

As I had the benchmarks set up for the flamegraphs I had also time to observe them and look at their metrics.

1.1.10 1.2.9 1.3.0
stable1110-general stable129-general stable130-general

Interesting is that 1.1.10 performed quite similar to 1.3.0 (with avg ~80 PI/s throughput), and 1.2.9 was able to do ~88 PI/s in avg.

Nodes

I still feel that our current benchmarks are highly depended on how they are scheduled in k8.
I checked the nodes for each of the benchmarks and saw something interesting.

stable-130-nodes

The Leader and follower of 1.2.9 are scheduled on the same node, which might give them some performance benefits.
The leader of 1.3.0 is scheduled together with them.

We can also look at the node metrics, currently not sure what they say us but maybe interesting for later.
stable-130-nodes-metrics

The followers for 1.3.0 are together scheduled on another node.

stable-130-nodes-2

The leader of 1.1.10, which perform similar to 1.3.0, is also not scheduled with his followers.
stable-1110-nodes

Here we can see that the DISK IO is much higher, but tbh Idk why.
stable-1110-nodes-metrics

Latencies

I opened last a week an issues which described that nodes are preferred on appending #8566 . The issue seems to have another impact. I observed that if the node which is always preferred has good append latency, the commit latency also quite good, but as soon as the append latency of that node goes up the commit latency also goes up!

We can see that in the benchmark where I reverted the #7967

revert-7967-latency-increase

Around 22:00 the latencies swapped on one follower appends are now faster and on the other slower. But the commit latency goes up and never comes back. Here it seems to prefer the Node 2.

Revert 7967

As written before due to the flamegraphs and looking at the git history, I guessed that it might be interesting to revert #7967. It looked at the begin also quite good.

revert-7967-general

We can see that the leader and the follower are again scheduled on the same node.

revert-7967-nodes
revert-7967-nodes-metrics

But over the night it went down again to the usual performance 😢

revert-7967-latency-increase

Taking a look at the nodes, we can see that the leader is now alone on the node (with other pods from different namespaces)
revert7967-nodes2

Which brings me again to the point that the general performance is highly depending on where it is scheduled, which makes it really hard to compare in short benchmarks / time frames. Running benchmarks over longer time (like 12 or 24 hours) the avg throughput shows a good approximation (due to rescheduling, preemption etc.). So yes we have still a performance regression and it was not resolved with reverting the above PR, but what I learned is take the results of short benchmarks with a fine grain of salt.

Next Steps

@npepinpe any updates on your side regarding Prio?
Otherwise I will start with adding more metrics to profile what happens in a commit.

Another Idea which came to my mind is to bring back the medic benchmarks between 1.2 and 1.3.0 to see when it changed. I think I have seen before that it was between alpha1 and alpha2 #8425 (comment) but looking at the comment we see already degrade in alpha1. So I would start with the first week after the 1.2.0 release.

@Zelldon Zelldon moved this from In progress to Ready in Zeebe Jan 19, 2022
@npepinpe npepinpe removed this from Ready in Zeebe Jan 19, 2022
@npepinpe
Copy link
Member

Moving to the backlog for now, probably until Q2. See the last comment in #8425

@Zelldon
Copy link
Member Author

Zelldon commented Jan 22, 2022

Did some tests the last days.

What it observed is that it seems always one node is really prefered (related to #8566 ) if this one has a higher append latency then the avg commit latency goes up, even if for the other node the append latency goes down. This is something I wouldn't expect. See the screenshot below. Append latency to one goes up, and to two goes down, but still avg commit latency goes up.

append-flush

Considering the flush rate we can see that the flush rate seems to go down on one, which cause the flush duration to increase (if we flush less, but more bytes at once it takes longer). It looks like it is related to install requests, since at the same time the node one receives install requests from the leader. (Related #8565)

flush-increase

@romansmirnov
Copy link
Member

This week, I spent a bit of time trying to understand the impact of taking snapshots, replicating snapshots, and compaction. Therefore, I did some micro-benchmarks (in an old-fashion way) to understand which operations spike, how they are related to each other, and influence each other. Eventually, I ended up micro-benchmarking the "Commit Latency" as this brings everything together. I just want to share my findings here because on one side they underline some of @Zelldon's observations and on another side, it might give additional input to this issue. Hopefully, this gives a bit more insight into how things are tight together.

tl;tr When running the micro-benchmarks, the following operations often ended up in latency spikes:

  • Flush to journal (mostly because of IO latency spike, for example, iowait is on a node being the leader for at least one partition is >= 10%, etc.)
  • Update metadata in the Raft's metastore (writing to the file - without flush - via FileChannel seems to be not efficient)
  • Replicating Snapshots (this includes the entire path, i.e., replicating from leader to follower, subsequent compaction on the follower side, on the leader side reset the next index, ...)
  • Compaction (highly depends on the snapshot interval and load)

To counter-proof it, I turned off all these operations, and as expected the cluster was capable of holding the load:

  • Starting 200pi/s: Cluster with 3 partitions and replication factor 3:
    image
  • Starting 100pi/s: Cluster with 1 partition and replication factor 3:
    image

In the counter-proof, the only limiting factors had been:

  • Log segment creation and switch (if a log segment is full)
  • Potential network glitches (which hadn't been the case in the test so far)

Now, the long version follows with some more details 😄

Record Write Latency

The Record Write Latency measures the roundtrip from the LogStorageAppender to the corresponding "Raft thread" and then back to the LogStorageAppender once the entry got written to the log storage.

Here are my observations:

  • As long as the log segment is not full, writing to the log segment is "just" a memcopy which usually is not expensive - at least I could not find any sample indicating any spikes here.
  • Only, when the log segment is full a new log segment needs to be created. The creation of a new log segment takes up to 10ms but it also spikes up to 100-250ms (and even higher).
  • The context switch back to the LogStorageAppender happens almost in no time, meaning, even if the LogStorageAppender is idle it will pick up the job to execute instantly. There is not much time lost.

To sum up, the creation of a new log segment may cause latency spikes.

Append Entry Latency

The Append Entry Latency measures the roundtrip between sending an AppendRequest to a follower and receiving an AppendResponse from the follower.

Here are my observations:

  • The time spent between the leader sends and the follower receiving the AppendRequest takes up to 2ms (this includes deserialization on the follower side).
  • As long as the log segment is not full, writing to the log segment is again "just" a memcopy which is not expensive.
  • Again, if the log segment is full, this requires creating a new log segment which may result in a latency spike.
  • After writing to the log segment, it flushes the writes to the log segment's file which results in latency spikes.
  • After flushing, it updates Raft's metastore by setting the last written index to the new index. This often has a latency spike.
  • When the follower is done with request handling, it responds to the leader.

To sum up, the following influences mostly this metric:

  • Flushing on the follower side
  • Storing the last written index in Raft's metastore

Commit Latency

The Commit Latency measures the roundtrip of an entry submitted by the LogStorageAppender to the "Raft thread" including Record Write Latency and Append Entry Latency back to the LogStorageAppender once replicated to a quorum of members.

Here are my observations:

  • When handling the AppendResponse, the leader calculates the commit index which happens in a constant time.
  • When the commit index grows, then the leader flushes to the log segment's file. This may result in latency spikes.
  • After flushing, it will update Raft's metastore. Again this operation might have latency spikes.
  • After updating the metastore, it will submit a new "job" in case a quorum is reached to notify the LogStorageAppender that some records got committed.

Also here, in addition to the Record Write Latency and Append Entry Latency, the following influences mostly this metric:

  • Flushing on the leader
  • Storing the last written index in Raft's metastore (on the leader side)

Additional Overhead

All the latency metrics above contain additional overhead, in the sense that they may include latency spikes of operations of other jobs. Meaning, the "Raft thread" maintains a worker queue where itself and others can submit work (i.e., jobs) to the queue. And the Raft thread executes those jobs sequentially. That means, if there are 10 jobs inside the worker queue they are executed one by one which influences the latency as follows:

  • j1: execution time: 10ms -> latency >= 10ms
  • j2: execution time: 10ms -> latency >= 20ms (latency of j1 + execution time of j2)
  • j3: execution time: 10ms -> latency >= 30ms (latency of j2 + execution time of j3)
  • ...
  • j9: execution time: 10ms -> latency -> 90ms
  • j10: execution time: 10ms -> latency -> 100ms (latency of j9 + execution time of j10)

Basically, the latency of job j10 includes its own execution time of 10ms and the total execution time of the 9 jobs prior to it in the queue. So, when one of the jobs experiences any type of a spike in the latency, then this impacts the latency of the following jobs in the queue.

Taking Snapshot

A broker takes snapshots in intervals for each partition. Taking a snapshot happens asynchronously and has no impact on the processing or replication path. At least, I couldn't find anything that has an impact here.

Compaction

After taking a snapshot, the log might get compacted. Meaning, it will delete log segments up to the snapshotted index. Depending on how many log segments need to be deleted, this may take some time. In my benchmarks with a snapshot interval of 5 minutes, the compaction took roughly ~150ms and more. In addition, if all log segments get deleted, it will require the creation of a new log segment which again adds additional latency.

However, compaction is part of the replication path in Raft. Meaning, after taking a snapshot, the compaction is submitted to the respective Raft thread. Eventually, the Raft thread will execute that compaction which takes time. Basically, during compaction, the leader won't replicate any entries. And in case of compaction on the follower side, the follower won't handle any append requests during that time and start lagging behind. Things are getting even worse, if the follower triggers compaction on all partitions it is following at the same time, and the follower is part of the quorum. This will have an impact on the Commit Latency and overall throughput.

Also to consider, the compaction happens after installing a received snapshot from a leader on a partition.

Just a side note: Another potential issue of compaction is that it does not consider any inflight requests. For example, the most recent follower might lag 10 entries behind, but with the compaction, those 10 entries may get deleted. And as a consequence, the leader needs to replicate its latest snapshot to the follower which again might cause the follower to lag behind.

Snapshot Replication

Whenever a snapshot is replicated (for whatever reason), the replication takes typically up to ~200ms. That means, during that time the follower the snapshot is replicated to does not receive any append requests - which is expected. The downside of it is that after installing the snapshot the follower is already lagging behind the leader. This increases the probability of receiving a new snapshot when the leader takes a new one which may end up in a snapshot replication loop.

In addition, on the follower side, after installing the replicated snapshot, the follower will trigger compaction. This again blocks the Raft thread for a while so that it won't be able to handle any incoming append requests. This may result in lagging behind the leader even more. There might also be cases, where taking a snapshot and receiving a snapshot happens frequently in a short period of time

image

In this case, the follower took a snapshot on its own, afterward got another snapshot replicated from the leader, and just a minute later took another snapshot on its own. So, it triggered 3 compactions in a row. This may happen multiple times so the follower had a hard time catching up with the leader over time.

On the leader side, after replicating the snapshot, it resets the follower's next index to replicate. This seek takes ~50ms typically.

To close the loop, there are multiple factors/variables that influence the commit latency. I think the latency spike caused by Raft's metastore update might be fixed easily by switching to mmap. And others might need a bit more thinking, don't know yet... For example, compaction should happen in the background and should not be part of the replication path. However, the operations listed above are mostly interacting with the file system and it looks like they are impacted by IO bottlenecks/limitations.

@deepthidevaki
Copy link
Contributor

I think the latency spike caused by Raft's metastore update might be fixed easily by switching to mmap.

Updates to metastore must be flushed to disk immediately to handle crashes. So switching to mmap might not help.

@romansmirnov
Copy link
Member

romansmirnov commented Feb 7, 2022

@Zelldon
Copy link
Member Author

Zelldon commented Feb 7, 2022

@Zelldon
Copy link
Member Author

Zelldon commented Feb 7, 2022

Discussed with @romansmirnov

Current assumption is that the actor threads are quite aggressive in work stealing and cycle fast between idle and running back and forth which consumes lot of cpu time. Can be also seen here #8551 (comment)

Reducing the threads means that the actors thread itself get more work to do and not run into these issues, where they have to compete for jobs/work.

@Zelldon
Copy link
Member Author

Zelldon commented Feb 9, 2022

@romansmirnov the benchmarks I started with the reduced threads are now slow as the others benchmarks. But the current medic week benchmark looks a bit faster. 🤷‍♂️

@Zelldon
Copy link
Member Author

Zelldon commented Mar 10, 2022

Today, I had another look at this, since I was wondering why it not happens on the long running cluster.

I realized that we use different nodes in our long running cluster. We use n1-standard-16 in our normal zeebe-cluster and n1-standard-8 in the long running cluster. This means that we might run more brokers/pods on one node in our normal zeebe cluster. This could cause more pressure on io/disks. This would also explain why it started to happen after we reduced the benchmark resources.

I checked some metrics in the gke and as expected saw that we have much more disk throughput per vm in our preemtable zeebe cluster.

Zeebe-cluster:
Up to 100 MB/s, on avg 50 MB/s
Top write throughput (1)

Machine Type n16
n16

Long Running cluster:
Not really higher then 40MB/s
Top write throughput

Machine Type n8

n8

I think we can verify that with anti-affinity rules, and check how it behaves and performs.

@Zelldon
Copy link
Member Author

Zelldon commented Mar 10, 2022

I tried it out with:

  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: "app.kubernetes.io/component"
                operator: In
                values:
                  - zeebe-broker
          topologyKey: "kubernetes.io/hostname"
          namespaces:
            - ccsm-helm
            - medic-cw-07-b83992c74d-benchmark
            - medic-cw-08-12c4ea63e6-benchmark
            - medic-cw-09-e4718dc49c-benchmark
            - medic-cw-10-ba172d7b06-benchmark
            - zell-affinity

This allows to run the brokers on nodes without other brokers (also without brokers from other namespaces)

ProviderID:                   gce://zeebe-io/europe-west1-b/gke-zeebe-cluster-default-pool-1f44de48-4scq
Non-terminated Pods:          (9 in total)
  Namespace                   Name                                                       CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                                       ------------  ----------  ---------------  -------------  ---
  default                     metrics-prometheus-node-exporter-8kwgm                     100m (0%)     0 (0%)      0 (0%)           0 (0%)         80m
  kube-system                 fluentbit-gke-9545p                                        100m (0%)     0 (0%)      200Mi (0%)       500Mi (0%)     80m
  kube-system                 gke-metrics-agent-pvgpj                                    3m (0%)       0 (0%)      50Mi (0%)        50Mi (0%)      80m
  kube-system                 kube-proxy-gke-zeebe-cluster-default-pool-1f44de48-4scq    100m (0%)     0 (0%)      0 (0%)           0 (0%)         80m
  kube-system                 pdcsi-node-kjvhr                                           10m (0%)      0 (0%)      20Mi (0%)        100Mi (0%)     80m
  zell-affinity               elasticsearch-master-0                                     3 (18%)       3 (18%)     8Gi (14%)        8Gi (14%)      73m
  zell-affinity               worker-6b9f675476-trz5x                                    500m (3%)     500m (3%)   256Mi (0%)       256Mi (0%)     71m
  zell-affinity               zell-affinity-zeebe-2                                      5 (31%)       5 (31%)     4Gi (7%)         4Gi (7%)       73m
  zell-affinity               zell-affinity-zeebe-gateway-69ccdd4fb5-78ddd               1 (6%)        1 (6%)      512Mi (0%)       512Mi (0%)     73m


ProviderID:                         gce://zeebe-io/europe-west1-b/gke-zeebe-cluster-default-pool-1f44de48-d7r8
Non-terminated Pods:                (8 in total)
  Namespace                         Name                                                       CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                         ----                                                       ------------  ----------  ---------------  -------------  ---
  default                           metrics-prometheus-node-exporter-xwskt                     100m (0%)     0 (0%)      0 (0%)           0 (0%)         3h6m
  kube-system                       fluentbit-gke-zs9bl                                        100m (0%)     0 (0%)      200Mi (0%)       500Mi (0%)     3h6m
  kube-system                       gke-metrics-agent-4q7vd                                    3m (0%)       0 (0%)      50Mi (0%)        50Mi (0%)      3h6m
  kube-system                       kube-proxy-gke-zeebe-cluster-default-pool-1f44de48-d7r8    100m (0%)     0 (0%)      0 (0%)           0 (0%)         3h6m
  kube-system                       pdcsi-node-6v7bh                                           10m (0%)      0 (0%)      20Mi (0%)        100Mi (0%)     3h6m
  medic-cw-10-ba172d7b06-benchmark  elasticsearch-master-1                                     3 (18%)       3 (18%)     8Gi (14%)        8Gi (14%)      60m
  zell-affinity                     elasticsearch-master-2                                     3 (18%)       3 (18%)     8Gi (14%)        8Gi (14%)      60m
  zell-affinity                     zell-affinity-zeebe-1                                      5 (31%)       5 (31%)     4Gi (7%)         4Gi (7%)       73m


ProviderID:                   gce://zeebe-io/europe-west1-b/gke-zeebe-cluster-default-pool-1f44de48-bd7n
Non-terminated Pods:          (11 in total)
  Namespace                   Name                                                       CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                                       ------------  ----------  ---------------  -------------  ---
  default                     metrics-prometheus-node-exporter-vwvb4                     100m (0%)     0 (0%)      0 (0%)           0 (0%)         3h10m
  kube-system                 fluentbit-gke-nqrf5                                        100m (0%)     0 (0%)      200Mi (0%)       500Mi (0%)     3h10m
  kube-system                 gke-metrics-agent-d7l52                                    3m (0%)       0 (0%)      50Mi (0%)        50Mi (0%)      3h10m
  kube-system                 kube-proxy-gke-zeebe-cluster-default-pool-1f44de48-bd7n    100m (0%)     0 (0%)      0 (0%)           0 (0%)         3h9m
  kube-system                 pdcsi-node-5qj49                                           10m (0%)      0 (0%)      20Mi (0%)        100Mi (0%)     3h10m
  zell-affinity               elasticsearch-master-1                                     3 (18%)       3 (18%)     8Gi (14%)        8Gi (14%)      72m
  zell-affinity               starter-7cc79b44c7-6w2vx                                   250m (1%)     250m (1%)   256Mi (0%)       256Mi (0%)     71m
  zell-affinity               worker-6b9f675476-gwlnz                                    500m (3%)     500m (3%)   256Mi (0%)       256Mi (0%)     71m
  zell-affinity               zell-affinity-zeebe-0                                      5 (31%)       5 (31%)     4Gi (7%)         4Gi (7%)       72m
  zell-affinity               zell-affinity-zeebe-gateway-69ccdd4fb5-728r6               1 (6%)        1 (6%)      512Mi (0%)       512Mi (0%)     72m
  zell-affinity               zell-affinity-zeebe-gateway-69ccdd4fb5-grpbr               1 (6%)        1 (6%)      512Mi (0%)       512Mi (0%)     72m

Still we see low performance (or at least lower than I would expect and high commit latency)

general
latency

@falko
Copy link
Member

falko commented Mar 11, 2022

What about ElasticSearch running on the same Kube node as the broker?

@Zelldon
Copy link
Member Author

Zelldon commented Mar 11, 2022

@falko I was not able do schedule them on different nodes. I failed to configured this proberly. I added anti affinity also for the elasticmaster labels, but somehow they always ended on the same node 🤷‍♂️. I can try next days again. But still it is already mich less load of the nodes as on the others

@Zelldon
Copy link
Member Author

Zelldon commented Mar 14, 2022

I did another benchmark with

  resources:
    limits:
      cpu: 15
      memory: 4Gi
    requests:
      cpu: 15
      memory: 4Gi

This results in one broker per node:

ProviderID:                   gce://zeebe-io/europe-west1-b/gke-zeebe-cluster-default-pool-1f44de48-dgb8
Non-terminated Pods:          (6 in total)
  Namespace                   Name                                                       CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                                       ------------  ----------  ---------------  -------------  ---
  default                     metrics-prometheus-node-exporter-jlz7t                     100m (0%)     0 (0%)      0 (0%)           0 (0%)         71m
  kube-system                 fluentbit-gke-v94vl                                        100m (0%)     0 (0%)      200Mi (0%)       500Mi (0%)     71m
  kube-system                 gke-metrics-agent-z8tgk                                    3m (0%)       0 (0%)      50Mi (0%)        50Mi (0%)      71m
  kube-system                 kube-proxy-gke-zeebe-cluster-default-pool-1f44de48-dgb8    100m (0%)     0 (0%)      0 (0%)           0 (0%)         71m
  kube-system                 pdcsi-node-7qtsl                                           10m (0%)      0 (0%)      20Mi (0%)        100Mi (0%)     71m
  zell-affinity               zell-affinity-zeebe-0                                      15 (94%)      15 (94%)    4Gi (7%)         4Gi (7%)       72m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                   Requests      Limits
  --------                   --------      ------
  cpu                        15313m (96%)  15 (94%)
  memory                     4366Mi (7%)   4746Mi (8%)
  ephemeral-storage          0 (0%)        0 (0%)
  hugepages-1Gi              0 (0%)        0 (0%)
  hugepages-2Mi              0 (0%)        0 (0%)
  attachable-volumes-gce-pd  0             0
Events:                      <none>

The numbers look far better.

In the following screenshot we see the benchmark first with affinity only, which is still fluctuating. Around 2 pm I restarted the benchmark with higher cpu request/limits. We can see a flat line now.

general-both

If we take a look at the last 3 hours we can see that the avg is almost 200 again, as we want to have.

general-last-3h

Latency:

In the latency we can see a bit a difference. There are less outliers, so the max is not that high.

latency
raft-latency

I haven't invested in more time to investigate the latency here.

CPU
Interesting is that it seems we also use less cpu?

cpu-both

Snapshots

The install request rate looks still the same as before.

install


Concluding this I would say that our benchmarks are highly affected by our current parallel benchmarks, means pods which are scheduled on the same node. Not only broker but also elastic impacts our throughput. This started to happen when we decreased the resources for our benchmarks. This was not reproducible in our long running cluster, since we use smaller nodes here.

Still we reach not our 200 PI/s as I would expect, so we have potentially also other issues which can influence this. Probably things we already discussed above.

How we want to continue here? @npepinpe \cc @falko @deepthidevaki @romansmirnov

@Zelldon
Copy link
Member Author

Zelldon commented Mar 16, 2022

Today I discussed that topic with @npepinpe. In general we currently not focusing on it and spent not much time on this, but if I have time I will setup a new node pool with local ssd attached. This helps to exclude whether it is network io related. If the benchmarks have still low performance with local ssd, then it is probably related to CPU cache contention. Otherwise it is related to network io. In general I will add soon some network anti-affinity rules to the helm charts, which should help a bit here as well.

All the insights here we should probably document better for users, so they know what to do if they face some performance issues.

zeebe-bors-camunda bot added a commit that referenced this issue Mar 29, 2022
8977: Add random protocol record/value factory r=npepinpe a=npepinpe

## Description

Adds a new utility to the `protocol-test-util` module which allows to generate random records and values in a deterministic way. Note that as data is randomly generated, the data has no meaning in itself - keys, positions, etc., are completely random. However, the `value` and `intent` are guaranteed to always be derived from the `valueType`.

This is currently used to properly test the deserialization of the protocol via Jackson, but can later be used for exporter related unit tests.

## Related issues

closes #8837 



8999: Add commit and record write quantile panels r=Zelldon a=Zelldon

## Description

Taking a look at performance issues often forces me at the end to look at the commit latencies, because of #8551 It is currently quite hard to compare latencies from two benchmarks, since this is show as a heatmap. 

![old-panels](https://user-images.githubusercontent.com/2758593/160366627-bfd3e8f3-9c59-4bd6-8831-6df5f4a6e8ca.png)

This PR should solve this issue, and allows me to not always recreate my panels (to see a difference).

It addes two new panels, one for the commit latency and one for the record write latency. Both show the quantiles (p90, p99), the median and the avg. 

The formulars were based on :  https://theswissbay.ch/pdf/Books/Computer%20science/prometheus_upandrunning.pdf (I use the book)


The new panels look like this:

![new-panels](https://user-images.githubusercontent.com/2758593/160366881-52720cfc-1bcd-4d46-8055-27c59a873c64.png)

this allows to easier compare it to other benchmark which perform worse like:

![otherbench-new-panels](https://user-images.githubusercontent.com/2758593/160366975-ffc7ac5d-3488-4ea7-9fd9-0839b365bfb1.png)


<!-- Please explain the changes you made here. -->

## Related issues

<!-- Which issues are closed by this PR or are related -->

related to #8551



Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com>
Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
zeebe-bors-camunda bot added a commit that referenced this issue Mar 29, 2022
8999: Add commit and record write quantile panels r=Zelldon a=Zelldon

## Description

Taking a look at performance issues often forces me at the end to look at the commit latencies, because of #8551 It is currently quite hard to compare latencies from two benchmarks, since this is show as a heatmap. 

![old-panels](https://user-images.githubusercontent.com/2758593/160366627-bfd3e8f3-9c59-4bd6-8831-6df5f4a6e8ca.png)

This PR should solve this issue, and allows me to not always recreate my panels (to see a difference).

It addes two new panels, one for the commit latency and one for the record write latency. Both show the quantiles (p90, p99), the median and the avg. 

The formulars were based on :  https://theswissbay.ch/pdf/Books/Computer%20science/prometheus_upandrunning.pdf (I use the book)


The new panels look like this:

![new-panels](https://user-images.githubusercontent.com/2758593/160366881-52720cfc-1bcd-4d46-8055-27c59a873c64.png)

this allows to easier compare it to other benchmark which perform worse like:

![otherbench-new-panels](https://user-images.githubusercontent.com/2758593/160366975-ffc7ac5d-3488-4ea7-9fd9-0839b365bfb1.png)


<!-- Please explain the changes you made here. -->

## Related issues

<!-- Which issues are closed by this PR or are related -->

related to #8551



9017: chore(maven): add trailing slashes to new Artifactory URL r=cmur2 a=cmur2

## Description

I got review feedback that it is recommended to have trailing slashes to the URL so I'm adding them for consistency.

## Related issues

Related to INFRA-3107

Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
Co-authored-by: Christian Nicolai <christian.nicolai@camunda.com>
zeebe-bors-camunda bot added a commit that referenced this issue Mar 29, 2022
8999: Add commit and record write quantile panels r=Zelldon a=Zelldon

## Description

Taking a look at performance issues often forces me at the end to look at the commit latencies, because of #8551 It is currently quite hard to compare latencies from two benchmarks, since this is show as a heatmap. 

![old-panels](https://user-images.githubusercontent.com/2758593/160366627-bfd3e8f3-9c59-4bd6-8831-6df5f4a6e8ca.png)

This PR should solve this issue, and allows me to not always recreate my panels (to see a difference).

It addes two new panels, one for the commit latency and one for the record write latency. Both show the quantiles (p90, p99), the median and the avg. 

The formulars were based on :  https://theswissbay.ch/pdf/Books/Computer%20science/prometheus_upandrunning.pdf (I use the book)


The new panels look like this:

![new-panels](https://user-images.githubusercontent.com/2758593/160366881-52720cfc-1bcd-4d46-8055-27c59a873c64.png)

this allows to easier compare it to other benchmark which perform worse like:

![otherbench-new-panels](https://user-images.githubusercontent.com/2758593/160366975-ffc7ac5d-3488-4ea7-9fd9-0839b365bfb1.png)


<!-- Please explain the changes you made here. -->

## Related issues

<!-- Which issues are closed by this PR or are related -->

related to #8551



Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
@Zelldon Zelldon removed their assignment Jun 9, 2022
@Zelldon
Copy link
Member Author

Zelldon commented Jul 6, 2022

It is crazy, I checked the benchmarks the last days and we have really volatile backpressure metrics, on day it is 3% today it was on one benchmark even 30%. It often seems to be related to which Broker is currently the leader of all partitions (it seems that it is almost always the case that one broker is the leader for all partitions). Probably related to #8566

general

What I can see is that the throughput is of course also affected by this which then reaches from 175 PI/s up to 188 PI/s.

All of that seems to be caused by the fluctuating commit latency.
latency

This is then related to how many other pods (other brokers or elastic) are running on the same kubernetes node.

$ k describe node gke-***-default-pool-1f44de48-rhgm
Non-terminated Pods:                                               (17 in total)
  Namespace                                                        Name                                                       CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                                                        ----                                                       ------------  ----------  ---------------  -------------  ---
  camunda-platform-kpf42u-d6c8bc48244f8774746d5297782809c0b4167fe  camunda-platform-it-operate-6cfdbf7889-v8rrn               600m (3%)     2 (12%)     400Mi (0%)       2Gi (3%)       10h
  camunda-platform-kpf42u-d6c8bc48244f8774746d5297782809c0b4167fe  camunda-platform-it-zeebe-2                                800m (5%)     960m (6%)   1200Mi (2%)      1920Mi (3%)    10h
  camunda-platform-neaobs-01d774b3603f9d1fd603deb2d59c91edea1acf9  camunda-platform-it-zeebe-1                                800m (5%)     960m (6%)   1200Mi (2%)      1920Mi (3%)    11h
  camunda-platform-neaobs-01d774b3603f9d1fd603deb2d59c91edea1acf9  elasticsearch-master-1                                     1 (6%)        2 (12%)     1Gi (1%)         2Gi (3%)       11h
  default                                                          metrics-prometheus-node-exporter-dhv8z                     100m (0%)     0 (0%)      0 (0%)           0 (0%)         11h
  kube-system                                                      fluentbit-gke-7pfhq                                        100m (0%)     0 (0%)      200Mi (0%)       500Mi (0%)     11h
  kube-system                                                      gke-metrics-agent-zn7kv                                    3m (0%)       0 (0%)      50Mi (0%)        50Mi (0%)      11h
  kube-system                                                      konnectivity-agent-967db5996-cws9d                         10m (0%)      0 (0%)      30Mi (0%)        30Mi (0%)      114m
  kube-system                                                      kube-proxy-gke-zeebe-cluster-default-pool-1f44de48-rhgm    100m (0%)     0 (0%)      0 (0%)           0 (0%)         11h
  kube-system                                                      metrics-server-v0.4.5-85bdf86c4d-xnbcj                     100m (0%)     195m (1%)   521Mi (0%)       771Mi (1%)     114m
  kube-system                                                      pdcsi-node-hgddm                                           10m (0%)      0 (0%)      20Mi (0%)        100Mi (0%)     11h
  medic-cw-25-006383a639-benchmark                                 worker-7bcf94ccd8-d9tjp                                    500m (3%)     500m (3%)   256Mi (0%)       256Mi (0%)     10h
  medic-cw-26-f834c4866b-benchmark                                 medic-cw-26-f834c4866b-benchmark-zeebe-0                   5 (31%)       5 (31%)     4Gi (7%)         4Gi (7%)       11h
  medic-cw-26-f834c4866b-benchmark                                 worker-554d65bc6-g4tpb                                     500m (3%)     500m (3%)   256Mi (0%)       256Mi (0%)     10h
  medic-cw-27-56ad2b36c8-benchmark                                 medic-cw-27-56ad2b36c8-benchmark-zeebe-0                   5 (31%)       5 (31%)     4Gi (7%)         4Gi (7%)       11h
  medic-cw-27-56ad2b36c8-benchmark                                 starter-6c98c758b9-rnhjq                                   250m (1%)     250m (1%)   256Mi (0%)       256Mi (0%)     114m
  zell-playground                                                  elasticsearch-master-0                                     1 (6%)        2 (12%)     1Gi (1%)         2Gi (3%)       11h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                   Requests       Limits
  --------                   --------       ------
  cpu                        15873m (99%)   19365m (121%)
  memory                     14629Mi (26%)  20395Mi (37%)
  ephemeral-storage          0 (0%)         0 (0%)
  hugepages-1Gi              0 (0%)         0 (0%)
  hugepages-2Mi              0 (0%)         0 (0%)
  attachable-volumes-gce-pd  0              0
Events:                      <none>


BTW You may ask why this is happening, why we have multiple brokers on the same node, we have anti-affinity defined? Well, anti-affinity is only per namespace, which means other brokers from other namespaces can be assigned to the same node.

@menski
Copy link
Contributor

menski commented Oct 28, 2022

Closing as version is EOL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Marks an issue as performance related kind/bug Categorizes an issue or PR as a bug kind/research Marks an issue as part of a research or investigation scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/high Marks a bug as having a noticeable impact on the user with no known workaround
Projects
None yet
Development

No branches or pull requests

7 participants