Performance regression in 1.3.0-alpha2 #8425

Zelldon · 2021-12-17T14:55:20Z

Describe the bug

It looks like we have a performance regression in our latest alpha release, at least this can be seen in the long running benchmarks.

Alpha 2

We can see that the throughput is much lower than on alpha1 and we have a high fluctuation of the current events and handled requests.

Latency

The latency seems to be much higher and be unstable as well.

Leader

I can't see any related leader changes or similar things

Previous alpha version:

Latency

To Reproduce

Run a long running benchmark?

Expected behavior

Similar throughput and latency, then in previous version. E.g. 200 PI/s.

Environment:

OS:
Zeebe Version: 1.3.0-alpha2
Configuration:

npepinpe · 2021-12-20T09:12:15Z

@Zelldon are you already looking into it?

Zelldon · 2021-12-20T09:12:57Z

Nope @npepinpe

Zelldon · 2021-12-21T10:25:59Z

Still the case:

lenaschoenburg · 2021-12-21T10:26:49Z

Pattern reminds me of this: #8244

Zelldon · 2021-12-21T10:31:23Z

True @oleschoenburg just the difference that we only running normal starters and worker

When we zoom in it looks really similar

korthout · 2021-12-23T16:04:14Z

We've not seen this pattern again on benchmarks for release-1.3.0 like release-1-3-0 (preemptible nodes on zeebe cluster) and zell-test-release-1-3 (non-preemptible nodes on long-running cluster). We also did not encounter it on long-running-v1-for-minor-updates (which is currently running on 1.2.7).

We should still investigate what is going wrong with release-1-3-alpha2 (non-preemptible nodes on long running cluster)

Zelldon · 2022-01-05T12:35:49Z

@npepinpe it seems the alpha2 benchmark has recovered also the release benchmark on stable nodes look okish.

Alpha2

Still we have this weird processing pattern.

Release 1.3

Preemptable Nodes

On preemptable nodes it looks different and seems that we degrading performance. But this might also be related to our benchmark resources. I will start one benchmark with higher resources.

Release

Medic CW 51

Medic CW 50

As a comparison to show how latency and throughput looked before.

Medic CW 01

What we can see is that the latency increased by factor two at least. The throughput graph looks also whether it will slowly decreasing.

\cc @korthout

Zelldon · 2022-01-06T07:25:58Z

I run two new benchmarks. One with the resource reduction (our current develop) and one without (revert the merge of #8268").

What we can see is that on both the performance is degraded (we not reaching our expected throughput of ~200 PI/s).

With the old setup (No resource reduction) we reach ~184 PI/s, which is better than with the resource reduction (~177 PI/s). But we need to be aware that on the other benchmark one node is leader for all partitions, so it is not completely comparable. For better comparability we could run a benchmark with one partition. To be able to compare we need to run the current setup, the old setup with develop and an older release with both configurations.

No resource reduction

Current Develop

Zelldon · 2022-01-06T07:36:48Z

New Benchmarks

New benchmarks configuration:

3 Worker a 120 max activation count
1 Starter 100 PI/s
1 Partition
Clustersize 3
Replication 3

With that we can make sure that only one node is leader for one partition. First we need to verify whether the benchmark with resource reduction (on dev) reaches 100 PI/s. If this is the case we need to increase the load until we reach a limit (where it is not able to keep up with the load). Then we can start with the other benchmarks, otherwise if all reach the 100 PI/s the tests would be useless.

Update 1:

We can see that with one partition and resource reduction on develop state we are not able to reach 100 PI/s. We reach around ~71 Pi/s. This means we can continue with the benchmarks.

Next step: Creating benchmarks without resource reduction and develop state. Furthermore I will create two benchmarks with latest 1.2.x release.

Update 2:

I run all benchmarks over one hour. The results look like the following.

	1.3 release	1.3 release without resource reduction	Release 1.2.9	Release 1.2.9 without resource reduction
Description	Benchmark with develop state (~1.3 release) with resource reduction	Benchmark with develop state (~1.3 release) without resource reduction (revert merge #8268)	Benchmark with release 1.2.9 with resource reduction	Benchmark with release 1.2.9 without resource reduction
Results
Throughput	76.7	74.9	89.3	83.5
Relative change calculation	`Actual Change / reference = (x - reference) / reference`
Reference 1.3 release (76.7)	0%	-2.34%	16.42%	8.8657%
Reference 1.3 release without (74.9)	2.4%	0%	19.22%	11.48%
Reference 1.2.9 (89.3)	-14.10%	-16.12%	0	-6.49%
Reference 1.2.9 without (83.5)	-8.14%	-10.299%	6.94%	0%

Interpretation:

We can see that all benchmarks with reduced resources are operating slightly better.
The worst is the current develop without resource reduction (currently completely unclear to me)
The best is 1.2.9 with resource reduction (89.3 PI/s).
We can clearly see that there is a performance regression between 1.3 and 1.2.9.
- If we compare 1.3 with resource reduction against 1.2.9 with resource reduction we see a difference of 14-16%.
- If we compare 1.3 without resource reduction against 1.2.9 without resource reduction we see a difference of 11%.

Due to the above insights we need to investigate more, what has introduced such a degrade of performance. \cc @npepinpe

Next Steps: I will run experiments with different versions (alphas) with the development benchmark state, which means reduced resources. We can do this since we have seen that it is not the cause of the performance degradation (at least for one partition) and it makes comparing easier (using the same configs).

npepinpe · 2022-01-06T12:47:18Z

I'm convinced, we don't need to prove more that there is a performance degradation. What're the next steps then to identify what happened? Do we have any leads? Is the performance constantly degraded or is it over time?

Zelldon · 2022-01-06T12:56:47Z

@npepinpe it seems to be constant. I will try the alphas first and then try to go over to the medic week benchmarks.

Zelldon · 2022-01-06T15:11:03Z

Currently it seems to be introduced between alpha1 and alpha2, I will write an update tomorrow morning I think I have to run it a bit longer.

Alpha1

Alpha2

Zelldon · 2022-01-07T08:32:25Z

I have run the benchmarks over night, and have some more insights.

In general it seems that the degrade of performance is caused by higher commit latency. It seems that from time to time the commit latency goes up to 100-250 ms, if this happens the throughtput goes down. This can also happen on 1.2.9 it seems. But in 1.2.9 it is not so likely (?).

Benchmarks last 24h

1.2.9

The throughput dropped in 1.2.9 for a short period of time to the same level we see in other benchmarks, which seems to be an indicator for me that we had the issue already before @npepinpe

We can see in the metrics that 1.2.9 has most of the time a commit latency of 25-50 (and partly 50-70). If we reach 100-250 then the throughput reaches the same level as on the other benchmarks.

Alpha 1

Alpha1 seems to show a similar pattern. If we reach the high latency bucket 100-250 the throughput reaches the level of ~74 PI/s. If we are in the 25-50 bucket the throughput seems to be ~90 and in 50-100 (which is most of the time) we reach ~80 PI/s.

Alpha 2

This benchmark is interesting because we see here most of the time a really high commit latency and the low throughput, here it seems not to become better. I think we currently have some issues with the resolution of the latency metrics. The bucket 100-250 ms is quite large. I think it make sense to adjust the defined bucket to have a bitter better and fine granular view.

Alpha 2 - 2

I started a second alpha2 benchmark to make sure that this happen again. This and the next benchmark show similar pattern (unstable latency).

Develop (1.3)

Conclusion

I think it makes sense to look into the commit latency, why it is sometimes fluctuating so much, what can affect this etc. Since this was already before the case, I would propose to close this issue and create a new issue, which concentrates on the commit latency. Wdyt? @npepinpe

Furthermore I would like to improve our current metrics to make it more fine granular (reduce the bucket range).

npepinpe · 2022-01-07T08:34:55Z

Create a new issue and leave this issue open. We can put this one back into planned for now, and work on the new one. But until we've fixed the performance issue then I would leave it open, as right now the commit latency is just a theory.

npepinpe · 2022-01-19T14:10:36Z

Moving to the backlog until Q2. We'll accept a 15% throughput loss for now, as we could not pinpoint a single root cause, implying it may be simply an accumulation of things. Reverting each is not exactly worth it, and instead we'll commit on improving performance in Q2, as well as commit to no further performance loss for 1.4. That means throughput losses between 1.3 and 1.4 would be release blockers, and medics should pay close attention to the weekly benchmarks until then so we catch these as fast as possible.

Zelldon · 2022-01-22T19:23:17Z

I did some benchmarks with 1.2.9 and develop, with one partition and only the simple starter (which starts a process with start and end event). In both benchmarks we can see that the throughput is almost the same. It seems that it might be more related to tasks or jobs, or it is more visible with them?

We see in both the same pattern, up and down of the throughput and quite high (with really high spikes in commit latency). The backlog is also really high most of the time.

1.2.9 Develop

Zelldon · 2022-06-02T12:25:23Z

Long running benchmarks are back to ~200 pi/s

Medic benchmarks are around ~180

Zelldon added kind/bug Categorizes an issue or PR as a bug area/performance Marks an issue as performance related severity/mid Marks a bug as having a noticeable impact but with a known workaround labels Dec 17, 2021

npepinpe added this to Ready in Zeebe Dec 20, 2021

Zelldon self-assigned this Jan 5, 2022

Zelldon moved this from Ready to In progress in Zeebe Jan 5, 2022

Zelldon mentioned this issue Jan 7, 2022

Commit latency is fluctuating and too high #8551

Closed

Zelldon removed their assignment Jan 7, 2022

Zelldon moved this from In progress to Planned in Zeebe Jan 7, 2022

npepinpe mentioned this issue Jan 19, 2022

InstallRequests are sent almost everytime a new snapshot is taken by the leader #8565

Closed

npepinpe removed this from Planned in Zeebe Jan 19, 2022

Zelldon closed this as completed Jun 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression in 1.3.0-alpha2 #8425

Performance regression in 1.3.0-alpha2 #8425

Zelldon commented Dec 17, 2021

npepinpe commented Dec 20, 2021

Zelldon commented Dec 20, 2021

Zelldon commented Dec 21, 2021

lenaschoenburg commented Dec 21, 2021

Zelldon commented Dec 21, 2021

korthout commented Dec 23, 2021

Zelldon commented Jan 5, 2022 •

edited

Zelldon commented Jan 6, 2022

Zelldon commented Jan 6, 2022 •

edited

npepinpe commented Jan 6, 2022

Zelldon commented Jan 6, 2022

Zelldon commented Jan 6, 2022

Zelldon commented Jan 7, 2022

npepinpe commented Jan 7, 2022

npepinpe commented Jan 19, 2022

Zelldon commented Jan 22, 2022

Zelldon commented Jun 2, 2022 •

edited

Performance regression in 1.3.0-alpha2 #8425

Performance regression in 1.3.0-alpha2 #8425

Comments

Zelldon commented Dec 17, 2021

Alpha 2

Latency

Leader

Previous alpha version:

Latency

npepinpe commented Dec 20, 2021

Zelldon commented Dec 20, 2021

Zelldon commented Dec 21, 2021

lenaschoenburg commented Dec 21, 2021

Zelldon commented Dec 21, 2021

korthout commented Dec 23, 2021

Zelldon commented Jan 5, 2022 • edited

Alpha2

Release 1.3

Preemptable Nodes

Release

Medic CW 51

Medic CW 50

Medic CW 01

Zelldon commented Jan 6, 2022

No resource reduction

Current Develop

Zelldon commented Jan 6, 2022 • edited

New Benchmarks

npepinpe commented Jan 6, 2022

Zelldon commented Jan 6, 2022

Zelldon commented Jan 6, 2022

Alpha1

Alpha2

Zelldon commented Jan 7, 2022

Benchmarks last 24h

1.2.9

Alpha 1

Alpha 2

Alpha 2 - 2

Develop (1.3)

Conclusion

npepinpe commented Jan 7, 2022

npepinpe commented Jan 19, 2022

Zelldon commented Jan 22, 2022

1.2.9

Develop

Zelldon commented Jun 2, 2022 • edited

Zelldon commented Jan 5, 2022 •

edited

Zelldon commented Jan 6, 2022 •

edited

Zelldon commented Jun 2, 2022 •

edited