Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in 1.3.0-alpha2 #8425

Closed
Zelldon opened this issue Dec 17, 2021 · 17 comments
Closed

Performance regression in 1.3.0-alpha2 #8425

Zelldon opened this issue Dec 17, 2021 · 17 comments
Labels
area/performance Marks an issue as performance related kind/bug Categorizes an issue or PR as a bug severity/mid Marks a bug as having a noticeable impact but with a known workaround

Comments

@Zelldon
Copy link
Member

Zelldon commented Dec 17, 2021

Describe the bug

It looks like we have a performance regression in our latest alpha release, at least this can be seen in the long running benchmarks.

Alpha 2

perf

We can see that the throughput is much lower than on alpha1 and we have a high fluctuation of the current events and handled requests.

Latency

The latency seems to be much higher and be unstable as well.

alpha2-latency
alpha2-latency2

Leader

I can't see any related leader changes or similar things

alpha2-leader

Previous alpha version:

perfalpa1

Latency

alpha1-latency
alpha1-latency2

To Reproduce

Run a long running benchmark?

Expected behavior

Similar throughput and latency, then in previous version. E.g. 200 PI/s.

Environment:

  • OS:
  • Zeebe Version: 1.3.0-alpha2
  • Configuration:
@Zelldon Zelldon added kind/bug Categorizes an issue or PR as a bug area/performance Marks an issue as performance related severity/mid Marks a bug as having a noticeable impact but with a known workaround labels Dec 17, 2021
@npepinpe npepinpe added this to Ready in Zeebe Dec 20, 2021
@npepinpe
Copy link
Member

@Zelldon are you already looking into it?

@Zelldon
Copy link
Member Author

Zelldon commented Dec 20, 2021

Nope @npepinpe

@Zelldon
Copy link
Member Author

Zelldon commented Dec 21, 2021

Still the case:

general

@lenaschoenburg
Copy link
Member

Pattern reminds me of this: #8244

@Zelldon
Copy link
Member Author

Zelldon commented Dec 21, 2021

True @oleschoenburg just the difference that we only running normal starters and worker

When we zoom in it looks really similar
zoom

@korthout
Copy link
Member

We've not seen this pattern again on benchmarks for release-1.3.0 like release-1-3-0 (preemptible nodes on zeebe cluster) and zell-test-release-1-3 (non-preemptible nodes on long-running cluster). We also did not encounter it on long-running-v1-for-minor-updates (which is currently running on 1.2.7).

We should still investigate what is going wrong with release-1-3-alpha2 (non-preemptible nodes on long running cluster)

@Zelldon
Copy link
Member Author

Zelldon commented Jan 5, 2022

@npepinpe it seems the alpha2 benchmark has recovered also the release benchmark on stable nodes look okish.

Alpha2

alpha2

Still we have this weird processing pattern.

Release 1.3

release

Preemptable Nodes

On preemptable nodes it looks different and seems that we degrading performance. But this might also be related to our benchmark resources. I will start one benchmark with higher resources.

Release

preempt-release

Medic CW 51

medic51

Medic CW 50

As a comparison to show how latency and throughput looked before.

medic50

medic50-latency

Medic CW 01

What we can see is that the latency increased by factor two at least. The throughput graph looks also whether it will slowly decreasing.

medic01
medic01-latency

\cc @korthout

@Zelldon Zelldon self-assigned this Jan 5, 2022
@Zelldon Zelldon moved this from Ready to In progress in Zeebe Jan 5, 2022
@Zelldon
Copy link
Member Author

Zelldon commented Jan 6, 2022

I run two new benchmarks. One with the resource reduction (our current develop) and one without (revert the merge of #8268").

What we can see is that on both the performance is degraded (we not reaching our expected throughput of ~200 PI/s).

With the old setup (No resource reduction) we reach ~184 PI/s, which is better than with the resource reduction (~177 PI/s). But we need to be aware that on the other benchmark one node is leader for all partitions, so it is not completely comparable. For better comparability we could run a benchmark with one partition. To be able to compare we need to run the current setup, the old setup with develop and an older release with both configurations.

No resource reduction

withoutresreduction

withoutresreduction-latency

Current Develop

devwithresreduction

devwithresreduction-latency

@Zelldon
Copy link
Member Author

Zelldon commented Jan 6, 2022

New Benchmarks

New benchmarks configuration:

  • 3 Worker a 120 max activation count
  • 1 Starter 100 PI/s
  • 1 Partition
  • Clustersize 3
  • Replication 3

With that we can make sure that only one node is leader for one partition. First we need to verify whether the benchmark with resource reduction (on dev) reaches 100 PI/s. If this is the case we need to increase the load until we reach a limit (where it is not able to keep up with the load). Then we can start with the other benchmarks, otherwise if all reach the 100 PI/s the tests would be useless.

Update 1:

We can see that with one partition and resource reduction on develop state we are not able to reach 100 PI/s. We reach around ~71 Pi/s. This means we can continue with the benchmarks.

first-bench-dev-p1

Next step: Creating benchmarks without resource reduction and develop state. Furthermore I will create two benchmarks with latest 1.2.x release.

Update 2:

I run all benchmarks over one hour. The results look like the following.

all

1.3 release 1.3 release without resource reduction Release 1.2.9 Release 1.2.9 without resource reduction
Description Benchmark with develop state (~1.3 release) with resource reduction Benchmark with develop state (~1.3 release) without resource reduction (revert merge #8268) Benchmark with release 1.2.9 with resource reduction Benchmark with release 1.2.9 without resource reduction
Results dev dev-without rel129-without rel129
Throughput 76.7 74.9 89.3 83.5
Relative change calculation Actual Change / reference = (x - reference) / reference
Reference 1.3 release (76.7) 0% -2.34% 16.42% 8.8657%
Reference 1.3 release without (74.9) 2.4% 0% 19.22% 11.48%
Reference 1.2.9 (89.3) -14.10% -16.12% 0 -6.49%
Reference 1.2.9 without (83.5) -8.14% -10.299% 6.94% 0%

Interpretation:

  • We can see that all benchmarks with reduced resources are operating slightly better.
  • The worst is the current develop without resource reduction (currently completely unclear to me)
  • The best is 1.2.9 with resource reduction (89.3 PI/s).
  • We can clearly see that there is a performance regression between 1.3 and 1.2.9.
    • If we compare 1.3 with resource reduction against 1.2.9 with resource reduction we see a difference of 14-16%.
    • If we compare 1.3 without resource reduction against 1.2.9 without resource reduction we see a difference of 11%.

Due to the above insights we need to investigate more, what has introduced such a degrade of performance. \cc @npepinpe


Next Steps: I will run experiments with different versions (alphas) with the development benchmark state, which means reduced resources. We can do this since we have seen that it is not the cause of the performance degradation (at least for one partition) and it makes comparing easier (using the same configs).

@npepinpe
Copy link
Member

npepinpe commented Jan 6, 2022

I'm convinced, we don't need to prove more that there is a performance degradation. What're the next steps then to identify what happened? Do we have any leads? Is the performance constantly degraded or is it over time?

@Zelldon
Copy link
Member Author

Zelldon commented Jan 6, 2022

@npepinpe it seems to be constant. I will try the alphas first and then try to go over to the medic week benchmarks.

@Zelldon
Copy link
Member Author

Zelldon commented Jan 6, 2022

Currently it seems to be introduced between alpha1 and alpha2, I will write an update tomorrow morning I think I have to run it a bit longer.

Alpha1

alpha1

Alpha2

alpha2

@Zelldon
Copy link
Member Author

Zelldon commented Jan 7, 2022

I have run the benchmarks over night, and have some more insights.

In general it seems that the degrade of performance is caused by higher commit latency. It seems that from time to time the commit latency goes up to 100-250 ms, if this happens the throughtput goes down. This can also happen on 1.2.9 it seems. But in 1.2.9 it is not so likely (?).

Benchmarks last 24h

1.2.9

The throughput dropped in 1.2.9 for a short period of time to the same level we see in other benchmarks, which seems to be an indicator for me that we had the issue already before @npepinpe
129-general-24h

129-latency-24h

We can see in the metrics that 1.2.9 has most of the time a commit latency of 25-50 (and partly 50-70). If we reach 100-250 then the throughput reaches the same level as on the other benchmarks.

129-latency-2-24h

Alpha 1

Alpha1 seems to show a similar pattern. If we reach the high latency bucket 100-250 the throughput reaches the level of ~74 PI/s. If we are in the 25-50 bucket the throughput seems to be ~90 and in 50-100 (which is most of the time) we reach ~80 PI/s.

alpha1-general-24h
alpha1-latency-24h

Alpha 2

This benchmark is interesting because we see here most of the time a really high commit latency and the low throughput, here it seems not to become better. I think we currently have some issues with the resolution of the latency metrics. The bucket 100-250 ms is quite large. I think it make sense to adjust the defined bucket to have a bitter better and fine granular view.

alpha2-general-24h
alpha2-latency-24h

Alpha 2 - 2

I started a second alpha2 benchmark to make sure that this happen again. This and the next benchmark show similar pattern (unstable latency).

alpha2-2-general-24h
alpha2-2-latency-24h

Develop (1.3)

dev-general-24h
dev-latency-24h

Conclusion

I think it makes sense to look into the commit latency, why it is sometimes fluctuating so much, what can affect this etc. Since this was already before the case, I would propose to close this issue and create a new issue, which concentrates on the commit latency. Wdyt? @npepinpe

Furthermore I would like to improve our current metrics to make it more fine granular (reduce the bucket range).

@npepinpe
Copy link
Member

npepinpe commented Jan 7, 2022

Create a new issue and leave this issue open. We can put this one back into planned for now, and work on the new one. But until we've fixed the performance issue then I would leave it open, as right now the commit latency is just a theory.

@npepinpe
Copy link
Member

Moving to the backlog until Q2. We'll accept a 15% throughput loss for now, as we could not pinpoint a single root cause, implying it may be simply an accumulation of things. Reverting each is not exactly worth it, and instead we'll commit on improving performance in Q2, as well as commit to no further performance loss for 1.4. That means throughput losses between 1.3 and 1.4 would be release blockers, and medics should pay close attention to the weekly benchmarks until then so we catch these as fast as possible.

@npepinpe npepinpe removed this from Planned in Zeebe Jan 19, 2022
@Zelldon
Copy link
Member Author

Zelldon commented Jan 22, 2022

I did some benchmarks with 1.2.9 and develop, with one partition and only the simple starter (which starts a process with start and end event). In both benchmarks we can see that the throughput is almost the same. It seems that it might be more related to tasks or jobs, or it is more visible with them?

We see in both the same pattern, up and down of the throughput and quite high (with really high spikes in commit latency). The backlog is also really high most of the time.

1.2.9

simple129

Develop

simple

@Zelldon
Copy link
Member Author

Zelldon commented Jun 2, 2022

Long running benchmarks are back to ~200 pi/s

longrun

Medic benchmarks are around ~180

medicrun

@Zelldon Zelldon closed this as completed Jun 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Marks an issue as performance related kind/bug Categorizes an issue or PR as a bug severity/mid Marks a bug as having a noticeable impact but with a known workaround
Projects
None yet
Development

No branches or pull requests

4 participants