-
Notifications
You must be signed in to change notification settings - Fork 557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression in 1.3.0-alpha2 #8425
Comments
@Zelldon are you already looking into it? |
Nope @npepinpe |
Pattern reminds me of this: #8244 |
We've not seen this pattern again on benchmarks for We should still investigate what is going wrong with release-1-3-alpha2 (non-preemptible nodes on long running cluster) |
@npepinpe it seems the alpha2 benchmark has recovered also the release benchmark on stable nodes look okish. Alpha2Still we have this weird processing pattern. Release 1.3Preemptable NodesOn preemptable nodes it looks different and seems that we degrading performance. But this might also be related to our benchmark resources. I will start one benchmark with higher resources. ReleaseMedic CW 51Medic CW 50As a comparison to show how latency and throughput looked before. Medic CW 01What we can see is that the latency increased by factor two at least. The throughput graph looks also whether it will slowly decreasing. \cc @korthout |
I run two new benchmarks. One with the resource reduction (our current develop) and one without (revert the merge of #8268"). What we can see is that on both the performance is degraded (we not reaching our expected throughput of ~200 PI/s). With the old setup (No resource reduction) we reach ~184 PI/s, which is better than with the resource reduction (~177 PI/s). But we need to be aware that on the other benchmark one node is leader for all partitions, so it is not completely comparable. For better comparability we could run a benchmark with one partition. To be able to compare we need to run the current setup, the old setup with develop and an older release with both configurations. No resource reductionCurrent Develop |
New BenchmarksNew benchmarks configuration:
With that we can make sure that only one node is leader for one partition. First we need to verify whether the benchmark with resource reduction (on dev) reaches 100 PI/s. If this is the case we need to increase the load until we reach a limit (where it is not able to keep up with the load). Then we can start with the other benchmarks, otherwise if all reach the 100 PI/s the tests would be useless. Update 1: We can see that with one partition and resource reduction on develop state we are not able to reach 100 PI/s. We reach around ~71 Pi/s. This means we can continue with the benchmarks. Next step: Creating benchmarks without resource reduction and develop state. Furthermore I will create two benchmarks with latest 1.2.x release. Update 2: I run all benchmarks over one hour. The results look like the following.
Interpretation:
Due to the above insights we need to investigate more, what has introduced such a degrade of performance. \cc @npepinpe Next Steps: I will run experiments with different versions (alphas) with the development benchmark state, which means reduced resources. We can do this since we have seen that it is not the cause of the performance degradation (at least for one partition) and it makes comparing easier (using the same configs). |
I'm convinced, we don't need to prove more that there is a performance degradation. What're the next steps then to identify what happened? Do we have any leads? Is the performance constantly degraded or is it over time? |
@npepinpe it seems to be constant. I will try the alphas first and then try to go over to the medic week benchmarks. |
I have run the benchmarks over night, and have some more insights. In general it seems that the degrade of performance is caused by higher commit latency. It seems that from time to time the commit latency goes up to 100-250 ms, if this happens the throughtput goes down. This can also happen on 1.2.9 it seems. But in 1.2.9 it is not so likely (?). Benchmarks last 24h1.2.9The throughput dropped in 1.2.9 for a short period of time to the same level we see in other benchmarks, which seems to be an indicator for me that we had the issue already before @npepinpe We can see in the metrics that 1.2.9 has most of the time a commit latency of 25-50 (and partly 50-70). If we reach 100-250 then the throughput reaches the same level as on the other benchmarks. Alpha 1Alpha1 seems to show a similar pattern. If we reach the high latency bucket 100-250 the throughput reaches the level of ~74 PI/s. If we are in the 25-50 bucket the throughput seems to be ~90 and in 50-100 (which is most of the time) we reach ~80 PI/s. Alpha 2This benchmark is interesting because we see here most of the time a really high commit latency and the low throughput, here it seems not to become better. I think we currently have some issues with the resolution of the latency metrics. The bucket 100-250 ms is quite large. I think it make sense to adjust the defined bucket to have a bitter better and fine granular view. Alpha 2 - 2I started a second alpha2 benchmark to make sure that this happen again. This and the next benchmark show similar pattern (unstable latency). Develop (1.3)ConclusionI think it makes sense to look into the commit latency, why it is sometimes fluctuating so much, what can affect this etc. Since this was already before the case, I would propose to close this issue and create a new issue, which concentrates on the commit latency. Wdyt? @npepinpe Furthermore I would like to improve our current metrics to make it more fine granular (reduce the bucket range). |
Create a new issue and leave this issue open. We can put this one back into planned for now, and work on the new one. But until we've fixed the performance issue then I would leave it open, as right now the commit latency is just a theory. |
Moving to the backlog until Q2. We'll accept a 15% throughput loss for now, as we could not pinpoint a single root cause, implying it may be simply an accumulation of things. Reverting each is not exactly worth it, and instead we'll commit on improving performance in Q2, as well as commit to no further performance loss for 1.4. That means throughput losses between 1.3 and 1.4 would be release blockers, and medics should pay close attention to the weekly benchmarks until then so we catch these as fast as possible. |
I did some benchmarks with 1.2.9 and develop, with one partition and only the simple starter (which starts a process with start and end event). In both benchmarks we can see that the throughput is almost the same. It seems that it might be more related to tasks or jobs, or it is more visible with them? We see in both the same pattern, up and down of the throughput and quite high (with really high spikes in commit latency). The backlog is also really high most of the time. 1.2.9Develop |
Describe the bug
It looks like we have a performance regression in our latest alpha release, at least this can be seen in the long running benchmarks.
Alpha 2
We can see that the throughput is much lower than on alpha1 and we have a high fluctuation of the current events and handled requests.
Latency
The latency seems to be much higher and be unstable as well.
Leader
I can't see any related leader changes or similar things
Previous alpha version:
Latency
To Reproduce
Run a long running benchmark?
Expected behavior
Similar throughput and latency, then in previous version. E.g. 200 PI/s.
Environment:
The text was updated successfully, but these errors were encountered: