Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird throughput pattern #8244

Closed
Zelldon opened this issue Nov 19, 2021 · 7 comments
Closed

Weird throughput pattern #8244

Zelldon opened this issue Nov 19, 2021 · 7 comments
Labels
area/performance Marks an issue as performance related blocker/info Marks an issue as blocked, awaiting more information from the author kind/bug Categorizes an issue or PR as a bug severity/low Marks a bug as having little to no noticeable impact for the user

Comments

@Zelldon
Copy link
Member

Zelldon commented Nov 19, 2021

Describe the bug

Due to recent discussions I have run a benchmark with different process models, actually I just started a benchmark with make all. This causes to run process models with only start-end, process model with intermediate timer catch event (PT1S) and a process model

Models

one_task
simpleProcess
timerProcess

Observation

We can observe a weird throughput pattern, which is recurring every ~20 mins

general
proc-queue

If we take a look what happens on the drops/spikes we can see that a lot of jobs are completed at once
throughput-rate
grpc

Interesting is that the state seems to grow and then shrink again, after all jobs are completed

snapshot

If we take a look at the created - completed instance, we can see that we accumulate instances/jobs, which are then at some point released/completed again.

instances

This sounds like it is related to issues like #7955 #8132

To Reproduce

just run a benchmark with make all, but make sure to configure the starters to a rate of 100.

Expected behavior

We expect to complete 300 process instances per second and that the throughput graph is stable.

Log/Stacktrace

Nothing

Environment:

  • OS:
  • Zeebe Version: develop
  • Configuration: benchmark
@Zelldon Zelldon added kind/bug Categorizes an issue or PR as a bug area/performance Marks an issue as performance related severity/low Marks a bug as having little to no noticeable impact for the user labels Nov 19, 2021
@Zelldon
Copy link
Member Author

Zelldon commented Nov 22, 2021

After scaling the workers down it looks much better

general

It seems that due to lot of workers we can disrupt the throughput, question would be for me whether we want to investigate here more or just document it better for users, which will stumble over this.

@menski menski added the blocker/info Marks an issue as blocked, awaiting more information from the author label Nov 22, 2021
@menski
Copy link
Contributor

menski commented Nov 22, 2021

I think we need more investigation to better understand the client behavior in this case, i.e. what are the metrics of the clients, why are there bursts of activation and completion. Is there an impact of back-off in the client.

@menski
Copy link
Contributor

menski commented Nov 22, 2021

Maybe a potential topics for chaos days, i.e. the number of workers should not impact the throughput of the cluster /cc @Zelldon

@deepthidevaki
Copy link
Contributor

I see similar behaviours in our weekly benchmark.

image

It is not consistent. Sometimes it occurs frequently, othertimes the throughput is flat.

@pihme
Copy link
Contributor

pihme commented May 31, 2022

Is the timer set to 20 mins by any chance? Then it might be related to the blocking looping over all timers

@Zelldon
Copy link
Member Author

Zelldon commented Jun 10, 2022

Lets try to reproduce again, if it still happens lets try with our new shiny due date feature flag and document the result here. If it is not failing let's just close it.

@Zelldon
Copy link
Member Author

Zelldon commented Jun 21, 2022

I'm not able to reproduce this. I run make all which deploys the timers, simple starter and normal starters.

load

Throughput looks stable. I will close this.

@Zelldon Zelldon closed this as completed Jun 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Marks an issue as performance related blocker/info Marks an issue as blocked, awaiting more information from the author kind/bug Categorizes an issue or PR as a bug severity/low Marks a bug as having little to no noticeable impact for the user
Projects
None yet
Development

No branches or pull requests

4 participants