Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Triggering due timer events causes periodic latency spikes #11594

Closed
romansmirnov opened this issue Feb 10, 2023 · 6 comments · Fixed by #12403
Closed

Triggering due timer events causes periodic latency spikes #11594

romansmirnov opened this issue Feb 10, 2023 · 6 comments · Fixed by #12403
Assignees
Labels
area/performance Marks an issue as performance related component/engine kind/bug Categorizes an issue or PR as a bug pm-eye The issue that PM should keep an eye on. Issue with this label usually doesn't have high priority. support Marks an issue as related to a customer support request version:8.1.11 Marks an issue as being completely or in parts released in 8.1.11 version:8.2.3 Marks an issue as being completely or in parts released in 8.2.3 version:8.3.0-alpha1 Marks an issue as being completely or in parts released in 8.3.0-alpha1 version:8.3.0 Marks an issue as being completely or in parts released in 8.3.0

Comments

@romansmirnov
Copy link
Member

romansmirnov commented Feb 10, 2023

Describe the bug

Basically, it is the same as #11591 but related to timer events (and not messages). In a nutshell, Zeebe checks regularly for due timer events to eventually trigger them and continue with the process flow. The corresponding checker shares the actor of the Stream Processor which will block the Stream Processor while the checker runs. Also, the checker might submit a batch of commands to trigger timer events which will be executed by the Stream Processor in a row without anything else in between.

Expected behavior

  • The Stream Processor and the checker do not share an actor so the Stream Processor continues processing while the checker collects timers to trigger.
  • (This might be partially already the case, needs to be checked.) The checker only submits a batch with a limited number of commands. For example, when the checker runs it will collect the first 10 due timer events and submit them as a batch to the log stream. And then continues with collecting the next 10 due timer events, and so on until there are due timer events. That way, triggering the timer events would interleave with any incoming commands from users/clients.
  • Instead of writing 10 commands, it could write just one command containing the 10 due timer events to trigger. (Might not be easily possible in terms of rolling upgrades, an old version of Zeebe would not be able to process such a command.)

Hints

  • A simple prototype to avoid sharing the actor can be found here (in case of expired messages): poc: async message time to live checker #11550
  • Things to consider: When the checker and the Stream Processor do not share an actor, they may run concurrently. That means, the checker reads from RocksDB, and the Stream Processor (mostly) writes to RocksDB. While RocksDB itself is thread-safe, the Zeebe layer may not (like TransactionContext, ...).
  • Also, while reading the state by the checker, the state might change.
  • The same pattern should be applied to the job's timeline and backoff checker.

Environment:

  • Zeebe Version: 8.1

@romansmirnov romansmirnov added kind/bug Categorizes an issue or PR as a bug support Marks an issue as related to a customer support request component/engine labels Feb 10, 2023
@Zelldon
Copy link
Member

Zelldon commented Feb 10, 2023

Related to #8991

@megglos
Copy link
Contributor

megglos commented Mar 3, 2023

related to #11591

@megglos
Copy link
Contributor

megglos commented Mar 3, 2023

Sync with @abbasadel :

@megglos megglos added the planning/discuss To be discussed at the next planning. label Mar 3, 2023
@korthout
Copy link
Member

korthout commented Mar 8, 2023

Discussed this issue in the ZPA triage:

  • not urgent for the 8.2 release (other issues have priority)
  • it is an important issue and should be fixed
  • marking it is as later as we will first focus on the 8.2 release

@korthout korthout added the area/performance Marks an issue as performance related label Mar 8, 2023
@aleksander-dytko aleksander-dytko added the pm-eye The issue that PM should keep an eye on. Issue with this label usually doesn't have high priority. label Mar 10, 2023
@megglos
Copy link
Contributor

megglos commented Mar 20, 2023

as this issue strongly relates to the mission of the ZDP team, we would take this one over to drive it forward

@megglos
Copy link
Contributor

megglos commented Apr 4, 2023

@abbasadel Ole might need support at least for alignment and dicussion from a ZPA engineer

@megglos megglos removed their assignment Apr 4, 2023
@megglos megglos removed the planning/discuss To be discussed at the next planning. label Apr 4, 2023
zeebe-bors-camunda bot added a commit that referenced this issue Apr 17, 2023
12454: [Backport stable/8.2] feat: run timer due date checker concurrently to processing r=oleschoenburg a=backport-action

# Description
Backport of #12403 to `stable/8.2`.

relates to #11594

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
@oleschoenburg oleschoenburg added the version:8.2.3 Marks an issue as being completely or in parts released in 8.2.3 label Apr 21, 2023
@megglos megglos added the version:8.1.11 Marks an issue as being completely or in parts released in 8.1.11 label Apr 26, 2023
@remcowesterhoud remcowesterhoud added version:8.3.0-alpha1 Marks an issue as being completely or in parts released in 8.3.0-alpha1 and removed version:8.3.0-alpha1 Marks an issue as being completely or in parts released in 8.3.0-alpha1 labels May 3, 2023
@megglos megglos added the version:8.3.0 Marks an issue as being completely or in parts released in 8.3.0 label Oct 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Marks an issue as performance related component/engine kind/bug Categorizes an issue or PR as a bug pm-eye The issue that PM should keep an eye on. Issue with this label usually doesn't have high priority. support Marks an issue as related to a customer support request version:8.1.11 Marks an issue as being completely or in parts released in 8.1.11 version:8.2.3 Marks an issue as being completely or in parts released in 8.2.3 version:8.3.0-alpha1 Marks an issue as being completely or in parts released in 8.3.0-alpha1 version:8.3.0 Marks an issue as being completely or in parts released in 8.3.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants