Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discarding expired messages causes periodic latency spikes #11591

Closed
romansmirnov opened this issue Feb 9, 2023 · 8 comments
Closed

Discarding expired messages causes periodic latency spikes #11591

romansmirnov opened this issue Feb 9, 2023 · 8 comments
Assignees
Labels
area/performance Marks an issue as performance related component/engine kind/feature Categorizes an issue or PR as a feature, i.e. new behavior severity/high Marks a bug as having a noticeable impact on the user with no known workaround support Marks an issue as related to a customer support request version:8.1.9 Marks an issue as being completely or in parts released in 8.1.9

Comments

@romansmirnov
Copy link
Member

romansmirnov commented Feb 9, 2023

Describe the bug

Every 60s, the MessageTimeToLiveChecker runs to collect expired (published) messages to eventually discard them from the RocksDB state. While running, it iterates the deadline column family as long as there are expired messages, and for each expired message it appends an expire command to a batch. Once all expired messages are collected, it eventually will submit this batch of expire commands to the log stream.

Now, assuming that 1500 messages per second with a TTL > 0 are published, and Zeebe is configured with 3 partitions, that means

  • Every 60s, 90k messages must be deleted from the partitions.
  • Broken down by partition (assuming an even distribution), on every partition the checker needs to collect 30k messages every 60s.
  • Once collected a batch of 30k expire commands to discard these messages is submitted to the log stream (to be executed by the stream processor eventually).

This approach may cause latency spikes in two ways:

  • The MessageTimeToLiveChecker shares the same actor with the Stream Processor. Basically, while the checker runs the Stream Processor does not do any processing in the meantime. Meaning, if collecting the 30k expired messages by the checker takes e.g. ~100ms, the Stream Processor does not process anything during that time. Due to it, the Stream Processor's backlog increases, and everything in the backlog gets delayed by at least 100ms.
  • Once the Stream Processor reaches the batch of expire commands during processing, it will execute the 30k expire commands one by one and in a row without anything else in between. (Just as an example:) Assuming executing on expire command takes 0.1ms, it will take up to 3s to execute the batch of 30k expire commands. And in the meantime, there will be no progress in any process instances on that specific partition.

Expected behavior

  • The Stream Processor and the MessageTimeToLiveChecker does not share an actor so that the Stream Processor continues processing while the checker collects expired messages.
  • The checker only submits a batch with a limited number of expire commands. For example, when the checker runs it will collect the first 10 expired messages and submit them as a batch to the log stream. And then continues with collecting the next 10 expired messages, and so on until there are no expired messages. That way, discarding the message would interleave with any incoming commands from users/clients.
  • Instead of writing 10 expire commands, it could write just one command containing the 10 message keys to expire. (Might not be easily possible in terms of rolling upgrades, an old version of Zeebe would not be able to process such a command.)

Hints

  • A simple prototype to avoid sharing the actor can be found here: poc: async message time to live checker #11550
  • Things to consider: When the checker and the Stream Processor do not share an actor, they may run concurrently. That means, the checker reads from RocksDB, the Stream Processor (mostly) writes to RocksDB. While RocksDB itself is thread-safe, the Zeebe layer is not, for example, the DbMessageState or the TransactionContext, etc.
  • Also, while reading the state by the checker, the state might change.

Environment:

  • Zeebe Version: 8.1

related to SUPPORT-15892

@romansmirnov romansmirnov added kind/bug Categorizes an issue or PR as a bug support Marks an issue as related to a customer support request component/engine labels Feb 9, 2023
@romansmirnov romansmirnov changed the title Discarding expired messages causes latency spikes periodically. Discarding expired messages causes periodic latency spikes Feb 9, 2023
@remcowesterhoud remcowesterhoud added kind/feature Categorizes an issue or PR as a feature, i.e. new behavior and removed kind/bug Categorizes an issue or PR as a bug labels Feb 15, 2023
@korthout korthout added the area/performance Marks an issue as performance related label Feb 15, 2023
@remcowesterhoud
Copy link
Contributor

I have changed this to a feature instead of a bug. Message expiry works as expected. We consider improving the performance of it to be a feature.

As it's a behavioural change we wouldn't want to make this part of a patch release. @romansmirnov you mentioned in the support case you'd like to see this as a patch release. Is this blocking the customer?

@felix-mueller Could you say something about the priority of this issue?

@menski
Copy link
Contributor

menski commented Feb 15, 2023

From my discussion with @romansmirnov I can share that this is a blocking issue for the customer as they have a large amount of messages, which have to be processed without impacting the performance to reach their targets. Therefore the request for a patch release.

@felix-mueller should decide in the priority in the planning meeting, but based on my conversation with Roman this should have a high priority to resolve open issues with the customers production use case.

@felix-mueller
Copy link
Member

@menski @remcowesterhoud this has very high priority and ideally should be done as soon as we have time. I would also vote for a patch release.

@remcowesterhoud
Copy link
Contributor

Let's discuss it in the planning Friday 👍

@remcowesterhoud remcowesterhoud added the severity/high Marks a bug as having a noticeable impact on the user with no known workaround label Feb 15, 2023
@menski
Copy link
Contributor

menski commented Feb 17, 2023

We discussed in our planning meeting 2023-02-17 that we will aim to work on this issue in the next iteration. I will inform support about this.

@korthout korthout self-assigned this Feb 17, 2023
@romansmirnov
Copy link
Member Author

@remcowesterhoud, as mentioned by @felix-mueller, the issue is critical for the customer. They need an 8.1 patch with the upcoming March release to hold their timelines.

@korthout, please feel free to approach me, if you need any further context. Also, we can discuss together the breakdown to decide on what we need to deliver at its core so that the customer can move on.

@korthout
Copy link
Member

korthout commented Feb 20, 2023

Discussed about this issue with @romansmirnov (and others).

We will aim to resolve the following tasks (in order of priority):

This issue also discussed the following, but we will not aim to achieve this in this iteration:

@korthout
Copy link
Member

korthout commented Mar 8, 2023

Closing this issue as the main parts are implemented in the patch release and forwarded to main. The other issues can be tracked individually

@korthout korthout closed this as completed Mar 8, 2023
@megglos megglos added the version:8.1.9 Marks an issue as being completely or in parts released in 8.1.9 label Mar 14, 2023
zeebe-bors-camunda bot added a commit that referenced this issue Jul 24, 2023
13400: feat: expire messages in TTL checker in batches r=korthout a=abbasadel

## Description

<!-- Please explain the changes you made here. -->

To further improve the performance of message expiration, the Message TTL Checker writes a single `MessageBatch:EXPIRE` command (instead of individual `Message:EXPIRE` commands) to expire a batch of messages simultaneously. 

## Related issues

<!-- Which issues are closed by this PR or are related -->
related: #11591
closes: #11953


Co-authored-by: Abbas Ibrahim <abbas.adel.ibrahim@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Marks an issue as performance related component/engine kind/feature Categorizes an issue or PR as a feature, i.e. new behavior severity/high Marks a bug as having a noticeable impact on the user with no known workaround support Marks an issue as related to a customer support request version:8.1.9 Marks an issue as being completely or in parts released in 8.1.9
Projects
None yet
Development

No branches or pull requests

6 participants