Discarding expired messages causes periodic latency spikes #11591

romansmirnov · 2023-02-09T17:01:40Z

Describe the bug

Every 60s, the MessageTimeToLiveChecker runs to collect expired (published) messages to eventually discard them from the RocksDB state. While running, it iterates the deadline column family as long as there are expired messages, and for each expired message it appends an expire command to a batch. Once all expired messages are collected, it eventually will submit this batch of expire commands to the log stream.

Now, assuming that 1500 messages per second with a TTL > 0 are published, and Zeebe is configured with 3 partitions, that means

Every 60s, 90k messages must be deleted from the partitions.
Broken down by partition (assuming an even distribution), on every partition the checker needs to collect 30k messages every 60s.
Once collected a batch of 30k expire commands to discard these messages is submitted to the log stream (to be executed by the stream processor eventually).

This approach may cause latency spikes in two ways:

The MessageTimeToLiveChecker shares the same actor with the Stream Processor. Basically, while the checker runs the Stream Processor does not do any processing in the meantime. Meaning, if collecting the 30k expired messages by the checker takes e.g. ~100ms, the Stream Processor does not process anything during that time. Due to it, the Stream Processor's backlog increases, and everything in the backlog gets delayed by at least 100ms.
Once the Stream Processor reaches the batch of expire commands during processing, it will execute the 30k expire commands one by one and in a row without anything else in between. (Just as an example:) Assuming executing on expire command takes 0.1ms, it will take up to 3s to execute the batch of 30k expire commands. And in the meantime, there will be no progress in any process instances on that specific partition.

Expected behavior

The Stream Processor and the MessageTimeToLiveChecker does not share an actor so that the Stream Processor continues processing while the checker collects expired messages.
The checker only submits a batch with a limited number of expire commands. For example, when the checker runs it will collect the first 10 expired messages and submit them as a batch to the log stream. And then continues with collecting the next 10 expired messages, and so on until there are no expired messages. That way, discarding the message would interleave with any incoming commands from users/clients.
Instead of writing 10 expire commands, it could write just one command containing the 10 message keys to expire. (Might not be easily possible in terms of rolling upgrades, an old version of Zeebe would not be able to process such a command.)

Hints

A simple prototype to avoid sharing the actor can be found here: poc: async message time to live checker #11550
Things to consider: When the checker and the Stream Processor do not share an actor, they may run concurrently. That means, the checker reads from RocksDB, the Stream Processor (mostly) writes to RocksDB. While RocksDB itself is thread-safe, the Zeebe layer is not, for example, the DbMessageState or the TransactionContext, etc.
Also, while reading the state by the checker, the state might change.

Environment:

Zeebe Version: 8.1

related to SUPPORT-15892

The text was updated successfully, but these errors were encountered:

remcowesterhoud · 2023-02-15T09:08:29Z

I have changed this to a feature instead of a bug. Message expiry works as expected. We consider improving the performance of it to be a feature.

As it's a behavioural change we wouldn't want to make this part of a patch release. @romansmirnov you mentioned in the support case you'd like to see this as a patch release. Is this blocking the customer?

@felix-mueller Could you say something about the priority of this issue?

menski · 2023-02-15T09:32:35Z

From my discussion with @romansmirnov I can share that this is a blocking issue for the customer as they have a large amount of messages, which have to be processed without impacting the performance to reach their targets. Therefore the request for a patch release.

@felix-mueller should decide in the priority in the planning meeting, but based on my conversation with Roman this should have a high priority to resolve open issues with the customers production use case.

felix-mueller · 2023-02-15T09:50:32Z

@menski @remcowesterhoud this has very high priority and ideally should be done as soon as we have time. I would also vote for a patch release.

remcowesterhoud · 2023-02-15T10:03:10Z

Let's discuss it in the planning Friday 👍

menski · 2023-02-17T09:41:29Z

We discussed in our planning meeting 2023-02-17 that we will aim to work on this issue in the next iteration. I will inform support about this.

romansmirnov · 2023-02-17T10:31:15Z

@remcowesterhoud, as mentioned by @felix-mueller, the issue is critical for the customer. They need an 8.1 patch with the upcoming March release to hold their timelines.

@korthout, please feel free to approach me, if you need any further context. Also, we can discuss together the breakdown to decide on what we need to deliver at its core so that the customer can move on.

korthout · 2023-02-20T17:39:04Z

Discussed about this issue with @romansmirnov (and others).

We will aim to resolve the following tasks (in order of priority):

This issue also discussed the following, but we will not aim to achieve this in this iteration:

korthout · 2023-03-08T16:35:00Z

Closing this issue as the main parts are implemented in the patch release and forwarded to main. The other issues can be tracked individually

13400: feat: expire messages in TTL checker in batches r=korthout a=abbasadel ## Description  To further improve the performance of message expiration, the Message TTL Checker writes a single `MessageBatch:EXPIRE` command (instead of individual `Message:EXPIRE` commands) to expire a batch of messages simultaneously. ## Related issues  related: #11591 closes: #11953 Co-authored-by: Abbas Ibrahim <abbas.adel.ibrahim@gmail.com>

romansmirnov added kind/bug Categorizes an issue or PR as a bug support Marks an issue as related to a customer support request component/engine labels Feb 9, 2023

romansmirnov changed the title ~~Discarding expired messages causes latency spikes periodically.~~ Discarding expired messages causes periodic latency spikes Feb 9, 2023

romansmirnov mentioned this issue Feb 10, 2023

Triggering due timer events causes periodic latency spikes #11594

Closed

remcowesterhoud added kind/feature Categorizes an issue or PR as a feature, i.e. new behavior and removed kind/bug Categorizes an issue or PR as a bug labels Feb 15, 2023

korthout added the area/performance Marks an issue as performance related label Feb 15, 2023

remcowesterhoud added the severity/high Marks a bug as having a noticeable impact on the user with no known workaround label Feb 15, 2023

korthout self-assigned this Feb 17, 2023

This was referenced Feb 20, 2023

MessageTimeToLiveChecker blocks Stream Processor #11761

Closed

MessageTimeToLiveChecker clogs the log stream with commands #11762

Closed

This was referenced Mar 3, 2023

Add section about message expiration camunda/camunda-docs#1764

Closed

Expire multiple messages in single command #11953

Closed

korthout closed this as completed Mar 8, 2023

megglos added the version:8.1.9 Marks an issue as being completely or in parts released in 8.1.9 label Mar 14, 2023

megglos mentioned this issue Apr 3, 2023

poc: async message time to live checker #11550

Closed

14 tasks

lenaschoenburg mentioned this issue Apr 6, 2023

Scheduled tasks should not block processing #12302

Closed

korthout mentioned this issue Apr 6, 2023

Enable running Message TTL Checker async to stream processor by default #12307

Closed

abbasadel mentioned this issue Jul 19, 2023

feat: expire messages in TTL checker in batches #13400

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discarding expired messages causes periodic latency spikes #11591

Discarding expired messages causes periodic latency spikes #11591

romansmirnov commented Feb 9, 2023 •

edited

remcowesterhoud commented Feb 15, 2023

menski commented Feb 15, 2023

felix-mueller commented Feb 15, 2023

remcowesterhoud commented Feb 15, 2023

menski commented Feb 17, 2023

romansmirnov commented Feb 17, 2023

korthout commented Feb 20, 2023 •

edited

korthout commented Mar 8, 2023

Discarding expired messages causes periodic latency spikes #11591

Discarding expired messages causes periodic latency spikes #11591

Comments

romansmirnov commented Feb 9, 2023 • edited

remcowesterhoud commented Feb 15, 2023

menski commented Feb 15, 2023

felix-mueller commented Feb 15, 2023

remcowesterhoud commented Feb 15, 2023

menski commented Feb 17, 2023

romansmirnov commented Feb 17, 2023

korthout commented Feb 20, 2023 • edited

korthout commented Mar 8, 2023

romansmirnov commented Feb 9, 2023 •

edited

korthout commented Feb 20, 2023 •

edited