`LogStorageAppender` Actor occupies an Actor Thread forever due to a full backpressue queue #8540

romansmirnov · 2022-01-06T13:13:12Z

Describe the bug

The LogStorageAppender subscribes to the write buffer (i.e., Dispatcher) to read from it and to append the available fragments to the LogStorage:

https://github.com/camunda-cloud/zeebe/blob/73e5c7be9f453e30b5aebc07aae322ee5f82b11e/logstreams/src/main/java/io/camunda/zeebe/logstreams/impl/log/LogStorageAppender.java#L151-L154

However, before appending to the LogStorage, the LogStorageAppender tries to acquire a "token" from a limiter (i.e., backpressure queue), and only if a token got acquired it appends to the LogStorage (and the read fragments are marked as read in the write buffer). On the other side, if no token got acquired, the fragments are not appended to the LogStorage (and the fragments are still marked as not read in the write buffer):

https://github.com/camunda-cloud/zeebe/blob/73e5c7be9f453e30b5aebc07aae322ee5f82b11e/logstreams/src/main/java/io/camunda/zeebe/logstreams/impl/log/LogStorageAppender.java#L119-L136

So whenever some fragments are available on the write buffer, the Actor LogStorageAppender is submitted to the broker thread group's task queue so that a broker thread can execute this Actor eventually. When executing this Actor, it checks whether there are some fragments are available on the write buffer:

https://github.com/camunda-cloud/zeebe/blob/73e5c7be9f453e30b5aebc07aae322ee5f82b11e/util/src/main/java/io/camunda/zeebe/util/sched/channel/ChannelConsumerCondition.java#L43-L48

and if there are some fragments available, it will execute the corresponding Actor Job (i.e., read from the write buffer and try to append to the LogStorage).

The actor job is executed as long as ChannelConsumerCondition#poll() returns true.

In a scenario, where the appender's backpressure queue is full (i.e., no token can be acquired from the limiter), the Actor Thread keeps executing the actor job because ChannelConsumerCondition#poll() still returns true. And as long as the backpressure is not emptied the Actor Thread will continue executing that job forever.

For example, this can happen in the following case:

For a certain partition - where the broker is the leader, the appender's backpressure queue is full (so that no token can be acquired from the limiter)
The Actor Thread executes the LogStorageAppenders actor job, i.e., it reads from the write buffer and tries to append the LogStorage (and ChannelConsumerCondition#poll() returns always true, but no token can be acquired)
For the same partition, the Raft layer transitions from leader to follower. This submits an Actor Task to ensure that the Zeebe application layer transitions to follower as well (i.e., to ensure that corresponding services are stopped and started).
The submitted Actor Task gets assigned to the same thread task queue as the LogStorageAppender is already in.

As a consequence, the submitted Actor Task to transition to follower on the Zeebe application layer won't be executed, because the Actor Thread is already occupied by the LogStorageAppender Actor which never releases the Actor Thread.

Note: This may also happen in other scenarios in which no leader change happens.

What is the impact of that issue?

In the worst case, it can happen that all Actor Threads are occupied by such Actor's (but for different partitions). This results in:

The broker being stuck in those Actor Jobs and never being able to make progress with any other Actor
Growing Actor Job Queues, as long as others (e.g., Raft) are able to submit jobs to the Actor Job Queues. That may result in OOMs.
...

Expected behavior
The LogStorageAppender Actor releases the Actor Thread so that other Actor Jobs can be executed.

Possible Solutions

When no token can be acquired because the backpressure queue is full, then resubmit the Actor Task to the actor thread task queue.
When executing ChannelConsumerCondition#poll() take the backpressure queue into account.

Environment:

Zeebe Version: 1.3.0

related to https://jira.camunda.com/browse/SUPPORT-11966

The text was updated successfully, but these errors were encountered:

Zelldon · 2022-01-07T15:56:24Z

I played today a bit around with metrics and actor and maybe if you know how to reproduce it this could help to see this. Currently it doesn't look like this happens in normal runs.

This graph shows how often a certain actor is called in a second.

8582: fix(log/appender): yield thread when experiencing backpressure r=romansmirnov a=romansmirnov ## Description Yield the thread, when the log storage appender experiences backpressure when trying to append the fragments to the log storage. That way, the actual actor task (log storage appender) is resubmitted to the working queue, and the actor thread is released to execute other actor tasks. ## Related issues closes #8540 8605: fix(log/stream): ensure the appender future always gets completed r=romansmirnov a=romansmirnov ## Description * Handles any kind of thrown `Throwable`s in the `LogStream` actor, so that the appender future gets completed exceptionally. * Handles the situation when opening the appender, the `LogStream` actor is supposed to be closed. In this situation, the appender future gets completed exceptionally as well. ## Related issues closes #7992 8615: deps(maven): bump value from 2.8.9-ea-1 to 2.9.0 r=npepinpe a=dependabot[bot] Bumps [value](https://github.com/immutables/immutables) from 2.8.9-ea-1 to 2.9.0. <details> <summary>Commits</summary> <ul> <li>See full diff in <a href="https://github.com/immutables/immutables/commits">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=org.immutables:value&package-manager=maven&previous-version=2.8.9-ea-1&new-version=2.9.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Co-authored-by: Roman <roman.smirnov@camunda.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

8617: [Backport stable/1.2] fix(log/appender): yield thread when experiencing backpressure r=romansmirnov a=github-actions[bot] # Description Backport of #8582 to `stable/1.2`. relates to #8540 Co-authored-by: Roman <roman.smirnov@camunda.com>

8618: [Backport stable/1.3] fix(log/appender): yield thread when experiencing backpressure r=romansmirnov a=github-actions[bot] # Description Backport of #8582 to `stable/1.3`. relates to #8540 Co-authored-by: Roman <roman.smirnov@camunda.com>

romansmirnov added the kind/bug Categorizes an issue or PR as a bug label Jan 6, 2022

romansmirnov added this to Ready in Zeebe Jan 6, 2022

romansmirnov self-assigned this Jan 6, 2022

romansmirnov added area/performance Marks an issue as performance related scope/broker Marks an issue or PR to appear in the broker section of the changelog support Marks an issue as related to a customer support request labels Jan 6, 2022

romansmirnov mentioned this issue Jan 6, 2022

Unexpected drop of performance with 1.2 #7955

Closed

romansmirnov moved this from Ready to In progress in Zeebe Jan 6, 2022

romansmirnov moved this from In progress to Ready in Zeebe Jan 6, 2022

romansmirnov moved this from Ready to In progress in Zeebe Jan 13, 2022

romansmirnov mentioned this issue Jan 13, 2022

fix(log/appender): yield thread when experiencing backpressure #8582

Merged

9 tasks

romansmirnov moved this from In progress to Review in progress in Zeebe Jan 18, 2022

ghost closed this as completed in 6ebfa79 Jan 19, 2022

Zeebe automation moved this from Review in progress to Done Jan 19, 2022

This was referenced Jan 19, 2022

[Backport stable/1.2] fix(log/appender): yield thread when experiencing backpressure #8617

Merged

[Backport stable/1.3] fix(log/appender): yield thread when experiencing backpressure #8618

Merged

npepinpe added version:1.3.2 Marks an issue as being completely or in parts released in 1.3.2 Release: 1.2.10 labels Jan 28, 2022

remcowesterhoud added the Release: 1.4.0-alpha1 label Feb 1, 2022

KerstinHebel removed this from Done in Zeebe Mar 23, 2022

npepinpe added the Release: 8.0.0-rc1 label Mar 28, 2022

pihme added the Release: 8.0.0 label Apr 5, 2022

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`LogStorageAppender` Actor occupies an Actor Thread forever due to a full backpressue queue #8540

`LogStorageAppender` Actor occupies an Actor Thread forever due to a full backpressue queue #8540

romansmirnov commented Jan 6, 2022 •

edited

Loading

Zelldon commented Jan 7, 2022 •

edited

Loading

LogStorageAppender Actor occupies an Actor Thread forever due to a full backpressue queue #8540

LogStorageAppender Actor occupies an Actor Thread forever due to a full backpressue queue #8540

Comments

romansmirnov commented Jan 6, 2022 • edited Loading

Zelldon commented Jan 7, 2022 • edited Loading

`LogStorageAppender` Actor occupies an Actor Thread forever due to a full backpressue queue #8540

`LogStorageAppender` Actor occupies an Actor Thread forever due to a full backpressue queue #8540

romansmirnov commented Jan 6, 2022 •

edited

Loading

Zelldon commented Jan 7, 2022 •

edited

Loading