Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LogStorageAppender Actor occupies an Actor Thread forever due to a full backpressue queue #8540

Closed
romansmirnov opened this issue Jan 6, 2022 · 1 comment · Fixed by #8582
Assignees
Labels
area/performance Marks an issue as performance related kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog support Marks an issue as related to a customer support request version:1.3.2 Marks an issue as being completely or in parts released in 1.3.2

Comments

@romansmirnov
Copy link
Member

romansmirnov commented Jan 6, 2022

Describe the bug

The LogStorageAppender subscribes to the write buffer (i.e., Dispatcher) to read from it and to append the available fragments to the LogStorage:

https://github.com/camunda-cloud/zeebe/blob/73e5c7be9f453e30b5aebc07aae322ee5f82b11e/logstreams/src/main/java/io/camunda/zeebe/logstreams/impl/log/LogStorageAppender.java#L151-L154

However, before appending to the LogStorage, the LogStorageAppender tries to acquire a "token" from a limiter (i.e., backpressure queue), and only if a token got acquired it appends to the LogStorage (and the read fragments are marked as read in the write buffer). On the other side, if no token got acquired, the fragments are not appended to the LogStorage (and the fragments are still marked as not read in the write buffer):

https://github.com/camunda-cloud/zeebe/blob/73e5c7be9f453e30b5aebc07aae322ee5f82b11e/logstreams/src/main/java/io/camunda/zeebe/logstreams/impl/log/LogStorageAppender.java#L119-L136

So whenever some fragments are available on the write buffer, the Actor LogStorageAppender is submitted to the broker thread group's task queue so that a broker thread can execute this Actor eventually. When executing this Actor, it checks whether there are some fragments are available on the write buffer:

https://github.com/camunda-cloud/zeebe/blob/73e5c7be9f453e30b5aebc07aae322ee5f82b11e/util/src/main/java/io/camunda/zeebe/util/sched/channel/ChannelConsumerCondition.java#L43-L48

and if there are some fragments available, it will execute the corresponding Actor Job (i.e., read from the write buffer and try to append to the LogStorage).

The actor job is executed as long as ChannelConsumerCondition#poll() returns true.

In a scenario, where the appender's backpressure queue is full (i.e., no token can be acquired from the limiter), the Actor Thread keeps executing the actor job because ChannelConsumerCondition#poll() still returns true. And as long as the backpressure is not emptied the Actor Thread will continue executing that job forever.

For example, this can happen in the following case:

  1. For a certain partition - where the broker is the leader, the appender's backpressure queue is full (so that no token can be acquired from the limiter)
  2. The Actor Thread executes the LogStorageAppenders actor job, i.e., it reads from the write buffer and tries to append the LogStorage (and ChannelConsumerCondition#poll() returns always true, but no token can be acquired)
  3. For the same partition, the Raft layer transitions from leader to follower. This submits an Actor Task to ensure that the Zeebe application layer transitions to follower as well (i.e., to ensure that corresponding services are stopped and started).
  4. The submitted Actor Task gets assigned to the same thread task queue as the LogStorageAppender is already in.

As a consequence, the submitted Actor Task to transition to follower on the Zeebe application layer won't be executed, because the Actor Thread is already occupied by the LogStorageAppender Actor which never releases the Actor Thread.

Note: This may also happen in other scenarios in which no leader change happens.

What is the impact of that issue?

In the worst case, it can happen that all Actor Threads are occupied by such Actor's (but for different partitions). This results in:

  • The broker being stuck in those Actor Jobs and never being able to make progress with any other Actor
  • Growing Actor Job Queues, as long as others (e.g., Raft) are able to submit jobs to the Actor Job Queues. That may result in OOMs.
  • ...

Expected behavior
The LogStorageAppender Actor releases the Actor Thread so that other Actor Jobs can be executed.

Possible Solutions

  • When no token can be acquired because the backpressure queue is full, then resubmit the Actor Task to the actor thread task queue.
  • When executing ChannelConsumerCondition#poll() take the backpressure queue into account.

Environment:

  • Zeebe Version: 1.3.0

related to https://jira.camunda.com/browse/SUPPORT-11966

@romansmirnov romansmirnov added the kind/bug Categorizes an issue or PR as a bug label Jan 6, 2022
@romansmirnov romansmirnov added this to Ready in Zeebe Jan 6, 2022
@romansmirnov romansmirnov self-assigned this Jan 6, 2022
@romansmirnov romansmirnov added area/performance Marks an issue as performance related scope/broker Marks an issue or PR to appear in the broker section of the changelog support Marks an issue as related to a customer support request labels Jan 6, 2022
@romansmirnov romansmirnov moved this from Ready to In progress in Zeebe Jan 6, 2022
@romansmirnov romansmirnov moved this from In progress to Ready in Zeebe Jan 6, 2022
@Zelldon
Copy link
Member

Zelldon commented Jan 7, 2022

I played today a bit around with metrics and actor and maybe if you know how to reproduce it this could help to see this. Currently it doesn't look like this happens in normal runs.

actors

This graph shows how often a certain actor is called in a second.

@romansmirnov romansmirnov moved this from Ready to In progress in Zeebe Jan 13, 2022
@romansmirnov romansmirnov moved this from In progress to Review in progress in Zeebe Jan 18, 2022
ghost pushed a commit that referenced this issue Jan 19, 2022
8582: fix(log/appender): yield thread when experiencing backpressure r=romansmirnov a=romansmirnov

## Description

Yield the thread, when the log storage appender experiences backpressure when trying to append the fragments to the log storage. That way, the actual actor task (log storage appender) is resubmitted to the working queue, and the actor thread is released to execute other actor tasks.

## Related issues

closes #8540 



8605: fix(log/stream): ensure the appender future always gets completed r=romansmirnov a=romansmirnov

## Description

* Handles any kind of thrown `Throwable`s in the `LogStream` actor, so that the appender future gets completed exceptionally.
* Handles the situation when opening the appender, the `LogStream` actor is supposed to be closed. In this situation, the appender future gets completed exceptionally as well.

## Related issues

closes #7992



8615: deps(maven): bump value from 2.8.9-ea-1 to 2.9.0 r=npepinpe a=dependabot[bot]

Bumps [value](https://github.com/immutables/immutables) from 2.8.9-ea-1 to 2.9.0.
<details>
<summary>Commits</summary>
<ul>
<li>See full diff in <a href="https://github.com/immutables/immutables/commits">compare view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=org.immutables:value&package-manager=maven&previous-version=2.8.9-ea-1&new-version=2.9.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)


</details>

Co-authored-by: Roman <roman.smirnov@camunda.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
@ghost ghost closed this as completed in 6ebfa79 Jan 19, 2022
Zeebe automation moved this from Review in progress to Done Jan 19, 2022
ghost pushed a commit that referenced this issue Jan 20, 2022
8617: [Backport stable/1.2] fix(log/appender): yield thread when experiencing backpressure r=romansmirnov a=github-actions[bot]

# Description
Backport of #8582 to `stable/1.2`.

relates to #8540

Co-authored-by: Roman <roman.smirnov@camunda.com>
ghost pushed a commit that referenced this issue Jan 20, 2022
8618: [Backport stable/1.3] fix(log/appender): yield thread when experiencing backpressure r=romansmirnov a=github-actions[bot]

# Description
Backport of #8582 to `stable/1.3`.

relates to #8540

Co-authored-by: Roman <roman.smirnov@camunda.com>
@npepinpe npepinpe added version:1.3.2 Marks an issue as being completely or in parts released in 1.3.2 Release: 1.2.10 labels Jan 28, 2022
@KerstinHebel KerstinHebel removed this from Done in Zeebe Mar 23, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Marks an issue as performance related kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog support Marks an issue as related to a customer support request version:1.3.2 Marks an issue as being completely or in parts released in 1.3.2
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants