Document `ActorScheduler` Capabilities and Limitations #9142

lenaschoenburg · 2022-04-14T11:23:15Z

Description

Zeebe currently implements and uses the ActorScheduler, a system that coordinates work between different components that we call actors. There is a growing list of issues and missing features that we'd like to address by replacing the ActorScheduler with something else. To help us make an informed decision on what a replacement should look like, we should first properly document these issues and missing features.

Goals

When picking a replacement system, the capability overview should allow us to judge how much we need to re-implement or re-design.
The limitation overview can act as a wish-list of new features.

Results

Overall, the actor scheduler is used to serialize work and thus prevent data races.
Since we have many concerns regarding usability, testability, maintainabilty and observability, we'd like to migrate away from the actor scheduler and replace it with something else.

Usage of the actor scheduler is fairly limited: In production code we are counting 27 different actors. Nevertheless, replacing it's usage will be difficult since it's usage is not well isolated and it is unclear how much we rely on some of the more subtle semantics such as ordering.

There are two different approaches we can take. The first option, and originally the motivation to look at the existing actor scheduler, is to switch to a full actor framework. In practice that'd mean Akka since there appears to be no good alternative. The second option is that we could go the same route that other similar projects such as Kafka are going and try to rely solely on java.util.concurrent.

Obviously, these two approaches are very different and they come with different advantages and drawbacks.
Switching to Akka would give us access to many desirable properties such as easier observability, potential for further usage of the Akka ecosystem and a programming model that is already familiar to some of us and that has proven useful for distributed systems. Testability should also improve dramatically, both by using Akka's dedicated testing libraries as well as allowing randomized property tests of actors and interactions between actors.

Switching to a more manual approach of using java.util.concurrent has the advantage that our code could end up being more idiomatic which makes on-boarding easier and potentially unlocks future improvements in the Java world such as project Loom. It'd also align Zeebe with other projects in a similar space, such as Kafka. To quote from their coding guide:

Prefer the java.util.concurrent packages to either low-level wait-notify, custom locking/synchronization, or higher level scala-specific primitives. The util.concurrent stuff is well thought out and actually works correctly. There is a generally feeling that threads and locking are not going to be the concurrency primitives of the future because of a variety of well-known weaknesses they have. This is probably true, but they have the advantage of actually being mature enough to use for high-performance software right now; their well-known deficiencies are easily worked around by equally well known best-practices. So avoid actors, software transactional memory, tuple spaces, or anything else not written by Doug Lea and used by at least a million other productions systems. :-)

We should take some time to validate both approaches and see how they'd work for us in practice.

The text was updated successfully, but these errors were encountered:

pihme · 2022-04-14T14:39:33Z

@oleschoenburg When you create the issue for ActorScheduler limitations, remind me to rant about

ActorFuture
- join()
- runOnCompletion(), but only when called from within actor
- why these two methods make the code hard to maintain (aka: should EmbeddedGatewayService.close() be called from an actor thread or from a non-actor thread? Answer: It must be called from a non-actor thread because of doAndLogException(topologyManager::close); in BrokerClientImpl.close() ).
- lack of composability with ordinary futures (io.camunda.zeebe.broker.bootstrap.EmbeddedGatewayServiceStep#shutdownInternal)
Naming of classes
Lack of visibility/inspection
boundary between actor world and non-actor world, and lack of boundary markers
lifecycle
no supervisor chain and exception handling (io.camunda.zeebe.broker.bootstrap.AbstractBrokerStartupStep#forwardExceptions, io.camunda.zeebe.broker.bootstrap.BrokerAdminServiceStep#startupInternal)

lenaschoenburg added the kind/epic Categorizes an issue as an umbrella issue (e.g. OKR) which references other, smaller issues label Apr 14, 2022

lenaschoenburg self-assigned this Apr 14, 2022

lenaschoenburg mentioned this issue Apr 20, 2022

Overview of ActorScheduler limitations #9183

Closed

pihme mentioned this issue May 4, 2022

Implement some means to figure out where Zeebe's time is spent #9282

Closed

lenaschoenburg closed this as completed May 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document `ActorScheduler` Capabilities and Limitations #9142

Document `ActorScheduler` Capabilities and Limitations #9142

lenaschoenburg commented Apr 14, 2022 •

edited

Loading

pihme commented Apr 14, 2022 •

edited

Loading

Document ActorScheduler Capabilities and Limitations #9142

Document ActorScheduler Capabilities and Limitations #9142

Comments

lenaschoenburg commented Apr 14, 2022 • edited Loading

Description

Goals

Results

pihme commented Apr 14, 2022 • edited Loading

Document `ActorScheduler` Capabilities and Limitations #9142

Document `ActorScheduler` Capabilities and Limitations #9142

lenaschoenburg commented Apr 14, 2022 •

edited

Loading

pihme commented Apr 14, 2022 •

edited

Loading