Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document ActorScheduler Capabilities and Limitations #9142

Closed
2 tasks done
lenaschoenburg opened this issue Apr 14, 2022 · 1 comment
Closed
2 tasks done

Document ActorScheduler Capabilities and Limitations #9142

lenaschoenburg opened this issue Apr 14, 2022 · 1 comment
Assignees
Labels
kind/epic Categorizes an issue as an umbrella issue (e.g. OKR) which references other, smaller issues

Comments

@lenaschoenburg
Copy link
Member

lenaschoenburg commented Apr 14, 2022

Description

Zeebe currently implements and uses the ActorScheduler, a system that coordinates work between different components that we call actors. There is a growing list of issues and missing features that we'd like to address by replacing the ActorScheduler with something else. To help us make an informed decision on what a replacement should look like, we should first properly document these issues and missing features.

Goals

When picking a replacement system, the capability overview should allow us to judge how much we need to re-implement or re-design.
The limitation overview can act as a wish-list of new features.

Results

Overall, the actor scheduler is used to serialize work and thus prevent data races.
Since we have many concerns regarding usability, testability, maintainabilty and observability, we'd like to migrate away from the actor scheduler and replace it with something else.

Usage of the actor scheduler is fairly limited: In production code we are counting 27 different actors. Nevertheless, replacing it's usage will be difficult since it's usage is not well isolated and it is unclear how much we rely on some of the more subtle semantics such as ordering.

There are two different approaches we can take. The first option, and originally the motivation to look at the existing actor scheduler, is to switch to a full actor framework. In practice that'd mean Akka since there appears to be no good alternative. The second option is that we could go the same route that other similar projects such as Kafka are going and try to rely solely on java.util.concurrent.

Obviously, these two approaches are very different and they come with different advantages and drawbacks.
Switching to Akka would give us access to many desirable properties such as easier observability, potential for further usage of the Akka ecosystem and a programming model that is already familiar to some of us and that has proven useful for distributed systems. Testability should also improve dramatically, both by using Akka's dedicated testing libraries as well as allowing randomized property tests of actors and interactions between actors.

Switching to a more manual approach of using java.util.concurrent has the advantage that our code could end up being more idiomatic which makes on-boarding easier and potentially unlocks future improvements in the Java world such as project Loom. It'd also align Zeebe with other projects in a similar space, such as Kafka. To quote from their coding guide:

Prefer the java.util.concurrent packages to either low-level wait-notify, custom locking/synchronization, or higher level scala-specific primitives. The util.concurrent stuff is well thought out and actually works correctly. There is a generally feeling that threads and locking are not going to be the concurrency primitives of the future because of a variety of well-known weaknesses they have. This is probably true, but they have the advantage of actually being mature enough to use for high-performance software right now; their well-known deficiencies are easily worked around by equally well known best-practices. So avoid actors, software transactional memory, tuple spaces, or anything else not written by Doug Lea and used by at least a million other productions systems. :-)

We should take some time to validate both approaches and see how they'd work for us in practice.

@lenaschoenburg lenaschoenburg added the kind/epic Categorizes an issue as an umbrella issue (e.g. OKR) which references other, smaller issues label Apr 14, 2022
@lenaschoenburg lenaschoenburg self-assigned this Apr 14, 2022
@pihme
Copy link
Contributor

pihme commented Apr 14, 2022

@oleschoenburg When you create the issue for ActorScheduler limitations, remind me to rant about

  • ActorFuture
    • join()
    • runOnCompletion(), but only when called from within actor
    • why these two methods make the code hard to maintain (aka: should EmbeddedGatewayService.close() be called from an actor thread or from a non-actor thread? Answer: It must be called from a non-actor thread because of doAndLogException(topologyManager::close); in BrokerClientImpl.close() ).
    • lack of composability with ordinary futures (io.camunda.zeebe.broker.bootstrap.EmbeddedGatewayServiceStep#shutdownInternal)
  • Naming of classes
  • Lack of visibility/inspection
  • boundary between actor world and non-actor world, and lack of boundary markers
  • lifecycle
  • no supervisor chain and exception handling (io.camunda.zeebe.broker.bootstrap.AbstractBrokerStartupStep#forwardExceptions, io.camunda.zeebe.broker.bootstrap.BrokerAdminServiceStep#startupInternal)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/epic Categorizes an issue as an umbrella issue (e.g. OKR) which references other, smaller issues
Projects
None yet
Development

No branches or pull requests

2 participants