Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overview of ActorScheduler limitations #9183

Closed
Tracked by #9142
lenaschoenburg opened this issue Apr 20, 2022 · 8 comments
Closed
Tracked by #9142

Overview of ActorScheduler limitations #9183

lenaschoenburg opened this issue Apr 20, 2022 · 8 comments
Assignees
Labels
area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) kind/documentation Categorizes an issue or PR as documentation related kind/research Marks an issue as part of a research or investigation

Comments

@lenaschoenburg
Copy link
Member

lenaschoenburg commented Apr 20, 2022

Let's go through the *bilities:

Usability

Perhaps the most painful limitation of the actor scheduler is it's usability. For new contributors, using the actor scheduler correctly is difficult. Contributing factors are confusing naming of classes1 and methods, unclear semantics and confusing interactions between different components.

Even for more experienced contributors it is often difficult to use the actor scheduler. Here, the lacking interoperability with regular Java futures can be tricky to work around. Since code that runs within an actor thread is not clearly distinguishable from non-actor code, it is often unclear which methods should or must be used.2

Actors should implement a public interface that only schedules work that is then implemented in private methods 3. However, there is no way to enforce that design and it makes testing difficult since tests then either have to interact with the actor scheduler or have access to private methods. Test interactions with the actor scheduler are difficult since the only control is a workUntilDone method.

Some, perhaps most, actors are conceptually state machines. This is not something that is easily expressible when using the actor scheduler. Instead, actors usually rely on asynchronous callbacks to serialize the units of work that are necessary to handle a request. When other requests come in, the actor must make sure that it doesn't start handling this request if it could cause issues.
An example for this is the snapshot director which must only take one concurrent snapshot at a time. As a state machine this would be easily implemented by not accepting new snapshot requests in certain states. Instead, the snapshot director has to maintain local state and, for every unit of work, check the local state and decide if and how it can progress. If we factor in error handling, the number of conditions explode and it becomes easy to introduce bugs.

Related to state machines are lifecycles. The actor scheduler allows actors to implement hooks. These are most commonly used to acquire resources during startup and release resources when shutting down. What is missing, is a standardized way to query or listen to other actor's lifecycle. This is usually implemented ad-hoc via listeners. 4

There is more to be said with regards to error handling and supervision, already mentioned here: #9183 (comment)

The actor scheduler is also a one-size-fits-all solution. Unlike full actor frameworks or more custom approaches, there is little room for configuring aspects such as job ordering, latency demands, priorities etc. on a per-actor basis.

Maintainability

Broadly speaking, there is consensus that the actor scheduler in it's current state is not maintainable. We do have some evidence of that, both in bugs that have been unnoticed for multiple years, as well as the number of contributions made. Of course both can be attributed to the apparent reliability of the actor scheduler. After all, if it works reliably, why make changes? To some extend this might hold true but seeing as there are so many pain points relating to usability that haven't been fixed, at least part of the reason must be that no one feels comfortable making changes in the actor scheduler.

I believe there are three contributing factors that make the actor scheduler difficult to maintain.

First, there are many interactions between different parts of the actor scheduler. Perhaps this is best shown by this diagram, first produced by @Zelldon :
ActorControl

Second, some parts of the actor scheduler seem inherently complex. Good examples for that are the pervasive use of unsafe and low-level details such as padding.

Third, the sometimes confusing naming and lack of documentation, which already contributes to usability issues, also complicates the maintenance.

Observability

The actor scheduler is largely a black box with little to no visibility. As @saig0 rightfully pointed out, this is often not an issue as the actor scheduler seems to work reliably. However, when we encounter issues, this lack of visibility hinders troubleshooting.
Potentially the most useful feature, distributed tracing, is not included in the actor scheduler and would have to be implemented ourselves. Of course this play right into the maintainability issues.

Reliability

As mentioned before, reliability of the actor scheduler itself is apparently quite high. Problems arise mainly due to "user errors" but rather than giving up we should thrive for a system that makes user errors less likely to occur.
There are multiple footguns, all described well by @pihme already2.
Error handling deserves special mention here 5. Since exceptions are often handled ad-hoc and inconsistently between each actor, there is always the risk of swallowing important exceptions.
Health monitoring, let alone a full supervision, is difficult at best. Currently this is implemented by listeners. There is no designated mechanism for cascading failures.
While individual actors can be named, there is nothing that prevents from having multiple instances of the same actor alive. In the past, this has led to serious issues 6.

Footnotes

  1. https://github.com/camunda/zeebe/issues/9183#issuecomment-1110068698

  2. https://github.com/camunda/zeebe/issues/9183#issuecomment-1110051630 2

  3. https://github.com/camunda/zeebe/issues/9183#issuecomment-1110123588

  4. https://github.com/camunda/zeebe/issues/9183#issuecomment-1110133575

  5. https://github.com/camunda/zeebe/issues/9183#issuecomment-1110142104

  6. https://github.com/camunda/zeebe/issues/8044

@lenaschoenburg lenaschoenburg self-assigned this Apr 20, 2022
@lenaschoenburg lenaschoenburg added team/distributed kind/research Marks an issue as part of a research or investigation area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) kind/documentation Categorizes an issue or PR as documentation related labels Apr 20, 2022
@lenaschoenburg
Copy link
Member Author

@pihme This is your reminder to rant 😄

@pihme
Copy link
Contributor

pihme commented Apr 26, 2022

So here my comments. These comments are mostly from the perspective of someone trying to use the actor scheduler, or to test code using the actor scheduler. I might also be throwing actor scheduler, actors, actor futures and other elements into one big basket.

Also many of the complaints may become totally irrelevant if we move to a third party framework like Akka or something else.

Anything in italic is strongly opinionated.

ActorFuture

Let's start with a quiz question. If you want to call io.camunda.zeebe.broker.system.EmbeddedGatewayService#close which thread do you need to use? a) an actor thread, b) a non-actor thread c) any thread will do.

Answer is b. It must be called from a non-actor thread.

This is, because close calls

  • gateway.stop(), which calls
  • brokerClient.close() which calls
  • doAndLogException() which calls
  • topolgyManager.close() which calls
  • ActorFuture.join() which must not be called from an actor thread

There are at least two methods on ActorFuture which are problematic:

  • join() - which must not be called from an actor thread, and
  • runOnCompletion(...) which must be called from an actor thread, actually it must be called from the same thread that the actor lives on (io.camunda.zeebe.util.sched.ActorControl#isCalledFromWithinActor)

This splits the code base into three categories:

  1. Code that must be called from an actor thread
  2. Code that must not be called from an actor thread
  3. Other code

The categories 1 and 2 are transitive. They apply up and down the entire call chain. So the one call to ActorFuture.join() in the call tree of EmbeddedGatewayService#close means that you cannot call call from close a method which needs to be on an actor thread.

This makes code incredibly hard to change. Because whenever you make any change, you need to check whether you don't change the category of the call hierarchies that are effected by the change. If you want to join on an actor, then suddenly the whole call hierachy falls into category 1. If you want to call another mehod, you first need to check in which category the code you are calling is in. And there is no naming convention, no automatic checks that can do it for you. All you can do is run the code and wait for the exceptions.

Personally, I think the ActorFuture is a huge trap. I think it should not implement Future, because it behaves so very differently. You cannot compose it with ordinary futures. I also think it should be two classes. I.e. a composable future which has runOnCompletion, but no join. And a joinable future which has just join, but no compose. And then the composable future could have a method toJoinableFuture to convert to it. Then, I think you at least have a chance by looking at the signature to see what the code expects in terms of runtime thread.

The exception on the join method makes sense, because we don't want to block an actor thread. The exception on the runOnCompletion method makes no sense to me at all. The future could remember the actor which created it, and the thread the actor runs on. And then it could schedule the follow up task with that actor on the right thread. This way you could use runOnCompletion from any thread. Then you would be left with two categories of code a) code that must not be called from an actor thread, and b) code that doesn't care

@pihme
Copy link
Contributor

pihme commented Apr 26, 2022

Naming of Classes

This is where I always get confused. Which is the actor? The class Actor and its subclasses? The field actor in class Actor which is actually an instance of ActorControl? The lambda which is scheduled by the Actor subclass on the ActorControl instance? The ActorJob that is wrapped around the lambda?

I never got an understanding what these things do and mean.

In particular the subclasses of Actor are kind of suspect to me. I would like to see them split up into an actual actor class, which is a proxy and then forwards everything to the actor implementation class.

public class FooActor implements Actor {
   FooImpl delegate;

   public foo() {
     actor.submit( delegate::foo );
  }

Then ti would be easier to either have a code checklist, or a test to make sure that all calls are scheduled onto the actor, that field modification only happen inside the impl calss and so on. Or you could also use dynamic proxies and reflection, or code generation to generate the actor proxies and thus ensure that this is always the case.

@pihme
Copy link
Contributor

pihme commented Apr 26, 2022

Lack of visibility / introspection

When you submit a task, most often you submit it into a black box. You know nothing about it. I once spent 4hours debugging why the task I submitted didn't get executed. Turned out, the actor was not started. An exception would have been helpful. But usually you don't get any response. The best way to get any response out of it is to use io.camunda.zeebe.util.sched.ActorControl#call(java.util.concurrent.Callable<T>) which gives you a future. And this future might be completed exceptionally by the scheduler io.camunda.zeebe.util.sched.ActorJob#failFuture(java.lang.String) And this exception could tell you then something helpful, e.g. that the actor was closed.

But if you use call(Runnable r) or submit(Runnable r) you are out of luck. There is no way to get hold of the exception by the caller.

_ I think the same is true when using runOnCompletion. There is a future passed around, but I doubt there is a guarantee that the follow up task will be called, e.g. when the actor is shutting down._

And I am talking only about the minimum here. What I would really like is to get back a handle. So that I can cancel the task, or inquire whether it has already run - stuff like that. For example when we change RAFT role, I would like to be able to figure out whether any tasks that I submitted for the old role in the past have run, or can be cancelled.

This might also help in making the whole thing more testable. One of the most frustrating things if something does not happen. And the you need to figure out why the thing is not happening, and at what point the chain of - essentially callbacks - is broken.

@pihme
Copy link
Contributor

pihme commented Apr 26, 2022

Boundary between actor and non-actor world

One thing I glossed over when discussing the perils of some methods in ActorFuture is the boundary between the different categories.

There is a bit of an undocumented convention that if a class extends Actor, then all public void methods essentially schedule some task on the actor control.

I am talking about this code:

  public final void onRequest(
      final ServerOutput serverOutput,
      final int partitionId,
      final long requestId,
      final DirectBuffer buffer,
      final int offset,
      final int length) {
    actor.submit(() -> handleRequest(serverOutput, partitionId, requestId, buffer, offset, length));
  }

So this is literally the boundary between the actor world and either the non actor world, or the world of another actor. But just looking at the methods there is no way to see this. I also don't know how we could ever check this automatically. So it really is down to all developers doing the right thing.

@pihme
Copy link
Contributor

pihme commented Apr 26, 2022

Lifecycle

The lifecycle of an actor is only visible to the actor itself, but not to the outside word. From the outside I cannot ask it, whether it has started, or whether it is shutting down.

Even the actor itself has little control over its own lifecycle. The actor can schedule tasks in a given phase. But it cannot pause /resume its own lifecycle.

Let's say you wanted to implement an actor, which loads some data from an URL during startup. Once the data is loaded, the actor is started. I would assume one could implement something like this:

public void onActorStarting() {
   httpClient.getAsync(url, 
                         response -> {
                            if (response.statusCode== 200) {
                                this.switchToStarted();
...

But this is not possible. You have to implement this kind of logic as a blocking call. Which deceives the concept of actor framework.

I also wonder about the distinction of IO vs. CPU threads and why we are not using non-blocking IO throughout the application. But that is another matter entirely

@pihme
Copy link
Contributor

pihme commented Apr 26, 2022

Supervisor chain and exception handling

What happens when an exception is thrown? In some cases it bubbles up to the Actor and the actor just stops. And then you might find a log message somewhere. In the worst case nothing happens, follow up tasks are not called and so futures don't get completed and then the application is just stuck waiting for a future, which will never be called.

It is hard to put my finger to when what happens. Sometimes exceptions are caught, sometimes they are forwarded to follow up tasks, sometimes they bubble up to the actor scheduler. But I also had instances where an exception was thrown somewhere and there was no way to catch the exception from the application.

I get a bit paranoid and created io.camunda.zeebe.broker.bootstrap.AbstractBrokerStartupStep#forwardExceptions and io.camunda.zeebe.broker.bootstrap.AbstractBrokerStartupStep#createFutureAndRun. And if you use these helper methods consistently, then you can make sure that you will receive the exception. But it is so easy to miss a code block and whoops your exception goes somewhere else.

And then you write code like this:

final var closeFuture = diskSpaceUsageMonitor.closeAsync();
      concurrencyControl.runOnCompletion(
          closeFuture,
          (ok, error) -> {
            if (error != null) {
              shutdownFuture.completeExceptionally(error);
              return;
            }

            forwardExceptions(
                () ->
                    concurrencyControl.run(
                        () ->
                            forwardExceptions(
                                () -> {
                                  brokerShutdownContext.setDiskSpaceUsageMonitor(null);
                                  shutdownFuture.complete(brokerShutdownContext);
                                },
                                shutdownFuture)),
                shutdownFuture);
          });

@pihme
Copy link
Contributor

pihme commented Apr 26, 2022

@oleschoenburg End of rant. Let me know if you have questions. And good luck!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) kind/documentation Categorizes an issue or PR as documentation related kind/research Marks an issue as part of a research or investigation
Projects
None yet
Development

No branches or pull requests

2 participants