-
Notifications
You must be signed in to change notification settings - Fork 562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overview of ActorScheduler
limitations
#9183
Comments
So here my comments. These comments are mostly from the perspective of someone trying to use the actor scheduler, or to test code using the actor scheduler. I might also be throwing actor scheduler, actors, actor futures and other elements into one big basket. Also many of the complaints may become totally irrelevant if we move to a third party framework like Akka or something else. Anything in italic is strongly opinionated. ActorFutureLet's start with a quiz question. If you want to call Answer is b. It must be called from a non-actor thread. This is, because close calls
There are at least two methods on ActorFuture which are problematic:
This splits the code base into three categories:
The categories 1 and 2 are transitive. They apply up and down the entire call chain. So the one call to ActorFuture.join() in the call tree of EmbeddedGatewayService#close means that you cannot call call from close a method which needs to be on an actor thread. This makes code incredibly hard to change. Because whenever you make any change, you need to check whether you don't change the category of the call hierarchies that are effected by the change. If you want to join on an actor, then suddenly the whole call hierachy falls into category 1. If you want to call another mehod, you first need to check in which category the code you are calling is in. And there is no naming convention, no automatic checks that can do it for you. All you can do is run the code and wait for the exceptions. Personally, I think the ActorFuture is a huge trap. I think it should not implement Future, because it behaves so very differently. You cannot compose it with ordinary futures. I also think it should be two classes. I.e. a composable future which has runOnCompletion, but no join. And a joinable future which has just join, but no compose. And then the composable future could have a method toJoinableFuture to convert to it. Then, I think you at least have a chance by looking at the signature to see what the code expects in terms of runtime thread. The exception on the join method makes sense, because we don't want to block an actor thread. The exception on the runOnCompletion method makes no sense to me at all. The future could remember the actor which created it, and the thread the actor runs on. And then it could schedule the follow up task with that actor on the right thread. This way you could use runOnCompletion from any thread. Then you would be left with two categories of code a) code that must not be called from an actor thread, and b) code that doesn't care |
Naming of ClassesThis is where I always get confused. Which is the actor? The class Actor and its subclasses? The field actor in class Actor which is actually an instance of ActorControl? The lambda which is scheduled by the Actor subclass on the ActorControl instance? The ActorJob that is wrapped around the lambda? I never got an understanding what these things do and mean. In particular the subclasses of Actor are kind of suspect to me. I would like to see them split up into an actual actor class, which is a proxy and then forwards everything to the actor implementation class.
Then ti would be easier to either have a code checklist, or a test to make sure that all calls are scheduled onto the actor, that field modification only happen inside the impl calss and so on. Or you could also use dynamic proxies and reflection, or code generation to generate the actor proxies and thus ensure that this is always the case. |
Lack of visibility / introspectionWhen you submit a task, most often you submit it into a black box. You know nothing about it. I once spent 4hours debugging why the task I submitted didn't get executed. Turned out, the actor was not started. An exception would have been helpful. But usually you don't get any response. The best way to get any response out of it is to use But if you use _ I think the same is true when using And I am talking only about the minimum here. What I would really like is to get back a handle. So that I can cancel the task, or inquire whether it has already run - stuff like that. For example when we change RAFT role, I would like to be able to figure out whether any tasks that I submitted for the old role in the past have run, or can be cancelled. This might also help in making the whole thing more testable. One of the most frustrating things if something does not happen. And the you need to figure out why the thing is not happening, and at what point the chain of - essentially callbacks - is broken. |
Boundary between actor and non-actor worldOne thing I glossed over when discussing the perils of some methods in ActorFuture is the boundary between the different categories. There is a bit of an undocumented convention that if a class extends Actor, then all public void methods essentially schedule some task on the actor control. I am talking about this code:
So this is literally the boundary between the actor world and either the non actor world, or the world of another actor. But just looking at the methods there is no way to see this. I also don't know how we could ever check this automatically. So it really is down to all developers doing the right thing. |
LifecycleThe lifecycle of an actor is only visible to the actor itself, but not to the outside word. From the outside I cannot ask it, whether it has started, or whether it is shutting down. Even the actor itself has little control over its own lifecycle. The actor can schedule tasks in a given phase. But it cannot pause /resume its own lifecycle. Let's say you wanted to implement an actor, which loads some data from an URL during startup. Once the data is loaded, the actor is started. I would assume one could implement something like this:
But this is not possible. You have to implement this kind of logic as a blocking call. Which deceives the concept of actor framework. I also wonder about the distinction of IO vs. CPU threads and why we are not using non-blocking IO throughout the application. But that is another matter entirely |
Supervisor chain and exception handlingWhat happens when an exception is thrown? In some cases it bubbles up to the Actor and the actor just stops. And then you might find a log message somewhere. In the worst case nothing happens, follow up tasks are not called and so futures don't get completed and then the application is just stuck waiting for a future, which will never be called. It is hard to put my finger to when what happens. Sometimes exceptions are caught, sometimes they are forwarded to follow up tasks, sometimes they bubble up to the actor scheduler. But I also had instances where an exception was thrown somewhere and there was no way to catch the exception from the application. I get a bit paranoid and created And then you write code like this:
|
@oleschoenburg End of rant. Let me know if you have questions. And good luck! |
Let's go through the *bilities:
Usability
Perhaps the most painful limitation of the actor scheduler is it's usability. For new contributors, using the actor scheduler correctly is difficult. Contributing factors are confusing naming of classes1 and methods, unclear semantics and confusing interactions between different components.
Even for more experienced contributors it is often difficult to use the actor scheduler. Here, the lacking interoperability with regular Java futures can be tricky to work around. Since code that runs within an actor thread is not clearly distinguishable from non-actor code, it is often unclear which methods should or must be used.2
Actors should implement a public interface that only schedules work that is then implemented in private methods 3. However, there is no way to enforce that design and it makes testing difficult since tests then either have to interact with the actor scheduler or have access to private methods. Test interactions with the actor scheduler are difficult since the only control is a
workUntilDone
method.Some, perhaps most, actors are conceptually state machines. This is not something that is easily expressible when using the actor scheduler. Instead, actors usually rely on asynchronous callbacks to serialize the units of work that are necessary to handle a request. When other requests come in, the actor must make sure that it doesn't start handling this request if it could cause issues.
An example for this is the snapshot director which must only take one concurrent snapshot at a time. As a state machine this would be easily implemented by not accepting new snapshot requests in certain states. Instead, the snapshot director has to maintain local state and, for every unit of work, check the local state and decide if and how it can progress. If we factor in error handling, the number of conditions explode and it becomes easy to introduce bugs.
Related to state machines are lifecycles. The actor scheduler allows actors to implement hooks. These are most commonly used to acquire resources during startup and release resources when shutting down. What is missing, is a standardized way to query or listen to other actor's lifecycle. This is usually implemented ad-hoc via listeners. 4
There is more to be said with regards to error handling and supervision, already mentioned here: #9183 (comment)
The actor scheduler is also a one-size-fits-all solution. Unlike full actor frameworks or more custom approaches, there is little room for configuring aspects such as job ordering, latency demands, priorities etc. on a per-actor basis.
Maintainability
Broadly speaking, there is consensus that the actor scheduler in it's current state is not maintainable. We do have some evidence of that, both in bugs that have been unnoticed for multiple years, as well as the number of contributions made. Of course both can be attributed to the apparent reliability of the actor scheduler. After all, if it works reliably, why make changes? To some extend this might hold true but seeing as there are so many pain points relating to usability that haven't been fixed, at least part of the reason must be that no one feels comfortable making changes in the actor scheduler.
I believe there are three contributing factors that make the actor scheduler difficult to maintain.
First, there are many interactions between different parts of the actor scheduler. Perhaps this is best shown by this diagram, first produced by @Zelldon :
Second, some parts of the actor scheduler seem inherently complex. Good examples for that are the pervasive use of unsafe and low-level details such as padding.
Third, the sometimes confusing naming and lack of documentation, which already contributes to usability issues, also complicates the maintenance.
Observability
The actor scheduler is largely a black box with little to no visibility. As @saig0 rightfully pointed out, this is often not an issue as the actor scheduler seems to work reliably. However, when we encounter issues, this lack of visibility hinders troubleshooting.
Potentially the most useful feature, distributed tracing, is not included in the actor scheduler and would have to be implemented ourselves. Of course this play right into the maintainability issues.
Reliability
As mentioned before, reliability of the actor scheduler itself is apparently quite high. Problems arise mainly due to "user errors" but rather than giving up we should thrive for a system that makes user errors less likely to occur.
There are multiple footguns, all described well by @pihme already2.
Error handling deserves special mention here 5. Since exceptions are often handled ad-hoc and inconsistently between each actor, there is always the risk of swallowing important exceptions.
Health monitoring, let alone a full supervision, is difficult at best. Currently this is implemented by listeners. There is no designated mechanism for cascading failures.
While individual actors can be named, there is nothing that prevents from having multiple instances of the same actor alive. In the past, this has led to serious issues 6.
Footnotes
https://github.com/camunda/zeebe/issues/9183#issuecomment-1110068698 ↩
https://github.com/camunda/zeebe/issues/9183#issuecomment-1110051630 ↩ ↩2
https://github.com/camunda/zeebe/issues/9183#issuecomment-1110123588 ↩
https://github.com/camunda/zeebe/issues/9183#issuecomment-1110133575 ↩
https://github.com/camunda/zeebe/issues/9183#issuecomment-1110142104 ↩
https://github.com/camunda/zeebe/issues/8044 ↩
The text was updated successfully, but these errors were encountered: