Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Typed child actors are not stopped when parent actor is restarted by supervision strategy #25556

Closed
svroonland opened this issue Aug 31, 2018 · 22 comments
Assignees
Labels
Milestone

Comments

@svroonland
Copy link

I'm spawning a child in a typed actor and using a restart supervision strategy on the parent. When the parent fails and is restarted, an exception "not unique actor name [child actor name here]" is thrown.

Apparently the child is not stopped as part of the restart procedure. This is unlike with untyped actors where, when an actor is restarted, its children are stopped and recreated. For typed actors I would also expect behavior as documented here for untyped actors: https://doc.akka.io/docs/akka/2.5/general/supervision.html#what-restarting-means

(I can provide some test cases for untyped and typed on request)

@patriknw
Copy link
Member

Thanks for reporting

@patriknw patriknw added 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted t:typed 3 - in progress Someone is working on this ticket and removed 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted labels Aug 31, 2018
@patriknw patriknw self-assigned this Sep 5, 2018
@patriknw
Copy link
Member

patriknw commented Sep 5, 2018

@svroonland I think this is actually expected behavior. I found this comment in a test:

// TODO document this difference compared to classic actors, and that
// children can be stopped if needed in PreRestart

Unless there are objections I will continue this by adding that documentation, including an example.

@svroonland
Copy link
Author

Not sure I find it intuitive behavior to not restart the children, but I trust you have broader considerations of course.

@patriknw
Copy link
Member

patriknw commented Sep 6, 2018

One drawback with always stopping the children is that any direct communication with the children from the outside is stopped. The actor references are obsolete and anyone using them must watch and re-establish new ones, or all such interaction must go via the parent.

The ActorRef of the parent is still valid after a restart, but not the ActorRefs of the children. Might be a surprise for many?

@michkhol
Copy link
Contributor

michkhol commented Sep 12, 2018

I see another problem: if the child actor is stopped in the PreStart message handler, the parent actor apparently does not wait for the stop to finish. If upon restart the parent tries to recreate the child actor with the same name fast enough, you still get the exception.

@patriknw
Copy link
Member

Have you verified that or is it a theory?
Stopping from parent should free the name immediately and new child with same name should be possible to create.

@michkhol
Copy link
Contributor

I have this problem in my code currently and I believe I can come up with a code sample shortly.

@michkhol
Copy link
Contributor

michkhol commented Sep 12, 2018

The supervisor is supposed to do 10 retries before stopping, but I'm getting an exception at the first try.

import akka.actor.ActorSystem
import akka.actor.typed.{ActorRef, Behavior, PostStop, PreRestart, SupervisorStrategy, Terminated}
import akka.actor.typed.scaladsl.Behaviors

import scala.concurrent.duration._
import akka.actor.typed.scaladsl.adapter._

object TypedSupervisedBug {
  import BadActor._

  implicit val actorSystem = ActorSystem("actorsystem")
  implicit val ec = actorSystem.dispatcher

  val start: Behavior[Msg] = Behaviors.setup { ctx =>
    val supervised: Behavior[Msg] = Behaviors.supervise(BadActor.start)
      .onFailure[IndexOutOfBoundsException](SupervisorStrategy.restartWithLimit(maxNrOfRetries = 10, withinTimeRange = 10.minutes))
    val badActor = ctx.spawn(supervised, "badActor")
    ctx.watch(badActor)
    active
  }

  val active: Behavior[Msg] = Behaviors.receive[Msg] { (ctx, msg) =>
    Behaviors.same
  }.receiveSignal {
    case (_, Terminated(w)) =>
      Behaviors.stopped
    case (_, PostStop) =>
      System.err.println("System has been terminated.")
      actorSystem.terminate().foreach(_ => System.exit(1))
      Behaviors.same
  }

  def main(args: Array[String]): Unit = {
     actorSystem.spawn(TypedSupervisedBug.start, "supevisor")
  }
}

object BadActor {
  trait Msg
  case object TestMsg extends Msg

  val start: Behavior[Msg] = Behaviors.setup { ctx =>
    val child = ctx.spawn(PoorChild.active, "poorchild")
//    ctx.watch(child)
    ctx.self ! TestMsg
    active(child)
  }

  def active(childRef: ActorRef[Msg]): Behavior[Msg] = Behaviors.receive[Msg] { (ctx, msg) =>
    throw new IndexOutOfBoundsException("Test exception")
  }.receiveSignal {
    case (ctx, PreRestart) =>
      System.err.println("BadActor is being restarted.")
      ctx.stop(childRef)
      Behaviors.same
  }
}

object PoorChild {
  import BadActor._

  val active: Behavior[Msg] = Behaviors.receive[Msg] { (ctx, msg) =>
    Behaviors.same
  }
}

@patriknw
Copy link
Member

Ah, I mixed up things. It's when Terminated is received in the parent that the name is free to be used again. Thanks.

Hmm, stopping children and then starting new with same name for a restart might still be something we want to support. I have to think about this.

@michkhol
Copy link
Contributor

Thanks! I would really like to see it implemented as in untyped actors, perhaps with an option not to stop children on parent restart. Otherwise while migrating code to typed actors, one would have an upleasant surprise.

@patriknw
Copy link
Member

I have analyzed how it's implemented in untyped. It's documented here: https://doc.akka.io/docs/akka/current/general/supervision.html#what-restarting-means

Here is a summary of the interaction. Given an actor hierarchy of grandparent, parent, child where an exception is thrown in parent and parent is restarted.

  1. parent: FaultHandling.handleInvokeFailure
    • suspend
    • parent ! Failed(self)
  2. grandparent: FaultHandling.handleFailure
    • restartChild
  3. parent: receive Recreate sys msg => FaultHandling.faultRecreate
    • preRestart
    • unwatch and stop children
  4. child: receive Terminate
  5. parent: DeathWatch.watchedActorTerminated
    • FaultHandling.handleChildTerminated
    • FaultHandling.finishRecreate
    • create freshActor instance
    • restart surviving children

In Typed the grandparent isn't involved since the supervision strategy is taking care of exception in the failing actor itself. We must probably hook into some of these steps from the supervision strategy (internal.Restarter), such as suspend and faultRecreate. Then the other steps will follow automatically.

We also have to override preRestart to not stop children by default , since that is supposed to be done explicitly from PreRestart signal if desired.

One question is if we want to keep the restart of surviving children or if we should keep them untouched?

I'd like @rkuhn 's advice on this issue so that we don't have an accidental misunderstanding of desired semantics.

@michkhol
Copy link
Contributor

michkhol commented Sep 14, 2018

We also have to override preRestart to not stop children by default , since that is supposed to be done explicitly from PreRestart signal if desired.

Why change the already expected behavior (from untyped)? If the user does nothing, the children should be stopped by default. I presume, in most cases the parent fully controls the life-cycle of its children, the explicit stop would be semantically redundant. Optionally the user could signal that the children should not be stopped, but in my opinion it is really optional.

@patriknw
Copy link
Member

I tried a few things. It's not that easy to hook into untyped stuff since we implement supervision and restart quite differently.

Then I started to think about implementing the "wait for children to stop" with ordinary typed things in the Restarter. Would have to use StashBuffer instead of suspend / resume.

An annoying detail is that I can't know which children were stopped in the PreRestart and which were kept so can't know when all expected Terminated signals have been received. One way around that could be to add extra bookkeeping for this specific thing in the typed.ActorContextImpl

@chbatey wdyt?

@rkuhn
Copy link
Contributor

rkuhn commented Sep 18, 2018

Instead of implementing the same semantics, we could also think about whether something else would be more appropriate, also given the theme of simplifying the actor lifecycle by switching to Behavior decorators and making the local decision without inspecting the exception itself.

One thing to keep in mind is that this is a library facility that could as well live in user code, no special magic shall be included—otherwise we break the premise of Behaviors being pure functions of the incoming message and the context (declaring method invocations on the context as permitted effects).

So we could add a second Restarter (or differentiate by a construction parameter) that terminates all child actors, switches to stashing behavior, awaits all Terminated messages, then injects the stash into the initial behavior. The simplification here being that either all or none of the child actors are kept. Users who want something else can copy the code and modify it according to their needs.

@patriknw
Copy link
Member

I agree that it's probably better to do this in Typed land, and starting with an option to stop all children is probably covering what most will ever need.

@patriknw
Copy link
Member

Closed the docs PR. When implementing this for real some of the samples from #25570 might be used and updated.

@chbatey
Copy link
Member

chbatey commented Sep 19, 2018

Sorry this notification got lost in the PR rebuild storm.
+1 to all or nothing to keep it simple.
I've nearly rewritten the Restarters as BehaviorInterceptors so we may be in for some fun conflicts.

@ktoso
Copy link
Member

ktoso commented Sep 19, 2018

For the record: also +1 for keeping this in "typed land" :)

@patriknw
Copy link
Member

rewritten the Restarters as BehaviorInterceptors

I know. I was only trying things out. We'll do the real thing after your refactoring

@patriknw patriknw removed their assignment Nov 26, 2018
@patriknw patriknw added the 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted label Nov 26, 2018
@patriknw patriknw removed the 3 - in progress Someone is working on this ticket label Nov 26, 2018
@helena helena self-assigned this Nov 26, 2018
@helena helena added 3 - in progress Someone is working on this ticket and removed 3 - in progress Someone is working on this ticket labels Nov 26, 2018
@helena helena removed their assignment Nov 28, 2018
@patriknw patriknw self-assigned this Dec 12, 2018
@patriknw patriknw added 3 - in progress Someone is working on this ticket and removed 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted labels Dec 12, 2018
patriknw added a commit that referenced this issue Dec 19, 2018
* stash messages and signals while waiting for children to be stopped
* handle watch of other actors
* exception from unstash
* exception from first setup
* merge RestartSupervisor and BackoffSupervisor
* remove unused PreStart
* docs
patriknw added a commit that referenced this issue Dec 19, 2018
* stash messages and signals while waiting for children to be stopped
* handle watch of other actors
* exception from unstash
* exception from first setup
* merge RestartSupervisor and BackoffSupervisor
* API change: restartWithLimit => restart.withLimit
* remove unused PreStart
* docs
patriknw added a commit that referenced this issue Dec 20, 2018
* stash messages and signals while waiting for children to be stopped
* handle watch of other actors
* exception from unstash
* exception from first setup
* merge RestartSupervisor and BackoffSupervisor
* API change: restartWithLimit => restart.withLimit
* remove unused PreStart
* docs
patriknw added a commit that referenced this issue Jan 15, 2019
* stash messages and signals while waiting for children to be stopped
* handle watch of other actors
* exception from unstash
* exception from first setup
* merge RestartSupervisor and BackoffSupervisor
* API change: restartWithLimit => restart.withLimit
* remove unused PreStart
* docs
patriknw added a commit that referenced this issue Jan 17, 2019
* stash messages and signals while waiting for children to be stopped
* handle watch of other actors
* exception from unstash
* exception from first setup
* merge RestartSupervisor and BackoffSupervisor
* API change: restartWithLimit => restart.withLimit
* remove unused PreStart
* docs
* move BubblingSample to separate class
* fix: fail after more than limit in restart.withLimit when deferred factory throws
* match case RestartOrBackoff instead
@patriknw patriknw removed the 3 - in progress Someone is working on this ticket label Jan 17, 2019
@patriknw patriknw added this to the 2.5.20 milestone Jan 17, 2019
@olijaun
Copy link

olijaun commented Oct 24, 2019

I'm a newbie to AKKA and I just investigated a problem that is related to this issue. I finally solved it by using SupervisorStrategy.restart().withStopChildren(true)

I think you really should update the docs: The "What Restarting Means" section (https://doc.akka.io/docs/akka/current/general/supervision.html) is in the Chapter "General Concepts". People starting with Akka will probably read this (regardless of whether they use "untyped" or "typed"). The documentation says: "call the old instance’s preRestart hook (defaults to sending termination requests to all children and calling postStop)"

Regards

@patriknw
Copy link
Member

Do you see the same problem in the latest documentation: https://doc.akka.io/docs/akka/2.6/general/supervision.html

Since it seems that you are developing something new I'd recommend that you use version 2.6.0-RC1.

@olijaun
Copy link

olijaun commented Oct 24, 2019

I won't switch to RC1 yet because I'm not sure if we will go into production before akka 2.6 :-) But I will at least check the Doc for 2.6 from now on. The respective chapter is not misleading anymore. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants