Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coordinated Shutdown (improve cluster leaving) #21537

Closed
patriknw opened this issue Sep 23, 2016 · 3 comments
Closed

Coordinated Shutdown (improve cluster leaving) #21537

patriknw opened this issue Sep 23, 2016 · 3 comments
Assignees
Milestone

Comments

@patriknw
Copy link
Member

patriknw commented Sep 23, 2016

There are lot of things that can be improved around graceful cluster leaving.

  • the process of leaving and finally shutdown actor system and jvm is complicated,
  • it becomes even more complicated when cluster sharding and cluster singleton should also be gracefully stopped
  • it's rather slow
  • the usage of the failure detector unreachability marker for the final steps before removing is questionable

Somewhat related tickets:
#18373
#21298
#21521

@patriknw patriknw added 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted t:cluster labels Sep 23, 2016
@patriknw patriknw changed the title improve cluster leaving Coordinated Shutdown (improve cluster leaving) Nov 25, 2016
@patriknw
Copy link
Member Author

patriknw commented Nov 25, 2016

We discussed a solution for this. Introduce some kind of CoordinatedShutdown manager in akka-actor where different modules can register their "shutdown hook" (might need both graceful and forceful hooks). Need dependencies between modules for sequential shutdown steps in right order (probably just with string identifiers in config). Might also need different phases?

@marekzebrowski
Copy link

marekzebrowski commented Nov 28, 2016

Leaving cluster is indeed a complex process. One issue is to when trigger leaving and second is to what needs to be done and in what order.
Our approach so far:
When - Runtime.getRuntime.addShutdownHook
Oder:

  1. akkaHttp server
  2. custom long-running flows
  3. shardRegions + other singletons
  4. leave cluster - awaiting for successful MemberExited(addr) message
  5. shutdown actor system
    we also have a special case useful for developement, which can be considered a hack: if there is one or less servers on seeds list, we don't wait for anything and just shut down immediately. Rationale behind it is one-node setup is development, and some test machines, so we don't need for graceful shutdown in such case. As we routinely do rolling restarts, steps like gracefully shutting down shard regions are important - if they fail, we need to restart the whole cluster

unfortunately Runtime.getRuntime.addShutdownHook stopped to work when we tried to use artery remoting in akka 2.4.14 as artery shuts down before shutdown other shutdown hooks are completed

@patriknw patriknw self-assigned this Dec 1, 2016
@patriknw patriknw added 3 - in progress Someone is working on this ticket and removed 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted labels Dec 1, 2016
patriknw added a commit that referenced this issue Dec 2, 2016
* CoordinatedShutdown that can run tasks for configured phases in order (DAG)
* coordinate handover/shutdown of singleton with cluster exiting/shutdown
patriknw added a commit that referenced this issue Dec 7, 2016
* CoordinatedShutdown that can run tasks for configured phases in order (DAG)
* coordinate handover/shutdown of singleton with cluster exiting/shutdown
patriknw added a commit that referenced this issue Dec 9, 2016
* CoordinatedShutdown that can run tasks for configured phases in order (DAG)
* coordinate handover/shutdown of singleton with cluster exiting/shutdown
@patriknw
Copy link
Member Author

patriknw commented Dec 9, 2016

@marekzebrowski @guidomedina I have been working on PR #21930. Most things are in place (a few things are not done yet). I think it will handle all of the things you have had problems with automatically, and you can hook in your own tasks in the shutdown phases.

It also has support for adding jvm shutdown hooks that are run before the Artery shutdown hook. The coordinated shutdown is started automatically by such shutdown hook, so you probably don't even have to add any such yourself.

patriknw added a commit that referenced this issue Dec 15, 2016
* CoordinatedShutdown that can run tasks for configured phases in order (DAG)
* coordinate handover/shutdown of singleton with cluster exiting/shutdown
patriknw added a commit that referenced this issue Jan 4, 2017
* CoordinatedShutdown that can run tasks for configured phases in order (DAG)
* coordinate handover/shutdown of singleton with cluster exiting/shutdown
patriknw added a commit that referenced this issue Jan 6, 2017
* CoordinatedShutdown that can run tasks for configured phases in order (DAG)
* coordinate handover/shutdown of singleton with cluster exiting/shutdown
patriknw added a commit that referenced this issue Jan 16, 2017
* CoordinatedShutdown that can run tasks for configured phases in order (DAG)
* coordinate handover/shutdown of singleton with cluster exiting/shutdown
* phase config obj with depends-on list
* integrate graceful leaving of sharding in coordinated shutdown
* add timeout and recover
* add some missing artery ports to tests
* leave via CoordinatedShutdown.run
* optionally exit-jvm in last phase
* run via jvm shutdown hook
* send ExitingConfirmed to leader before shutdown of Exiting
  to not have to wait for failure detector to mark it as
  unreachable before removing
* the unreachable signal is still kept as a safe guard if
  message is lost or leader dies
* PhaseClusterExiting vs MemberExited in ClusterSingletonManager
* terminate ActorSystem when cluster shutdown (via Down)
* add more predefined and custom phases
* reference documentation
* migration guide
* problem when the leader order was sys2, sys1, sys3,
  then sys3 could not perform it's duties and move Leving sys1 to
  Exiting because it was observing sys1 as unreachable
* exclude Leaving with exitingConfirmed from convergence condidtion
patriknw added a commit that referenced this issue Jan 16, 2017
@patriknw patriknw removed the 3 - in progress Someone is working on this ticket label Jan 16, 2017
@patriknw patriknw added this to the 2.5.0 milestone Jan 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants