Coordinated Shutdown (improve cluster leaving) #21537

patriknw · 2016-09-23T14:09:04Z

There are lot of things that can be improved around graceful cluster leaving.

the process of leaving and finally shutdown actor system and jvm is complicated,
it becomes even more complicated when cluster sharding and cluster singleton should also be gracefully stopped
it's rather slow
the usage of the failure detector unreachability marker for the final steps before removing is questionable

Somewhat related tickets:
#18373
#21298
#21521

patriknw · 2016-11-25T16:31:55Z

We discussed a solution for this. Introduce some kind of CoordinatedShutdown manager in akka-actor where different modules can register their "shutdown hook" (might need both graceful and forceful hooks). Need dependencies between modules for sequential shutdown steps in right order (probably just with string identifiers in config). Might also need different phases?

marekzebrowski · 2016-11-28T14:54:51Z

Leaving cluster is indeed a complex process. One issue is to when trigger leaving and second is to what needs to be done and in what order.
Our approach so far:
When - Runtime.getRuntime.addShutdownHook
Oder:

akkaHttp server
custom long-running flows
shardRegions + other singletons
leave cluster - awaiting for successful MemberExited(addr) message
shutdown actor system
we also have a special case useful for developement, which can be considered a hack: if there is one or less servers on seeds list, we don't wait for anything and just shut down immediately. Rationale behind it is one-node setup is development, and some test machines, so we don't need for graceful shutdown in such case. As we routinely do rolling restarts, steps like gracefully shutting down shard regions are important - if they fail, we need to restart the whole cluster

unfortunately Runtime.getRuntime.addShutdownHook stopped to work when we tried to use artery remoting in akka 2.4.14 as artery shuts down before shutdown other shutdown hooks are completed

* CoordinatedShutdown that can run tasks for configured phases in order (DAG) * coordinate handover/shutdown of singleton with cluster exiting/shutdown

patriknw · 2016-12-09T10:42:26Z

@marekzebrowski @guidomedina I have been working on PR #21930. Most things are in place (a few things are not done yet). I think it will handle all of the things you have had problems with automatically, and you can hook in your own tasks in the shutdown phases.

It also has support for adding jvm shutdown hooks that are run before the Artery shutdown hook. The coordinated shutdown is started automatically by such shutdown hook, so you probably don't even have to add any such yourself.

* CoordinatedShutdown that can run tasks for configured phases in order (DAG) * coordinate handover/shutdown of singleton with cluster exiting/shutdown

* CoordinatedShutdown that can run tasks for configured phases in order (DAG) * coordinate handover/shutdown of singleton with cluster exiting/shutdown * phase config obj with depends-on list * integrate graceful leaving of sharding in coordinated shutdown * add timeout and recover * add some missing artery ports to tests * leave via CoordinatedShutdown.run * optionally exit-jvm in last phase * run via jvm shutdown hook * send ExitingConfirmed to leader before shutdown of Exiting to not have to wait for failure detector to mark it as unreachable before removing * the unreachable signal is still kept as a safe guard if message is lost or leader dies * PhaseClusterExiting vs MemberExited in ClusterSingletonManager * terminate ActorSystem when cluster shutdown (via Down) * add more predefined and custom phases * reference documentation * migration guide * problem when the leader order was sys2, sys1, sys3, then sys3 could not perform it's duties and move Leving sys1 to Exiting because it was observing sys1 as unreachable * exclude Leaving with exitingConfirmed from convergence condidtion

add CoordinatedShutdown, #21537

patriknw added 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted t:cluster labels Sep 23, 2016

patriknw mentioned this issue Sep 23, 2016

WIP Quarantine gracefully downed node after some time #21534

Merged

patriknw changed the title ~~improve cluster leaving~~ Coordinated Shutdown (improve cluster leaving) Nov 25, 2016

patriknw self-assigned this Dec 1, 2016

patriknw added 3 - in progress Someone is working on this ticket and removed 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted labels Dec 1, 2016

patriknw added a commit that referenced this issue Dec 2, 2016

add CoordinatedShutdown, #21537

a2e0311

* CoordinatedShutdown that can run tasks for configured phases in order (DAG) * coordinate handover/shutdown of singleton with cluster exiting/shutdown

patriknw added a commit that referenced this issue Dec 7, 2016

add CoordinatedShutdown, #21537

2fd9ed8

* CoordinatedShutdown that can run tasks for configured phases in order (DAG) * coordinate handover/shutdown of singleton with cluster exiting/shutdown

patriknw added a commit that referenced this issue Dec 9, 2016

add CoordinatedShutdown, #21537

2788751

* CoordinatedShutdown that can run tasks for configured phases in order (DAG) * coordinate handover/shutdown of singleton with cluster exiting/shutdown

patriknw mentioned this issue Dec 9, 2016

add CoordinatedShutdown, #21537 #21930

Merged

patriknw added a commit that referenced this issue Dec 15, 2016

add CoordinatedShutdown, #21537

e5ab4bd

* CoordinatedShutdown that can run tasks for configured phases in order (DAG) * coordinate handover/shutdown of singleton with cluster exiting/shutdown

patriknw added a commit that referenced this issue Jan 4, 2017

add CoordinatedShutdown, #21537

ad1aa28

* CoordinatedShutdown that can run tasks for configured phases in order (DAG) * coordinate handover/shutdown of singleton with cluster exiting/shutdown

patriknw added a commit that referenced this issue Jan 6, 2017

add CoordinatedShutdown, #21537

88f3059

* CoordinatedShutdown that can run tasks for configured phases in order (DAG) * coordinate handover/shutdown of singleton with cluster exiting/shutdown

patriknw mentioned this issue Jan 11, 2017

Ease graceful leaving from Cluster (for Cluster Singleton) #18373

Closed

patriknw added a commit that referenced this issue Jan 16, 2017

Merge pull request #21930 from akka/wip-21537-coordinated-patriknw

4b6a650

add CoordinatedShutdown, #21537

patriknw removed the 3 - in progress Someone is working on this ticket label Jan 16, 2017

patriknw added this to the 2.5.0 milestone Jan 16, 2017

patriknw closed this as completed Jan 16, 2017

patriknw added a commit that referenced this issue Jan 20, 2017

add note about CoordinatedShutdown and tests, #21537

611dc93

Aaronontheweb mentioned this issue Feb 14, 2017

Akka.Cluster: Add CoordinatedShutdown akkadotnet/akka.net#2516

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coordinated Shutdown (improve cluster leaving) #21537

Coordinated Shutdown (improve cluster leaving) #21537

patriknw commented Sep 23, 2016 •

edited

patriknw commented Nov 25, 2016 •

edited

marekzebrowski commented Nov 28, 2016 •

edited

patriknw commented Dec 9, 2016 •

edited

Coordinated Shutdown (improve cluster leaving) #21537

Coordinated Shutdown (improve cluster leaving) #21537

Comments

patriknw commented Sep 23, 2016 • edited

patriknw commented Nov 25, 2016 • edited

marekzebrowski commented Nov 28, 2016 • edited

patriknw commented Dec 9, 2016 • edited

patriknw commented Sep 23, 2016 •

edited

patriknw commented Nov 25, 2016 •

edited

marekzebrowski commented Nov 28, 2016 •

edited

patriknw commented Dec 9, 2016 •

edited