Static quorum ring distribution strategy #38

slashdotdash · 2017-08-21T11:32:59Z

Adds new strategy module Swarm.Distribution.StaticQuorumRing. This is used to provide consistency during a network partition.

The strategy is configured with a quorum size which defines the minimum number of nodes that must be connected in the cluster to allow process registration and distribution. If there are fewer nodes available than the quorum size, any calls to Swarm.register_name/5 will block until enough nodes have started.

The Swarm.Distribution.Strategy.key_to_node/2 function may return :undefined to indicate that there is no available node to start the process. In this case the tracker will record the registration as pending. It will attempt to start the process whenever the network topology changes. Should a node go down, or during a net split, any process that are currently running but are determined to have no node available will be stopped. They will be restarted whenever a node becomes available, as determined by the same consistent named based hash ring distribution using libring. Additional ring distribution strategies can now be written that take advantage of consistency, instead of availability, by returning :undefined node.

A full suite of tests for the new functionality are included in test/quorum_test.exs.

Please let me know of any issues or changes you recommend.

Spawned test cluster nodes require the `MyApp.WorkerSup` module to be started. Only registered processes may join groups (and be assigned metadata).

When the ring maps a name to an `:undefined` node, the tracker will record the pending request. On subsequent topology changes it will retry the pending registrations.

…start

The `:timeout` value can be used to limit the duration of blocking name registration calls.

Provide alternate strategy options: availability, consistency.

Papipo · 2017-08-21T13:37:07Z

👏

slashdotdash · 2017-08-21T19:36:15Z

The motivation for this pull request is to support running Commanded on a cluster of nodes (#39).

slashdotdash · 2017-08-21T20:07:50Z

lib/swarm/tracker/tracker.ex

@@ -698,11 +725,9 @@ defmodule Swarm.Tracker do
                  debug "#{inspect name} has requested to be restarted"
                  {:ok, new_state} = remove_registration(obj, %{state | clock: lclock})
                  send(pid, {:swarm, :die})
-                  case handle_call({:track, name, m, f, a}, nil, %{state | clock: lclock}) do


@bitwalker I was unsure whether the state should be the new_state returned from the remove_registration/2 function call two lines above? Does it make a difference if the clock is incremented only once for the two operations (remove, then add) rather than twice?

We should use the new state returned from the last call which manipulates the state (i.e. the clock should be incremented twice)

This ensures that all processes in the registry are correctly redistributed (or stopped), not only the process that has triggered the monitor.

bitwalker

This looks great! I have some comments and thoughts, but nothing major; my initial reading of this looks good to me. Very impressive that you pretty much nailed it in your first pass! Nice work :)

bitwalker · 2017-08-22T18:15:22Z

lib/swarm.ex

  """
-  @spec register_name(term, atom(), atom(), [term]) :: {:ok, pid} | {:error, term}
-  defdelegate register_name(name, m, f, a), to: Swarm.Registry, as: :register
+  @spec register_name(term, atom(), atom(), [term], non_neg_integer() | :infinity) :: {:ok, pid} | {:error, term}


For optional parameters, you should provide a typespec for both arities, in this case, simply omitting the timeout parameter is enough. This is because otherwise someone can't do iex> s Swarm.register_name/4 and get a typespec, they have to know to do iex> s Swarm.register_name/5.

bitwalker · 2017-08-22T19:01:25Z

lib/swarm/distribution/static_quorum_ring.ex

+
+  You must configure the quorum size using the `:static_quorum_size` setting:
+
+      config :swarm,


May be worth including the :distribution_strategy option here for clarity and in order to re-emphasize that both options are required to use this strategy correctly.

bitwalker · 2017-08-22T19:06:56Z

lib/swarm/distribution/static_quorum_ring.ex

+  It defines the minimum number of nodes that must be connected in the cluster to allow process
+  registration and distribution.
+
+  If there are fewer nodes currently available than the quorum size, any calls to


It may be worth discussing here the use of the kernel config options :sync_nodes_mandatory, :sync_nodes_optional, and :sync_nodes_timeout. These ensure the required and optional members of the cluster are connected when the runtime boots and before any applications start, it's particularly useful for use cases this strategy is designed around (i.e. the cluster members are known in advance). The mandatory and optional settings take a list of nodes, and the timeout setting takes an integer or :infinity. You can configure it like any other app, e.g.:

config :kernel, sync_nodes_mandatory: [:"node1@192.168.1.1", :"node2@192.168.1.2"], sync_nodes_timeout: 60_000

That's a useful feature I was unaware of.

It is for sure :). The only caveat to the above is that the configuration needs to be present when the VM boots, so running under mix, you need to pass --erl "-config path/to/sys.config" and convert the configuration I mentioned to Erlang terms, e.g.:

[{kernel, [{sync_nodes_mandatory, ['node1@192.168.1.1', ...]}, {sync_nodes_timeout, 60000}]}].

Using the Mix config files works for releases though.

bitwalker · 2017-08-22T19:09:43Z

mix.exs

@@ -14,7 +14,7 @@ defmodule Swarm.Mixfile do
  def project do
    [app: :swarm,
     version: "3.0.5",
-     elixir: "~> 1.3",
+     elixir: "~> 1.5",


Is there a reason we're relying on 1.5 here? I don't like breaking backwards compatibility unless we have a good reason to do so.

bitwalker · 2017-08-22T19:12:07Z

lib/swarm/tracker/tracker.ex

@@ -17,22 +17,39 @@ defmodule Swarm.Tracker do
  alias Swarm.Registry
  alias Swarm.Distribution.Strategy

+  defmodule Tracking do


Can you add @moduledoc false to this as well?

bitwalker · 2017-08-22T19:12:38Z

lib/swarm/tracker/tracker.ex

+      a: list(),
+      from: {pid, tag :: term},
+    }
+    defstruct name: nil,


Since all of the default values are nil, let's make this defstruct [:name, :m, :f, :a, :from]

bitwalker · 2017-08-22T19:16:46Z

lib/swarm/tracker/tracker.ex

@@ -698,11 +725,9 @@ defmodule Swarm.Tracker do
                  debug "#{inspect name} has requested to be restarted"
                  {:ok, new_state} = remove_registration(obj, %{state | clock: lclock})
                  send(pid, {:swarm, :die})
-                  case handle_call({:track, name, m, f, a}, nil, %{state | clock: lclock}) do


We should use the new state returned from the last call which manipulates the state (i.e. the clock should be incremented twice)

bitwalker · 2017-08-22T19:18:28Z

lib/swarm/tracker/tracker.ex

@@ -719,14 +744,19 @@ defmodule Swarm.Tracker do
          :else ->
            # pid is dead, we're going to restart it
            case Strategy.key_to_node(state.strategy, name) do
+              :undefined ->
+                # No node available to restart process on, so remove registrartion
+                debug "no node available to restart #{inspect name}"


My instinct is that this should be logged as a warning - it seems like a condition that we would definitely want logged when it occurs.

bitwalker · 2017-08-22T19:22:40Z

lib/swarm/tracker/tracker.ex

+  end
+  defp handle_cast({:retry_pending_trackings}, %{pending_trackings: pending_trackings} = state) do
+    debug "retry pending trackings: #{inspect state.pending_trackings}"
+    state = Enum.reduce(pending_trackings, %TrackerState{state | pending_trackings: []}, fn (tracking, state) ->


Minor style change here, my preference in new/modified code is to change assignments which break 80 characters into multiple lines and use pipes if applicable to shorten it up; and I also prefer to avoid parens in anonymous function arguments, e.g:

state = pending_trackings |> Enum.reduce(%TrackerState{state | pending_trackings: []}, fn tracking, state -> ... end)

slashdotdash · 2017-08-22T19:44:32Z

@bitwalker Thanks for the positive feedback. I'll go ahead and make the changes you've outlined and update the pull request.

slashdotdash · 2017-08-23T12:29:58Z

@bitwalker I've pushed additional commits to this pull request containing the changes you outlined.

One issue that I've spotted today is that joined groups are not rejoined when a process gets restarted. I can include a fix for that too, if you want?

bitwalker · 2017-08-23T17:51:08Z

Let's fix that issue in a separate PR to keep things easier to review :), I've merged this for now, thanks for all the hard work!

slashdotdash added 15 commits August 14, 2017 22:39

Fix ring strategy module alias

e87b06f

Elixir v1.5

80614e2

Fix broken registry tests

3291a47

Spawned test cluster nodes require the `MyApp.WorkerSup` module to be started. Only registered processes may join groups (and be assigned metadata).

Create StaticQuorumRing distribution strategy

13e2dbe

Handle pending track requests

df72d71

When the ring maps a name to an `:undefined` node, the tracker will record the pending request. On subsequent topology changes it will retry the pending registrations.

Restart downed process due to unavailable node once quorum reached

d36fd31

Track pending registration when node dies but no node available to re…

4349d55

…start

Track pending registrations when node unavailable

a3c63aa

Static quorum ring strategy documentation

46d596b

Count nodes inside inner ring

2eead4e

Split brain quorum test

0744cdf

Add an optional :timeout to Swarm.register_name function

57559f9

The `:timeout` value can be used to limit the duration of blocking name registration calls.

Use do_track/2 function instead of handle_call

32d4410

Include static quorum strategy in README

f928e19

Provide alternate strategy options: availability, consistency.

Swarm.Distribution.Strategy.key_to_node/2 may return :undefined node

87f3aad

slashdotdash mentioned this pull request Aug 21, 2017

Support running on a cluster of nodes commanded/commanded#39

Closed

slashdotdash commented Aug 21, 2017

View reviewed changes

Call handle_topology_change/2 on monitor :noconnection

694eecc

This ensures that all processes in the registry are correctly redistributed (or stopped), not only the process that has triggered the monitor.

bitwalker requested changes Aug 22, 2017

View reviewed changes

slashdotdash added 4 commits August 23, 2017 10:55

Revert to Elixir v1.3

45e710b

Add typespec for Swarm.register_name/4

9b146a6

Include sync_nodes_* config in static quorum module docs

2333731

Pull request feedback

2213793

bitwalker approved these changes Aug 23, 2017

View reviewed changes

bitwalker merged commit 6d13987 into bitwalker:master Aug 23, 2017

slashdotdash deleted the feature/split-brain branch August 23, 2017 19:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Static quorum ring distribution strategy #38

Static quorum ring distribution strategy #38

slashdotdash commented Aug 21, 2017 •

edited

Papipo commented Aug 21, 2017

slashdotdash commented Aug 21, 2017

slashdotdash Aug 21, 2017

bitwalker Aug 22, 2017

bitwalker left a comment

bitwalker Aug 22, 2017

bitwalker Aug 22, 2017

bitwalker Aug 22, 2017

slashdotdash Aug 23, 2017

bitwalker Aug 23, 2017

bitwalker Aug 22, 2017

bitwalker Aug 22, 2017

bitwalker Aug 22, 2017

bitwalker Aug 22, 2017

bitwalker Aug 22, 2017

bitwalker Aug 22, 2017

slashdotdash commented Aug 22, 2017

slashdotdash commented Aug 23, 2017

bitwalker commented Aug 23, 2017


		You must configure the quorum size using the `:static_quorum_size` setting:

		config :swarm,

Static quorum ring distribution strategy #38

Static quorum ring distribution strategy #38

Conversation

slashdotdash commented Aug 21, 2017 • edited

Papipo commented Aug 21, 2017

slashdotdash commented Aug 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bitwalker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slashdotdash commented Aug 22, 2017

slashdotdash commented Aug 23, 2017

bitwalker commented Aug 23, 2017

slashdotdash commented Aug 21, 2017 •

edited