Function clause error in `Swarm.IntervalTreeClock.fill` on topology change #126

devonestes · 2019-03-08T13:58:33Z

Howdy!

Our error tracker picked up the following error that happened when we deployed our application yesterday.

We're using Swarm (version 3.4.0) to manage a two node cluster on Kubernetes. When the first node was shut down as part of the rolling deploy, the following ErlangError was raised (I've formatted it a bit to make it easier to read):

{{:function_clause,
  [
    {Swarm.IntervalTreeClock, :fill, [{1, 0}, {1, 81714}],
     [file: 'lib/swarm/tracker/crdt.ex', line: 170]},
    {Swarm.IntervalTreeClock, :event, 1, [file: 'lib/swarm/tracker/crdt.ex', line: 52]},
    {Swarm.Tracker, :remove_registration, 2, [file: 'lib/swarm/tracker/tracker.ex', line: 1494]},
    {Swarm.Tracker, :"-handle_topology_change/2-fun-0-", 3,
     [file: 'lib/swarm/tracker/tracker.ex', line: 911]},
    {:lists, :foldl, 3, [file: 'lists.erl', line: 1263]},
    {:ets, :do_foldl, 4, [file: 'ets.erl', line: 638]},
    {:ets, :foldl, 3, [file: 'ets.erl', line: 627]},
    {Swarm.Tracker, :handle_topology_change, 2, [file: 'lib/swarm/tracker/tracker.ex', line: 840]}
  ]},
 {:gen_statem, :call,
  [
    Swarm.Tracker,
    {:track, #Reference<0.2697759696.1380188162.160667>,
     %{
       mfa:
         {GenServer, :start_link,
          [
            PotionWeb.Devices.SendAfter,
            {"device::id",
             {PotionWeb.Devices.Worker, :device_offline, []}, 30000}
          ]}
     }},
    :infinity
  ]}}

It looks like a function clause error where Swarm.IntervalTreeClock.fill({1, 0}, {1, 81714}) was called, but none of the function clauses match:

  defp fill(0, e), do: e
  defp fill(1, {_, _, _} = e), do: height(e)
  defp fill(_, n) when is_integer(n), do: n

  defp fill({1, r}, {n, el, er}) do
    er1 = fill(r, er)
    d = max(height(el), base(er1))
    norm_ev({n, d, er1})
  end

  defp fill({l, 1}, {n, el, er}) do
    el1 = fill(l, el)
    d = max(height(er), base(el1))
    norm_ev({n, el1, d})
  end

  defp fill({l, r}, {n, el, er}) do
    norm_ev({n, fill(l, el), fill(r, er)})
  end

This is pretty deep in the CRDT weeds so I don't think I'm going to be able to help much in figuring out how that function got called with those arguments, but I'd be happy to help with any other information about the application that can help in fixing this.

The text was updated successfully, but these errors were encountered:

spunkedy · 2019-04-28T21:15:59Z

I have this issue as well but on 3.3.1
Hope this extra info might help whoever.

Process #PID<0.9499.2> terminating: {{:function_clause, [{Swarm.IntervalTreeClock, :fill, [{0, 18}, {1, 18}], [file: 'lib/swarm/tracker/crdt.ex', line: 164]}, {Swarm.IntervalTreeClock, :event, 1, [file: 'lib/swarm/tracker/crdt.ex', line: 53]}, {Swarm.Tracker, :"-sync_registry/4-fun-0-", 4, [file: 'lib/swarm/tracker/tracker.ex', line: 371]}, {Enum, :"-reduce/3-lists^foldl/2-0-", 3, [file: 'lib/enum.ex', line: 1925]}, {Swarm.Tracker, :awaiting_sync_ack, 3, [file: 'lib/swarm/tracker/tracker.ex', line: 488]}, {:gen_statem, :call_state_function, 5, [file: 'gen_statem.erl', line: 1660]}, {:gen_statem, :loop_event_state_function, 6, [file: 'gen_statem.erl', line: 1023]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 249]}]}, {:gen_statem, :call, [Swarm.Tracker, {:track, Backend.Scheduler.ExecutionBroadcaster, %{mfa: {Quantum.ExecutionBroadcaster, :start_link, [%Quantum.ExecutionBroadcaster.StartOpts{debug_logging: true, job_broadcaster_reference: {:via, :swarm, Backend.Scheduler.JobBroadcaster}, name: Backend.Scheduler.ExecutionBroadcaster, scheduler: Backend.Scheduler, storage: Quantum.Storage.Noop}]}}}, :infinity]}}

My configuration

          backend: [
            strategy: Cluster.Strategy.Kubernetes.DNS,
              config: [
                service: System.get_env("K8S_CLUSTER_SERVICE"),
                application_name: System.get_env("K8S_CLUSTER_APP_NAME"),
                polling_interval: 10_000
              ]
            ]
          ]

markmeeus · 2019-06-06T17:23:04Z

We are also having this issue on 3.3.1.
Running 2 nodes on OpenShift (Kubernetes) with libCluster Kubernetes.DNS (headless service) Strategy
In our case, the error Swarm.IntervalTreeClock.fill didn't go away without restarting the services.

Probably related is that we experienced the following crash right before the Swarm.IntervalTreeClock.fill error.

{
  {
    {:badmatch, false}, 
    [
      {Swarm.Registry, :new!, 1, [file: 'lib/swarm/registry.ex', line: 99]}, 
      {Swarm.Tracker, :handle_replica_event, 4, [file: 'lib/swarm/tracker/tracker.ex', line: 814]}, 
      {:gen_statem, :call_state_function, 5, [file: 'gen_statem.erl', line: 1240]}, 
      {:gen_statem, :loop_event, 6, [file: 'gen_statem.erl', line: 1012]}, 
      {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 247]}]}, 
      {:gen_statem, :call, [
        Swarm.Tracker, {:track, "<our_id>", 
        %{mfa: {Cadex.Swarm.Supervisor, :start_event_processor, ["PH06407805_2019"]}}}, :infinity]}
}

Seems like Registry tries to insert a record in ets with a duplicate key?

Some of our processes exit immediately after being started, they assume that there isn't anything left to do.
But in some edge cases, new data may come in, and the process is recreated.
I did notice that this specific process seems to have stopped immediately after being started. Maybe it didn't get cleaned up on all nodes before it got recreated again?
Also, a network split seems to have occured, because I found this in the logs at the time the error started:

[swarm on cadex_service@server2] [tracker:handle_cast] received sync request from cadex_service@server1
[swarm on cadex_service@server1] [tracker:cluster_wait] selected sync node: cadex_service@server2
[swarm on cadex_service@server1] [tracker:cluster_wait] found connected nodes: [:"cadex_service@server2"]
[swarm on cadex_service@server1] [tracker:cluster_wait] joining cluster..

markmeeus · 2019-06-24T15:28:09Z

We have upgraded to 3.4, and now it seems to derail even more quickly.
These issues occur on 2 fixed nodes (not K8s).

beside the IntervalTreeClock (343,584 log entries in 6 hours,
we are also getting lots of these:

mismatched pids, local clock conflicts with remote clock, event unhandled"

and these:

received track event for "<some-id>", but local clock conflicts with remote clock, event unhandled

and these

failed to start "some-id" on cadex_service@13.41.2.18: nodedown, retrying operation..

And also, some tracked processes are actually gone:

(EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started

I'll try to get the states from the system

I am also wondering what would happen if a node restarts and starts creating processes before it joined the cluster?
When it rejoins I can imagine the other nodes not being aware that other restarted ? Could there be some sort of race condition between rejoining after a restart and registering new processes that may have been tracked before?

markmeeus · 2019-06-26T09:03:04Z

Sorry for diluting this ticket,
It looks like the IntervalTreeClock errors on fill and sum are gone after upgrading to 3.4.
We did some refactoring in our code so that we don't join groups when a process is allready a joined in the group, and fixed some processes crashing in the :die event.
Seems to solve most of our issues.

When we have bursts in data, we still get these warnings:

received track event for "<some-id>", but local clock conflicts with remote clock, event unhandled

Not sure how to interpret these, since everything seems to work out fine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Function clause error in `Swarm.IntervalTreeClock.fill` on topology change #126

Function clause error in `Swarm.IntervalTreeClock.fill` on topology change #126

devonestes commented Mar 8, 2019

spunkedy commented Apr 28, 2019

markmeeus commented Jun 6, 2019

markmeeus commented Jun 24, 2019 •

edited

markmeeus commented Jun 26, 2019

Function clause error in Swarm.IntervalTreeClock.fill on topology change #126

Function clause error in Swarm.IntervalTreeClock.fill on topology change #126

Comments

devonestes commented Mar 8, 2019

spunkedy commented Apr 28, 2019

markmeeus commented Jun 6, 2019

markmeeus commented Jun 24, 2019 • edited

markmeeus commented Jun 26, 2019

Function clause error in `Swarm.IntervalTreeClock.fill` on topology change #126

Function clause error in `Swarm.IntervalTreeClock.fill` on topology change #126

markmeeus commented Jun 24, 2019 •

edited