Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Function clause error in Swarm.IntervalTreeClock.fill on topology change #126

Open
devonestes opened this issue Mar 8, 2019 · 4 comments

Comments

@devonestes
Copy link

Howdy!

Our error tracker picked up the following error that happened when we deployed our application yesterday.

We're using Swarm (version 3.4.0) to manage a two node cluster on Kubernetes. When the first node was shut down as part of the rolling deploy, the following ErlangError was raised (I've formatted it a bit to make it easier to read):

{{:function_clause,
  [
    {Swarm.IntervalTreeClock, :fill, [{1, 0}, {1, 81714}],
     [file: 'lib/swarm/tracker/crdt.ex', line: 170]},
    {Swarm.IntervalTreeClock, :event, 1, [file: 'lib/swarm/tracker/crdt.ex', line: 52]},
    {Swarm.Tracker, :remove_registration, 2, [file: 'lib/swarm/tracker/tracker.ex', line: 1494]},
    {Swarm.Tracker, :"-handle_topology_change/2-fun-0-", 3,
     [file: 'lib/swarm/tracker/tracker.ex', line: 911]},
    {:lists, :foldl, 3, [file: 'lists.erl', line: 1263]},
    {:ets, :do_foldl, 4, [file: 'ets.erl', line: 638]},
    {:ets, :foldl, 3, [file: 'ets.erl', line: 627]},
    {Swarm.Tracker, :handle_topology_change, 2, [file: 'lib/swarm/tracker/tracker.ex', line: 840]}
  ]},
 {:gen_statem, :call,
  [
    Swarm.Tracker,
    {:track, #Reference<0.2697759696.1380188162.160667>,
     %{
       mfa:
         {GenServer, :start_link,
          [
            PotionWeb.Devices.SendAfter,
            {"device::id",
             {PotionWeb.Devices.Worker, :device_offline, []}, 30000}
          ]}
     }},
    :infinity
  ]}}

It looks like a function clause error where Swarm.IntervalTreeClock.fill({1, 0}, {1, 81714}) was called, but none of the function clauses match:

  defp fill(0, e), do: e
  defp fill(1, {_, _, _} = e), do: height(e)
  defp fill(_, n) when is_integer(n), do: n

  defp fill({1, r}, {n, el, er}) do
    er1 = fill(r, er)
    d = max(height(el), base(er1))
    norm_ev({n, d, er1})
  end

  defp fill({l, 1}, {n, el, er}) do
    el1 = fill(l, el)
    d = max(height(er), base(el1))
    norm_ev({n, el1, d})
  end

  defp fill({l, r}, {n, el, er}) do
    norm_ev({n, fill(l, el), fill(r, er)})
  end

This is pretty deep in the CRDT weeds so I don't think I'm going to be able to help much in figuring out how that function got called with those arguments, but I'd be happy to help with any other information about the application that can help in fixing this.

@spunkedy
Copy link

I have this issue as well but on 3.3.1
Hope this extra info might help whoever.

Process #PID<0.9499.2> terminating: {{:function_clause, [{Swarm.IntervalTreeClock, :fill, [{0, 18}, {1, 18}], [file: 'lib/swarm/tracker/crdt.ex', line: 164]}, {Swarm.IntervalTreeClock, :event, 1, [file: 'lib/swarm/tracker/crdt.ex', line: 53]}, {Swarm.Tracker, :"-sync_registry/4-fun-0-", 4, [file: 'lib/swarm/tracker/tracker.ex', line: 371]}, {Enum, :"-reduce/3-lists^foldl/2-0-", 3, [file: 'lib/enum.ex', line: 1925]}, {Swarm.Tracker, :awaiting_sync_ack, 3, [file: 'lib/swarm/tracker/tracker.ex', line: 488]}, {:gen_statem, :call_state_function, 5, [file: 'gen_statem.erl', line: 1660]}, {:gen_statem, :loop_event_state_function, 6, [file: 'gen_statem.erl', line: 1023]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 249]}]}, {:gen_statem, :call, [Swarm.Tracker, {:track, Backend.Scheduler.ExecutionBroadcaster, %{mfa: {Quantum.ExecutionBroadcaster, :start_link, [%Quantum.ExecutionBroadcaster.StartOpts{debug_logging: true, job_broadcaster_reference: {:via, :swarm, Backend.Scheduler.JobBroadcaster}, name: Backend.Scheduler.ExecutionBroadcaster, scheduler: Backend.Scheduler, storage: Quantum.Storage.Noop}]}}}, :infinity]}}

My configuration

          backend: [
            strategy: Cluster.Strategy.Kubernetes.DNS,
              config: [
                service: System.get_env("K8S_CLUSTER_SERVICE"),
                application_name: System.get_env("K8S_CLUSTER_APP_NAME"),
                polling_interval: 10_000
              ]
            ]
          ]

@markmeeus
Copy link

We are also having this issue on 3.3.1.
Running 2 nodes on OpenShift (Kubernetes) with libCluster Kubernetes.DNS (headless service) Strategy
In our case, the error Swarm.IntervalTreeClock.fill didn't go away without restarting the services.

Probably related is that we experienced the following crash right before the Swarm.IntervalTreeClock.fill error.

{
  {
    {:badmatch, false}, 
    [
      {Swarm.Registry, :new!, 1, [file: 'lib/swarm/registry.ex', line: 99]}, 
      {Swarm.Tracker, :handle_replica_event, 4, [file: 'lib/swarm/tracker/tracker.ex', line: 814]}, 
      {:gen_statem, :call_state_function, 5, [file: 'gen_statem.erl', line: 1240]}, 
      {:gen_statem, :loop_event, 6, [file: 'gen_statem.erl', line: 1012]}, 
      {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 247]}]}, 
      {:gen_statem, :call, [
        Swarm.Tracker, {:track, "<our_id>", 
        %{mfa: {Cadex.Swarm.Supervisor, :start_event_processor, ["PH06407805_2019"]}}}, :infinity]}
}

Seems like Registry tries to insert a record in ets with a duplicate key?

Some of our processes exit immediately after being started, they assume that there isn't anything left to do.
But in some edge cases, new data may come in, and the process is recreated.
I did notice that this specific process seems to have stopped immediately after being started. Maybe it didn't get cleaned up on all nodes before it got recreated again?
Also, a network split seems to have occured, because I found this in the logs at the time the error started:

[swarm on cadex_service@server2] [tracker:handle_cast] received sync request from cadex_service@server1
[swarm on cadex_service@server1] [tracker:cluster_wait] selected sync node: cadex_service@server2
[swarm on cadex_service@server1] [tracker:cluster_wait] found connected nodes: [:"cadex_service@server2"]
[swarm on cadex_service@server1] [tracker:cluster_wait] joining cluster..

@markmeeus
Copy link

markmeeus commented Jun 24, 2019

We have upgraded to 3.4, and now it seems to derail even more quickly.
These issues occur on 2 fixed nodes (not K8s).

beside the IntervalTreeClock (343,584 log entries in 6 hours,
we are also getting lots of these:

mismatched pids, local clock conflicts with remote clock, event unhandled"

and these:

received track event for "<some-id>", but local clock conflicts with remote clock, event unhandled

and these

failed to start "some-id" on cadex_service@13.41.2.18: nodedown, retrying operation..

And also, some tracked processes are actually gone:

(EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started

I'll try to get the states from the system

I am also wondering what would happen if a node restarts and starts creating processes before it joined the cluster?
When it rejoins I can imagine the other nodes not being aware that other restarted ? Could there be some sort of race condition between rejoining after a restart and registering new processes that may have been tracked before?

@markmeeus
Copy link

Sorry for diluting this ticket,
It looks like the IntervalTreeClock errors on fill and sum are gone after upgrading to 3.4.
We did some refactoring in our code so that we don't join groups when a process is allready a joined in the group, and fixed some processes crashing in the :die event.
Seems to solve most of our issues.

When we have bursts in data, we still get these warnings:

received track event for "<some-id>", but local clock conflicts with remote clock, event unhandled

Not sure how to interpret these, since everything seems to work out fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants