Cannot get connection id for node in OTP 21.0 #70

beardedeagle · 2018-06-20T03:43:51Z

Creating issue for tracking.

Given OTP 21.0, Elixir 1.6.6 and libcluster 3.0.1:

iex --name vanguard@vanguard01.cloud.phx3.gdg --cookie test -S mix phx.server
Erlang/OTP 21 [erts-10.0] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [hipe]

[info] [swarm on vanguard@vanguard01.cloud.phx3.gdg] [tracker:init] started
[error]
** Cannot get connection id for node :"vanguard@vanguard01.cloud.phx3.gdg"

[warn] [libcluster:db] unable to connect to :"vanguard@vanguard01.cloud.phx3.gdg"
[warn] [libcluster:db] unable to connect to :"vanguard@vanguard02.cloud.phx3.gdg"
[info] Running VanguardWeb.Endpoint with Cowboy using http://0.0.0.0:4000
Interactive Elixir (1.6.6) - press Ctrl+C to exit (type h() ENTER for help)
iex(vanguard@vanguard01.cloud.phx3.gdg)1> 21:00:06 - info: compiled 6 files into 2 files, copied 3 in 739 ms
[info] [swarm on vanguard@vanguard01.cloud.phx3.gdg] [tracker:cluster_wait] joining cluster..
[info] [swarm on vanguard@vanguard01.cloud.phx3.gdg] [tracker:cluster_wait] no connected nodes, proceeding without sync

Given OTP 20.3.8, Elixir 1.6.6 and libcluster 3.0.1:

iex --name vanguard@vanguard01.cloud.phx3.gdg --cookie test -S mix phx.server
Erlang/OTP 20 [erts-9.3.3] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:10] [hipe] [kernel-poll:false]

[info] [swarm on vanguard@vanguard01.cloud.phx3.gdg] [tracker:init] started
[info] [libcluster:db] connected to :"vanguard@vanguard01.cloud.phx3.gdg"
[warn] [libcluster:db] unable to connect to :"vanguard@vanguard02.cloud.phx3.gdg"
[info] Running VanguardWeb.Endpoint with Cowboy using http://0.0.0.0:4000
Interactive Elixir (1.6.6) - press Ctrl+C to exit (type h() ENTER for help)
iex(vanguard@vanguard01.cloud.phx3.gdg)1> 21:47:42 - info: compiled 6 files into 2 files, copied 3 in 730 ms
[info] [swarm on vanguard@vanguard01.cloud.phx3.gdg] [tracker:cluster_wait] joining cluster..
[info] [swarm on vanguard@vanguard01.cloud.phx3.gdg] [tracker:cluster_wait] no connected nodes, proceeding without sync

The text was updated successfully, but these errors were encountered:

starbelly · 2018-06-20T03:49:44Z

This seems like suspect number one to me:

OTP-14370    Application(s): erts

               *** POTENTIAL INCOMPATIBILITY ***

               Truly asynchronous auto-connect. Earlier, when
               erlang:send was aimed toward an unconnected node, the
               function would not return until the connection setup
               had completed (or failed). Now the function returns
               directly after the message has been enqueued and the
               connection setup started.

               The same applies to all distributed operations that may
               trigger auto-connect, i.e. '!', send, link, monitor,
               monitor_node, exit/2 and group_leader.

               The interface for all these functions are unchanged as
               they do not return connection failures. The only
               exception is erlang:monitor where a *possible
               incompatibility* is introduced: An attempt to monitor a
               process on a primitive node (such as erl_interface or
               jinterface), where remote process monitoring is not
               implemented, will no longer fail with badarg exception.
               Instead a monitor will be created, but it will only
               supervise the connection to the node.

Relevant areas of code:

Erlang OTP/21 :
erlang/otp@f89fb92#diff-190f9182a0dcf029f6e8810c78e2fd1dR467

Libcluster:

libcluster/lib/strategy/strategy.ex

Line 36 in 9236fb3

def connect_nodes(topology, {_, _, _} = connect, {_, _, _} = list_nodes, nodes)

bitwalker · 2018-06-20T03:50:30Z

For those tuning in, looks like this is due to a regression of some kind in OTP 21 with the new auto-connect behavior. I need to do some investigating to see if I can get a repro case to open a bug, or figure out what, if anything, we need to do differently in libcluster.

bitwalker · 2018-06-20T15:39:12Z

I'm not able to reproduce this locally, @beardedeagle do you have a minimal working example I can try to reproduce with? I was going to try and open a bug today, but I haven't been able to find a way to trigger the error yet.

beardedeagle · 2018-06-20T16:16:42Z

Let me get something thrown together

beardedeagle · 2018-06-20T20:12:00Z

Got busy at work, I'll have to make the example tonight. The app I am showing this error in uses Swarm and Mnesiam as well. Which both require libcluster to be started before them. They are a part of :extra_applications so I wonder if the change to make libcluster start as a part of the applications supervision tree may be part of the issue. Either way I'll get to this tonight.

starbelly · 2018-06-20T22:52:21Z

Yeah, now that I've had some sleep. I'm gonna go with ... this is not a problem in libcluster or swarm. That error log message comes directly from net_kernel and looking back over the code there it's pretty blatant, it just can't connect to the node, in this case itself. As to why, could be a lot of reasons, but I don't think they are going to be related libcluster and friends.

Auto-connect:
https://github.com/erlang/otp/blob/f89fb92384280e2939414287a2ecb8f86a199318/lib/kernel/src/net_kernel.erl#L453

Explicit connect:
https://github.com/erlang/otp/blob/f89fb92384280e2939414287a2ecb8f86a199318/lib/kernel/src/net_kernel.erl#L474

Of course, it's always possible that libcluster or swarm is prematurely triggering an auto-connect. But I would rule out other things first... Try firing up an plain node first (no libcluster, no swarm).

beardedeagle · 2018-06-21T01:46:38Z

https://github.com/beardedeagle/test_app

^ I am still, personally, duplicating the issue locally. Curiously a separate issue popped up using the Dockerfile. libcluster just plain fails to connect:

docker run --rm -p 4000:4000 -i -t 1e9199de90b1
Erlang/OTP 21 [erts-10.0] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe]

01:02:49.273 [warn] [libcluster:test_app] unable to connect to :"test_app@test_app01.com"
01:02:49.287 [warn] [libcluster:test_app] unable to connect to :"test_app@test_app02.com"
01:02:49.290 [info] Running TestAppWeb.Endpoint with Cowboy using http://0.0.0.0:4000
Interactive Elixir (1.6.6) - press Ctrl+C to exit (type h() ENTER for help)
iex(test_app@127.0.0.1)1> node
:"test_app@127.0.0.1"

starbelly · 2018-06-21T03:06:16Z

This is mostly an undocumented breaking change in regards to the new auto_connect behavior:

https://bugs.erlang.org/browse/ERL-643

flowerett · 2018-06-21T10:04:53Z

Hey guys,
just checked this out with my basic phoenix app, everything works:
https://github.com/flowerett/cluster21

From the message above from @beardedeagle I can see that name is not properly configured, maybe that can be a problem?
In the dummy app I use env vars and docker-entrypoint for that, but should be fine with vm.args as well.

=====
As to the https://bugs.erlang.org/browse/ERL-643

In OTP versions prior to 21.0 the behavior was a node could "connect" to itself, in OTP 21, this results in an exception.

I assume that some strategies can try to connect to itself. In the DNSPoll strategy, node() is explicitly removed:
https://github.com/bitwalker/libcluster/blob/master/lib/strategy/dns_poll.ex#L127

starbelly · 2018-06-21T16:50:35Z

I dunno... but I'm curious... I'm wondering what's the actual use case for connecting with yourself via node_connect/1 is?

bitwalker · 2018-06-21T18:59:39Z

Ah, good catch on the self-connect thing. @starbelly this is usually unintentional, but results from sharing a config with a list of nodes which all need to connect to each other; if you don't explicitly remove Node.self from the list, then you will attempt to self-connect, which I guess prior to 21 was just ignored, but now is an error. We can fix this in libcluster.

See #70

bitwalker · 2018-06-21T19:07:10Z

I've pushed a change which will prevent libcluster from ever connecting/disconnecting the current node, so hopefully that addresses the exception being raised in OTP 21. There is still the issue of Swarm needing an update to allow starting in your own supervision tree the same way libcluster does, but that's more an issue for Swarm than libcluster.

@beardedeagle I'll hold off on closing until you've had a chance to test

beardedeagle · 2018-06-21T19:53:40Z

I'll pull it here shortly and test @bitwalker

beardedeagle · 2018-06-21T20:07:49Z

Output from original app experiencing issue:

iex --name vanguard@vanguard01.gdg --cookie test -S mix phx.server
Erlang/OTP 21 [erts-10.0] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [hipe]

[info] [swarm on vanguard@vanguard01.gdg] [tracker:init] started
[warn] [libcluster:vanguard] unable to connect to :"vanguard@vanguard02.gdg"
[info] Running VanguardWeb.Endpoint with Cowboy using http://0.0.0.0:4000
Interactive Elixir (1.6.6) - press Ctrl+C to exit (type h() ENTER for help)
iex(vanguard@vanguard01.gdg)1> 15:04:49 - info: compiled 6 files into 2 files, copied 3 in 791 ms
[info] [swarm on vanguard@vanguard01.gdg] [tracker:cluster_wait] joining cluster..
[info] [swarm on vanguard@vanguard01.gdg] [tracker:cluster_wait] no connected nodes, proceeding without sync

Looks good to me. I'd say it's safe to close this issue @bitwalker and I'll get back to testing it some more.

gilons · 2019-04-28T18:40:36Z

Me I Got the same error when i had a node with host name that is not routable that is, some thing like helper@192.168.45.169 where 192.168.45.169 is an ip that cannot be ping .. but when i changed the host to a correct ip i stopped having the error.

gilons · 2019-04-29T07:39:33Z

but giving an arbitrary name to a node like helper@machinename everything works fine

webdeb · 2019-05-08T19:49:53Z

I guess I am getting the same error here.

shutdown: failed to start child: Cluster.Strategy.Kubernetes
** (EXIT) exited in: :gen_server.call(:net_kernel, {:connect, :normal, :"karta@10-42-5-58.karta-dev.pod.cluster.local"}, :infinity)
** (EXIT) bad return value: {#PID<0.2154.0>, {:accept_pending, :nok_pending}}

This logs are from the 10.42.5.58 node. So libcluster is trying to connect to self().

update

this was my fault, the POD_IP was not exported to the container

sescobb27 · 2021-05-20T16:13:05Z

hi there, i'm seeing a similar error in DataDog logs, everything works as expected, nodes are connected but it logs the following message, i also noted that it is trying to connect to itself because the log comes from the same node cluster-1, not sure how to get rid of it as it is producing too much noise. thank you

Elixir: 1.11.3
Erlang: OTP 23
Libcluster: 3.2.2 (i upgraded to 3.3.0 but same, not that frequent tho)

config :libcluster,
  topologies: [
    k8s: [
      strategy: Cluster.Strategy.Kubernetes,
      config: [
        mode: :ip,
        # elixir node name
        kubernetes_node_basename: "cluster",
        # k8s label
        kubernetes_selector: "app=cluster",
        polling_interval: 10_000,
        # which API to query to get the running pods
        kubernetes_ip_lookup_mode: :pods,
        kubernetes_service_name: "cluster"
      ]
    ]
  ]

** Cannot get connection id for node :"cluster@cluster-1.cluster.staging.svc.cluster.local"

iamMrGaurav · 2024-02-23T15:23:04Z

@sescobb27 can you tell me how did you resolve your issue?

beardedeagle changed the title ~~Cannot get connection id for node~~ Cannot get connection id for node in OTP 21.0 Jun 20, 2018

bitwalker added bug:distribution bug:third-party labels Jun 20, 2018

bitwalker added a commit that referenced this issue Jun 21, 2018

[strategy] do not attempt to connect or disconnect self

08a3c65

See #70

bitwalker closed this as completed Jun 21, 2018

iamd3vil mentioned this issue Jul 3, 2018

Cannot get connection id for node error in OTP 21.0 mrluc/peerage#16

Closed

queer referenced this issue in mewna/raindrop Jul 8, 2018

Update for fixing OTP 21 selfnode connect changes

34d640c

Rotario mentioned this issue Feb 18, 2024

Nodes can't communicate in static cluster emqx/emqx#12484

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot get connection id for node in OTP 21.0 #70

Cannot get connection id for node in OTP 21.0 #70

beardedeagle commented Jun 20, 2018

starbelly commented Jun 20, 2018 •

edited

bitwalker commented Jun 20, 2018

bitwalker commented Jun 20, 2018

beardedeagle commented Jun 20, 2018

beardedeagle commented Jun 20, 2018 •

edited

starbelly commented Jun 20, 2018 •

edited

beardedeagle commented Jun 21, 2018 •

edited

starbelly commented Jun 21, 2018

flowerett commented Jun 21, 2018 •

edited

starbelly commented Jun 21, 2018

bitwalker commented Jun 21, 2018

bitwalker commented Jun 21, 2018

beardedeagle commented Jun 21, 2018

beardedeagle commented Jun 21, 2018

gilons commented Apr 28, 2019

gilons commented Apr 29, 2019

webdeb commented May 8, 2019 •

edited

sescobb27 commented May 20, 2021 •

edited

iamMrGaurav commented Feb 23, 2024

Cannot get connection id for node in OTP 21.0 #70

Cannot get connection id for node in OTP 21.0 #70

Comments

beardedeagle commented Jun 20, 2018

starbelly commented Jun 20, 2018 • edited

bitwalker commented Jun 20, 2018

bitwalker commented Jun 20, 2018

beardedeagle commented Jun 20, 2018

beardedeagle commented Jun 20, 2018 • edited

starbelly commented Jun 20, 2018 • edited

beardedeagle commented Jun 21, 2018 • edited

starbelly commented Jun 21, 2018

flowerett commented Jun 21, 2018 • edited

starbelly commented Jun 21, 2018

bitwalker commented Jun 21, 2018

bitwalker commented Jun 21, 2018

beardedeagle commented Jun 21, 2018

beardedeagle commented Jun 21, 2018

gilons commented Apr 28, 2019

gilons commented Apr 29, 2019

webdeb commented May 8, 2019 • edited

update

sescobb27 commented May 20, 2021 • edited

iamMrGaurav commented Feb 23, 2024

starbelly commented Jun 20, 2018 •

edited

beardedeagle commented Jun 20, 2018 •

edited

starbelly commented Jun 20, 2018 •

edited

beardedeagle commented Jun 21, 2018 •

edited

flowerett commented Jun 21, 2018 •

edited

webdeb commented May 8, 2019 •

edited

sescobb27 commented May 20, 2021 •

edited