Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot get connection id for node in OTP 21.0 #70

Closed
beardedeagle opened this issue Jun 20, 2018 · 19 comments
Closed

Cannot get connection id for node in OTP 21.0 #70

beardedeagle opened this issue Jun 20, 2018 · 19 comments

Comments

@beardedeagle
Copy link

Creating issue for tracking.

Given OTP 21.0, Elixir 1.6.6 and libcluster 3.0.1:

iex --name vanguard@vanguard01.cloud.phx3.gdg --cookie test -S mix phx.server
Erlang/OTP 21 [erts-10.0] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [hipe]

[info] [swarm on vanguard@vanguard01.cloud.phx3.gdg] [tracker:init] started
[error]
** Cannot get connection id for node :"vanguard@vanguard01.cloud.phx3.gdg"

[warn] [libcluster:db] unable to connect to :"vanguard@vanguard01.cloud.phx3.gdg"
[warn] [libcluster:db] unable to connect to :"vanguard@vanguard02.cloud.phx3.gdg"
[info] Running VanguardWeb.Endpoint with Cowboy using http://0.0.0.0:4000
Interactive Elixir (1.6.6) - press Ctrl+C to exit (type h() ENTER for help)
iex(vanguard@vanguard01.cloud.phx3.gdg)1> 21:00:06 - info: compiled 6 files into 2 files, copied 3 in 739 ms
[info] [swarm on vanguard@vanguard01.cloud.phx3.gdg] [tracker:cluster_wait] joining cluster..
[info] [swarm on vanguard@vanguard01.cloud.phx3.gdg] [tracker:cluster_wait] no connected nodes, proceeding without sync

Given OTP 20.3.8, Elixir 1.6.6 and libcluster 3.0.1:

iex --name vanguard@vanguard01.cloud.phx3.gdg --cookie test -S mix phx.server
Erlang/OTP 20 [erts-9.3.3] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:10] [hipe] [kernel-poll:false]

[info] [swarm on vanguard@vanguard01.cloud.phx3.gdg] [tracker:init] started
[info] [libcluster:db] connected to :"vanguard@vanguard01.cloud.phx3.gdg"
[warn] [libcluster:db] unable to connect to :"vanguard@vanguard02.cloud.phx3.gdg"
[info] Running VanguardWeb.Endpoint with Cowboy using http://0.0.0.0:4000
Interactive Elixir (1.6.6) - press Ctrl+C to exit (type h() ENTER for help)
iex(vanguard@vanguard01.cloud.phx3.gdg)1> 21:47:42 - info: compiled 6 files into 2 files, copied 3 in 730 ms
[info] [swarm on vanguard@vanguard01.cloud.phx3.gdg] [tracker:cluster_wait] joining cluster..
[info] [swarm on vanguard@vanguard01.cloud.phx3.gdg] [tracker:cluster_wait] no connected nodes, proceeding without sync
@beardedeagle beardedeagle changed the title Cannot get connection id for node Cannot get connection id for node in OTP 21.0 Jun 20, 2018
@starbelly
Copy link

starbelly commented Jun 20, 2018

This seems like suspect number one to me:

OTP-14370    Application(s): erts

               *** POTENTIAL INCOMPATIBILITY ***

               Truly asynchronous auto-connect. Earlier, when
               erlang:send was aimed toward an unconnected node, the
               function would not return until the connection setup
               had completed (or failed). Now the function returns
               directly after the message has been enqueued and the
               connection setup started.

               The same applies to all distributed operations that may
               trigger auto-connect, i.e. '!', send, link, monitor,
               monitor_node, exit/2 and group_leader.

               The interface for all these functions are unchanged as
               they do not return connection failures. The only
               exception is erlang:monitor where a *possible
               incompatibility* is introduced: An attempt to monitor a
               process on a primitive node (such as erl_interface or
               jinterface), where remote process monitoring is not
               implemented, will no longer fail with badarg exception.
               Instead a monitor will be created, but it will only
               supervise the connection to the node.

Relevant areas of code:

Erlang OTP/21 :
erlang/otp@f89fb92#diff-190f9182a0dcf029f6e8810c78e2fd1dR467

Libcluster:

def connect_nodes(topology, {_, _, _} = connect, {_, _, _} = list_nodes, nodes)

@bitwalker
Copy link
Owner

For those tuning in, looks like this is due to a regression of some kind in OTP 21 with the new auto-connect behavior. I need to do some investigating to see if I can get a repro case to open a bug, or figure out what, if anything, we need to do differently in libcluster.

@bitwalker
Copy link
Owner

I'm not able to reproduce this locally, @beardedeagle do you have a minimal working example I can try to reproduce with? I was going to try and open a bug today, but I haven't been able to find a way to trigger the error yet.

@beardedeagle
Copy link
Author

Let me get something thrown together

@beardedeagle
Copy link
Author

beardedeagle commented Jun 20, 2018

Got busy at work, I'll have to make the example tonight. The app I am showing this error in uses Swarm and Mnesiam as well. Which both require libcluster to be started before them. They are a part of :extra_applications so I wonder if the change to make libcluster start as a part of the applications supervision tree may be part of the issue. Either way I'll get to this tonight.

@starbelly
Copy link

starbelly commented Jun 20, 2018

Yeah, now that I've had some sleep. I'm gonna go with ... this is not a problem in libcluster or swarm. That error log message comes directly from net_kernel and looking back over the code there it's pretty blatant, it just can't connect to the node, in this case itself. As to why, could be a lot of reasons, but I don't think they are going to be related libcluster and friends.

Auto-connect:
https://github.com/erlang/otp/blob/f89fb92384280e2939414287a2ecb8f86a199318/lib/kernel/src/net_kernel.erl#L453

Explicit connect:
https://github.com/erlang/otp/blob/f89fb92384280e2939414287a2ecb8f86a199318/lib/kernel/src/net_kernel.erl#L474

Of course, it's always possible that libcluster or swarm is prematurely triggering an auto-connect. But I would rule out other things first... Try firing up an plain node first (no libcluster, no swarm).

@beardedeagle
Copy link
Author

beardedeagle commented Jun 21, 2018

https://github.com/beardedeagle/test_app

^ I am still, personally, duplicating the issue locally. Curiously a separate issue popped up using the Dockerfile. libcluster just plain fails to connect:

docker run --rm -p 4000:4000 -i -t 1e9199de90b1
Erlang/OTP 21 [erts-10.0] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe]

01:02:49.273 [warn] [libcluster:test_app] unable to connect to :"test_app@test_app01.com"
01:02:49.287 [warn] [libcluster:test_app] unable to connect to :"test_app@test_app02.com"
01:02:49.290 [info] Running TestAppWeb.Endpoint with Cowboy using http://0.0.0.0:4000
Interactive Elixir (1.6.6) - press Ctrl+C to exit (type h() ENTER for help)
iex(test_app@127.0.0.1)1> node
:"test_app@127.0.0.1"

@starbelly
Copy link

This is mostly an undocumented breaking change in regards to the new auto_connect behavior:

https://bugs.erlang.org/browse/ERL-643

@flowerett
Copy link
Contributor

flowerett commented Jun 21, 2018

Hey guys,
just checked this out with my basic phoenix app, everything works:
https://github.com/flowerett/cluster21

From the message above from @beardedeagle I can see that name is not properly configured, maybe that can be a problem?
In the dummy app I use env vars and docker-entrypoint for that, but should be fine with vm.args as well.

=====
As to the https://bugs.erlang.org/browse/ERL-643

In OTP versions prior to 21.0 the behavior was a node could "connect" to itself, in OTP 21, this results in an exception.

I assume that some strategies can try to connect to itself. In the DNSPoll strategy, node() is explicitly removed:
https://github.com/bitwalker/libcluster/blob/master/lib/strategy/dns_poll.ex#L127

@starbelly
Copy link

I dunno... but I'm curious... I'm wondering what's the actual use case for connecting with yourself via node_connect/1 is?

@bitwalker
Copy link
Owner

Ah, good catch on the self-connect thing. @starbelly this is usually unintentional, but results from sharing a config with a list of nodes which all need to connect to each other; if you don't explicitly remove Node.self from the list, then you will attempt to self-connect, which I guess prior to 21 was just ignored, but now is an error. We can fix this in libcluster.

@bitwalker
Copy link
Owner

I've pushed a change which will prevent libcluster from ever connecting/disconnecting the current node, so hopefully that addresses the exception being raised in OTP 21. There is still the issue of Swarm needing an update to allow starting in your own supervision tree the same way libcluster does, but that's more an issue for Swarm than libcluster.

@beardedeagle I'll hold off on closing until you've had a chance to test

@beardedeagle
Copy link
Author

I'll pull it here shortly and test @bitwalker

@beardedeagle
Copy link
Author

Output from original app experiencing issue:

iex --name vanguard@vanguard01.gdg --cookie test -S mix phx.server
Erlang/OTP 21 [erts-10.0] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [hipe]

[info] [swarm on vanguard@vanguard01.gdg] [tracker:init] started
[warn] [libcluster:vanguard] unable to connect to :"vanguard@vanguard02.gdg"
[info] Running VanguardWeb.Endpoint with Cowboy using http://0.0.0.0:4000
Interactive Elixir (1.6.6) - press Ctrl+C to exit (type h() ENTER for help)
iex(vanguard@vanguard01.gdg)1> 15:04:49 - info: compiled 6 files into 2 files, copied 3 in 791 ms
[info] [swarm on vanguard@vanguard01.gdg] [tracker:cluster_wait] joining cluster..
[info] [swarm on vanguard@vanguard01.gdg] [tracker:cluster_wait] no connected nodes, proceeding without sync

Looks good to me. I'd say it's safe to close this issue @bitwalker and I'll get back to testing it some more.

@gilons
Copy link

gilons commented Apr 28, 2019

Me I Got the same error when i had a node with host name that is not routable that is, some thing like helper@192.168.45.169 where 192.168.45.169 is an ip that cannot be ping .. but when i changed the host to a correct ip i stopped having the error.

@gilons
Copy link

gilons commented Apr 29, 2019

but giving an arbitrary name to a node like helper@machinename everything works fine

@webdeb
Copy link

webdeb commented May 8, 2019

I guess I am getting the same error here.

shutdown: failed to start child: Cluster.Strategy.Kubernetes
** (EXIT) exited in: :gen_server.call(:net_kernel, {:connect, :normal, :"karta@10-42-5-58.karta-dev.pod.cluster.local"}, :infinity)
** (EXIT) bad return value: {#PID<0.2154.0>, {:accept_pending, :nok_pending}}

This logs are from the 10.42.5.58 node. So libcluster is trying to connect to self().

update

this was my fault, the POD_IP was not exported to the container

@sescobb27
Copy link

sescobb27 commented May 20, 2021

hi there, i'm seeing a similar error in DataDog logs, everything works as expected, nodes are connected but it logs the following message, i also noted that it is trying to connect to itself because the log comes from the same node cluster-1, not sure how to get rid of it as it is producing too much noise. thank you

Elixir: 1.11.3
Erlang: OTP 23
Libcluster: 3.2.2 (i upgraded to 3.3.0 but same, not that frequent tho)

config :libcluster,
  topologies: [
    k8s: [
      strategy: Cluster.Strategy.Kubernetes,
      config: [
        mode: :ip,
        # elixir node name
        kubernetes_node_basename: "cluster",
        # k8s label
        kubernetes_selector: "app=cluster",
        polling_interval: 10_000,
        # which API to query to get the running pods
        kubernetes_ip_lookup_mode: :pods,
        kubernetes_service_name: "cluster"
      ]
    ]
  ]
** Cannot get connection id for node :"cluster@cluster-1.cluster.staging.svc.cluster.local"

@iamMrGaurav
Copy link

@sescobb27 can you tell me how did you resolve your issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants