Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes can't communicate in static cluster #12484

Closed
Rotario opened this issue Feb 6, 2024 · 32 comments · Fixed by #12541
Closed

Nodes can't communicate in static cluster #12484

Rotario opened this issue Feb 6, 2024 · 32 comments · Fixed by #12541
Assignees

Comments

@Rotario
Copy link

Rotario commented Feb 6, 2024

What happened?

I'm running a 2 core node cluster on fly.io
my cluster settings are:

  EMQX_CLUSTER__PROTO_DIST = "inet6_tcp"
  EMQX_CLUSTER__DISCOVERY_STRATEGY = "static"
  EMQX_CLUSTER__STATIC__SEEDS = "[emqx@xxxxx.vm.emqx.internal,emqx@xxxxxx.vm.emqx.internal]"

They're running fine most of the time:
image

But when testing I've found that sometimes they get stuck and can't communicate and spit out the following logs:

2024-02-06T00:35:56Z app[6e82e56f2e3038] lhr [info]2024-02-06T00:35:56.990082+00:00 [error] event=connect_to_remote_server, peer=emqx@2874de3c1675d8.vm.emqx.internal, port=5369, reason=nxdomain                  
2024-02-06T00:35:56Z app[6e82e56f2e3038] lhr [info]2024-02-06T00:35:56.990243+00:00 [error] crasher: initial call: gen_rpc_client:init/1, pid: <0.7306.0>, registered_name: [], exit: {{badrpc,nxdomain},[{gen_serv
er,init_it,6,[{file,"gen_server.erl"},{line,835}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}, ancestors: [gen_rpc_client_sup,gen_rpc_sup,<0.2123.0>], message_queue_len: 0, messages: [], l
inks: [<0.2129.0>], dictionary: [], trap_exit: true, status: running, heap_size: 2586, stack_size: 28, reductions: 2710; neighbours:   

This results in the fact that if I subscribe to one node, I don't receive messages sent to the other node.
I've found a restart fixes this, but I can't deploy EMQX into production if this occurs.
What could potentially be causing this please? Is it the DNS resolution of the domain names from the cloud provider (docs here: https://fly.io/docs/networking/private-networking/)? If so I'm waiting for #12467 to get merged so I can move to AAAA discovery. This might fix the issue.

What did you expect to happen?

I expected two nodes to communicate correctly using my cloud provider's private IP address resolution mechanisms

How can we reproduce it (as minimally and precisely as possible)?

Use static discovery and 2 seeds

Anything else we need to know?

No response

EMQX version

$ ./bin/emqx_ctl broker
sysdescr  : EMQX
version   : 5.4.1
datetime  : 2024-02-06T00:52:37.516464153+00:00
uptime    : 14 minutes, 54 seconds

OS version

# On Linux:
$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
$ uname -a
Linux 2874de3c1675d8 5.15.98-fly #g2626e66887 SMP Mon Jan 29 07:34:38 UTC 2024 x86_64 GNU/Linux

Log files

2024-02-06T00:35:56Z app[6e82e56f2e3038] lhr [info]2024-02-06T00:35:56.990082+00:00 [error] event=connect_to_remote_server, peer=emqx@2874de3c1675d8.vm.emqx.internal, port=5369, reason=nxdomain 2024-02-06T00:35:56Z app[6e82e56f2e3038] lhr [info]2024-02-06T00:35:56.990243+00:00 [error] crasher: initial call: gen_rpc_client:init/1, pid: <0.7306.0>, registered_name: [], exit: {{badrpc,nxdomain},[{gen_serv er,init_it,6,[{file,"gen_server.erl"},{line,835}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}, ancestors: [gen_rpc_client_sup,gen_rpc_sup,<0.2123.0>], message_queue_len: 0, messages: [], l inks: [<0.2129.0>], dictionary: [], trap_exit: true, status: running, heap_size: 2586, stack_size: 28, reductions: 2710; neighbours:
@Rotario Rotario added the BUG label Feb 6, 2024
@id
Copy link
Collaborator

id commented Feb 6, 2024

Yes, nxdomain in the logs indicates that it's a DNS issue.

@id id added #triage/wait and removed BUG labels Feb 6, 2024
@Rotario
Copy link
Author

Rotario commented Feb 6, 2024 via email

@Rotario Rotario closed this as completed Feb 6, 2024
@Rotario
Copy link
Author

Rotario commented Feb 6, 2024

Hi - I've still got this issue, I've installed dnsutils on the production server (fly.io) and when I run
nslookup xxxx
the IP is resolved correctly.
I'm still getting this issue however sometimes. I don't think EMQX supports AAAA records properly somehow

Fly is pretty good at resolving these hostnames correctly. I think it might be an issue with EMQX

2024-02-06T16:56:21Z app[2874de3c1675d8] lhr [info]2024-02-06T16:56:21.032192+00:00 [error] event=connect_to_remote_server, peer=emqx@6e82e56f2e3038.vm.emqx.internal, port=5369, reason=nxdomain
2024-02-06T16:56:21Z app[2874de3c1675d8] lhr [info]2024-02-06T16:56:21.032304+00:00 [error] crasher: initial call: gen_rpc_client:init/1, pid: <0.11531.1>, registered_name: [], exit: {{badrpc,nxdomain},[{gen_server,init_it,6,[{file,"gen_server.erl"},{line,835}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}, ancestors: [gen_rpc_client_sup,gen_rpc_sup,<0.2123.0>], message_queue_len: 0, messages: [], links: [<0.2129.0>], dictionary: [], trap_exit: true, status: running, heap_size: 2586, stack_size: 28, reductions: 2712; neighbours:

@Rotario Rotario reopened this Feb 6, 2024
@kjellwinblad
Copy link
Contributor

Hi,

Did you test with a version that includes the AAAA record fix: #12467? I Don't think that fix is included in version 5.4.1 as it just recently got merged into the master branch.

@Rotario
Copy link
Author

Rotario commented Feb 7, 2024 via email

@ieQu1
Copy link
Member

ieQu1 commented Feb 7, 2024

Not sure if the above PR will help:

  • It only affects DNS discovery strategy
  • It only adds a parameter to the configuration schema, and doesn't change anything in the underlying code.

@ieQu1
Copy link
Member

ieQu1 commented Feb 7, 2024

Can you post full response of nslookup? My GPG key is https://keyserver.ubuntu.com/pks/lookup?search=488654DF3FED6FDE&fingerprint=on&op=index , if you don't want to expose the data publicly.

@Rotario
Copy link
Author

Rotario commented Feb 7, 2024

Sure - from VM1

root@6e82e56f2e3038:/opt/emqx# nslookup 2874de3c1675d8.vm.emqx.internal
Server:		fdaa::3
Address:	fdaa::3#53

Name:	2874de3c1675d8.vm.emqx.internal
Address: fdaa:0:xxxx:xxxx:xxxx:4935:2

from VM2

root@2874de3c1675d8:/opt/emqx# nslookup 6e82e56f2e3038.vm.emqx.internal
Server:		fdaa::3
Address:	fdaa::3#53

Name:	6e82e56f2e3038.vm.emqx.internal
Address: fdaa:0:xxxx:xxx:xx:xxxx:e24:2

@ieQu1
Copy link
Member

ieQu1 commented Feb 7, 2024

If I understand correctly, the problem is intermittent, since the nodes can communicate after restart. I assume that the IP addresses don't change. This suggests a temporary problem with the name resolution.

What distro are you running? Does it have a local DNS cache, like systemd-resolved? Also, what is the TTL for the AAAA record?

@Rotario
Copy link
Author

Rotario commented Feb 8, 2024

Hi thanks for your reply I think it seems pretty consistently erroring now. I don't know if maybe a config change killed it?
I've not changed any of the cluster settings from those above

The distro is the Fly machine firecracker - I don't know how to check the DNS cache. NSlookup seems to resolve it fine - I'll check the TTL

@Rotario
Copy link
Author

Rotario commented Feb 8, 2024

I'm going to wait for the DNS discovery AAAA feature to be released and try that

@Rotario Rotario closed this as completed Feb 8, 2024
@Rotario
Copy link
Author

Rotario commented Feb 9, 2024

Hi - I've got the master branch working now on the cloud provider

  EMQX_CLUSTER__DNS__NAME = "emqx.internal"
  EMQX_CLUSTER__DNS__RECORD_TYPE = "aaaa"
  EMQX_CLUSTER__PROTO_DIST = "inet6_tcp"
  EMQX_CLUSTER__DISCOVERY_STRATEGY = "dns"

And I'm still getting nxdomain issues - I think it's now interpretting this ipv6 IP as a domain name? Does it need to be bracketed?

2024-02-09T15:36:12Z app[6e82e56f2e3038] lhr [info]2024-02-09T15:36:12.952226+00:00 [error] event=connect_to_remote_server, peer=emqx@fdaa:0:xxxxxxxxxxx:2, port=5369, reason=nxdomain
2024-02-09T15:36:12Z app[6e82e56f2e3038] lhr [info]2024-02-09T15:36:12.952315+00:00 [error] crasher: initial call: gen_rpc_client:init/1, pid: <0.12092.0>, registered_name: [], exit: {{badrpc,nxdomain},[{gen_server,init_it,6,[{file,"gen_server.erl"},{line,961}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}]}, ancestors: [gen_rpc_client_sup,gen_rpc_sup,<0.2158.0>], message_queue_len: 0, messages: [], links: [<0.2164.0>], dictionary: [], trap_exit: true, status: running, heap_size: 2586, stack_size: 28, reductions: 2615; neighbours:

It did seem to be working fine

@Rotario Rotario reopened this Feb 9, 2024
@Rotario
Copy link
Author

Rotario commented Feb 9, 2024

I've changed the node name from emqx@ipv6_of_machine to emqx@FQDN of machine
and now I'm getting

Cannot get connection id for node 'emqx@2874de3c1675d8.vm.emqx.internal'

@Rotario
Copy link
Author

Rotario commented Feb 9, 2024

Update:

I found the RPC settings, and set it to listen only on ipv6 and ::

Now it seems like the nxdomain issues are fixed. The error I get now is this.
Any ideas please? seems connected to the setting EMQX_HOST and EMQX_NODE__NAME but I've played around and can't get this error to stop.

2024-02-09T17:03:56Z app[6e82e56f2e3038] lhr [info]2024-02-09T17:03:56.166004+00:00 [error] ** Cannot get connection id for node 'emqx@6e82e56f2e3038.vm.emqx.internal'
2024-02-09T17:04:02Z app[2874de3c1675d8] lhr [info]2024-02-09T17:04:02.986175+00:00 [error] ** Cannot get connection id for node 'emqx@2874de3c1675d8.vm.emqx.internal'

@ieQu1 ieQu1 self-assigned this Feb 9, 2024
@Rotario
Copy link
Author

Rotario commented Feb 12, 2024

The cluster seems to be working fine - with messages and subscribes passing thru transparently. Though the cant get connection id is still logging

@ieQu1
Copy link
Member

ieQu1 commented Feb 15, 2024

I recall that cant get connection id message originates inside the Erlang runtime... I'll have to take a deep dive into the Erlang code to find the precise conditions that trigger it, and what are the implications of that error.

@Rotario
Copy link
Author

Rotario commented Feb 18, 2024

It seeems related to nodes trying to connect to themselves which makes sense from the logs.
Maybe the auto DNS discovery mechanism with ipv6 makes a node try to connect to itself? Maybe some compare function that should return true to make a node not connect to itself doesn't compare IPv6 IPs properly or something?

Searching in GH i can't find Node.connect anywhere, so cant find the place where nodes are connecting to each other.

2024-02-18T18:09:01Z app[6e82e56f2e3038] lhr [info]2024-02-18T18:09:01.339236+00:00 [error] ** Cannot get connection id for node 'emqx@6e82e56f2e3038.vm.emqx.internal'
2024-02-18T18:09:08Z app[2874de3c1675d8] lhr [info]2024-02-18T18:09:08.114719+00:00 [error] ** Cannot get connection id for node 'emqx@2874de3c1675d8.vm.emqx.internal'

bitwalker/libcluster#70
https://github.com/mrluc/peerage/pull/17/files

@zmstone
Copy link
Member

zmstone commented Feb 19, 2024

Found this fix (in the links @Rotario shared above): erlang/otp#1870 but merged long ago, EMQX 5.4.1 should have the fix in place.
maybe there are other call paths not covered.

@zmstone
Copy link
Member

zmstone commented Feb 19, 2024

Hi again @Rotario
"Cannot get connection id" has happened before in ipv4 network too, but I never managed to find the cause.
If you are open to run some debug tests. I can provide a patch + steps to install the patch to help the investigation.
the patch will try to print the stacktrace when a node tries to self-connect.

@Rotario
Copy link
Author

Rotario commented Feb 19, 2024

Yeah of course, send it over and I can try to run it. It's the last thing I hope before I start trying it in production

@zmstone
Copy link
Member

zmstone commented Feb 19, 2024

Hi @Rotario

Here is the beam file in a zipped dir: net_kernel.zip
Sha256sum of the beam file (not the zip): 44551fcb70c1d58a9e2ab8430a968d65de37552e1e5d7012dcf0cf6ddbefba0a
Code diff: zmstone/otp@41dcd06

Extract the file net_kernel.beam, and mount it to replace /opt/emqx/lib/kernel-8.5.4/ebin/net_kernel.beam
Or commit a new docker tag with this file replaced in the container.

This patch adds another log line after "Cannot get connection id" which should include more error context as well as the stacktrace.

@Rotario
Copy link
Author

Rotario commented Feb 19, 2024

Thanks for your help @zmstone !

** Cannot get connection id for node 'emqx@6e82e56f2e3038.vm.emqx.internal'
error:badarg, [{erts_internal,new_connection,['emqx@6e82e56f2e3038.vm.emqx.internal'],[{error_info,#{module => erl_erts_errors}}]},{net_kernel,handle_info,2,[{file,"net_kernel.erl"},{line,1098}]},{gen_server,try_handle_info,3,[{file,"gen_server.erl"},{line,1095}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,1183}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}]

@zmstone
Copy link
Member

zmstone commented Feb 20, 2024

Thank you @Rotario.
The log narrowed it down a bit.
Now we know that it's logged when accepting a new connection from Erlang distribution listener.
We might need to trace/debug the connection initiator side, but I need to dig more in the code to find where to trace.

In the meantime, could you help to test with this patch? dist-debug.zip.
It included more debug logs, see commits here: https://github.com/zmstone/otp/commits/log-stacktrace-when-cannot-get-connection-id-happens/

Also would like to ask:

  • Does it happen if you start a single node?
  • How often do you see this logged?

Example (normal) logs from the patch:

v5.4.1(emqx@192.168.31.216)1> 110989263 'emqx@192.168.31.216':recv_name: 'N' name="emqx2@192.168.31.216" creation=4
{"MD5 connection from ~p~n",['emqx2@192.168.31.216']}
{net_kernel,accept_pending,'emqx@192.168.31.216','emqx2@192.168.31.216',undefined,[]}
110990781 'emqx@192.168.31.216':do_mark_pending(<0.2016.0>,'emqx@192.168.31.216','emqx2@192.168.31.216',55966662589) -> ok
{dist_util,<0.3250.0>,send_status,'emqx2@192.168.31.216',ok}
110991845 'emqx@192.168.31.216':send: 'N' challenge=780365816 creation=4
110994512 'emqx@192.168.31.216':recv_reply: challenge=631707742 digest=[6,93,82,221,42,134,231,21,140,166,246,
                                        161,49,183,233,112]
110994717 'emqx@192.168.31.216':sum = <<6,93,82,221,42,134,231,21,140,166,246,161,49,183,233,112>>
110994893 'emqx@192.168.31.216':send_ack: digest=<<25,210,44,53,117,211,215,113,115,210,118,160,255,201,32,218>>
{dist_util,<0.3250.0>,accept_connection,'emqx2@192.168.31.216'}
110995605 'emqx@192.168.31.216':setnode: node='emqx2@192.168.31.216' port=#Port<0.18> flags=55966662589(normal) creation=4

@zmstone
Copy link
Member

zmstone commented Feb 20, 2024

Some more information for you to troubleshoot the network:
EMQX node resolves the peer's port from node name, in your case, both nodes should be listening on port 4370.
(For a node named emqx1@hostname, it listens on port 4371, and so on).
If node 2874de3c1675d8@vm.emqx.internal tries to connect 6e82e56f2e3038.vm.emqx.internal:4370 but end up resolving 6e82e56f2e3038 to self IP, or if the connection is looped back to self due to a misconfigured TCP routing/port forwarding, this might happen.

@Rotario
Copy link
Author

Rotario commented Feb 20, 2024

Thanks Zaiming @zmstone
Yeah I read about the port discovery - I hope the resolved IPs areconsistently correct, whenever I've manually done nslookup on each node the IPs resolve to the correct ones.

**Does it happen if you start a single node?**

I just ran one node. That single node still complains that it can't get a connection id for itself. Could this be due to the new IPv6 autodiscovery?
How often do you see this logged?
~Every 14s

Here's the log

135174674 'emqx@6e82e56f2e3038.vm.emqx.internal':send_name: 'N' node='emqx@6e82e56f2e3038.vm.emqx.internal' creation=4             
135175366 'emqx@6e82e56f2e3038.vm.emqx.internal':recv_name: 'N' name="emqx@6e82e56f2e3038.vm.emqx.internal" creation=4
{"MD5 connection from ~p~n",['emqx@6e82e56f2e3038.vm.emqx.internal']}                          
{net_kernel,accept_pending,'emqx@6e82e56f2e3038.vm.emqx.internal','emqx@6e82e56f2e3038.vm.emqx.internal',undefined,[]}
135180949 'emqx@6e82e56f2e3038.vm.emqx.internal':do_mark_pending(<0.2062.0>,'emqx@6e82e56f2e3038.vm.emqx.internal','emqx@6e82e56f2e3038.vm.emqx.internal',559666
                                                                                                                                                                
{dist_util,<0.3087.0>,send_status,'emqx@6e82e56f2e3038.vm.emqx.internal',nok}                      
{dist_util,<0.3085.0>,recv_status,'emqx@fdaa:0:e26a:a7b:8f:ff70:e24:2',"nok"}                  
2024-02-20T10:07:19.818709+00:00 [error] ** Cannot get connection id for node 'emqx@6e82e56f2e3038.vm.emqx.internal'
2024-02-20T10:07:19.818859+00:00 [error] error:badarg, [{erts_internal,new_connection,['emqx@6e82e56f2e3038.vm.emqx.internal'],[{error_info,#{module => erl_erts
ernel.erl"},{line,1099}]},{gen_server,try_handle_info,3,[{file,"gen_server.erl"},{line,1095}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,1183}]},{
line,241}]}]                                                                                                                                                                   

@zmstone
Copy link
Member

zmstone commented Feb 20, 2024

For reference. This is what happens if I self-connect with a different name:

v5.4.1(emqx@192.168.31.216)1>  net_kernel:connect_node(node()). % self node is `emqx@192.168.31.216` so it should always return `true` -- as expected.
true
v5.4.1(emqx@192.168.31.216)2>  net_kernel:connect_node('emqx@local.host'). % local.host is added as a loopback address
20025380 'emqx@192.168.31.216':send_name: 'N' node='emqx@192.168.31.216' creation=4
20025933 'emqx@192.168.31.216':recv_name: 'N' name="emqx@192.168.31.216" creation=4
{"MD5 connection from ~p~n",['emqx@192.168.31.216']}
{net_kernel,accept_pending,'emqx@192.168.31.216','emqx@192.168.31.216',undefined,[]}
2024-02-20T10:20:05.500277+00:00 [error] ** Cannot get connection id for node 'emqx@192.168.31.216'
2024-02-20T10:20:05.500549+00:00 [error] error:badarg, [{erts_internal,new_connection,['emqx@192.168.31.216'],[{error_info,#{module => erl_erts_errors}}]},{net_kernel,handle_info,2,[{file,"net_kernel.erl"},{line,1099}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,1123}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,1200}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]
20027524 'emqx@192.168.31.216':do_mark_pending(<0.2016.0>,'emqx@192.168.31.216','emqx@192.168.31.216',55966662589) -> nok_pending
{dist_util,<0.3196.0>,send_status,'emqx@192.168.31.216',nok}
{dist_util,<0.3194.0>,recv_status,'emqx@local.host',"nok"}

this line looks suspicious in your logs: {dist_util,<0.3085.0>,recv_status,'emqx@fdaa:0:e26a:a7b:8f:ff70:e24:2',"nok"}

Some place in the code is trying to connect to this name 'emqx@fdaa:0:e26a:a7b:8f:ff70:e24:2' every 14/15 seconds.

@zmstone
Copy link
Member

zmstone commented Feb 20, 2024

@Rotario I guess fdaa:0:e26a:a7b:8f:ff70:e24:2 is 6e82e56f2e3038's address?
Is emqx@fdaa:0:e26a:a7b:8f:ff70:e24:2 by any chance configured in the static discovery node list ?
Could you share the output of this command (in the container): emqx eval 'application:get_all_env(ekka)'

@Rotario
Copy link
Author

Rotario commented Feb 20, 2024

Ok could it maybe just be a hangover from switching discovery mechanisms?

emqx eval 'application:get_all_env(ekka)' gives

root@6e82e56f2e3038:/opt/emqx# emqx eval 'application:get_all_env(ekka)'
643702 'remsh_maint_emqx763@6e82e56f2e3038.vm.emqx.internal':send_name: 'N' node='remsh_maint_emqx763@6e82e56f2e3038.vm.emqx.internal' creation=4
{dist_util,<0.93.0>,recv_status,'emqx@6e82e56f2e3038.vm.emqx.internal',"ok"}
656404 'remsh_maint_emqx763@6e82e56f2e3038.vm.emqx.internal':recv: 'N' node="emqx@6e82e56f2e3038.vm.emqx.internal", challenge=3526690536 creation=4
657095 'remsh_maint_emqx763@6e82e56f2e3038.vm.emqx.internal':send_reply: challenge=785686415 digest=<<190,232,253,141,252,169,249,220,242,
                                         38,215,173,131,214,231,93>>
659050 'remsh_maint_emqx763@6e82e56f2e3038.vm.emqx.internal':recv_ack: digest=[19,143,161,52,113,63,105,18,229,8,223,64,237,51,219,50]
659608 'remsh_maint_emqx763@6e82e56f2e3038.vm.emqx.internal':sum = <<19,143,161,52,113,63,105,18,229,8,223,64,237,51,219,50>>
660181 'remsh_maint_emqx763@6e82e56f2e3038.vm.emqx.internal':setnode: node='emqx@6e82e56f2e3038.vm.emqx.internal' port=#Port<0.6> flags=55966662588(hidden) creation=4
[{proto_dist,inet6_tcp},
 {cluster_discovery,{dns,[{name,"emqx.internal"},{type,aaaa}]}},
 {{callback,stop},fun emqx_machine_boot:stop_apps/0},
 {cluster_name,emqxcl},
 {{callback,start},fun emqx_machine_boot:ensure_apps_started/0}]

I'll destroy the instances and volumes and recreate from scratch

@Rotario
Copy link
Author

Rotario commented Feb 20, 2024

I've noticed the new nodes aren't discovering each other either. I have to run emqx ctl cluster join <node>. This isn't an issue right now. Can explore it later

NOTE: There's no static discovery list - I think maybe the hostname is being resolved to an IP and that's being used as the node name? Which obviously the nodes aren't configured as.

Yeah you're right fdaa:0:e26a:a7b:8f:ff70:e24:2 is 6e82e56f2e3038

More logs from new clean nodes

2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]321599835 'emqx@784e679bd37638.vm.emqx.internal':send_name: 'N' node='emqx@784e679bd37638.vm.emqx.internal' creation=4
2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]321600674 'emqx@784e679bd37638.vm.emqx.internal':recv_name: 'N' name="emqx@784e679bd37638.vm.emqx.internal" creation=4
2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]{"MD5 connection from ~p~n",['emqx@784e679bd37638.vm.emqx.internal']}
2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]{net_kernel,accept_pending,'emqx@784e679bd37638.vm.emqx.internal','emqx@784e679bd37638.vm.emqx.internal',undefined,[]}
2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]321606087 'emqx@784e679bd37638.vm.emqx.internal':do_mark_pending(<0.2062.0>,'emqx@784e679bd37638.vm.emqx.internal','emqx@784e679bd37638.vm.emqx.internal',55966662589) -> nok_pending
2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]{dist_util,<0.3602.0>,send_status,'emqx@784e679bd37638.vm.emqx.internal',nok}
2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]{dist_util,<0.3600.0>,recv_status,'emqx@fdaa:0:e26a:a7b:172:ce9:3967:2',"nok"}
2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]2024-02-20T11:10:24.196529+00:00 [error] ** Cannot get connection id for node 'emqx@784e679bd37638.vm.emqx.internal'
2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]2024-02-20T11:10:24.196691+00:00 [error] error:badarg, [{erts_internal,new_connection,['emqx@784e679bd37638.vm.emqx.internal'],[{error_info,#{module => erl_erts_errors}}]},{net_kernel,handle_info,2,[{file,"net_kernel.erl"},{line,1099}]},{gen_server,try_handle_info,3,[{file,"gen_server.erl"},{line,1095}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,1183}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}]

@zmstone
Copy link
Member

zmstone commented Feb 20, 2024

Ah. ok.
Now I get it.
The node name should not use FQDN as host part if you use DNS discovery strategy.
This is because DNS resolution is a list of IP addresses, this will always make the nodes connect peers using emqx@IP.

You'll need to either use static/manual strategy for node discovery,
or use DNS, but assign static IPs to the containers (so the nodes will get a static name like emqx@IPv6)

@Rotario
Copy link
Author

Rotario commented Feb 20, 2024

Ah ok brill thank you! And set node names to the ip not to the fqdn. Brill thank you

@Rotario Rotario closed this as completed Feb 20, 2024
@zmstone
Copy link
Member

zmstone commented Feb 20, 2024

@Rotario Thank you.
This debug session resolved a long time myth for us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants