Nodes can't communicate in static cluster #12484

Rotario · 2024-02-06T00:53:32Z

What happened?

I'm running a 2 core node cluster on fly.io
my cluster settings are:

  EMQX_CLUSTER__PROTO_DIST = "inet6_tcp"
  EMQX_CLUSTER__DISCOVERY_STRATEGY = "static"
  EMQX_CLUSTER__STATIC__SEEDS = "[emqx@xxxxx.vm.emqx.internal,emqx@xxxxxx.vm.emqx.internal]"

They're running fine most of the time:

But when testing I've found that sometimes they get stuck and can't communicate and spit out the following logs:

2024-02-06T00:35:56Z app[6e82e56f2e3038] lhr [info]2024-02-06T00:35:56.990082+00:00 [error] event=connect_to_remote_server, peer=emqx@2874de3c1675d8.vm.emqx.internal, port=5369, reason=nxdomain                  
2024-02-06T00:35:56Z app[6e82e56f2e3038] lhr [info]2024-02-06T00:35:56.990243+00:00 [error] crasher: initial call: gen_rpc_client:init/1, pid: <0.7306.0>, registered_name: [], exit: {{badrpc,nxdomain},[{gen_serv
er,init_it,6,[{file,"gen_server.erl"},{line,835}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}, ancestors: [gen_rpc_client_sup,gen_rpc_sup,<0.2123.0>], message_queue_len: 0, messages: [], l
inks: [<0.2129.0>], dictionary: [], trap_exit: true, status: running, heap_size: 2586, stack_size: 28, reductions: 2710; neighbours:

This results in the fact that if I subscribe to one node, I don't receive messages sent to the other node.
I've found a restart fixes this, but I can't deploy EMQX into production if this occurs.
What could potentially be causing this please? Is it the DNS resolution of the domain names from the cloud provider (docs here: https://fly.io/docs/networking/private-networking/)? If so I'm waiting for #12467 to get merged so I can move to AAAA discovery. This might fix the issue.

What did you expect to happen?

I expected two nodes to communicate correctly using my cloud provider's private IP address resolution mechanisms

How can we reproduce it (as minimally and precisely as possible)?

Use static discovery and 2 seeds

Anything else we need to know?

No response

EMQX version

$ ./bin/emqx_ctl broker
sysdescr  : EMQX
version   : 5.4.1
datetime  : 2024-02-06T00:52:37.516464153+00:00
uptime    : 14 minutes, 54 seconds

OS version

# On Linux:
$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
$ uname -a
Linux 2874de3c1675d8 5.15.98-fly #g2626e66887 SMP Mon Jan 29 07:34:38 UTC 2024 x86_64 GNU/Linux

Log files

2024-02-06T00:35:56Z app[6e82e56f2e3038] lhr [info]2024-02-06T00:35:56.990082+00:00 [error] event=connect_to_remote_server, peer=emqx@2874de3c1675d8.vm.emqx.internal, port=5369, reason=nxdomain 2024-02-06T00:35:56Z app[6e82e56f2e3038] lhr [info]2024-02-06T00:35:56.990243+00:00 [error] crasher: initial call: gen_rpc_client:init/1, pid: <0.7306.0>, registered_name: [], exit: {{badrpc,nxdomain},[{gen_serv er,init_it,6,[{file,"gen_server.erl"},{line,835}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}, ancestors: [gen_rpc_client_sup,gen_rpc_sup,<0.2123.0>], message_queue_len: 0, messages: [], l inks: [<0.2129.0>], dictionary: [], trap_exit: true, status: running, heap_size: 2586, stack_size: 28, reductions: 2710; neighbours:

The text was updated successfully, but these errors were encountered:

id · 2024-02-06T07:07:18Z

Yes, nxdomain in the logs indicates that it's a DNS issue.

Rotario · 2024-02-06T07:51:14Z

Cool thanks I'll close this and investigate

…

On Tue, 6 Feb 2024, 07:07 Ivan Dyachkov, ***@***.***> wrote: Yes, nxdomain in the logs indicates that it's a DNS issue. — Reply to this email directly, view it on GitHub <#12484 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHA6QA4HPFEQA6MNSYSUUIDYSHJDFAVCNFSM6AAAAABC3C2TYKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRYHEYDGMZRGE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Rotario · 2024-02-06T17:01:38Z

Hi - I've still got this issue, I've installed dnsutils on the production server (fly.io) and when I run
nslookup xxxx
the IP is resolved correctly.
I'm still getting this issue however sometimes. I don't think EMQX supports AAAA records properly somehow

Fly is pretty good at resolving these hostnames correctly. I think it might be an issue with EMQX

2024-02-06T16:56:21Z app[2874de3c1675d8] lhr [info]2024-02-06T16:56:21.032192+00:00 [error] event=connect_to_remote_server, peer=emqx@6e82e56f2e3038.vm.emqx.internal, port=5369, reason=nxdomain
2024-02-06T16:56:21Z app[2874de3c1675d8] lhr [info]2024-02-06T16:56:21.032304+00:00 [error] crasher: initial call: gen_rpc_client:init/1, pid: <0.11531.1>, registered_name: [], exit: {{badrpc,nxdomain},[{gen_server,init_it,6,[{file,"gen_server.erl"},{line,835}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}, ancestors: [gen_rpc_client_sup,gen_rpc_sup,<0.2123.0>], message_queue_len: 0, messages: [], links: [<0.2129.0>], dictionary: [], trap_exit: true, status: running, heap_size: 2586, stack_size: 28, reductions: 2712; neighbours:

kjellwinblad · 2024-02-07T10:06:38Z

Hi,

Did you test with a version that includes the AAAA record fix: #12467? I Don't think that fix is included in version 5.4.1 as it just recently got merged into the master branch.

Rotario · 2024-02-07T12:11:42Z

I'm running from a docker image. Can I just copy the 5.5 Docker file and replace the emqx version with master?

…

On Wed, 7 Feb 2024, 10:06 Kjell Winblad, ***@***.***> wrote: Hi, Did you test with a version that includes the AAAA record fix: #12467 <#12467>? I Don't think that fix is included in version 5.4.1 as it just recently got merged into the master branch. — Reply to this email directly, view it on GitHub <#12484 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHA6QA7K33I5NI4VQR2NLP3YSNG3VAVCNFSM6AAAAABC3C2TYKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZRGY4TQMRSGA> . You are receiving this because you modified the open/close state.Message ID: ***@***.***>

ieQu1 · 2024-02-07T12:12:08Z

Not sure if the above PR will help:

It only affects DNS discovery strategy
It only adds a parameter to the configuration schema, and doesn't change anything in the underlying code.

ieQu1 · 2024-02-07T12:24:57Z

Can you post full response of nslookup? My GPG key is https://keyserver.ubuntu.com/pks/lookup?search=488654DF3FED6FDE&fingerprint=on&op=index , if you don't want to expose the data publicly.

Rotario · 2024-02-07T14:58:08Z

Sure - from VM1

root@6e82e56f2e3038:/opt/emqx# nslookup 2874de3c1675d8.vm.emqx.internal
Server:		fdaa::3
Address:	fdaa::3#53

Name:	2874de3c1675d8.vm.emqx.internal
Address: fdaa:0:xxxx:xxxx:xxxx:4935:2

from VM2

root@2874de3c1675d8:/opt/emqx# nslookup 6e82e56f2e3038.vm.emqx.internal
Server:		fdaa::3
Address:	fdaa::3#53

Name:	6e82e56f2e3038.vm.emqx.internal
Address: fdaa:0:xxxx:xxx:xx:xxxx:e24:2

ieQu1 · 2024-02-07T17:22:03Z

If I understand correctly, the problem is intermittent, since the nodes can communicate after restart. I assume that the IP addresses don't change. This suggests a temporary problem with the name resolution.

What distro are you running? Does it have a local DNS cache, like systemd-resolved? Also, what is the TTL for the AAAA record?

Rotario · 2024-02-08T14:50:08Z

Hi thanks for your reply I think it seems pretty consistently erroring now. I don't know if maybe a config change killed it?
I've not changed any of the cluster settings from those above

The distro is the Fly machine firecracker - I don't know how to check the DNS cache. NSlookup seems to resolve it fine - I'll check the TTL

Rotario · 2024-02-08T23:22:27Z

I'm going to wait for the DNS discovery AAAA feature to be released and try that

Rotario · 2024-02-09T15:37:42Z

Hi - I've got the master branch working now on the cloud provider

  EMQX_CLUSTER__DNS__NAME = "emqx.internal"
  EMQX_CLUSTER__DNS__RECORD_TYPE = "aaaa"
  EMQX_CLUSTER__PROTO_DIST = "inet6_tcp"
  EMQX_CLUSTER__DISCOVERY_STRATEGY = "dns"

And I'm still getting nxdomain issues - I think it's now interpretting this ipv6 IP as a domain name? Does it need to be bracketed?

2024-02-09T15:36:12Z app[6e82e56f2e3038] lhr [info]2024-02-09T15:36:12.952226+00:00 [error] event=connect_to_remote_server, peer=emqx@fdaa:0:xxxxxxxxxxx:2, port=5369, reason=nxdomain
2024-02-09T15:36:12Z app[6e82e56f2e3038] lhr [info]2024-02-09T15:36:12.952315+00:00 [error] crasher: initial call: gen_rpc_client:init/1, pid: <0.12092.0>, registered_name: [], exit: {{badrpc,nxdomain},[{gen_server,init_it,6,[{file,"gen_server.erl"},{line,961}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}]}, ancestors: [gen_rpc_client_sup,gen_rpc_sup,<0.2158.0>], message_queue_len: 0, messages: [], links: [<0.2164.0>], dictionary: [], trap_exit: true, status: running, heap_size: 2586, stack_size: 28, reductions: 2615; neighbours:

It did seem to be working fine

Rotario · 2024-02-09T16:24:55Z

I've changed the node name from emqx@ipv6_of_machine to emqx@FQDN of machine
and now I'm getting

Cannot get connection id for node 'emqx@2874de3c1675d8.vm.emqx.internal'

Rotario · 2024-02-09T17:05:25Z

Update:

I found the RPC settings, and set it to listen only on ipv6 and ::

Now it seems like the nxdomain issues are fixed. The error I get now is this.
Any ideas please? seems connected to the setting EMQX_HOST and EMQX_NODE__NAME but I've played around and can't get this error to stop.

2024-02-09T17:03:56Z app[6e82e56f2e3038] lhr [info]2024-02-09T17:03:56.166004+00:00 [error] ** Cannot get connection id for node 'emqx@6e82e56f2e3038.vm.emqx.internal'
2024-02-09T17:04:02Z app[2874de3c1675d8] lhr [info]2024-02-09T17:04:02.986175+00:00 [error] ** Cannot get connection id for node 'emqx@2874de3c1675d8.vm.emqx.internal'

Rotario · 2024-02-12T10:44:37Z

The cluster seems to be working fine - with messages and subscribes passing thru transparently. Though the cant get connection id is still logging

ieQu1 · 2024-02-15T21:16:33Z

I recall that cant get connection id message originates inside the Erlang runtime... I'll have to take a deep dive into the Erlang code to find the precise conditions that trigger it, and what are the implications of that error.

Rotario · 2024-02-18T18:11:21Z

It seeems related to nodes trying to connect to themselves which makes sense from the logs.
Maybe the auto DNS discovery mechanism with ipv6 makes a node try to connect to itself? Maybe some compare function that should return true to make a node not connect to itself doesn't compare IPv6 IPs properly or something?

Searching in GH i can't find Node.connect anywhere, so cant find the place where nodes are connecting to each other.

2024-02-18T18:09:01Z app[6e82e56f2e3038] lhr [info]2024-02-18T18:09:01.339236+00:00 [error] ** Cannot get connection id for node 'emqx@6e82e56f2e3038.vm.emqx.internal'
2024-02-18T18:09:08Z app[2874de3c1675d8] lhr [info]2024-02-18T18:09:08.114719+00:00 [error] ** Cannot get connection id for node 'emqx@2874de3c1675d8.vm.emqx.internal'

bitwalker/libcluster#70
https://github.com/mrluc/peerage/pull/17/files

zmstone · 2024-02-19T10:18:16Z

Found this fix (in the links @Rotario shared above): erlang/otp#1870 but merged long ago, EMQX 5.4.1 should have the fix in place.
maybe there are other call paths not covered.

zmstone · 2024-02-19T13:13:19Z

Hi again @Rotario
"Cannot get connection id" has happened before in ipv4 network too, but I never managed to find the cause.
If you are open to run some debug tests. I can provide a patch + steps to install the patch to help the investigation.
the patch will try to print the stacktrace when a node tries to self-connect.

Rotario · 2024-02-19T13:17:41Z

Yeah of course, send it over and I can try to run it. It's the last thing I hope before I start trying it in production

zmstone · 2024-02-19T18:46:26Z

Hi @Rotario

Here is the beam file in a zipped dir: net_kernel.zip
Sha256sum of the beam file (not the zip): 44551fcb70c1d58a9e2ab8430a968d65de37552e1e5d7012dcf0cf6ddbefba0a
Code diff: zmstone/otp@41dcd06

Extract the file net_kernel.beam, and mount it to replace /opt/emqx/lib/kernel-8.5.4/ebin/net_kernel.beam
Or commit a new docker tag with this file replaced in the container.

This patch adds another log line after "Cannot get connection id" which should include more error context as well as the stacktrace.

Rotario · 2024-02-19T20:00:36Z

Thanks for your help @zmstone !

** Cannot get connection id for node 'emqx@6e82e56f2e3038.vm.emqx.internal'
error:badarg, [{erts_internal,new_connection,['emqx@6e82e56f2e3038.vm.emqx.internal'],[{error_info,#{module => erl_erts_errors}}]},{net_kernel,handle_info,2,[{file,"net_kernel.erl"},{line,1098}]},{gen_server,try_handle_info,3,[{file,"gen_server.erl"},{line,1095}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,1183}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}]

zmstone · 2024-02-20T09:42:05Z

Thank you @Rotario.
The log narrowed it down a bit.
Now we know that it's logged when accepting a new connection from Erlang distribution listener.
We might need to trace/debug the connection initiator side, but I need to dig more in the code to find where to trace.

In the meantime, could you help to test with this patch? dist-debug.zip.
It included more debug logs, see commits here: https://github.com/zmstone/otp/commits/log-stacktrace-when-cannot-get-connection-id-happens/

Also would like to ask:

Does it happen if you start a single node?
How often do you see this logged?

Example (normal) logs from the patch:

v5.4.1(emqx@192.168.31.216)1> 110989263 'emqx@192.168.31.216':recv_name: 'N' name="emqx2@192.168.31.216" creation=4
{"MD5 connection from ~p~n",['emqx2@192.168.31.216']}
{net_kernel,accept_pending,'emqx@192.168.31.216','emqx2@192.168.31.216',undefined,[]}
110990781 'emqx@192.168.31.216':do_mark_pending(<0.2016.0>,'emqx@192.168.31.216','emqx2@192.168.31.216',55966662589) -> ok
{dist_util,<0.3250.0>,send_status,'emqx2@192.168.31.216',ok}
110991845 'emqx@192.168.31.216':send: 'N' challenge=780365816 creation=4
110994512 'emqx@192.168.31.216':recv_reply: challenge=631707742 digest=[6,93,82,221,42,134,231,21,140,166,246,
                                        161,49,183,233,112]
110994717 'emqx@192.168.31.216':sum = <<6,93,82,221,42,134,231,21,140,166,246,161,49,183,233,112>>
110994893 'emqx@192.168.31.216':send_ack: digest=<<25,210,44,53,117,211,215,113,115,210,118,160,255,201,32,218>>
{dist_util,<0.3250.0>,accept_connection,'emqx2@192.168.31.216'}
110995605 'emqx@192.168.31.216':setnode: node='emqx2@192.168.31.216' port=#Port<0.18> flags=55966662589(normal) creation=4

zmstone · 2024-02-20T09:55:17Z

Some more information for you to troubleshoot the network:
EMQX node resolves the peer's port from node name, in your case, both nodes should be listening on port 4370.
(For a node named emqx1@hostname, it listens on port 4371, and so on).
If node 2874de3c1675d8@vm.emqx.internal tries to connect 6e82e56f2e3038.vm.emqx.internal:4370 but end up resolving 6e82e56f2e3038 to self IP, or if the connection is looped back to self due to a misconfigured TCP routing/port forwarding, this might happen.

Rotario · 2024-02-20T10:10:20Z

Thanks Zaiming @zmstone
Yeah I read about the port discovery - I hope the resolved IPs areconsistently correct, whenever I've manually done nslookup on each node the IPs resolve to the correct ones.

**Does it happen if you start a single node?**

I just ran one node. That single node still complains that it can't get a connection id for itself. Could this be due to the new IPv6 autodiscovery?
How often do you see this logged?
~Every 14s

Here's the log

135174674 'emqx@6e82e56f2e3038.vm.emqx.internal':send_name: 'N' node='emqx@6e82e56f2e3038.vm.emqx.internal' creation=4             
135175366 'emqx@6e82e56f2e3038.vm.emqx.internal':recv_name: 'N' name="emqx@6e82e56f2e3038.vm.emqx.internal" creation=4
{"MD5 connection from ~p~n",['emqx@6e82e56f2e3038.vm.emqx.internal']}                          
{net_kernel,accept_pending,'emqx@6e82e56f2e3038.vm.emqx.internal','emqx@6e82e56f2e3038.vm.emqx.internal',undefined,[]}
135180949 'emqx@6e82e56f2e3038.vm.emqx.internal':do_mark_pending(<0.2062.0>,'emqx@6e82e56f2e3038.vm.emqx.internal','emqx@6e82e56f2e3038.vm.emqx.internal',559666
                                                                                                                                                                
{dist_util,<0.3087.0>,send_status,'emqx@6e82e56f2e3038.vm.emqx.internal',nok}                      
{dist_util,<0.3085.0>,recv_status,'emqx@fdaa:0:e26a:a7b:8f:ff70:e24:2',"nok"}                  
2024-02-20T10:07:19.818709+00:00 [error] ** Cannot get connection id for node 'emqx@6e82e56f2e3038.vm.emqx.internal'
2024-02-20T10:07:19.818859+00:00 [error] error:badarg, [{erts_internal,new_connection,['emqx@6e82e56f2e3038.vm.emqx.internal'],[{error_info,#{module => erl_erts
ernel.erl"},{line,1099}]},{gen_server,try_handle_info,3,[{file,"gen_server.erl"},{line,1095}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,1183}]},{
line,241}]}]

zmstone · 2024-02-20T10:22:28Z

For reference. This is what happens if I self-connect with a different name:

v5.4.1(emqx@192.168.31.216)1>  net_kernel:connect_node(node()). % self node is `emqx@192.168.31.216` so it should always return `true` -- as expected.
true
v5.4.1(emqx@192.168.31.216)2>  net_kernel:connect_node('emqx@local.host'). % local.host is added as a loopback address
20025380 'emqx@192.168.31.216':send_name: 'N' node='emqx@192.168.31.216' creation=4
20025933 'emqx@192.168.31.216':recv_name: 'N' name="emqx@192.168.31.216" creation=4
{"MD5 connection from ~p~n",['emqx@192.168.31.216']}
{net_kernel,accept_pending,'emqx@192.168.31.216','emqx@192.168.31.216',undefined,[]}
2024-02-20T10:20:05.500277+00:00 [error] ** Cannot get connection id for node 'emqx@192.168.31.216'
2024-02-20T10:20:05.500549+00:00 [error] error:badarg, [{erts_internal,new_connection,['emqx@192.168.31.216'],[{error_info,#{module => erl_erts_errors}}]},{net_kernel,handle_info,2,[{file,"net_kernel.erl"},{line,1099}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,1123}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,1200}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]
20027524 'emqx@192.168.31.216':do_mark_pending(<0.2016.0>,'emqx@192.168.31.216','emqx@192.168.31.216',55966662589) -> nok_pending
{dist_util,<0.3196.0>,send_status,'emqx@192.168.31.216',nok}
{dist_util,<0.3194.0>,recv_status,'emqx@local.host',"nok"}

this line looks suspicious in your logs: {dist_util,<0.3085.0>,recv_status,'emqx@fdaa:0:e26a:a7b:8f:ff70:e24:2',"nok"}

Some place in the code is trying to connect to this name 'emqx@fdaa:0:e26a:a7b:8f:ff70:e24:2' every 14/15 seconds.

zmstone · 2024-02-20T10:55:02Z

@Rotario I guess fdaa:0:e26a:a7b:8f:ff70:e24:2 is 6e82e56f2e3038's address?
Is emqx@fdaa:0:e26a:a7b:8f:ff70:e24:2 by any chance configured in the static discovery node list ?
Could you share the output of this command (in the container): emqx eval 'application:get_all_env(ekka)'

Rotario · 2024-02-20T10:59:36Z

Ok could it maybe just be a hangover from switching discovery mechanisms?

emqx eval 'application:get_all_env(ekka)' gives

root@6e82e56f2e3038:/opt/emqx# emqx eval 'application:get_all_env(ekka)'
643702 'remsh_maint_emqx763@6e82e56f2e3038.vm.emqx.internal':send_name: 'N' node='remsh_maint_emqx763@6e82e56f2e3038.vm.emqx.internal' creation=4
{dist_util,<0.93.0>,recv_status,'emqx@6e82e56f2e3038.vm.emqx.internal',"ok"}
656404 'remsh_maint_emqx763@6e82e56f2e3038.vm.emqx.internal':recv: 'N' node="emqx@6e82e56f2e3038.vm.emqx.internal", challenge=3526690536 creation=4
657095 'remsh_maint_emqx763@6e82e56f2e3038.vm.emqx.internal':send_reply: challenge=785686415 digest=<<190,232,253,141,252,169,249,220,242,
                                         38,215,173,131,214,231,93>>
659050 'remsh_maint_emqx763@6e82e56f2e3038.vm.emqx.internal':recv_ack: digest=[19,143,161,52,113,63,105,18,229,8,223,64,237,51,219,50]
659608 'remsh_maint_emqx763@6e82e56f2e3038.vm.emqx.internal':sum = <<19,143,161,52,113,63,105,18,229,8,223,64,237,51,219,50>>
660181 'remsh_maint_emqx763@6e82e56f2e3038.vm.emqx.internal':setnode: node='emqx@6e82e56f2e3038.vm.emqx.internal' port=#Port<0.6> flags=55966662588(hidden) creation=4
[{proto_dist,inet6_tcp},
 {cluster_discovery,{dns,[{name,"emqx.internal"},{type,aaaa}]}},
 {{callback,stop},fun emqx_machine_boot:stop_apps/0},
 {cluster_name,emqxcl},
 {{callback,start},fun emqx_machine_boot:ensure_apps_started/0}]

I'll destroy the instances and volumes and recreate from scratch

Rotario · 2024-02-20T11:01:45Z

I've noticed the new nodes aren't discovering each other either. I have to run emqx ctl cluster join <node>. This isn't an issue right now. Can explore it later

NOTE: There's no static discovery list - I think maybe the hostname is being resolved to an IP and that's being used as the node name? Which obviously the nodes aren't configured as.

Yeah you're right fdaa:0:e26a:a7b:8f:ff70:e24:2 is 6e82e56f2e3038

More logs from new clean nodes

2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]321599835 'emqx@784e679bd37638.vm.emqx.internal':send_name: 'N' node='emqx@784e679bd37638.vm.emqx.internal' creation=4
2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]321600674 'emqx@784e679bd37638.vm.emqx.internal':recv_name: 'N' name="emqx@784e679bd37638.vm.emqx.internal" creation=4
2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]{"MD5 connection from ~p~n",['emqx@784e679bd37638.vm.emqx.internal']}
2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]{net_kernel,accept_pending,'emqx@784e679bd37638.vm.emqx.internal','emqx@784e679bd37638.vm.emqx.internal',undefined,[]}
2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]321606087 'emqx@784e679bd37638.vm.emqx.internal':do_mark_pending(<0.2062.0>,'emqx@784e679bd37638.vm.emqx.internal','emqx@784e679bd37638.vm.emqx.internal',55966662589) -> nok_pending
2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]{dist_util,<0.3602.0>,send_status,'emqx@784e679bd37638.vm.emqx.internal',nok}
2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]{dist_util,<0.3600.0>,recv_status,'emqx@fdaa:0:e26a:a7b:172:ce9:3967:2',"nok"}
2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]2024-02-20T11:10:24.196529+00:00 [error] ** Cannot get connection id for node 'emqx@784e679bd37638.vm.emqx.internal'
2024-02-20T11:10:24Z app[784e679bd37638] lhr [info]2024-02-20T11:10:24.196691+00:00 [error] error:badarg, [{erts_internal,new_connection,['emqx@784e679bd37638.vm.emqx.internal'],[{error_info,#{module => erl_erts_errors}}]},{net_kernel,handle_info,2,[{file,"net_kernel.erl"},{line,1099}]},{gen_server,try_handle_info,3,[{file,"gen_server.erl"},{line,1095}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,1183}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}]

zmstone · 2024-02-20T11:12:24Z

Ah. ok.
Now I get it.
The node name should not use FQDN as host part if you use DNS discovery strategy.
This is because DNS resolution is a list of IP addresses, this will always make the nodes connect peers using emqx@IP.

You'll need to either use static/manual strategy for node discovery,
or use DNS, but assign static IPs to the containers (so the nodes will get a static name like emqx@IPv6)

Rotario · 2024-02-20T11:13:18Z

Ah ok brill thank you! And set node names to the ip not to the fqdn. Brill thank you

zmstone · 2024-02-20T11:15:54Z

@Rotario Thank you.
This debug session resolved a long time myth for us.

Rotario added the BUG label Feb 6, 2024

id added #triage/wait and removed BUG labels Feb 6, 2024

Rotario closed this as completed Feb 6, 2024

Rotario reopened this Feb 6, 2024

Rotario closed this as completed Feb 8, 2024

Rotario reopened this Feb 9, 2024

ieQu1 self-assigned this Feb 9, 2024

Rotario closed this as completed Feb 20, 2024

zmstone removed the #triage/wait label Feb 20, 2024

zmstone mentioned this issue Feb 20, 2024

Fail fast if node host is not IP address but cluster strategy is dns with a or aaaa record type #12541

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodes can't communicate in static cluster #12484

Nodes can't communicate in static cluster #12484

Rotario commented Feb 6, 2024

id commented Feb 6, 2024

Rotario commented Feb 6, 2024 via email

Rotario commented Feb 6, 2024 •

edited

kjellwinblad commented Feb 7, 2024

Rotario commented Feb 7, 2024 via email

ieQu1 commented Feb 7, 2024

ieQu1 commented Feb 7, 2024

Rotario commented Feb 7, 2024

ieQu1 commented Feb 7, 2024 •

edited

Rotario commented Feb 8, 2024 •

edited

Rotario commented Feb 8, 2024

Rotario commented Feb 9, 2024 •

edited

Rotario commented Feb 9, 2024

Rotario commented Feb 9, 2024

Rotario commented Feb 12, 2024

ieQu1 commented Feb 15, 2024

Rotario commented Feb 18, 2024

zmstone commented Feb 19, 2024 •

edited

zmstone commented Feb 19, 2024

Rotario commented Feb 19, 2024 •

edited

zmstone commented Feb 19, 2024 •

edited

Rotario commented Feb 19, 2024 •

edited

zmstone commented Feb 20, 2024 •

edited

zmstone commented Feb 20, 2024 •

edited

Rotario commented Feb 20, 2024 •

edited

zmstone commented Feb 20, 2024 •

edited

zmstone commented Feb 20, 2024

Rotario commented Feb 20, 2024 •

edited

Rotario commented Feb 20, 2024 •

edited

zmstone commented Feb 20, 2024 •

edited

Rotario commented Feb 20, 2024

zmstone commented Feb 20, 2024

Nodes can't communicate in static cluster #12484

Nodes can't communicate in static cluster #12484

Comments

Rotario commented Feb 6, 2024

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

EMQX version

OS version

Log files

id commented Feb 6, 2024

Rotario commented Feb 6, 2024 via email

Rotario commented Feb 6, 2024 • edited

kjellwinblad commented Feb 7, 2024

Rotario commented Feb 7, 2024 via email

ieQu1 commented Feb 7, 2024

ieQu1 commented Feb 7, 2024

Rotario commented Feb 7, 2024

ieQu1 commented Feb 7, 2024 • edited

Rotario commented Feb 8, 2024 • edited

Rotario commented Feb 8, 2024

Rotario commented Feb 9, 2024 • edited

Rotario commented Feb 9, 2024

Rotario commented Feb 9, 2024

Rotario commented Feb 12, 2024

ieQu1 commented Feb 15, 2024

Rotario commented Feb 18, 2024

zmstone commented Feb 19, 2024 • edited

zmstone commented Feb 19, 2024

Rotario commented Feb 19, 2024 • edited

zmstone commented Feb 19, 2024 • edited

Rotario commented Feb 19, 2024 • edited

zmstone commented Feb 20, 2024 • edited

zmstone commented Feb 20, 2024 • edited

Rotario commented Feb 20, 2024 • edited

zmstone commented Feb 20, 2024 • edited

zmstone commented Feb 20, 2024

Rotario commented Feb 20, 2024 • edited

Rotario commented Feb 20, 2024 • edited

zmstone commented Feb 20, 2024 • edited

Rotario commented Feb 20, 2024

zmstone commented Feb 20, 2024

Rotario commented Feb 6, 2024 •

edited

ieQu1 commented Feb 7, 2024 •

edited

Rotario commented Feb 8, 2024 •

edited

Rotario commented Feb 9, 2024 •

edited

zmstone commented Feb 19, 2024 •

edited

Rotario commented Feb 19, 2024 •

edited

zmstone commented Feb 19, 2024 •

edited

Rotario commented Feb 19, 2024 •

edited

zmstone commented Feb 20, 2024 •

edited

zmstone commented Feb 20, 2024 •

edited

Rotario commented Feb 20, 2024 •

edited

zmstone commented Feb 20, 2024 •

edited

Rotario commented Feb 20, 2024 •

edited

Rotario commented Feb 20, 2024 •

edited

zmstone commented Feb 20, 2024 •

edited