New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nodes can't communicate in static cluster #12484
Comments
Yes, |
Cool thanks I'll close this and investigate
…On Tue, 6 Feb 2024, 07:07 Ivan Dyachkov, ***@***.***> wrote:
Yes, nxdomain in the logs indicates that it's a DNS issue.
—
Reply to this email directly, view it on GitHub
<#12484 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHA6QA4HPFEQA6MNSYSUUIDYSHJDFAVCNFSM6AAAAABC3C2TYKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRYHEYDGMZRGE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Hi - I've still got this issue, I've installed dnsutils on the production server (fly.io) and when I run Fly is pretty good at resolving these hostnames correctly. I think it might be an issue with EMQX
|
Hi, Did you test with a version that includes the AAAA record fix: #12467? I Don't think that fix is included in version 5.4.1 as it just recently got merged into the master branch. |
I'm running from a docker image. Can I just copy the 5.5 Docker file and
replace the emqx version with master?
…On Wed, 7 Feb 2024, 10:06 Kjell Winblad, ***@***.***> wrote:
Hi,
Did you test with a version that includes the AAAA record fix: #12467
<#12467>? I Don't think that fix is
included in version 5.4.1 as it just recently got merged into the master
branch.
—
Reply to this email directly, view it on GitHub
<#12484 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHA6QA7K33I5NI4VQR2NLP3YSNG3VAVCNFSM6AAAAABC3C2TYKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZRGY4TQMRSGA>
.
You are receiving this because you modified the open/close state.Message
ID: ***@***.***>
|
Not sure if the above PR will help:
|
Can you post full response of nslookup? My GPG key is https://keyserver.ubuntu.com/pks/lookup?search=488654DF3FED6FDE&fingerprint=on&op=index , if you don't want to expose the data publicly. |
Sure - from VM1
from VM2
|
If I understand correctly, the problem is intermittent, since the nodes can communicate after restart. I assume that the IP addresses don't change. This suggests a temporary problem with the name resolution. What distro are you running? Does it have a local DNS cache, like |
Hi thanks for your reply I think it seems pretty consistently erroring now. I don't know if maybe a config change killed it? The distro is the Fly machine firecracker - I don't know how to check the DNS cache. NSlookup seems to resolve it fine - I'll check the TTL |
I'm going to wait for the DNS discovery AAAA feature to be released and try that |
Hi - I've got the master branch working now on the cloud provider
And I'm still getting nxdomain issues - I think it's now interpretting this ipv6 IP as a domain name? Does it need to be bracketed?
It did seem to be working fine |
I've changed the node name from emqx@ipv6_of_machine to emqx@FQDN of machine
|
Update: I found the RPC settings, and set it to listen only on ipv6 and Now it seems like the nxdomain issues are fixed. The error I get now is this.
|
The cluster seems to be working fine - with messages and subscribes passing thru transparently. Though the |
I recall that |
It seeems related to nodes trying to connect to themselves which makes sense from the logs. Searching in GH i can't find Node.connect anywhere, so cant find the place where nodes are connecting to each other.
bitwalker/libcluster#70 |
Found this fix (in the links @Rotario shared above): erlang/otp#1870 but merged long ago, EMQX 5.4.1 should have the fix in place. |
Hi again @Rotario |
Yeah of course, send it over and I can try to run it. It's the last thing I hope before I start trying it in production |
Hi @Rotario Here is the beam file in a zipped dir: net_kernel.zip Extract the file This patch adds another log line after "Cannot get connection id" which should include more error context as well as the stacktrace. |
Thanks for your help @zmstone !
|
Thank you @Rotario. In the meantime, could you help to test with this patch? dist-debug.zip. Also would like to ask:
Example (normal) logs from the patch:
|
Some more information for you to troubleshoot the network: |
Thanks Zaiming @zmstone
I just ran one node. That single node still complains that it can't get a connection id for itself. Could this be due to the new IPv6 autodiscovery? Here's the log
|
For reference. This is what happens if I self-connect with a different name:
this line looks suspicious in your logs: Some place in the code is trying to connect to this name |
@Rotario I guess |
Ok could it maybe just be a hangover from switching discovery mechanisms? emqx eval 'application:get_all_env(ekka)' gives
I'll destroy the instances and volumes and recreate from scratch |
I've noticed the new nodes aren't discovering each other either. I have to run NOTE: There's no static discovery list - I think maybe the hostname is being resolved to an IP and that's being used as the node name? Which obviously the nodes aren't configured as. Yeah you're right More logs from new clean nodes
|
Ah. ok. You'll need to either use static/manual strategy for node discovery, |
Ah ok brill thank you! And set node names to the ip not to the fqdn. Brill thank you |
@Rotario Thank you. |
What happened?
I'm running a 2 core node cluster on fly.io
my cluster settings are:
They're running fine most of the time:
But when testing I've found that sometimes they get stuck and can't communicate and spit out the following logs:
This results in the fact that if I subscribe to one node, I don't receive messages sent to the other node.
I've found a restart fixes this, but I can't deploy EMQX into production if this occurs.
What could potentially be causing this please? Is it the DNS resolution of the domain names from the cloud provider (docs here: https://fly.io/docs/networking/private-networking/)? If so I'm waiting for #12467 to get merged so I can move to AAAA discovery. This might fix the issue.
What did you expect to happen?
I expected two nodes to communicate correctly using my cloud provider's private IP address resolution mechanisms
How can we reproduce it (as minimally and precisely as possible)?
Use static discovery and 2 seeds
Anything else we need to know?
No response
EMQX version
OS version
Log files
The text was updated successfully, but these errors were encountered: