Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intermittent failure to resolve DNS records for services #4222

Open
javacruft opened this issue Sep 25, 2023 · 7 comments
Open

intermittent failure to resolve DNS records for services #4222

javacruft opened this issue Sep 25, 2023 · 7 comments

Comments

@javacruft
Copy link
Contributor

Summary

During testing of OpenStack Sunbeam we occasionally see an issue where pods are unable to resolve the hostname associated with another service in the same deployment:

https://bugs.launchpad.net/snap-openstack/+bug/2033680

This typically materialises when something in the pod tries to connect to the remote service:

(2003, "Can't connect to MySQL server on 'nova-api-mysql-router.openstack.svc.cluster.local' ([Errno -2] Name or service not known)")

DNS addon is enabled.

What Should Happen Instead?

Hostname of service should be resolvable.

Reproduction Steps

We're not able to reproduce this error consistently.

Introspection Report

Not currently collected - have requested.

Can you suggest a fix?

Not based on current information - debugging or additional log collection would be great.

Are you interested in contributing with a fix?

no

@ktsakalozos
Copy link
Member

Thank you for reporting this @javacruft . Is it possible to have an inspection report?

@gnuoy
Copy link

gnuoy commented Oct 24, 2023

I was looking into a failed deployment and seemed to hit this issue:

inspection-report-20231024_124905.tar.gz

@neoaggelos
Copy link
Contributor

@gnuoy thanks for the report, we'll have a look!

@gnuoy
Copy link

gnuoy commented Oct 26, 2023

I have another deployment with this symptom and I've narrowed it down to a bit. I've taken a look at the dns query openstack uses and the interesting bits are that it uses udp and enables the search option.

On a working system:

# dig +notcp +short +search nova-api-mysql-router.openstack.svc.cluster.local
10.152.183.186
# dig +notcp +short +search nova-api-mysql-router.openstack.svc.cluster.local.
10.152.183.186

On a broken system:

# dig +notcp +short +search nova-api-mysql-router.openstack.svc.cluster.local
# dig +notcp +short +search nova-api-mysql-router.openstack.svc.cluster.local.
10.152.183.210

(Note the trailing '.' on the fqdn in the second query)

I can also 'fix' the broken system by updating the ndots option in /etc/resolv.conf from ndots:5 to ndots:4

@gnuoy
Copy link

gnuoy commented Oct 26, 2023

Inspection report from broken deployment mentioned in previous comment:
inspection-report-20231026_153632.tar.gz

@gnuoy
Copy link

gnuoy commented Oct 26, 2023

Using microk8s.kubectl get configmap -n kube-system coredns -o yaml to dump the coredns config shows another difference, on the working system:

forward . 8.8.8.8 8.8.4.4

On the broken system :

forward . /etc/resolv.conf

On the broken system:

$ grep -vE '^$|^#' /etc/resolv.conf
nameserver 127.0.0.53
options edns0 trust-ad
search maas

I wonder if .maas is getting added on to the queries which don't have a trailing dot (so they aren't fqdn's) and then that causes (me frantically waves hand ) an issue with the ndot setting.

On both systems the dns plugin was enabled without specifying any nameservers eg sudo microk8s enable dns

@air-awan
Copy link

Thanks for sharing @gnuoy . I have similar issue, deploying microstack on newly installed Ubuntu 22.04 by following instruction from https://microstack.run/docs/multi-node is always failing with "Error: Timed out while waiting for model 'openstack' to be ready".
This was during the "Deploying OpenStack Control Plane" step when executing "sunbeam cluster bootstrap --role control --role compute --role storage" command.

Running "juju status -m openstack" show that neutron and nova status is stuck at "blocked":
neutron/0* blocked idle 10.1.169.227 (container:neutron-server) healthcheck failed: online
nova/0* blocked idle 10.1.169.206 (workload) DB sync failed

Analyzing logs from nova-conductor show this error:
2003, "Can't connect to MySQL server on 'nova-api-mysql-router.openstack.svc.cluster.local' ([Errno -2] No address found)"

neutron-server shows similar error:
2003, "Can't connect to MySQL server on 'neutron-mysql-router.openstack.svc.cluster.local' ([Errno -2] No address found)"

Editing ndots options from /etc/resolv.conf inside neutron-server and nova-conductor pod get me pass that error. However nova-conductor encountered another error:
Unhandled error: sqlalchemy.exc.ProgrammingError: (pymysql.err.ProgrammingError) (1146, "Table 'nova_api.cell_mappings' doesn't exist")

After several time of Ubuntu re-installation, then found out about microstack tear-down procedure (https://discourse.ubuntu.com/t/tear-down-your-openstack-lab-environment/25078/11), I've finally point the issue to "search" options in /etc/resolv.conf (coming from netplan config). So after configuring "search" to blank ("[]" in netplan or "." resolv.conf), the sunbeam bootstrap process can finish succesfully.

Summary:
In microstack node, using search configuration option in netplan or resolv.conf causing DNS problem in sunbeam/microstack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants