intermittent failure to resolve DNS records for services #4222

javacruft · 2023-09-25T11:11:57Z

Summary

During testing of OpenStack Sunbeam we occasionally see an issue where pods are unable to resolve the hostname associated with another service in the same deployment:

https://bugs.launchpad.net/snap-openstack/+bug/2033680

This typically materialises when something in the pod tries to connect to the remote service:

(2003, "Can't connect to MySQL server on 'nova-api-mysql-router.openstack.svc.cluster.local' ([Errno -2] Name or service not known)")

DNS addon is enabled.

What Should Happen Instead?

Hostname of service should be resolvable.

Reproduction Steps

We're not able to reproduce this error consistently.

Introspection Report

Not currently collected - have requested.

Can you suggest a fix?

Not based on current information - debugging or additional log collection would be great.

Are you interested in contributing with a fix?

no

ktsakalozos · 2023-09-26T06:39:34Z

Thank you for reporting this @javacruft . Is it possible to have an inspection report?

gnuoy · 2023-10-24T12:52:02Z

I was looking into a failed deployment and seemed to hit this issue:

inspection-report-20231024_124905.tar.gz

neoaggelos · 2023-10-24T13:19:52Z

@gnuoy thanks for the report, we'll have a look!

gnuoy · 2023-10-26T15:26:25Z

I have another deployment with this symptom and I've narrowed it down to a bit. I've taken a look at the dns query openstack uses and the interesting bits are that it uses udp and enables the search option.

On a working system:

# dig +notcp +short +search nova-api-mysql-router.openstack.svc.cluster.local
10.152.183.186
# dig +notcp +short +search nova-api-mysql-router.openstack.svc.cluster.local.
10.152.183.186

On a broken system:

# dig +notcp +short +search nova-api-mysql-router.openstack.svc.cluster.local
# dig +notcp +short +search nova-api-mysql-router.openstack.svc.cluster.local.
10.152.183.210

(Note the trailing '.' on the fqdn in the second query)

I can also 'fix' the broken system by updating the ndots option in /etc/resolv.conf from ndots:5 to ndots:4

gnuoy · 2023-10-26T15:39:44Z

Inspection report from broken deployment mentioned in previous comment:
inspection-report-20231026_153632.tar.gz

gnuoy · 2023-10-26T15:50:48Z

Using microk8s.kubectl get configmap -n kube-system coredns -o yaml to dump the coredns config shows another difference, on the working system:

forward . 8.8.8.8 8.8.4.4

On the broken system :

forward . /etc/resolv.conf

On the broken system:

$ grep -vE '^$|^#' /etc/resolv.conf
nameserver 127.0.0.53
options edns0 trust-ad
search maas

I wonder if .maas is getting added on to the queries which don't have a trailing dot (so they aren't fqdn's) and then that causes (me frantically waves hand ) an issue with the ndot setting.

On both systems the dns plugin was enabled without specifying any nameservers eg sudo microk8s enable dns

air-awan · 2023-11-16T11:41:26Z

Thanks for sharing @gnuoy . I have similar issue, deploying microstack on newly installed Ubuntu 22.04 by following instruction from https://microstack.run/docs/multi-node is always failing with "Error: Timed out while waiting for model 'openstack' to be ready".
This was during the "Deploying OpenStack Control Plane" step when executing "sunbeam cluster bootstrap --role control --role compute --role storage" command.

Running "juju status -m openstack" show that neutron and nova status is stuck at "blocked":
neutron/0* blocked idle 10.1.169.227 (container:neutron-server) healthcheck failed: online
nova/0* blocked idle 10.1.169.206 (workload) DB sync failed

Analyzing logs from nova-conductor show this error:
2003, "Can't connect to MySQL server on 'nova-api-mysql-router.openstack.svc.cluster.local' ([Errno -2] No address found)"

neutron-server shows similar error:
2003, "Can't connect to MySQL server on 'neutron-mysql-router.openstack.svc.cluster.local' ([Errno -2] No address found)"

Editing ndots options from /etc/resolv.conf inside neutron-server and nova-conductor pod get me pass that error. However nova-conductor encountered another error:
Unhandled error: sqlalchemy.exc.ProgrammingError: (pymysql.err.ProgrammingError) (1146, "Table 'nova_api.cell_mappings' doesn't exist")

After several time of Ubuntu re-installation, then found out about microstack tear-down procedure (https://discourse.ubuntu.com/t/tear-down-your-openstack-lab-environment/25078/11), I've finally point the issue to "search" options in /etc/resolv.conf (coming from netplan config). So after configuring "search" to blank ("[]" in netplan or "." resolv.conf), the sunbeam bootstrap process can finish succesfully.

Summary:
In microstack node, using search configuration option in netplan or resolv.conf causing DNS problem in sunbeam/microstack.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

intermittent failure to resolve DNS records for services #4222

intermittent failure to resolve DNS records for services #4222

javacruft commented Sep 25, 2023

ktsakalozos commented Sep 26, 2023

gnuoy commented Oct 24, 2023

neoaggelos commented Oct 24, 2023

gnuoy commented Oct 26, 2023

gnuoy commented Oct 26, 2023

gnuoy commented Oct 26, 2023 •

edited

Loading

air-awan commented Nov 16, 2023

intermittent failure to resolve DNS records for services #4222

intermittent failure to resolve DNS records for services #4222

Comments

javacruft commented Sep 25, 2023

Summary

What Should Happen Instead?

Reproduction Steps

Introspection Report

Can you suggest a fix?

Are you interested in contributing with a fix?

ktsakalozos commented Sep 26, 2023

gnuoy commented Oct 24, 2023

neoaggelos commented Oct 24, 2023

gnuoy commented Oct 26, 2023

gnuoy commented Oct 26, 2023

gnuoy commented Oct 26, 2023 • edited Loading

air-awan commented Nov 16, 2023

gnuoy commented Oct 26, 2023 •

edited

Loading