no_healthy_upstream possibly caused by strange envoy DNS timeout #5564

arista-marcin · 2024-02-14T15:39:10Z

Describe the bug
Our k8s cluster (emissary-igress running on bare metal as daemon set on 3 dedicated nodes) started to act very weird, we noticed this by receiving (for particular services only) a:

no_healthy_upstream

But upstream service where ok and fully reachable from all pods (running manually a curl command)
We decided to restart one of the pod and after that it started to return expected status, but other were still not.
After further investigation we noticed following in the logs:

emissary-ingress-l55kb emissary-ingress [2024-02-14 10:57:03.261][85][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:152] dns resolution for bug-service-fastapi.bug-service failed with c-ares status 12
emissary-ingress-l55kb emissary-ingress [2024-02-14 10:57:03.261][85][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:245] DNS request timed out 4 times
emissary-ingress-l55kb emissary-ingress [2024-02-14 10:57:03.261][85][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:278] dns resolution for bug-service-fastapi.bug-service completed with status 1

Which gave us a suspicious that this might be related to envoy DNS.
Funny enough when checking on each particular pod dns resolve is working fine for given host, which again leads to a conclusion this is something inside envoy.
We also noticed that when running netstat -anp | grep :53 | grep ^udp | grep ESTABLISHED on affected pods we constantly see same ESTABLISH connection, similar to:

# netstat -anp | grep :53 | grep ^udp | grep ESTABLISHED
udp        0      0 10.243.192.5:34923      192.168.0.10:53         ESTABLISHED -

and when repeating that command multiple times we got same output (source port), were on working pods source port was changing.

The issue was also seen by checking envoy_dns_cares_timeouts metric.

Is this something known?

Unfortunately after restart of remaining pods, everything returned to normal.
At the moment we cannot reproduce this, but I'm afraid (according to murphy's law) it will happen again sooner than later.

To Reproduce
Not deterministic unfortunately

Expected behavior
emissary-ingress (and underlying envoy) are capable of doing DNS resolve and therefore returns expected status for upstream system (instead of no_healthy_upstream)

Versions (please complete the following information):

emissary-ingress 3.9.1
kubernetes baremetal v1.24.17
envoy version: 6637fd1bab315774420f3c3d97488fedb7fc710f/1.27.2/Clean/RELEASE/BoringSSL

The text was updated successfully, but these errors were encountered:

cindymullins-dw · 2024-02-17T06:04:37Z

Hi @arista-marcin , to my knowledge we've not seen anything like that. I'm not finding any references to this error. Thanks for reporting. If you do see it again, please let us know.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

no_healthy_upstream possibly caused by strange envoy DNS timeout #5564

no_healthy_upstream possibly caused by strange envoy DNS timeout #5564

arista-marcin commented Feb 14, 2024

cindymullins-dw commented Feb 17, 2024

no_healthy_upstream possibly caused by strange envoy DNS timeout #5564

no_healthy_upstream possibly caused by strange envoy DNS timeout #5564

Comments

arista-marcin commented Feb 14, 2024

cindymullins-dw commented Feb 17, 2024