Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no_healthy_upstream possibly caused by strange envoy DNS timeout #5564

Open
arista-marcin opened this issue Feb 14, 2024 · 1 comment
Open

Comments

@arista-marcin
Copy link

Describe the bug
Our k8s cluster (emissary-igress running on bare metal as daemon set on 3 dedicated nodes) started to act very weird, we noticed this by receiving (for particular services only) a:

no_healthy_upstream

But upstream service where ok and fully reachable from all pods (running manually a curl command)
We decided to restart one of the pod and after that it started to return expected status, but other were still not.
After further investigation we noticed following in the logs:

emissary-ingress-l55kb emissary-ingress [2024-02-14 10:57:03.261][85][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:152] dns resolution for bug-service-fastapi.bug-service failed with c-ares status 12
emissary-ingress-l55kb emissary-ingress [2024-02-14 10:57:03.261][85][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:245] DNS request timed out 4 times
emissary-ingress-l55kb emissary-ingress [2024-02-14 10:57:03.261][85][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:278] dns resolution for bug-service-fastapi.bug-service completed with status 1

Which gave us a suspicious that this might be related to envoy DNS.
Funny enough when checking on each particular pod dns resolve is working fine for given host, which again leads to a conclusion this is something inside envoy.
We also noticed that when running netstat -anp | grep :53 | grep ^udp | grep ESTABLISHED on affected pods we constantly see same ESTABLISH connection, similar to:

# netstat -anp | grep :53 | grep ^udp | grep ESTABLISHED
udp        0      0 10.243.192.5:34923      192.168.0.10:53         ESTABLISHED -

and when repeating that command multiple times we got same output (source port), were on working pods source port was changing.

The issue was also seen by checking envoy_dns_cares_timeouts metric.

Is this something known?

Unfortunately after restart of remaining pods, everything returned to normal.
At the moment we cannot reproduce this, but I'm afraid (according to murphy's law) it will happen again sooner than later.

To Reproduce
Not deterministic unfortunately

Expected behavior
emissary-ingress (and underlying envoy) are capable of doing DNS resolve and therefore returns expected status for upstream system (instead of no_healthy_upstream)

Versions (please complete the following information):

  • emissary-ingress 3.9.1
  • kubernetes baremetal v1.24.17
  • envoy version: 6637fd1bab315774420f3c3d97488fedb7fc710f/1.27.2/Clean/RELEASE/BoringSSL
@cindymullins-dw
Copy link
Contributor

Hi @arista-marcin , to my knowledge we've not seen anything like that. I'm not finding any references to this error. Thanks for reporting. If you do see it again, please let us know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants