STRICT_DNS drops cluster members on lookup failure #2691

jasonmartens · 2018-03-01T04:59:37Z

Title: STRICT_DNS drops cluster members on lookup failure

Description:
We are using Envoy in a Consul environment. We would like to use DNS lookups to configure our clusters. For our particular use case, we need Envoy instances in DCs around the world to locate a set of hosts in one datacenter. To do this, we are using a prepared query. In short, this allows us to do a global lookup of the set of hosts we need and query it using DNS.

However, when network lag is too great the DNS response occasionally returns NXDOMAIN, instead of the set of IPs it normally returns. When using STRICT_DNS for the cluster, this is catastrophic, because all hosts are removed from the cluster causing downtime until the next successful DNS query happens.

Instead, I would like Envoy to consider the DNS entries as advisory, and keep using the last known set until lookups recover.

Workarounds
We are trying out LOGICAL_DNS instead, which seems to have the DNS lookup properties that we want. However, we do have a set of Envoy sidecars that are the result of the lookup, and it would be better if downstream Envoy could maintain connections to upstream envoy instances. From what I can tell, LOGICAL_DNS also does not use HTTP/2?

We are just getting started with Envoy, so maybe there is something obvious I'm missing. But from what I can tell, the behavior of STRICT_DNS is more what we want than LOGICAL_DNS.

mattklein123 · 2018-03-01T21:19:16Z

@jasonmartens the history here is that when we used to use getaddrinfo_a() there was basically no good way to differentiate an error from an empty response (terrible API). With c-ares there might be. I'm not sure. If there are clear errors that we should be ignoring, we can make the DNS resolver not consider it an empty response. I would have a look at the code.

Your other option is to enable active health checking against the endpoints. This will stabilize the endpoints since Envoy will trust active HC over discovery.

Enable TCP Client Metrics Signed-off-by: gargnupur <gargnupur@google.com> Enable TCP Client Metrics Signed-off-by: gargnupur <gargnupur@google.com> Remove extra line Signed-off-by: gargnupur <gargnupur@google.com> Regenerate wasm files Signed-off-by: gargnupur <gargnupur@google.com>

mattklein123 · 2020-04-14T23:55:58Z

@junr03 fixed this recently.

Fix a possible use-after-free with platform cert verification by using a unique_ptr in the flat_hash_set of pending validations. The flat_hash_set does not ensure pointer stability, but the validation thread holds a pointer to the PendingVerification, which is problematic. This PR makes PendingVerification non-moveable and non-copyable which avoids this problem. There is also another potential use-after free in that the task posted to the dispatcher deletes the PendingValidation, but the PendingValidation touches member variables after the call to post. Reordered the call to post to avoid this. Fixes #2691 Signed-off-by: Ryan Hamilton rch@google.com Signed-off-by: JP Simard <jp@jpsim.com>

ggreenway added the enhancement Feature requests. Not bugs or questions. label Mar 1, 2018

mattklein123 added the help wanted Needs help! label Mar 1, 2018

mattklein123 closed this as completed Apr 14, 2020

andrewjjenkins mentioned this issue Jan 18, 2022

DNS improvements #16314

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

STRICT_DNS drops cluster members on lookup failure #2691

STRICT_DNS drops cluster members on lookup failure #2691

jasonmartens commented Mar 1, 2018

mattklein123 commented Mar 1, 2018

mattklein123 commented Apr 14, 2020

STRICT_DNS drops cluster members on lookup failure #2691

STRICT_DNS drops cluster members on lookup failure #2691

Comments

jasonmartens commented Mar 1, 2018

mattklein123 commented Mar 1, 2018

mattklein123 commented Apr 14, 2020