Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STRICT_DNS drops cluster members on lookup failure #2691

Closed
jasonmartens opened this issue Mar 1, 2018 · 2 comments
Closed

STRICT_DNS drops cluster members on lookup failure #2691

jasonmartens opened this issue Mar 1, 2018 · 2 comments
Labels
enhancement Feature requests. Not bugs or questions. help wanted Needs help!

Comments

@jasonmartens
Copy link

Title: STRICT_DNS drops cluster members on lookup failure

Description:
We are using Envoy in a Consul environment. We would like to use DNS lookups to configure our clusters. For our particular use case, we need Envoy instances in DCs around the world to locate a set of hosts in one datacenter. To do this, we are using a prepared query. In short, this allows us to do a global lookup of the set of hosts we need and query it using DNS.

However, when network lag is too great the DNS response occasionally returns NXDOMAIN, instead of the set of IPs it normally returns. When using STRICT_DNS for the cluster, this is catastrophic, because all hosts are removed from the cluster causing downtime until the next successful DNS query happens.

Instead, I would like Envoy to consider the DNS entries as advisory, and keep using the last known set until lookups recover.

Workarounds
We are trying out LOGICAL_DNS instead, which seems to have the DNS lookup properties that we want. However, we do have a set of Envoy sidecars that are the result of the lookup, and it would be better if downstream Envoy could maintain connections to upstream envoy instances. From what I can tell, LOGICAL_DNS also does not use HTTP/2?

We are just getting started with Envoy, so maybe there is something obvious I'm missing. But from what I can tell, the behavior of STRICT_DNS is more what we want than LOGICAL_DNS.

@ggreenway ggreenway added the enhancement Feature requests. Not bugs or questions. label Mar 1, 2018
@mattklein123
Copy link
Member

@jasonmartens the history here is that when we used to use getaddrinfo_a() there was basically no good way to differentiate an error from an empty response (terrible API). With c-ares there might be. I'm not sure. If there are clear errors that we should be ignoring, we can make the DNS resolver not consider it an empty response. I would have a look at the code.

Your other option is to enable active health checking against the endpoints. This will stabilize the endpoints since Envoy will trust active HC over discovery.

@mattklein123 mattklein123 added the help wanted Needs help! label Mar 1, 2018
Shikugawa pushed a commit to Shikugawa/envoy that referenced this issue Mar 28, 2020
Enable TCP Client Metrics

Signed-off-by: gargnupur <gargnupur@google.com>

Enable TCP Client Metrics

Signed-off-by: gargnupur <gargnupur@google.com>

Remove extra line

Signed-off-by: gargnupur <gargnupur@google.com>

Regenerate wasm files

Signed-off-by: gargnupur <gargnupur@google.com>
@mattklein123
Copy link
Member

@junr03 fixed this recently.

jpsim pushed a commit that referenced this issue Nov 28, 2022
Fix a possible use-after-free with platform cert verification by using a unique_ptr in the flat_hash_set of pending validations. The flat_hash_set does not ensure pointer stability, but the validation thread holds a pointer to the PendingVerification, which is problematic. This PR makes PendingVerification non-moveable and non-copyable which avoids this problem.

There is also another potential use-after free in that the task posted to the dispatcher deletes the PendingValidation, but the PendingValidation touches member variables after the call to post. Reordered the call to post to avoid this.

Fixes #2691

Signed-off-by: Ryan Hamilton rch@google.com
Signed-off-by: JP Simard <jp@jpsim.com>
jpsim pushed a commit that referenced this issue Nov 29, 2022
Fix a possible use-after-free with platform cert verification by using a unique_ptr in the flat_hash_set of pending validations. The flat_hash_set does not ensure pointer stability, but the validation thread holds a pointer to the PendingVerification, which is problematic. This PR makes PendingVerification non-moveable and non-copyable which avoids this problem.

There is also another potential use-after free in that the task posted to the dispatcher deletes the PendingValidation, but the PendingValidation touches member variables after the call to post. Reordered the call to post to avoid this.

Fixes #2691

Signed-off-by: Ryan Hamilton rch@google.com
Signed-off-by: JP Simard <jp@jpsim.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests. Not bugs or questions. help wanted Needs help!
Projects
None yet
Development

No branches or pull requests

3 participants