Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS issue resolving external domains #18

Closed
Mistic92 opened this issue Feb 5, 2021 · 27 comments
Closed

DNS issue resolving external domains #18

Mistic92 opened this issue Feb 5, 2021 · 27 comments

Comments

@Mistic92
Copy link
Contributor

Mistic92 commented Feb 5, 2021

I just tried to use runsd with GCP tracking and got this error

@google-cloud/trace-agent ERROR TraceWriter#publish: Received error  while publishing traces to cloudtrace.googleapis.com: FetchError: request to https://cloudtrace.googleapis.com/v1/projects/yosh-dev/traces failed, reason: getaddrinfo EAI_AGAIN cloudtrace.googleapis.com

I assume the reason is that it's using https while for runsd it should be http. Is there a way to makie it working without changing urls in client object?

@Mistic92
Copy link
Contributor Author

Mistic92 commented Feb 5, 2021

GCP Logging have the same issue

@ahmetb
Copy link
Owner

ahmetb commented Feb 5, 2021

Http vs https shouldn't be the issue as we don't hijack those hostnames.

Can you run "dig cloudtrace.googleapis.com" or see what IP it resolves to in the container on Cloud Run?

@Mistic92
Copy link
Contributor Author

Mistic92 commented Feb 5, 2021

I used your example code and results below.
Region europe-west1

$ dig +search A cloudtrace.googleapis.com


; <<>> DiG 9.16.11 <<>> +search A cloudtrace.googleapis.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 33282
;; flags: qr; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;cloudtrace.googleapis.com.google.internal. IN A

From logs

I0205 19:22:15.638194 1 dns.go:42] [dns] > Q0: type=A name=cloudtrace.googleapis.com.europe-west1.run.internal.
I0205 19:22:15.638336 1 dns.go:66] [dns] < type=A name=cloudtrace.googleapis.com.europe-west1.run.internal. is too short or long (need ndots=4; got=6), nxdomain
I0205 19:22:15.638900 1 dns.go:42] [dns] > Q0: type=A name=cloudtrace.googleapis.com.run.internal.
I0205 19:22:15.638920 1 dns.go:66] [dns] < type=A name=cloudtrace.googleapis.com.run.internal. is too short or long (need ndots=4; got=5), nxdomain
I0205 19:22:15.639314 1 dns.go:42] [dns] > Q0: type=A name=cloudtrace.googleapis.com.google.internal.
I0205 19:22:15.639333 1 dns.go:120] [dns] >> recursing type=A name=cloudtrace.googleapis.com.google.internal.
I0205 19:22:15.642319 1 dns.go:127] [dns] << recursed type=A name=cloudtrace.googleapis.com.google.internal. rcode=SERVFAIL answers=0 rtt=2.865112ms

@ahmetb
Copy link
Owner

ahmetb commented Feb 5, 2021

Looks like ndots config is a bit weird.

  1. Are you on the latest version?
  2. Any chance you can print out /etc/resolv.conf at the runtime of the container?

@ahmetb
Copy link
Owner

ahmetb commented Feb 5, 2021

I'm asking these because I have an example container that runs runsd + has a curl form like this:
image

and when I run it, curl works fine:

image

@Mistic92
Copy link
Contributor Author

Mistic92 commented Feb 5, 2021

  1. Yes, I'm using link with latest
ADD https://github.com/ahmetb/runsd/releases/latest/download/runsd /runsd
RUN chmod +x /runsd
ENTRYPOINT ["/runsd", "-v=5", "--", "/app"]
  1. I have added print endpoint to example code from this repo
func print(w http.ResponseWriter, req *http.Request) {
	file, err := os.Open("/etc/resolv.conf")
	if err != nil {
		fmt.Fprintf(w, "dig failed: %v\n", err)
	}
	defer func() {
		if err = file.Close(); err != nil {
			log.Fatal(err)
		}
	}()
	b, err := ioutil.ReadAll(file)
	fmt.Fprint(w, string(b))
}

Output of file is

nameserver 127.0.0.1
nameserver ::1
search europe-west1.run.internal. run.internal. google.internal.
options ndots:4

Dig from example from this code repo

$ dig +search A cloudtrace.googleapis.com


; <<>> DiG 9.16.11 <<>> +search A cloudtrace.googleapis.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 15816
;; flags: qr; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;cloudtrace.googleapis.com.google.internal. IN A

Curl output

$ curl -sSLNv --http2 http://cloudtrace.googleapis.com

* Could not resolve host: cloudtrace.googleapis.com
* Closing connection 0
curl: (6) Could not resolve host: cloudtrace.googleapis.com
curl failed: exit status 6

@ahmetb
Copy link
Owner

ahmetb commented Feb 5, 2021

Weird enough I have the exact example/ application working on my side:

/dig?domain=cloudtrace.googleapis.com shows:

$ dig +search A cloudtrace.googleapis.com
; <<>> DiG 9.16.11 <<>> +search A cloudtrace.googleapis.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 59541
;; flags: qr aa; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;cloudtrace.googleapis.com.	IN	A

;; ANSWER SECTION:
cloudtrace.googleapis.com. 253	IN	A	172.217.214.95

;; Query time: 1 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri Feb 05 21:32:02 UTC 2021
;; MSG SIZE  rcvd: 84

and /resolvconf shows:

nameserver 127.0.0.1
nameserver ::1
search us-central1.run.internal. run.internal. google.internal.
options ndots:4

Do you spot any differences?

@Mistic92
Copy link
Contributor Author

Mistic92 commented Feb 5, 2021

The only one difference I see is region us-central1 vs europe-west1 and in your screenshot there is no --http2 param in curl.

@ahmetb
Copy link
Owner

ahmetb commented Feb 5, 2021

I updated my image, works with --http2 as well:

image

@ahmetb
Copy link
Owner

ahmetb commented Feb 5, 2021

Do you mind trying another region?

@Mistic92
Copy link
Contributor Author

Mistic92 commented Feb 5, 2021

Sure, on us-central1 not working too, the same results, only /etc/resolv.conf is different (as expected).

nameserver 127.0.0.1
nameserver ::1
search us-central1.run.internal. run.internal. google.internal.
options ndots:4

I'll try to run this on my private account, not corp one which is in organization but at this moment I don't see reason of this issue

@Mistic92
Copy link
Contributor Author

Mistic92 commented Feb 5, 2021

I deployed it on my private account on private project and the same result, not working.

I was thinking that maybe it's Docker image issue but now I'm not sure. Image with example I'm building locally with docker-compose on wsl2.
First issue where we got errors from trace+logs was built on gitlab runner on GKE and it was completely different code (nodejs, slim image)

@Mistic92
Copy link
Contributor Author

Mistic92 commented Feb 5, 2021

When I used dig with @8.8.8.8 I got results some results

/dig?domain=cloudtrace.googleapis.com@8.8.8.8

$ dig +search A cloudtrace.googleapis.com@8.8.8.8


; <<>> DiG 9.16.11 <<>> +search A cloudtrace.googleapis.com@8.8.8.8
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 49689
;; flags: qr aa; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

;; QUESTION SECTION:
;cloudtrace.googleapis.com\@8.8.8.8. IN	A

;; AUTHORITY SECTION:
.			86399	IN	SOA	a.root-servers.net. nstld.verisign-grs.com. 2021020502 1800 900 604800 86400

;; Query time: 15 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri Feb 05 22:00:35 UTC 2021
;; MSG SIZE  rcvd: 126

;; Query time: 15 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri Feb 05 22:00:35 UTC 2021
;; MSG SIZE  rcvd: 126

@ahmetb
Copy link
Owner

ahmetb commented Feb 5, 2021

that's because it didn't parse it correctly. :) @8.8.8.8 is a separate argument. :)

@Mistic92
Copy link
Contributor Author

Mistic92 commented Feb 5, 2021

Oh, I was hoping that we moved closer with an issue:D For now I have no idea what's going on.

@ahmetb
Copy link
Owner

ahmetb commented Feb 5, 2021

Can you perhaps try deploying to us-central1.
I can't imagine what would cause this.

@Mistic92
Copy link
Contributor Author

Mistic92 commented Feb 6, 2021

I did previously and on my private account it was us-central1.
I'll try build via Cloud Build and see if there is any difference

@Mistic92
Copy link
Contributor Author

Mistic92 commented Feb 6, 2021

Building on Cloud Build does not change anything.
Startup
image

Dig for google.com
image

@Mistic92
Copy link
Contributor Author

Mistic92 commented Feb 6, 2021

I cloned repo into cloudshell, slightly modified regions file to let me run this, built docker and dig and curl from example was working. I triend it also with runtime runsc to try gvisor but it's also working. Pushed this example into container registry, created revision on cloud run and nope, not working.

@Mistic92
Copy link
Contributor Author

Mistic92 commented Mar 25, 2021

Hi, do you have any update regarding dns fix on google side?

edit:
Looks like it's working now. I have added this to our services and requests are passed correctly <3
Can you confirm that this fix on google side is stable?

@ahmetb
Copy link
Owner

ahmetb commented Mar 25, 2021

Sorry, I've been debating internally defending this is a bug. So far I have not convinced people to fix this and it required me to go low level to prove that this is how DNS works and our bug makes it incompatible.

My only other option is to never proxy the queries to "google.internal." Zone in runsd resolve metadata.google.internal lookup myself. I am not sure what other side effects would be and it wouldn't be a long-term solution.

Maybe we can try.

@ahmetb ahmetb changed the title GCP tracking failing DNS issue resolving external Mar 25, 2021
@ahmetb ahmetb changed the title DNS issue resolving external DNS issue resolving external domains Mar 25, 2021
@Mistic92
Copy link
Contributor Author

At this moment it works on my 2 projects. One with hello example and one on my dev environment so maybe something has changed. But until it's not confirmed internally I won't use it.

I was considering using serverless connector as you mentioned earlier but it is at least $16 per month.

image

@ahmetb
Copy link
Owner

ahmetb commented Mar 25, 2021

The problem wasnt resolving internal domains? It was resolving external domains I think? Example.com shouldn't be working?

@Mistic92
Copy link
Contributor Author

Hm, internally it was not working too but maybe it was other reason and I just don't remember.
But yes, query to external domain still not working :(

Do you think that https://dns.google might help? Query this and cache when getting SERVFAIL or pattern does not match run.app. Just loose considerations.
https://dns.google/resolve?name=cloudtrace.googleapis.com&type=A

@ahmetb
Copy link
Owner

ahmetb commented Mar 25, 2021

We can't blindly replace the nameserver we recurse into. People who use VPC Connector actually use internal DNS names of VMs and it works from Cloud Run. Thats why we have to proxy to 169.254.169.254. However the "google.Internal." search domain is giving us problems iterating over candidates. That's the problem.

ahmetb added a commit that referenced this issue Apr 1, 2021
If we let "{nonexisting}.google.internal." get handled by the Cloud Run host
nameserver, it returns a SERVFAIL, which prevents trying other "search" domains
from being tried.

Adding a temporary workaround that _only_ handles "metadata.google.internal."
for A question (and ignoring other question types) and properly NXDOMAIN-ing
the non-existing domains.

This is to address #18.

Signed-off-by: Ahmet Alp Balkan <ahmetb@google.com>
@ahmetb
Copy link
Owner

ahmetb commented Apr 1, 2021

I've just pushed a tag v0.0.0-rc.10 that I think fixes the problem. Give it a try @Mistic92 and close if it works.

@ahmetb
Copy link
Owner

ahmetb commented Apr 13, 2021

It seems like the fix solved it.

@ahmetb ahmetb closed this as completed Apr 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants