Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: Use longer timeout for ginkgo DNS lookups #12062

Merged
merged 1 commit into from Jun 21, 2020

Conversation

jrajahalme
Copy link
Member

FQDN tests have been failing due to DNS lookups from ginkgo not
succeeding in 30 seconds. Use the longer HelperTimeout (4 minutes)
instead.

Also split the two lookups into two separate WithTimeout()
invocations, so that we do not need to repeat the 1st if the 2nd
fails.

Related: #11848
Related: #10538
Signed-off-by: Jarno Rajahalme jarno@covalent.io

FQDN tests have been failing due to DNS lookups from ginkgo not
succeeding in 30 seconds. Use the longer HelperTimeout (4 minutes)
instead.

Also split the two lookups into two separate WithTimeout()
invocations, so that we do not need to repeat the 1st if the 2nd
fails.

Signed-off-by: Jarno Rajahalme <jarno@covalent.io>
@jrajahalme jrajahalme added area/CI-improvement Topic or proposal to improve the Continuous Integration workflow release-note/ci This PR makes changes to the CI. ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Jun 12, 2020
@jrajahalme jrajahalme requested a review from a team as a code owner June 12, 2020 19:27
@maintainer-s-little-helper maintainer-s-little-helper bot added this to In progress in 1.8.0 Jun 12, 2020
@jrajahalme
Copy link
Member Author

test-me-please

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.001%) to 37.035% when pulling ae31e75 on pr/jrajahalme/test-ginkgo-longer-dns-timeout into a8066ce on master.

@pchaigno
Copy link
Member

Do we really expect a DNS request to take more than 30 seconds? Maybe we could add retries instead?

What's the risk of letting the tests proceed if the domains could not be resolved? We already have the IPs above and they don't seem that unstable. Would we risk missing actual test failures if we issue a warning instead of an error when the preparatory DNS request fails?

@jrajahalme
Copy link
Member Author

jrajahalme commented Jun 13, 2020

@pchaigno WithTimeout() retries the command by default every 5 seconds already.

Based on the test run of this PR vagrant-cache.ci.cilium.io is still 147.75.38.95, but http://jenkins.cilium.io is 54.148.123.155 rather than 104.198.14.52.

addrs, lookupErr = net.LookupHost("vagrant-cache.ci.cilium.io")
if lookupErr != nil {
lookupErr = fmt.Errorf("error looking up vagrant-cache.ci.cilium.io: %s", lookupErr)
addrs, err2 := net.LookupHost("vagrant-cache.ci.cilium.io")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use google.com or something similar with high availability guarantees?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think changing the domain name we request will have any impact on the success of DNS requests. We already use Google's DNS, if I recall correctly, to resolve domain names.
The problem is likely Packet's connection to the outside, so any DNS resolver will have the same issues.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hence a longer timeout may help to carry over intermittent connectivity issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That DNS resolution relies on getaddrinfo(3) under the hood. I don't think getaddrinfo(3) implements any retry. I couldn't find any retry mechanism at the Golang layer either.

In case of intermittent connectivity issues, the request packet won't stay around, it will be dropped. So I don't think waiting longer for an answer is going to solve this.

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 15, 2020
@pchaigno pchaigno removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 15, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. and removed ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Jun 15, 2020
@jrajahalme jrajahalme requested a review from pchaigno June 16, 2020 18:25
@joestringer
Copy link
Member

@pchaigno WithTimeout() retries the command by default every 5 seconds already.

How many attempts are actually made? The Go net library doesn't document how long the net.LookupHost() request actually takes so probably needs a bit more digging to find out. Basically I'm wondering how many requests the 30s timeout corresponds to and why we think that number of requests are failing (and hence why a longer timeout would improve the behaviour)

@jrajahalme
Copy link
Member Author

@pchaigno @joestringer The default call interval of helpers.RepeatUntilTrue() that helpers.WithTimeout() uses, is 5 seconds, so you would get 5 or 6 repeated calls to net.LookupHost() during the 30 second timeout.

Copy link
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I keep forgetting that WithTimeout implements retries :/

LGTM!

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 19, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.8.0 Jun 19, 2020
@joestringer
Copy link
Member

@pchaigno @joestringer The default call interval of helpers.RepeatUntilTrue() that helpers.WithTimeout() uses, is 5 seconds, so you would get 5 or 6 repeated calls to net.LookupHost() during the 30 second timeout.

This assumes that net.LookupHost() returns immediately. Is that true?

@jrajahalme
Copy link
Member Author

This assumes that net.LookupHost() returns immediately. Is that true?

Good question! According to Go docs net.LookupHost() can be backed by a cgo call to getaddrinfo() or by an internal Go resolver, depending on many different considerations. Tried looking into potential latency of getaddrinfo(), and found mentions of multiple seconds at the worst, so I don't know if it is bounded at all. Don't know the internals of the Go implementation.

@aanm aanm merged commit 05cdda5 into master Jun 21, 2020
1.8.0 automation moved this from In progress to Merged Jun 21, 2020
@aanm aanm deleted the pr/jrajahalme/test-ginkgo-longer-dns-timeout branch June 21, 2020 12:12
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.8 in 1.8.0 Jun 21, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.8 in 1.8.0 Jun 21, 2020
@aanm aanm mentioned this pull request Jun 21, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.8 to Backport done to v1.8 in 1.8.0 Jun 22, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.8 to Backport done to v1.8 in 1.8.0 Jun 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI-improvement Topic or proposal to improve the Continuous Integration workflow ci/flake This is a known failure that occurs in the tree. Please investigate me! ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI.
Projects
No open projects
1.8.0
  
Merged
1.8.0
Backport done to v1.8
Development

Successfully merging this pull request may close these issues.

None yet

6 participants