-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachprod: use gcloud CLI instead of net.LookupSRV #113934
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TC Run (0.3 prob): https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_Nightlies_RoachtestNightlyGceBazel/12555726
Reviewable status: complete! 0 of 0 LGTMs obtained
Previously `net.LookupSRV` with a custom resolver was used to lookup DNS records. This approach resulted in several flakes and required waiting on DNS servers to have the records available. The CLI is more stable, but has a greater call overhead. Fixes cockroachdb#111269 Epic: None Release Note: None
Previously we used net lookups to retrieve DNS entries and discover services. This is a call that can happen frequently, but the overhead of a DNS lookup is low. After the change to rather use the CLI to retrieve DNS entries there is a bigger cost associated with the call. This change introduces a cache to reduce the cost of the `LookupSRVRecords` call which could be called frequently depending on the origin of use. The cache is updated for any CRUD operations on the DNS entries, and a call to the CLI will not occur if any entry exists for the name the lookup is attempting. The names are also normalised to remove a trailing dot in order to make matching against the cache work correctly. There is a small risk that the cache could go out of sync if any other roachprod process manipulates the records with a create, update or destroy operation, while a continuous roachprod process is interacting with the entries. This risk is relatively small and usually applies to roachtest rather than everyday use of roachprod. Epic: None Release Note: None
ece4c15
to
f7224af
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran all seed-multi-region
tests as well as these are the ones always present on the DNS failures: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_Nightlies_RoachtestNightlyGceBazel/12675094?buildTab=tests
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @renatolabs, and @srosenberg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of questions and a suggestion about use of mutex. Excited to finally fix that roachtest! 😆
Reviewed 2 of 2 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @DarrylWong, @herkolategan, and @srosenberg)
pkg/roachprod/vm/gce/dns.go
line 171 at r1 (raw file):
target := name if service != "" || proto != "" { target = "_" + service + "._" + proto + "." + name
Not immediately obvious to me what this is doing, especially the predicate above.
pkg/roachprod/vm/gce/dns.go
line 43 at r2 (raw file):
type dnsProvider struct { recordsCache struct { mu syncutil.RWMutex
I think we should make our lives easier and just use a regular Mutex
. RWMutex
's performance gains are questionable [1], and we are just opening the door for a class of bugs that doesn't exist with Mutex
. Also, this is not going to be ultra performance sensitive anyway.
pkg/roachprod/vm/gce/dns.go
line 277 at r2 (raw file):
depending on where the name originates from
Curious to understand this better.
Just so I understand the risk: it would have to involve two roachprod instances (command line or through roachtest) changing the same cluster at the same time, yes? |
Previously the service domain name was constructed in a few places. This change unifies passing the full service name from one location.
The current `RWMutex` adds unnecessary complexity, and as this is an area that doesn't require high levels of performance it's preferable to use a regular mutex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's the primary case. The other risk is that any long running process holding onto roachprod, for instance roachtest, would be unaware of external changes to the entries due to the cache (for the same cluster). For instance, I start a long running test. Then via command line start more virtual clusters via a new roachprod process. This would result in the long running test not being aware of those new services. If that test were to in a subsequent later step try to do something similar it might fail.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @DarrylWong, @renatolabs, and @srosenberg)
pkg/roachprod/vm/gce/dns.go
line 171 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Not immediately obvious to me what this is doing, especially the predicate above.
Originates from the go code logic to choose between name or constructing the name: https://github.com/golang/go/blob/master/src/net/lookup.go#L723
But I agree, it's convoluted and unclear. I'll clean this up a bit.
pkg/roachprod/vm/gce/dns.go
line 43 at r2 (raw file):
Previously, renatolabs (Renato Costa) wrote…
I think we should make our lives easier and just use a regular
Mutex
.RWMutex
's performance gains are questionable [1], and we are just opening the door for a class of bugs that doesn't exist withMutex
. Also, this is not going to be ultra performance sensitive anyway.
Fair enough, I'll change to a regular mutex.
pkg/roachprod/vm/gce/dns.go
line 277 at r2 (raw file):
Previously, renatolabs (Renato Costa) wrote…
depending on where the name originates from
Curious to understand this better.
gcloud
cli and lookups will allow passing a FQDN with or without a trailing dot (https://en.wikipedia.org/wiki/Fully_qualified_domain_name). This allows you to create an entry without a trailing dot, but get one back if you query the same record. dig
also allows you to pass this optionally. It's possible to enforce the convention here, or just handle both cases.
My wording on the comment could be improved to rather mention gcloud
will always return the domain name ending with a dot, but you can optionally specify it during creation, updates, or lookups.
14b2f24
to
de5d867
Compare
TFTR! bors r=renatolabs |
Build succeeded: |
Previously
net.LookupSRV
with a custom resolver was used to lookup DNSrecords. This approach resulted in several flakes and required waiting on DNS
servers to have the records available. The CLI is more stable, but has a greater
call overhead.
This PR also introduces a cache to reduce the cost of the
LookupSRVRecords
call which could be called frequently depending on the origin of use. The cache
is updated for any CRUD operations on the DNS entries, and a call to the CLI
will not occur if any entry exists for the name the lookup is attempting. The
names are also normalised to remove a trailing dot in order to make matching
against the cache work correctly.
There is a small risk that the cache could go out of sync if any other roachprod
process manipulates the records with a create, update or destroy operation,
while a continuous roachprod process is interacting with the entries. This risk
is relatively small and usually applies to roachtest rather than everyday use of
roachprod.
Fixes #111269
Epic: None
Release Note: None