Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachprod: use gcloud CLI instead of net.LookupSRV #113934

Merged
merged 4 commits into from
Nov 17, 2023

Conversation

herkolategan
Copy link
Collaborator

@herkolategan herkolategan commented Nov 7, 2023

Previously net.LookupSRV with a custom resolver was used to lookup DNS
records. This approach resulted in several flakes and required waiting on DNS
servers to have the records available. The CLI is more stable, but has a greater
call overhead.

This PR also introduces a cache to reduce the cost of the LookupSRVRecords
call which could be called frequently depending on the origin of use. The cache
is updated for any CRUD operations on the DNS entries, and a call to the CLI
will not occur if any entry exists for the name the lookup is attempting. The
names are also normalised to remove a trailing dot in order to make matching
against the cache work correctly.

There is a small risk that the cache could go out of sync if any other roachprod
process manipulates the records with a create, update or destroy operation,
while a continuous roachprod process is interacting with the entries. This risk
is relatively small and usually applies to roachtest rather than everyday use of
roachprod.

Fixes #111269

Epic: None
Release Note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator Author

@herkolategan herkolategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously `net.LookupSRV` with a custom resolver was used to lookup DNS
records. This approach resulted in several flakes and required waiting on DNS
servers to have the records available. The CLI is more stable, but has a greater
call overhead.

Fixes cockroachdb#111269

Epic: None
Release Note: None
Previously we used net lookups to retrieve DNS entries and discover services.
This is a call that can happen frequently, but the overhead of a DNS lookup is
low. After the change to rather use the CLI to retrieve DNS entries there is a
bigger cost associated with the call.

This change introduces a cache to reduce the cost of the `LookupSRVRecords` call
which could be called frequently depending on the origin of use. The cache is
updated for any CRUD operations on the DNS entries, and a call to the CLI will
not occur if any entry exists for the name the lookup is attempting. The names
are also normalised to remove a trailing dot in order to make matching against
the cache work correctly.

There is a small risk that the cache could go out of sync if any other roachprod
process manipulates the records with a create, update or destroy operation,
while a continuous roachprod process is interacting with the entries. This risk
is relatively small and usually applies to roachtest rather than everyday use of
roachprod.

Epic: None
Release Note: None
@herkolategan herkolategan marked this pull request as ready for review November 10, 2023 11:22
@herkolategan herkolategan requested a review from a team as a code owner November 10, 2023 11:22
@herkolategan herkolategan requested review from srosenberg, DarrylWong and renatolabs and removed request for a team November 10, 2023 11:22
Copy link
Collaborator Author

@herkolategan herkolategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran all seed-multi-region tests as well as these are the ones always present on the DNS failures: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_Nightlies_RoachtestNightlyGceBazel/12675094?buildTab=tests

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @renatolabs, and @srosenberg)

Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Nice!

A couple of questions and a suggestion about use of mutex. Excited to finally fix that roachtest! 😆

Reviewed 2 of 2 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @DarrylWong, @herkolategan, and @srosenberg)


pkg/roachprod/vm/gce/dns.go line 171 at r1 (raw file):

	target := name
	if service != "" || proto != "" {
		target = "_" + service + "._" + proto + "." + name

Not immediately obvious to me what this is doing, especially the predicate above.


pkg/roachprod/vm/gce/dns.go line 43 at r2 (raw file):

type dnsProvider struct {
	recordsCache struct {
		mu      syncutil.RWMutex

I think we should make our lives easier and just use a regular Mutex. RWMutex's performance gains are questionable [1], and we are just opening the door for a class of bugs that doesn't exist with Mutex. Also, this is not going to be ultra performance sensitive anyway.

[1] https://cockroachlabs.slack.com/archives/C4X2J0RH6/p1693183691487509?thread_ts=1692899413.532049&cid=C4X2J0RH6


pkg/roachprod/vm/gce/dns.go line 277 at r2 (raw file):

depending on where the name originates from

Curious to understand this better.

@renatolabs
Copy link
Contributor

There is a small risk that the cache could go out of sync

Just so I understand the risk: it would have to involve two roachprod instances (command line or through roachtest) changing the same cluster at the same time, yes?

Previously the service domain name was constructed in a few places. This change
unifies passing the full service name from one location.
The current `RWMutex` adds unnecessary complexity, and as this is an area that
doesn't require high levels of performance it's preferable to use a regular
mutex.
Copy link
Collaborator Author

@herkolategan herkolategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's the primary case. The other risk is that any long running process holding onto roachprod, for instance roachtest, would be unaware of external changes to the entries due to the cache (for the same cluster). For instance, I start a long running test. Then via command line start more virtual clusters via a new roachprod process. This would result in the long running test not being aware of those new services. If that test were to in a subsequent later step try to do something similar it might fail.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @DarrylWong, @renatolabs, and @srosenberg)


pkg/roachprod/vm/gce/dns.go line 171 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Not immediately obvious to me what this is doing, especially the predicate above.

Originates from the go code logic to choose between name or constructing the name: https://github.com/golang/go/blob/master/src/net/lookup.go#L723
But I agree, it's convoluted and unclear. I'll clean this up a bit.


pkg/roachprod/vm/gce/dns.go line 43 at r2 (raw file):

Previously, renatolabs (Renato Costa) wrote…

I think we should make our lives easier and just use a regular Mutex. RWMutex's performance gains are questionable [1], and we are just opening the door for a class of bugs that doesn't exist with Mutex. Also, this is not going to be ultra performance sensitive anyway.

[1] https://cockroachlabs.slack.com/archives/C4X2J0RH6/p1693183691487509?thread_ts=1692899413.532049&cid=C4X2J0RH6

Fair enough, I'll change to a regular mutex.


pkg/roachprod/vm/gce/dns.go line 277 at r2 (raw file):

Previously, renatolabs (Renato Costa) wrote…

depending on where the name originates from

Curious to understand this better.

gcloud cli and lookups will allow passing a FQDN with or without a trailing dot (https://en.wikipedia.org/wiki/Fully_qualified_domain_name). This allows you to create an entry without a trailing dot, but get one back if you query the same record. dig also allows you to pass this optionally. It's possible to enforce the convention here, or just handle both cases.

My wording on the comment could be improved to rather mention gcloud will always return the domain name ending with a dot, but you can optionally specify it during creation, updates, or lookups.

@herkolategan
Copy link
Collaborator Author

TFTR!

bors r=renatolabs

@craig
Copy link
Contributor

craig bot commented Nov 17, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachprod: ns-cloud-a1.googledomains.com lookup failure
3 participants