Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachprod: ns-cloud-a1.googledomains.com lookup failure #111269

Closed
herkolategan opened this issue Sep 26, 2023 · 1 comment · Fixed by #113934
Closed

roachprod: ns-cloud-a1.googledomains.com lookup failure #111269

herkolategan opened this issue Sep 26, 2023 · 1 comment · Fixed by #113934
Assignees
Labels
A-testing Testing tools and infrastructure C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-testeng TestEng Team

Comments

@herkolategan
Copy link
Collaborator

herkolategan commented Sep 26, 2023

We observed a DNS flake on a custom run of roachtest [1].
It was unable to lookup ns-cloud-a1.googledomains.com the DNS server against which follow-up DNS requests would have been made.

lookup _system-sql._tcp.teamcity-11905017-1695525840-95-n9cpu4-geo.roachprod-managed.crdb.io on 169.254.169.254:53: dial tcp: lookup ns-cloud-a1.googledomains.com: i/o timeout

[1] https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_Nightlies_RoachtestNightlyGceBazel/11905017

Jira issue: CRDB-31847

@herkolategan herkolategan added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure T-testeng TestEng Team labels Sep 26, 2023
@blathers-crl
Copy link

blathers-crl bot commented Sep 26, 2023

cc @cockroachdb/test-eng

@blathers-crl blathers-crl bot added this to Triage in Test Engineering Sep 26, 2023
herkolategan added a commit to herkolategan/cockroach that referenced this issue Sep 26, 2023
Previously we observed a flake while trying to lookup the DNS server's ip
`ns-cloud-a1.googledomains.com`. This change replaces the host name with its
static IP (216.239.32.106) in order to reduce flakes.

Fixes: cockroachdb#111269

Epic: None
Release Note: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Oct 25, 2023
Previously we observed a flake while trying to lookup the DNS server's ip
`ns-cloud-a1.googledomains.com`. This change falls back to a known static IP
(216.239.32.106) if the lookup fails in order to reduce flakes.

Fixes: cockroachdb#111269

Epic: None
Release Note: None
herkolategan added a commit to herkolategan/cockroach that referenced this issue Nov 2, 2023
Previously we observed flakes while trying to lookup the DNS server's IP
`ns-cloud-a1.googledomains.com`. This change adds a new function that uses
multiple resolvers to resolve the IP. In the worst case it falls back to a known
static IP (216.239.32.106), although this IP might not remain correct.

Fixes cockroachdb#111269

Epic: None
Release Note: None
@herkolategan herkolategan self-assigned this Nov 10, 2023
@herkolategan herkolategan moved this from Triage to Backlog in Test Engineering Nov 10, 2023
@herkolategan herkolategan moved this from Backlog to In Progress in Test Engineering Nov 10, 2023
herkolategan added a commit to herkolategan/cockroach that referenced this issue Nov 17, 2023
Previously `net.LookupSRV` with a custom resolver was used to lookup DNS
records. This approach resulted in several flakes and required waiting on DNS
servers to have the records available. The CLI is more stable, but has a greater
call overhead.

Fixes cockroachdb#111269

Epic: None
Release Note: None
craig bot pushed a commit that referenced this issue Nov 17, 2023
113934: roachprod: use gcloud CLI instead of net.LookupSRV r=renatolabs a=herkolategan

Previously `net.LookupSRV` with a custom resolver was used to lookup DNS
records. This approach resulted in several flakes and required waiting on DNS
servers to have the records available. The CLI is more stable, but has a greater
call overhead.

This PR also introduces a cache to reduce the cost of the `LookupSRVRecords`
call which could be called frequently depending on the origin of use. The cache
is updated for any CRUD operations on the DNS entries, and a call to the CLI
will not occur if any entry exists for the name the lookup is attempting. The
names are also normalised to remove a trailing dot in order to make matching
against the cache work correctly.

There is a small risk that the cache could go out of sync if any other roachprod
process manipulates the records with a create, update or destroy operation,
while a continuous roachprod process is interacting with the entries. This risk
is relatively small and usually applies to roachtest rather than everyday use of
roachprod.

Fixes #111269

Epic: None
Release Note: None


113996: upgrade: use high priority txn's to update the cluster version r=fqazi a=fqazi

Previously, it was possible for the leasing subsystem to starve out attempts to set the cluster version during upgrades, since the leasing subsystem uses high priority txn for renewals. To address this, this patch makes the logic to set the cluster version high priority so it can't be pushed out by lease renewals.

Fixes: #113908

Release note (bug fix): Addressed a bug that could cause cluster version finalization to get starved out by descriptor lease renewals on larger clusters.

Co-authored-by: Herko Lategan <herko@cockroachlabs.com>
Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>
@craig craig bot closed this as completed in f06db11 Nov 17, 2023
Test Engineering automation moved this from In Progress to Done Nov 17, 2023
cockroach-teamcity pushed a commit to cockroach-teamcity/cockroach that referenced this issue Nov 27, 2023
Previously `net.LookupSRV` with a custom resolver was used to lookup DNS
records. This approach resulted in several flakes and required waiting on DNS
servers to have the records available. The CLI is more stable, but has a greater
call overhead.

Fixes cockroachdb#111269

Epic: None
Release Note: None
annrpom pushed a commit to annrpom/cockroach that referenced this issue Nov 29, 2023
Previously `net.LookupSRV` with a custom resolver was used to lookup DNS
records. This approach resulted in several flakes and required waiting on DNS
servers to have the records available. The CLI is more stable, but has a greater
call overhead.

Fixes cockroachdb#111269

Epic: None
Release Note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-testeng TestEng Team
Projects
1 participant