Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrades: Setting the cluster version can get stuck behind leasing #113908

Closed
fqazi opened this issue Nov 6, 2023 · 0 comments · Fixed by #113996
Closed

upgrades: Setting the cluster version can get stuck behind leasing #113908

fqazi opened this issue Nov 6, 2023 · 0 comments · Fixed by #113996
Assignees
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)

Comments

@fqazi
Copy link
Collaborator

fqazi commented Nov 6, 2023

On clusters with a large number of nodes and descriptors its possible for leasing traffic to be continuous. As a part of 23.1 we added version checks during lease renewal and deletion to detect if the format should be regional by row or the old non-multi region format. Unfortunately, the way the version guards in the leasing manager work is that they query KV in a high priority transaction preventing us from bumping up the cluster version number

We need to know transactionally the version so that we can determine if the new regional by row table should be used, so not doing version checks transactionally is not an option. Instead the alternative we are going to pursue is to bump the priority of the upgrade transaction when setting the cluster version.

Jira issue: CRDB-33252

gz#19331

Epic CRDB-35306

@fqazi fqazi added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Nov 6, 2023
@fqazi fqazi self-assigned this Nov 6, 2023
@nvanbenschoten nvanbenschoten added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label Nov 7, 2023
@rafiss rafiss added branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Nov 7, 2023
fqazi added a commit to fqazi/cockroach that referenced this issue Nov 8, 2023
Previously, it was possible for the leasing subsystem to starve out
attempts to set the cluster version during upgrades, since the leasing
subsystem uses high priority txn for renewals. To address this, this
patch makes the logic to set the cluster version high priority so
it can't be pushed out by lease renewals.

Fixes: cockroachdb#113908

Release note (bug fix): Addressed a bug that could cause cluster version
finalization to get starved out by descriptor lease renewals on larger
clusters.
fqazi added a commit to fqazi/cockroach that referenced this issue Nov 9, 2023
Previously, it was possible for the leasing subsystem to starve out
attempts to set the cluster version during upgrades, since the leasing
subsystem uses high priority txn for renewals. To address this, this
patch makes the logic to set the cluster version high priority so
it can't be pushed out by lease renewals.

Fixes: cockroachdb#113908

Release note (bug fix): Addressed a bug that could cause cluster version
finalization to get starved out by descriptor lease renewals on larger
clusters.
fqazi added a commit to fqazi/cockroach that referenced this issue Nov 9, 2023
Previously, it was possible for the leasing subsystem to starve out
attempts to set the cluster version during upgrades, since the leasing
subsystem uses high priority txn for renewals. To address this, this
patch makes the logic to set the cluster version high priority so
it can't be pushed out by lease renewals.

Fixes: cockroachdb#113908

Release note (bug fix): Addressed a bug that could cause cluster version
finalization to get starved out by descriptor lease renewals on larger
clusters.
fqazi added a commit to fqazi/cockroach that referenced this issue Nov 13, 2023
Previously, it was possible for the leasing subsystem to starve out
attempts to set the cluster version during upgrades, since the leasing
subsystem uses high priority txn for renewals. To address this, this
patch makes the logic to set the cluster version high priority so
it can't be pushed out by lease renewals.

Fixes: cockroachdb#113908

Release note (bug fix): Addressed a bug that could cause cluster version
finalization to get starved out by descriptor lease renewals on larger
clusters.
fqazi added a commit to fqazi/cockroach that referenced this issue Nov 13, 2023
Previously, it was possible for the leasing subsystem to starve out
attempts to set the cluster version during upgrades, since the leasing
subsystem uses high priority txn for renewals. To address this, this
patch makes the logic to set the cluster version high priority so
it can't be pushed out by lease renewals.

Fixes: cockroachdb#113908

Release note (bug fix): Addressed a bug that could cause cluster version
finalization to get starved out by descriptor lease renewals on larger
clusters.
fqazi added a commit to fqazi/cockroach that referenced this issue Nov 15, 2023
Previously, it was possible for the leasing subsystem to starve out
attempts to set the cluster version during upgrades, since the leasing
subsystem uses high priority txn for renewals. To address this, this
patch makes the logic to set the cluster version high priority so
it can't be pushed out by lease renewals.

Fixes: cockroachdb#113908

Release note (bug fix): Addressed a bug that could cause cluster version
finalization to get starved out by descriptor lease renewals on larger
clusters.
fqazi added a commit to fqazi/cockroach that referenced this issue Nov 16, 2023
Previously, it was possible for the leasing subsystem to starve out
attempts to set the cluster version during upgrades, since the leasing
subsystem uses high priority txn for renewals. To address this, this
patch makes the logic to set the cluster version high priority so
it can't be pushed out by lease renewals.

Fixes: cockroachdb#113908

Release note (bug fix): Addressed a bug that could cause cluster version
finalization to get starved out by descriptor lease renewals on larger
clusters.
craig bot pushed a commit that referenced this issue Nov 17, 2023
113934: roachprod: use gcloud CLI instead of net.LookupSRV r=renatolabs a=herkolategan

Previously `net.LookupSRV` with a custom resolver was used to lookup DNS
records. This approach resulted in several flakes and required waiting on DNS
servers to have the records available. The CLI is more stable, but has a greater
call overhead.

This PR also introduces a cache to reduce the cost of the `LookupSRVRecords`
call which could be called frequently depending on the origin of use. The cache
is updated for any CRUD operations on the DNS entries, and a call to the CLI
will not occur if any entry exists for the name the lookup is attempting. The
names are also normalised to remove a trailing dot in order to make matching
against the cache work correctly.

There is a small risk that the cache could go out of sync if any other roachprod
process manipulates the records with a create, update or destroy operation,
while a continuous roachprod process is interacting with the entries. This risk
is relatively small and usually applies to roachtest rather than everyday use of
roachprod.

Fixes #111269

Epic: None
Release Note: None


113996: upgrade: use high priority txn's to update the cluster version r=fqazi a=fqazi

Previously, it was possible for the leasing subsystem to starve out attempts to set the cluster version during upgrades, since the leasing subsystem uses high priority txn for renewals. To address this, this patch makes the logic to set the cluster version high priority so it can't be pushed out by lease renewals.

Fixes: #113908

Release note (bug fix): Addressed a bug that could cause cluster version finalization to get starved out by descriptor lease renewals on larger clusters.

Co-authored-by: Herko Lategan <herko@cockroachlabs.com>
Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>
@craig craig bot closed this as completed in 07f3628 Nov 17, 2023
yuzefovich pushed a commit to yuzefovich/cockroach that referenced this issue Nov 28, 2023
Previously, it was possible for the leasing subsystem to starve out
attempts to set the cluster version during upgrades, since the leasing
subsystem uses high priority txn for renewals. To address this, this
patch makes the logic to set the cluster version high priority so
it can't be pushed out by lease renewals.

Fixes: cockroachdb#113908

Release note (bug fix): Addressed a bug that could cause cluster version
finalization to get starved out by descriptor lease renewals on larger
clusters.
fqazi added a commit that referenced this issue Nov 29, 2023
Previously, it was possible for the leasing subsystem to starve out
attempts to set the cluster version during upgrades, since the leasing
subsystem uses high priority txn for renewals. To address this, this
patch makes the logic to set the cluster version high priority so
it can't be pushed out by lease renewals.

Fixes: #113908

Release note (bug fix): Addressed a bug that could cause cluster version
finalization to get starved out by descriptor lease renewals on larger
clusters.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants