Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regionliveness: add support for sqlinstances and recovery #116784

Merged

Conversation

fqazi
Copy link
Collaborator

@fqazi fqazi commented Dec 19, 2023

This PR completes the regionliveness survivability goal work by implementing the following:

  1. Logic to bound the sqlliveness reviewals based on the unavailable_at time set on a region
  2. Updating the sqlinstance allocation logic to take into account regionliveness by working on live regions only.
  3. Updating the change feed initial scan on sqlinstances to work on a per-region basis and consider region liveness.
  4. Updating the sqlinstance allocation logic to recover from region failures by cleaning up system.leases, system.sqlinstances, and system.sqllivness after a failure.
  5. A new roachtest focused on setting up a physical cluster and simulating failure scenarios for nightly builds.
  6. A synthetic test for simulating region failures and recovery from them

This PR is stacked on top of: #115568, so the first 4 commits should be ignored.

informs: #103727
EPIC: CC-24173

@fqazi fqazi requested review from a team as code owners December 19, 2023 18:43
@fqazi fqazi requested review from herkolategan, DarrylWong, yuzefovich and adityamaru and removed request for a team December 19, 2023 18:43
Copy link

blathers-crl bot commented Dec 19, 2023

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@fqazi fqazi force-pushed the regionLivenessForSqlSessionHeartbeat branch from ba5c268 to caccf15 Compare December 19, 2023 19:08
@yuzefovich yuzefovich removed their request for review December 20, 2023 01:08
@fqazi fqazi force-pushed the regionLivenessForSqlSessionHeartbeat branch 2 times, most recently from 0ddfd7e to 1a3848a Compare January 9, 2024 00:50
@fqazi fqazi requested review from JeffSwenson and removed request for adityamaru January 12, 2024 21:27
pkg/sql/sqlliveness/slinstance/slinstance.go Outdated Show resolved Hide resolved
pkg/sql/sqlliveness/slstorage/slstorage.go Outdated Show resolved Hide resolved
pkg/sql/sqlliveness/slstorage/slstorage.go Outdated Show resolved Hide resolved
pkg/sql/sqlliveness/slstorage/slstorage.go Outdated Show resolved Hide resolved
pkg/sql/sqlinstance/instancestorage/instancestorage.go Outdated Show resolved Hide resolved
pkg/sql/sqlinstance/instancestorage/instancestorage.go Outdated Show resolved Hide resolved
@fqazi fqazi force-pushed the regionLivenessForSqlSessionHeartbeat branch from 1a3848a to 49c5303 Compare January 25, 2024 03:59
Copy link
Collaborator Author

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @herkolategan, and @JeffSwenson)


pkg/sql/sqlinstance/instancestorage/instancestorage.go line 292 at r7 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

Instead of saving the sql server's own session can we run the clean up operation before allocating a session ID?

Done.


pkg/sql/sqlliveness/slstorage/slstorage.go line 543 at r5 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

nit: this block can be simplified by removing the delta calculation. If the region is unavailable before the calculated expiration timestamp, then we can use that timestamp as the new expiration.

regionUnavailableAt := unavailableRegions[string(region)].GoTime()
if regionUnavailableAt.Less(exp) {
    exp = regionUnavailableAt
}

The problem that we have is that the timestamp is stored as DTimestamp, and the expiration is a hlc.Timestamp. The delta gives an approximate conversion.


pkg/sql/sqlliveness/slstorage/slstorage.go line 546 at r5 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

Is nil the right value to return if the expiration timestamp is in the past? I think Update is only called when heart beating a server's own session. So this case indicates the server's own session has expired.

Done.


pkg/sql/sqlliveness/slstorage/slstorage.go line 556 at r5 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

I don't think we need to set the transaction deadline here. It is valid to write a liveness row in an unavailable region as long as the expires_at timestamp is in the past, which would be true if the transaction gets pushed.

Done.

@fqazi fqazi force-pushed the regionLivenessForSqlSessionHeartbeat branch from 49c5303 to 068dff2 Compare January 25, 2024 15:17
@fqazi fqazi force-pushed the regionLivenessForSqlSessionHeartbeat branch 2 times, most recently from 2a8bffd to 1039d7c Compare January 25, 2024 16:36
@fqazi fqazi force-pushed the regionLivenessForSqlSessionHeartbeat branch from 1039d7c to b8d3e6c Compare January 25, 2024 22:07
Copy link
Collaborator Author

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @herkolategan, and @JeffSwenson)


pkg/sql/sqlliveness/slstorage/slstorage.go line 517 at r17 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

The implementation of the recovery protocol looks correct to me. But there is one potential availability issue in the implementation I noticed while reviewing the RFC.

The RFC contains this line:

If unvailable_at is not null, but set to a timestamp after the transaction timestamp, a SQL server can attempt to set it to null in a transaction modifying only the region_liveness table.

I think the reason I wrote that is I was assuming sql liveness would attempt to clear an unavailable_at set for its region before the region is treated as unavailable. Currently, if any server sets unavailable_at the sql servers will all restart even if their region becomes available before the sessions would expire.

I believe it is safe to clear an unavailable_at without deleting any of the survive zone rows as long as the transaction deadline comes before the unavailable_at timestamp.

This logic probably belongs in Storage.Update and should run whenever an unavailable at timestamp is observed and the sql server was able to read from the sqlliveness table.

Done.

@fqazi fqazi force-pushed the regionLivenessForSqlSessionHeartbeat branch 5 times, most recently from 890e3c3 to 99502e4 Compare January 30, 2024 04:59
@fqazi fqazi force-pushed the regionLivenessForSqlSessionHeartbeat branch from 99502e4 to daf362a Compare February 1, 2024 19:10
Copy link
Collaborator Author

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @herkolategan, and @JeffSwenson)


pkg/sql/sqlliveness/slstorage/slstorage.go line 545 at r36 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

I think the exp := expiration.GoTime(); exp.After(ts) check is a little too aggressive. The error case should only occur if the unavailable at timestamp comes before the transaction timestamp. If the requested expiration time comes after the unavailable time, then we can clamp the expiration time to the unavailable time.

Done.


pkg/sql/sqlliveness/slstorage/slstorage.go line 596 at r38 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

I think this operation doesn't actually care about the expiration timestamp. It only cares about the unavailable at instance. As long as the unavailable at instance is still in the future it can clear the row.

That also makes it easier to extract the logic into a helper in the region liveness package because using the unavailable_at instant as the deadline allows the reset logic to only cares about the state of the region liveness table.

Done.

To add support for being able to clamp SQL session renewals to the the
lifespan of a region, the region liveness prober interface needs to be
updated to support querying timestamps. This patch updates the region
liveness prober to support querying unavailable_at times add the ability
to retrieve timestamps.

Release note: None
Previously, sessions could be renewed beyond the regionliveness,
allowing them to stay alive during the renewal process. This meant these
sql sessions would not be brought down when the region was declared
dead. This patch adds code to clamp the renewal time to the
unavailable_at time when one is set.

Fixes: cockroachdb#113163
Release note: None
Previously, we didn't have a generic way to retrieve spans related to
regions in a system table. Instead of hand rolling or repeating code
across multiple tables, let's provide a utility for doing this, since
after recovering from a dead region, we will need to clean system tables
to remove existing rows.

Release note: None
Previously, once a region was marked as dead, we did not handle recovery
properly inside the SQL instance allocation logic. As a result, we would
attempt to bring up the region even though the session will be blocked
from renewing any further because of the unavailable_at timestamp in the
regionliveness table. This patch adds logic for cleaning up information
from when the region died to allow recovery by clearing sqlliveness,
sql_instances and regionliveness when a region is ready to recover.

Fixes: cockroachdb#115564, cockroachdb#115565
Release note: None
Previously, the initial scan used by the SQL instance range feed to
monitor changes was not region-liveness aware. As a result, if a region
were to have gone down, we could hang waiting for a reply from a region.
This patch makes the initial scan region aware and probes regions before
querying them.

Release note: None
Previously, we could not specify regions when creating tenants
inside roachtest. This patch adds a new option for selecting a region
during tenant creation.

Release note: None
Previously, we had no coverage for multi-region multi-tenant scenarios
on real clusters inside this roach test. This patch adds a new test with
this scenario, focusing on region failure and recovery to ensure that
region-liveness works correctly.

Release note: None
@fqazi fqazi force-pushed the regionLivenessForSqlSessionHeartbeat branch from daf362a to bacfa57 Compare February 1, 2024 20:16
Copy link
Collaborator

@JeffSwenson JeffSwenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 💯

@fqazi
Copy link
Collaborator Author

fqazi commented Feb 1, 2024

@JeffSwenson TFTR!

bors r+

@craig
Copy link
Contributor

craig bot commented Feb 1, 2024

Build succeeded:

@craig craig bot merged commit 7b66dc4 into cockroachdb:master Feb 1, 2024
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants