Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: make sql liveness and descriptor leasing able to region failure while only SURVIVE ZONE. #103727

Closed
chengxiong-ruan opened this issue May 22, 2023 · 3 comments
Assignees
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)

Comments

@chengxiong-ruan
Copy link
Contributor

chengxiong-ruan commented May 22, 2023

Is your feature request related to a problem? Please describe.
Currently Serverless needs to chose between configuring the system database as SURVIVE REGION or SURVIVE ZONE. Configuring the database as SURVIVE REGION is optimal for availability whereas configuring the database as SURVIVE ZONE is optimal for cold start performance.

For most tables in the system database, write performance is not important for cold start performance. The only tables that need fast writes are system.sqlliveness, system.sql_instances, and system.lease. These tables are already partitioned as RBR tables. The high level idea is to configure these tables as SURVIVE ZONE and configure all other tables as SURVIVE REGION.

Actually making this work will require changes to liveness and lease subsystems to handle region failure. One possible approach is to introduce a global “region_liveness” table. If a region is unable to read from another region’s RBR data, it would mark the region as unreachable in the region_liveness table. After enough time has passed to ensure all leases in the region have expired, the region is marked as down. If a sql server wants to extend leases, the sql server would read from the region_liveness table in the transaction that extends the lease. The server is only allowed to extend it’s leases if the region is marked as available.

Jira issue: CRDB-28158

Epic CC-24173

@chengxiong-ruan chengxiong-ruan added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label May 22, 2023
@chengxiong-ruan chengxiong-ruan self-assigned this May 22, 2023
@chengxiong-ruan chengxiong-ruan added this to Triage in SQL Foundations via automation May 22, 2023
@blathers-crl blathers-crl bot added the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label May 22, 2023
@chengxiong-ruan chengxiong-ruan moved this from Triage to Backlog in SQL Foundations May 26, 2023
fqazi added a commit to fqazi/cockroach that referenced this issue Aug 3, 2023
Previously, we had no way of tracking region liveness
which could can be used to improve behaviour on region
failures. To address this, this patch introduces the
system.region_liveness table.

Epic: CRDB-28158

Informs: cockroachdb#103727

Release note: None
fqazi added a commit to fqazi/cockroach that referenced this issue Aug 9, 2023
Previously, we had no way of tracking region liveness
which could can be used to improve behaviour on region
failures. To address this, this patch introduces the
system.region_liveness table.

Epic: CRDB-28158

Informs: cockroachdb#103727

Release note: None
fqazi added a commit to fqazi/cockroach that referenced this issue Aug 17, 2023
Previously, we had no way of tracking region liveness
which could can be used to improve behaviour on region
failures. To address this, this patch introduces the
system.region_liveness table.

Epic: CRDB-28158

Informs: cockroachdb#103727

Release note: None
@rmloveland
Copy link
Collaborator

@fqazi I'm assigned to work on docs for this epic (via DOC-8234)

I built cockroach from the master branch and am seeing the system.public.region_liveness table which was added in your commit here

show columns from system.public.region_liveness;
   column_name   | data_type | is_nullable | column_default | generation_expression |        indices         | is_hidden
-----------------+-----------+-------------+----------------+-----------------------+------------------------+------------
  crdb_region    | BYTES     |      f      | NULL           |                       | {region_liveness_pkey} |     f
  unavailable_at | TIMESTAMP |      t      | NULL           |                       | {region_liveness_pkey} |     f
(2 rows)

Based on reading the issue description above and looking at some of the other tables in system.public, it seems like this new region_liveness system table is not a user-facing feature and will only be used as part of the setup when other system tables are set as REGIONAL BY ROW during some of our multi-region cloud deployments. Is that correct?

(I also assumed this is not user-facing because in your commit you wrote "Release note: None", but I am confirming since I'm assigned a doc task for the epic)

@fqazi
Copy link
Collaborator

fqazi commented Sep 26, 2023

@rmloveland Right, no user-facing change is visible by this PR yet.

@rmloveland
Copy link
Collaborator

@fqazi thanks for confirming! I will likely close DOC-8234 then

craig bot pushed a commit that referenced this issue Feb 1, 2024
116784: regionliveness: add support for sqlinstances and recovery r=fqazi a=fqazi

This PR completes the regionliveness survivability goal work by implementing the following:

1. Logic to bound the sqlliveness reviewals based on the unavailable_at time set on a region
2. Updating the sqlinstance allocation logic to take into account regionliveness by working on live regions only.
3. Updating the change feed initial scan on sqlinstances to work on a per-region basis and consider region liveness.
4. Updating the sqlinstance allocation logic to recover from region failures by cleaning up system.leases, system.sqlinstances, and system.sqllivness after a failure.
5. A new roachtest focused on setting up a physical cluster and simulating failure scenarios for nightly builds.
6. A synthetic test for simulating region failures and recovery from them

This PR is stacked on top of:  #115568, so the first 4 commits should be ignored.

informs: #103727
EPIC: CC-24173

118578: roachtest: fix typo in query_comparison_util.go r=mgartner a=mgartner

Epic: None

Release note: None

Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>
Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>
SQL Foundations automation moved this from Backlog to Done Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
Development

No branches or pull requests

3 participants