New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: make sql liveness and descriptor leasing able to region failure while only SURVIVE ZONE. #103727
Comments
Previously, we had no way of tracking region liveness which could can be used to improve behaviour on region failures. To address this, this patch introduces the system.region_liveness table. Epic: CRDB-28158 Informs: cockroachdb#103727 Release note: None
Previously, we had no way of tracking region liveness which could can be used to improve behaviour on region failures. To address this, this patch introduces the system.region_liveness table. Epic: CRDB-28158 Informs: cockroachdb#103727 Release note: None
Previously, we had no way of tracking region liveness which could can be used to improve behaviour on region failures. To address this, this patch introduces the system.region_liveness table. Epic: CRDB-28158 Informs: cockroachdb#103727 Release note: None
@fqazi I'm assigned to work on docs for this epic (via DOC-8234) I built
Based on reading the issue description above and looking at some of the other tables in (I also assumed this is not user-facing because in your commit you wrote "Release note: None", but I am confirming since I'm assigned a doc task for the epic) |
@rmloveland Right, no user-facing change is visible by this PR yet. |
116784: regionliveness: add support for sqlinstances and recovery r=fqazi a=fqazi This PR completes the regionliveness survivability goal work by implementing the following: 1. Logic to bound the sqlliveness reviewals based on the unavailable_at time set on a region 2. Updating the sqlinstance allocation logic to take into account regionliveness by working on live regions only. 3. Updating the change feed initial scan on sqlinstances to work on a per-region basis and consider region liveness. 4. Updating the sqlinstance allocation logic to recover from region failures by cleaning up system.leases, system.sqlinstances, and system.sqllivness after a failure. 5. A new roachtest focused on setting up a physical cluster and simulating failure scenarios for nightly builds. 6. A synthetic test for simulating region failures and recovery from them This PR is stacked on top of: #115568, so the first 4 commits should be ignored. informs: #103727 EPIC: CC-24173 118578: roachtest: fix typo in query_comparison_util.go r=mgartner a=mgartner Epic: None Release note: None Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com> Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>
Is your feature request related to a problem? Please describe.
Currently Serverless needs to chose between configuring the system database as SURVIVE REGION or SURVIVE ZONE. Configuring the database as SURVIVE REGION is optimal for availability whereas configuring the database as SURVIVE ZONE is optimal for cold start performance.
For most tables in the system database, write performance is not important for cold start performance. The only tables that need fast writes are system.sqlliveness, system.sql_instances, and system.lease. These tables are already partitioned as RBR tables. The high level idea is to configure these tables as SURVIVE ZONE and configure all other tables as SURVIVE REGION.
Actually making this work will require changes to liveness and lease subsystems to handle region failure. One possible approach is to introduce a global “region_liveness” table. If a region is unable to read from another region’s RBR data, it would mark the region as unreachable in the region_liveness table. After enough time has passed to ensure all leases in the region have expired, the region is marked as down. If a sql server wants to extend leases, the sql server would read from the region_liveness table in the transaction that extends the lease. The server is only allowed to extend it’s leases if the region is marked as available.
Jira issue: CRDB-28158
Epic CC-24173
The text was updated successfully, but these errors were encountered: