Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqlliveness: adopt regionliveness when querying if a session is alive #115568

Merged
merged 4 commits into from
Jan 30, 2024

Conversation

fqazi
Copy link
Collaborator

@fqazi fqazi commented Dec 4, 2023

This PR will start query the regionliveness table as a part of the protocol to confirm if a given sqlliveness session IsAlive. As a side effect of this change the jobs subsystem and other infrastructure will be able take into account regionliveness. To allow this changed to happen the following commits are included in this PR

  1. Update sqlliveness so that the heartbeat / TTL settings are moved into their out subpackage, avoiding circular dependencies with regionliveness
  2. regionliveness is refactored to issue direct KV calls so that lower level code can utilize this infrastructure
  3. The sqlliveness logic for IsAlive now takes into account information from the regionliveness table for both probing and detecting if a session should be dead.
  4. Some of our tests for regionliveness did not configure the system database appropriately, so the region information was not correctly propagated to the sqlliveness table.

Informs: #115563

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@fqazi fqazi force-pushed the regionLivenessForSqlSessions branch 3 times, most recently from 4c50f01 to 1c485f7 Compare December 5, 2023 05:05
@fqazi fqazi marked this pull request as ready for review December 5, 2023 05:05
@fqazi fqazi requested review from a team as code owners December 5, 2023 05:05
@fqazi fqazi requested review from mgartner and msbutler and removed request for a team December 5, 2023 05:05
@fqazi fqazi force-pushed the regionLivenessForSqlSessions branch from 1c485f7 to fffb1a0 Compare December 5, 2023 16:55
@fqazi fqazi force-pushed the regionLivenessForSqlSessions branch from fffb1a0 to e66131a Compare December 18, 2023 21:38
@fqazi fqazi force-pushed the regionLivenessForSqlSessions branch from e66131a to a7e6f27 Compare December 19, 2023 19:00
defaultTTL := slbase.DefaultTTL.Get(&l.settings.SV)
defaultHeartbeat := slbase.DefaultHeartBeat.Get(&l.settings.SV)
// Get the read timestamp and pick a commit deadline.
readTS := txn.KV().ReadTimestamp().AddDuration(defaultHeartbeat)
readTS := txn.ReadTimestamp().AddDuration(defaultHeartbeat)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think you should rename readTS to commitDeadline. Since the readTS is variable but is allowed to be anything less than the commit deadline.

}
return txn.KV().UpdateDeadline(ctx, readTS)
return txn.UpdateDeadline(ctx, readTS)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to defer setting UpdateDealdine to the end of the transaction? Since we are computing the commitDeadline it seems like we should be able to specify the deadline earlier in the transaction and remove the explicit ReadTimestamp.After() guard.

@@ -37,21 +37,21 @@ type rbrEncoder struct {
rbrIndex roachpb.Key
}

func (e *rbrEncoder) encode(session sqlliveness.SessionID) (roachpb.Key, error) {
func (e *rbrEncoder) encode(session sqlliveness.SessionID) (roachpb.Key, []byte, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encode should not return the []byte slice representing the region. The region bytes are pointing at memory that is owned by the string and since []byte is mutable it would be easy to accidentally violate the go memory model.

Since you want the value as a string I think it is possible to avoid allocation here by making a safe version of DecodeSessionID that extracts a substring from the session ID. The encode function could use the UnsafeBytes helper to avoid allocating memory for the region.


if err != nil &&
regionliveness.IsQueryTimeoutErr(err) {
probeErr := livenessProber.ProbeLivenessWithPhysicalRegion(ctx, regionPhysicalRep)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to call ProbeLivenessWithPhysicalRegions inside of a txn callback?

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @fqazi, @mgartner, and @msbutler)


pkg/ccl/multiregionccl/regionliveness_test.go line 280 at r4 (raw file):

		// will be aware (i.e. session ID and SQL instance will not be MR aware).
		if i == 0 {
			require.NoError(t, err)

nit: what error is this asserting on?

For the sqlliveness package to adopt the regionliveness package, we need
to eliminate a circular dependency between the two.  Currently, the
region liveness requires to access the default TTL and heartbeat
settings inside sqlliveness. To address this, this patch will eliminate
this cycle by moving these settings into a subpackage called slbase.

Release note: None
@fqazi fqazi force-pushed the regionLivenessForSqlSessions branch from a7e6f27 to 59b9e00 Compare January 26, 2024 03:11
Copy link
Collaborator Author

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @JeffSwenson, @mgartner, and @msbutler)


pkg/sql/regionliveness/prober.go line 201 at r2 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

Do we need to defer setting UpdateDealdine to the end of the transaction? Since we are computing the commitDeadline it seems like we should be able to specify the deadline earlier in the transaction and remove the explicit ReadTimestamp.After() guard.

Done.


pkg/sql/sqlliveness/slstorage/key_encoder.go line 40 at r3 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

encode should not return the []byte slice representing the region. The region bytes are pointing at memory that is owned by the string and since []byte is mutable it would be easy to accidentally violate the go memory model.

Since you want the value as a string I think it is possible to avoid allocation here by making a safe version of DecodeSessionID that extracts a substring from the session ID. The encode function could use the UnsafeBytes helper to avoid allocating memory for the region.

Done.


pkg/sql/sqlliveness/slstorage/slstorage.go line 338 at r3 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

Is it safe to call ProbeLivenessWithPhysicalRegions inside of a txn callback?

Done.

Good point, and I moved this outside the transaction after the error hit.

@fqazi fqazi force-pushed the regionLivenessForSqlSessions branch 2 times, most recently from d4c6829 to 3d9a772 Compare January 26, 2024 04:32
Previously, the region liveness interfaces used internal executors for
reading and writing to the region_livness table. This was effective in
cases where we had access to internal executors like leasing but broke
down for lower level code that needs to adopt these concepts like
sqlliveness. This patch, moves the region liveness logic to use KV API
to encode / decode rows as required.

Release note: None
Previously, sqlliveness would probe the system.sqlliveness table, but
had no logic to check for region liveness. Which meant it ran the risk
of potentially getting stuck on dead regions if the system database is
moved to SURVIVE REGION. This patch adopts the regionliveness subsystem
so that these queries have timeouts and probe for dead regions. This
allows subsystems that rely on IsAlive like jobs to take advantage of
region liveness.

Fixes: cockroachdb#115563
Release note: None
Previously, the regionliveness_test set up all the tenants and then
added the regions on the system database. Unfortuantely, specific
subsystems like sqlliveness need the region enum to be configured
*before* a new tenant is added. This patch, changes the order of
operations so that the enum is updated first.

Release note: None
@fqazi fqazi force-pushed the regionLivenessForSqlSessions branch from 3d9a772 to e53bda2 Compare January 26, 2024 05:30
@mgartner mgartner removed their request for review January 26, 2024 22:13
Copy link
Collaborator

@JeffSwenson JeffSwenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @msbutler and @rafiss)

@fqazi
Copy link
Collaborator Author

fqazi commented Jan 29, 2024

@JeffSwenson TFTR!

bors r+

@craig
Copy link
Contributor

craig bot commented Jan 30, 2024

Build succeeded:

@craig craig bot merged commit 70150a6 into cockroachdb:master Jan 30, 2024
8 of 9 checks passed
craig bot pushed a commit that referenced this pull request Feb 1, 2024
116784: regionliveness: add support for sqlinstances and recovery r=fqazi a=fqazi

This PR completes the regionliveness survivability goal work by implementing the following:

1. Logic to bound the sqlliveness reviewals based on the unavailable_at time set on a region
2. Updating the sqlinstance allocation logic to take into account regionliveness by working on live regions only.
3. Updating the change feed initial scan on sqlinstances to work on a per-region basis and consider region liveness.
4. Updating the sqlinstance allocation logic to recover from region failures by cleaning up system.leases, system.sqlinstances, and system.sqllivness after a failure.
5. A new roachtest focused on setting up a physical cluster and simulating failure scenarios for nightly builds.
6. A synthetic test for simulating region failures and recovery from them

This PR is stacked on top of:  #115568, so the first 4 commits should be ignored.

informs: #103727
EPIC: CC-24173

118578: roachtest: fix typo in query_comparison_util.go r=mgartner a=mgartner

Epic: None

Release note: None

Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>
Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants