-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sqlliveness: adopt regionliveness when querying if a session is alive #115568
Conversation
4c50f01
to
1c485f7
Compare
1c485f7
to
fffb1a0
Compare
fffb1a0
to
e66131a
Compare
e66131a
to
a7e6f27
Compare
pkg/sql/regionliveness/prober.go
Outdated
defaultTTL := slbase.DefaultTTL.Get(&l.settings.SV) | ||
defaultHeartbeat := slbase.DefaultHeartBeat.Get(&l.settings.SV) | ||
// Get the read timestamp and pick a commit deadline. | ||
readTS := txn.KV().ReadTimestamp().AddDuration(defaultHeartbeat) | ||
readTS := txn.ReadTimestamp().AddDuration(defaultHeartbeat) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think you should rename readTS
to commitDeadline
. Since the readTS is variable but is allowed to be anything less than the commit deadline.
pkg/sql/regionliveness/prober.go
Outdated
} | ||
return txn.KV().UpdateDeadline(ctx, readTS) | ||
return txn.UpdateDeadline(ctx, readTS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to defer setting UpdateDealdine to the end of the transaction? Since we are computing the commitDeadline it seems like we should be able to specify the deadline earlier in the transaction and remove the explicit ReadTimestamp.After() guard.
@@ -37,21 +37,21 @@ type rbrEncoder struct { | |||
rbrIndex roachpb.Key | |||
} | |||
|
|||
func (e *rbrEncoder) encode(session sqlliveness.SessionID) (roachpb.Key, error) { | |||
func (e *rbrEncoder) encode(session sqlliveness.SessionID) (roachpb.Key, []byte, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encode should not return the []byte slice representing the region. The region bytes are pointing at memory that is owned by the string and since []byte is mutable it would be easy to accidentally violate the go memory model.
Since you want the value as a string I think it is possible to avoid allocation here by making a safe version of DecodeSessionID that extracts a substring from the session ID. The encode function could use the UnsafeBytes helper to avoid allocating memory for the region.
|
||
if err != nil && | ||
regionliveness.IsQueryTimeoutErr(err) { | ||
probeErr := livenessProber.ProbeLivenessWithPhysicalRegion(ctx, regionPhysicalRep) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it safe to call ProbeLivenessWithPhysicalRegions inside of a txn callback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @fqazi, @mgartner, and @msbutler)
pkg/ccl/multiregionccl/regionliveness_test.go
line 280 at r4 (raw file):
// will be aware (i.e. session ID and SQL instance will not be MR aware). if i == 0 { require.NoError(t, err)
nit: what error is this asserting on?
For the sqlliveness package to adopt the regionliveness package, we need to eliminate a circular dependency between the two. Currently, the region liveness requires to access the default TTL and heartbeat settings inside sqlliveness. To address this, this patch will eliminate this cycle by moving these settings into a subpackage called slbase. Release note: None
a7e6f27
to
59b9e00
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @JeffSwenson, @mgartner, and @msbutler)
pkg/sql/regionliveness/prober.go
line 201 at r2 (raw file):
Previously, JeffSwenson (Jeff Swenson) wrote…
Do we need to defer setting UpdateDealdine to the end of the transaction? Since we are computing the commitDeadline it seems like we should be able to specify the deadline earlier in the transaction and remove the explicit ReadTimestamp.After() guard.
Done.
pkg/sql/sqlliveness/slstorage/key_encoder.go
line 40 at r3 (raw file):
Previously, JeffSwenson (Jeff Swenson) wrote…
encode should not return the []byte slice representing the region. The region bytes are pointing at memory that is owned by the string and since []byte is mutable it would be easy to accidentally violate the go memory model.
Since you want the value as a string I think it is possible to avoid allocation here by making a safe version of DecodeSessionID that extracts a substring from the session ID. The encode function could use the UnsafeBytes helper to avoid allocating memory for the region.
Done.
pkg/sql/sqlliveness/slstorage/slstorage.go
line 338 at r3 (raw file):
Previously, JeffSwenson (Jeff Swenson) wrote…
Is it safe to call ProbeLivenessWithPhysicalRegions inside of a txn callback?
Done.
Good point, and I moved this outside the transaction after the error hit.
d4c6829
to
3d9a772
Compare
Previously, the region liveness interfaces used internal executors for reading and writing to the region_livness table. This was effective in cases where we had access to internal executors like leasing but broke down for lower level code that needs to adopt these concepts like sqlliveness. This patch, moves the region liveness logic to use KV API to encode / decode rows as required. Release note: None
Previously, sqlliveness would probe the system.sqlliveness table, but had no logic to check for region liveness. Which meant it ran the risk of potentially getting stuck on dead regions if the system database is moved to SURVIVE REGION. This patch adopts the regionliveness subsystem so that these queries have timeouts and probe for dead regions. This allows subsystems that rely on IsAlive like jobs to take advantage of region liveness. Fixes: cockroachdb#115563 Release note: None
Previously, the regionliveness_test set up all the tenants and then added the regions on the system database. Unfortuantely, specific subsystems like sqlliveness need the region enum to be configured *before* a new tenant is added. This patch, changes the order of operations so that the enum is updated first. Release note: None
3d9a772
to
e53bda2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @msbutler and @rafiss)
@JeffSwenson TFTR! bors r+ |
Build succeeded: |
116784: regionliveness: add support for sqlinstances and recovery r=fqazi a=fqazi This PR completes the regionliveness survivability goal work by implementing the following: 1. Logic to bound the sqlliveness reviewals based on the unavailable_at time set on a region 2. Updating the sqlinstance allocation logic to take into account regionliveness by working on live regions only. 3. Updating the change feed initial scan on sqlinstances to work on a per-region basis and consider region liveness. 4. Updating the sqlinstance allocation logic to recover from region failures by cleaning up system.leases, system.sqlinstances, and system.sqllivness after a failure. 5. A new roachtest focused on setting up a physical cluster and simulating failure scenarios for nightly builds. 6. A synthetic test for simulating region failures and recovery from them This PR is stacked on top of: #115568, so the first 4 commits should be ignored. informs: #103727 EPIC: CC-24173 118578: roachtest: fix typo in query_comparison_util.go r=mgartner a=mgartner Epic: None Release note: None Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com> Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>
This PR will start query the regionliveness table as a part of the protocol to confirm if a given sqlliveness session IsAlive. As a side effect of this change the jobs subsystem and other infrastructure will be able take into account regionliveness. To allow this changed to happen the following commits are included in this PR
Informs: #115563