Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: rollback to savepoint broken in mixed-version (19.2/20.1) clusters #46372

Closed
andreimatei opened this issue Mar 20, 2020 · 1 comment · Fixed by #46415
Closed

sql: rollback to savepoint broken in mixed-version (19.2/20.1) clusters #46372

andreimatei opened this issue Mar 20, 2020 · 1 comment · Fixed by #46415
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@andreimatei
Copy link
Contributor

andreimatei commented Mar 20, 2020

A 20.1 node rolling back to a savepoint doesn't affect request evaluation on 19.2 nodes, as 19.2 doesn't understand about ignored sequence number ranges. This means that reads served by old nodes will still return values that should have been rolled back, and they'll also commit them.
It's not the highest impact bug in the world seems it seems dubious to start using savepoints in a mixed-version cluster since some nodes would error out, but if you do start using it it's not good.

I guess we should gate execution of savepoints on a cluster version. One problem is with rolling back to the cockroach_restart savepoint which, in 20.1 nodes, uses the same ignored seqnum mechanism when the rollback happens from the Open state. That's a bit of an edge case, as generally one rolls back to cockroach_restart after a retriable error (so, not from the Open state).

Repro below. t is a table with the leaseholder on 19.2, t2 has the leaseholder on 20.1. After a rollback, writes to the latter go away, but not to the former.

root@localhost:26259/defaultdb> begin;
BEGIN
root@localhost:26259/defaultdb  OPEN> savepoint a;
SAVEPOINT
root@localhost:26259/defaultdb  OPEN> insert into t(x) values (1);
INSERT 1
root@localhost:26259/defaultdb  OPEN> insert into t2(x) values (1);
INSERT 1
root@localhost:26259/defaultdb  OPEN> select * from t;
  x
-----
  1
(1 row)
root@localhost:26259/defaultdb  OPEN> select * from t2;
  x
-----
  1
(1 row)
root@localhost:26259/defaultdb  OPEN> rollback to savepoint a;
ROLLBACK
root@localhost:26259/defaultdb  OPEN> select * from t;
  x
-----
  1
(1 row)
root@localhost:26259/defaultdb  OPEN> select * from t2;
  x
-----
(0 rows)
@andreimatei andreimatei added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Mar 20, 2020
@andreimatei andreimatei self-assigned this Mar 20, 2020
@knz
Copy link
Contributor

knz commented Mar 20, 2020

oh I know what to do with this

@knz knz self-assigned this Mar 21, 2020
craig bot pushed a commit that referenced this issue Mar 23, 2020
46391: kv/concurrency: never regress timestamp in lockTable on lock acquisition r=sumeerbhola a=nvanbenschoten

Fixes #46290.
Fixes #43735.
Fixes #41425.
Fixes #43809.

This change adjusts the lockTable to be more lenient to the timestamp of new lock acquisitions. Specifically, it lifts the restriction that all calls to AcquireLock use monotonically increasing timestamps. Instead, it properly handles apparent timestamp regressions by ratcheting lock timestamps instead of replacing them directly.

This matches the corresponding behavior in MVCC: https://github.com/cockroachdb/cockroach/blob/92107b551bbafe54fddb496442c590cb6feb5d65/pkg/storage/mvcc.go#L1631

This leniency is needed for sequences of events like the following:
- txn A acquires lock at epoch 1, ts 10
- txn B pushes txn A to ts 20
- txn B updates lock to ts 20
- txn A restarts at ts 15 without noticing that it has been pushes
- txn A re-acquires lock at epoch 2, ts 15
- we hit the lock timestamp regression assertion

We see this frequently in CDC roachtests because the rangefeed processor performs high-priority timestamp pushes on long-running transactions. Outside of CDC, this is rare in our system.

Release note (bug fix): CDC no longer combines with long running transactions to trigger an assertion.

Release justification: fixes a high-priority bug in existing functionality. The bug could crash a server if the right sequence of events occurred. This was typically rare, but was much more common when CDC was in use. This is a fairly large PR, but there's only a single line of non-testing code touched.

46415: sql: misc savepoint fixes r=andreimatei a=knz

Fixes #46372 

Release justification: bug fixes and low-risk updates to new functionality

These commits correct a few gaps in the recent implementation.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
@craig craig bot closed this as completed in #46415 Mar 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants