Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SELECT FOR UPDATE SKIP LOCKED skips unlocked rows sometimes? #121917

Open
Tracked by #114737
asg0451 opened this issue Apr 8, 2024 · 27 comments · Fixed by #127718
Open
Tracked by #114737

SELECT FOR UPDATE SKIP LOCKED skips unlocked rows sometimes? #121917

asg0451 opened this issue Apr 8, 2024 · 27 comments · Fixed by #127718
Labels
A-read-committed Related to the introduction of Read Committed A-sql-optimizer SQL logical planning and optimizations. branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 C-performance Perf of queries or internals. Solution not expected to change functional behavior. O-community Originated from the community T-sql-queries SQL Queries Team X-blathers-triaged blathers was able to find an owner

Comments

@asg0451
Copy link
Contributor

asg0451 commented Apr 8, 2024

Describe the problem

I'm working on a job queue system backed by CRDB and I've run into a situation I don't understand. I'm hoping someone can shed some light on it for me.

To Reproduce

  1. Set up a CRDB Serverless cluster.
  2. Clone the repro code in this repo.
  3. Run with DATABASE_URL="<db conn str>" NUM_WORKERS=2 go run .

The program sets up a jobs table (see schema.sql), seeds it with a few jobs, then runs some workers that poll for work using SELECT FOR UPDATE SKIP LOCKED here

Expected behavior

I expect each worker to grab a job, "process" it, then grab a new one, until no jobs remain. However, observing the log output, you can see that frequently workers fail to find jobs. In fact, It seems like after the first job each gets, it fails to see a new job while the other is working one one. Why is this?

Example run (annotated):

time=2024-04-08T00:29:46.725-04:00 level=DEBUG msg="running schema.sql"
time=2024-04-08T00:29:47.175-04:00 level=DEBUG msg="resetting jobs"
time=2024-04-08T00:29:48.629-04:00 level=INFO msg="job found" worker=1 job="{String:a Valid:true}"
time=2024-04-08T00:29:48.673-04:00 level=INFO msg="job found" worker=0 job="{String:b Valid:true}" // See here they both got jobs
time=2024-04-08T00:29:53.888-04:00 level=INFO msg="job finished" worker=0 job="{ID:{Bytes:[42 4 203 147 205 42 74 144 146 202 156 73 71 160 90 207] Valid:true} Name:{String:b Valid:true} CreatedAt:{Time:2024-04-08 04:29:47.29262 +0000 UTC InfinityModifier:finite Valid:true} CompletedAt:{Time:2024-04-08 04:29:48.443138 +0000 UTC InfinityModifier:finite Valid:true} CompletedBy:{String:cc/crdbtestjobs 0 Valid:true}}"
time=2024-04-08T00:29:53.888-04:00 level=INFO msg="job finished" worker=1 job="{ID:{Bytes:[24 96 114 86 121 83 76 129 168 179 161 17 215 75 124 18] Valid:true} Name:{String:a Valid:true} CreatedAt:{Time:2024-04-08 04:29:47.29262 +0000 UTC InfinityModifier:finite Valid:true} CompletedAt:{Time:2024-04-08 04:29:48.422372 +0000 UTC InfinityModifier:finite Valid:true} CompletedBy:{String:cc/crdbtestjobs 1 Valid:true}}"
time=2024-04-08T00:29:54.371-04:00 level=INFO msg="no job found" worker=0  // but now worker 0 can't find any while 1 is working on its
time=2024-04-08T00:29:54.371-04:00 level=INFO msg="job found" worker=1 job="{String:d Valid:true}"
time=2024-04-08T00:29:54.848-04:00 level=INFO msg="no job found" worker=0
time=2024-04-08T00:29:55.349-04:00 level=INFO msg="no job found" worker=0
time=2024-04-08T00:29:55.800-04:00 level=INFO msg="no job found" worker=0
time=2024-04-08T00:29:56.251-04:00 level=INFO msg="no job found" worker=0
time=2024-04-08T00:29:56.739-04:00 level=INFO msg="no job found" worker=0
time=2024-04-08T00:29:57.176-04:00 level=INFO msg="no job found" worker=0
time=2024-04-08T00:29:57.656-04:00 level=INFO msg="no job found" worker=0
time=2024-04-08T00:29:58.117-04:00 level=INFO msg="no job found" worker=0
time=2024-04-08T00:29:58.586-04:00 level=INFO msg="no job found" worker=0
time=2024-04-08T00:29:59.029-04:00 level=INFO msg="no job found" worker=0
time=2024-04-08T00:29:59.475-04:00 level=INFO msg="job finished" worker=1 job="{ID:{Bytes:[98 32 63 71 154 255 64 96 191 82 227 157 177 117 4 16] Valid:true} Name:{String:d Valid:true} CreatedAt:{Time:2024-04-08 04:29:47.29262 +0000 UTC InfinityModifier:finite Valid:true} CompletedAt:{Time:2024-04-08 04:29:54.199079 +0000 UTC InfinityModifier:finite Valid:true} CompletedBy:{String:cc/crdbtestjobs 1 Valid:true}}"
time=2024-04-08T00:29:59.484-04:00 level=INFO msg="no job found" worker=0 // ditto
time=2024-04-08T00:29:59.938-04:00 level=INFO msg="job found" worker=1 job="{String:c Valid:true}"
time=2024-04-08T00:29:59.939-04:00 level=INFO msg="no job found" worker=0
time=2024-04-08T00:30:00.412-04:00 level=INFO msg="no job found" worker=0
time=2024-04-08T00:30:00.875-04:00 level=INFO msg="no job found" worker=0
time=2024-04-08T00:30:01.346-04:00 level=INFO msg="no job found" worker=0
time=2024-04-08T00:30:01.799-04:00 level=INFO msg="no job found" worker=0
time=2024-04-08T00:30:02.235-04:00 level=INFO msg="no job found" worker=0
time=2024-04-08T00:30:02.745-04:00 level=INFO msg="no job found" worker=0
time=2024-04-08T00:30:03.181-04:00 level=INFO msg="no job found" worker=0
time=2024-04-08T00:30:03.655-04:00 level=INFO msg="no job found" worker=0
time=2024-04-08T00:30:04.124-04:00 level=INFO msg="no job found" worker=0
time=2024-04-08T00:30:04.538-04:00 level=INFO msg="no job found" worker=0
time=2024-04-08T00:30:04.969-04:00 level=INFO msg="no job found" worker=0
time=2024-04-08T00:30:05.029-04:00 level=INFO msg="job finished" worker=1 job="{ID:{Bytes:[143 160 240 141 130 52 77 159 151 19 223 198 14 146 153 33] Valid:true} Name:{String:c Valid:true} CreatedAt:{Time:2024-04-08 04:29:47.29262 +0000 UTC InfinityModifier:finite Valid:true} CompletedAt:{Time:2024-04-08 04:29:59.789133 +0000 UTC InfinityModifier:finite Valid:true} CompletedBy:{String:cc/crdbtestjobs 1 Valid:true}}"
time=2024-04-08T00:30:05.400-04:00 level=INFO msg="job found" worker=0 job="{String:f Valid:true}" // worker 0 wins the race this time and now worker 1 can't find a job!
time=2024-04-08T00:30:05.431-04:00 level=INFO msg="no job found" worker=1
time=2024-04-08T00:30:05.850-04:00 level=INFO msg="no job found" worker=1
time=2024-04-08T00:30:06.246-04:00 level=INFO msg="no job found" worker=1
time=2024-04-08T00:30:06.655-04:00 level=INFO msg="no job found" worker=1
(ctrl-c for brevity)

Contrast this with postgres's behaviour under the same code:

time=2024-04-08T02:42:40.602-04:00 level=DEBUG msg="running schema.sql"
time=2024-04-08T02:42:41.026-04:00 level=DEBUG msg="resetting jobs"
time=2024-04-08T02:42:42.392-04:00 level=INFO msg="job found" worker=2 job="{String:a Valid:true}"
time=2024-04-08T02:42:42.392-04:00 level=INFO msg="job found" worker=0 job="{String:c Valid:true}"
time=2024-04-08T02:42:42.392-04:00 level=INFO msg="job found" worker=1 job="{String:b Valid:true}"
time=2024-04-08T02:42:47.547-04:00 level=INFO msg="job finished" worker=2 job="{ID:{Bytes:[123 153 228 234 250 118 67 31 188 172 62 47 39 114 235 127] Valid:true} Name:{String:a Valid:true} CreatedAt:{Time:2024-04-07 23:42:41.218606 +0000 UTC InfinityModifier:finite Valid:true} CompletedAt:{Time:2024-04-07 23:42:42.258571 +0000 UTC InfinityModifier:finite Valid:true} CompletedBy:{String:cc/crdbtestjobs 2 Valid:true}}"
time=2024-04-08T02:42:47.547-04:00 level=INFO msg="job finished" worker=0 job="{ID:{Bytes:[55 177 105 201 234 84 68 211 134 237 144 62 45 182 176 121] Valid:true} Name:{String:c Valid:true} CreatedAt:{Time:2024-04-07 23:42:41.218606 +0000 UTC InfinityModifier:finite Valid:true} CompletedAt:{Time:2024-04-07 23:42:42.258597 +0000 UTC InfinityModifier:finite Valid:true} CompletedBy:{String:cc/crdbtestjobs 0 Valid:true}}"
time=2024-04-08T02:42:47.620-04:00 level=INFO msg="job finished" worker=1 job="{ID:{Bytes:[217 121 73 159 166 146 79 234 140 118 76 52 225 118 36 173] Valid:true} Name:{String:b Valid:true} CreatedAt:{Time:2024-04-07 23:42:41.218606 +0000 UTC InfinityModifier:finite Valid:true} CompletedAt:{Time:2024-04-07 23:42:42.258517 +0000 UTC InfinityModifier:finite Valid:true} CompletedBy:{String:cc/crdbtestjobs 1 Valid:true}}"
time=2024-04-08T02:42:47.982-04:00 level=INFO msg="job found" worker=0 job="{String:d Valid:true}"
time=2024-04-08T02:42:47.987-04:00 level=INFO msg="job found" worker=2 job="{String:e Valid:true}"
time=2024-04-08T02:42:48.034-04:00 level=INFO msg="job found" worker=1 job="{String:f Valid:true}"
time=2024-04-08T02:42:53.091-04:00 level=INFO msg="job finished" worker=0 job="{ID:{Bytes:[174 45 184 28 42 93 76 98 145 129 73 191 190 117 226 255] Valid:true} Name:{String:d Valid:true} CreatedAt:{Time:2024-04-07 23:42:41.218606 +0000 UTC InfinityModifier:finite Valid:true} CompletedAt:{Time:2024-04-07 23:42:47.938799 +0000 UTC InfinityModifier:finite Valid:true} CompletedBy:{String:cc/crdbtestjobs 0 Valid:true}}"
time=2024-04-08T02:42:53.091-04:00 level=INFO msg="job finished" worker=2 job="{ID:{Bytes:[76 229 28 127 149 214 68 210 174 220 73 184 30 206 11 105] Valid:true} Name:{String:e Valid:true} CreatedAt:{Time:2024-04-07 23:42:41.218606 +0000 UTC InfinityModifier:finite Valid:true} CompletedAt:{Time:2024-04-07 23:42:47.935604 +0000 UTC InfinityModifier:finite Valid:true} CompletedBy:{String:cc/crdbtestjobs 2 Valid:true}}"
time=2024-04-08T02:42:53.111-04:00 level=INFO msg="job finished" worker=1 job="{ID:{Bytes:[151 115 250 132 130 210 75 164 130 102 233 183 252 28 183 50] Valid:true} Name:{String:f Valid:true} CreatedAt:{Time:2024-04-07 23:42:41.218606 +0000 UTC InfinityModifier:finite Valid:true} CompletedAt:{Time:2024-04-07 23:42:48.009649 +0000 UTC InfinityModifier:finite Valid:true} CompletedBy:{String:cc/crdbtestjobs 1 Valid:true}}"
time=2024-04-08T02:42:53.540-04:00 level=INFO msg="no job found" worker=1
time=2024-04-08T02:42:53.540-04:00 level=INFO msg="no job found" worker=2
time=2024-04-08T02:42:53.551-04:00 level=INFO msg="no job found" worker=0
(they actually did them all)

As you can see, postgres does not have this issue at all.

Additional data / screenshots
Everything is in the linked repo

Environment:

  • CockroachDB version: Serverless, AWS v23.2.3
  • Server OS: ?
  • Client app golang crdbpgxv5

Jira issue: CRDB-37619

EDIT: All of a sudden this stopped happening for NUM_WORKERS=2 -- both of them reliably got jobs. Running with NUM_WORKERS=3 exhibits the issue again. Idk what changed.

@asg0451 asg0451 added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Apr 8, 2024
Copy link

blathers-crl bot commented Apr 8, 2024

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

  • @cockroachdb/sql-foundations (found keywords: pgx)

If we have not gotten back to your issue within a few business days, you can try the following:

  • Join our community slack channel and ask on #cockroachdb.
  • Try find someone from here if you know they worked closely on the area and CC them.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Apr 8, 2024
@asg0451
Copy link
Contributor Author

asg0451 commented Apr 14, 2024

I was able to reproduce this on a local crdb node (with start-single-node) with NUM_WORKERS=10. Still confused by this behaviour. Anyone have an idea?

@rafiss rafiss added the T-kv KV Team label Apr 15, 2024
@michae2
Copy link
Collaborator

michae2 commented Apr 16, 2024

@asg0451 are you using read committed isolation? If so, this might be because of #115057 which is fixed in v24.1. It might be worth trying with v24.1.0-alpha.5 to see if this works correctly there.

@asg0451
Copy link
Contributor Author

asg0451 commented Apr 16, 2024

hey @michae2 , thanks for the reply. It's definitely reproducible on SERIALIZABLE and iirc crdb might not have let me use SKIP LOCKED on READ COMMITTED at all, though i may be misremembering that.

@michae2
Copy link
Collaborator

michae2 commented Apr 16, 2024

@asg0451 if you're using serializable, it would be worth trying again with optimizer_use_lock_op_for_serializable and enable_durable_locking_for_serializable turned on. One way to do that for all connections using a specific role / user is:

ALTER ROLE <rolename> SET optimizer_use_lock_op_for_serializable = on;
ALTER ROLE <rolename> SET enable_durable_locking_for_serializable = on;

In v23.2 there is a new implementation of SELECT FOR UPDATE that fixes some problems, which is only enabled for read committed isolation. Turning on these settings enables it for serializable isolation as well.

@asg0451
Copy link
Contributor Author

asg0451 commented Apr 17, 2024

Hey @michae2 , I can confirm that the issue is still present with read committed isolation level.

Trying those settings results in this error when running the query: ERROR: usage of replicated locks in conjunction with skip locked wait policy is currently unsupported (SQLSTATE 0A000)"

build: CCL v23.2.4 @ 2024/04/08 21:52:09 (go1.21.8 X:nocoverageredesign)

@ajwerner
Copy link
Contributor

I met Miles (@asg0451) this morning, and I happen to have a debugging tool I've been pointing at cockroach. I noticed when I captured a snapshot that the initial scan was not using skip locked and was using a hard limit. That lead me to taking a statement bundle which I'll paste here. I think it's just a straight-up bug in how cockroach is applying SKIP LOCKED in the context of index joins.

The statement is:

SELECT
  id, name, created_at, completed_at, completed_by
FROM
  jobs
WHERE
  completed_at IS NULL
LIMIT
  1
FOR UPDATE SKIP LOCKED

The schema is:

CREATE TABLE defaultdb.public.jobs (
	id UUID NOT NULL DEFAULT gen_random_uuid(),
	name STRING NULL,
	created_at TIMESTAMP NOT NULL DEFAULT current_timestamp():::TIMESTAMP,
	completed_at TIMESTAMP NULL,
	completed_by STRING NULL,
	CONSTRAINT jobs_pkey PRIMARY KEY (id ASC),
	INDEX pending_jobs_idx (id ASC, name ASC, created_at ASC, completed_at ASC) WHERE completed_at IS NULL
);

The plan ends up being:

• index join
│ nodes: n1
│ regions: local
│ actual row count: 0
│ KV time: 414µs
│ KV contention time: 0µs
│ KV rows decoded: 0
│ KV bytes read: 0 B
│ KV gRPC calls: 1
│ estimated max memory allocated: 20 KiB
│ estimated max sql temp disk usage: 0 B
│ estimated row count: 1
│ table: jobs@jobs_pkey
│ locking strength: for update
│ locking wait policy: skip locked
│ locking durability: guaranteed
│
└── • scan
      nodes: n1
      regions: local
      actual row count: 1
      KV time: 2ms
      KV contention time: 0µs
      KV rows decoded: 1
      KV bytes read: 67 B
      KV gRPC calls: 1
      estimated max memory allocated: 20 KiB
      estimated row count: 1 (0.99% of the table; stats collected 8 minutes ago; using stats forecast for 2 minutes ago)
      table: jobs@pending_jobs_idx (partial index)
      spans: LIMITED SCAN
      limit: 1

A problem with this plan is that the index read does not use SKIP LOCKED, so it'll return the first row in that index. Then, in the index join we'll scan the primary index but we will use locking wait policy: skip locked. The end result is that the primary key lookup will return zero rows. I feel like there's two problems:

  1. The limit being pushed down to the index scan probably needs to be a soft limit -- though I can see that one being controversial in its own right because it'll expand the contention footprint of the scan. The join reader should be the one enforcing a hard limit of 1.
  2. The read from the index should use skip locked

@michae2
Copy link
Collaborator

michae2 commented May 17, 2024

Great point @ajwerner! Looks like the plan is correct with optimizer_use_lock_op_for_serializable = off:

  • limit
  │ count: 1
  │
  └── • index join
      │ table: jobs@jobs_pkey
      │ locking strength: for update
      │ locking wait policy: skip locked
      │
      └── • scan
            missing stats
            table: jobs@pending_jobs_idx (partial index)
            spans: FULL SCAN (SOFT LIMIT)
            locking strength: for update
            locking wait policy: skip locked

and incorrect with optimizer_use_lock_op_for_serializable = on:

  • index join
  │ table: jobs@jobs_pkey
  │ locking strength: for update
  │ locking wait policy: skip locked
  │
  └── • scan
        missing stats
        table: jobs@pending_jobs_idx (partial index)
        spans: LIMITED SCAN
        limit: 1

@michae2 michae2 added T-sql-queries SQL Queries Team A-read-committed Related to the introduction of Read Committed A-sql-optimizer SQL logical planning and optimizations. labels May 17, 2024
@ajwerner
Copy link
Contributor

One more note is that the bad plan also happens with read committed without those session variables.

@ajwerner
Copy link
Contributor

I think the plan is also wrong even without the index join. Consider the below plan. I think the scan at the bottom needs to have a SKIP LOCKED (otherwise it could block, no?). Then having the limit before the locking lookup join means that you again can fail to find the row because it is indeed locked. I don't know enough to know whether this is the same problem.

planning time: 158µs
execution time: 1ms
distribution: local
vectorized: true
rows decoded from KV: 100 (6.5 KiB, 2 gRPC calls)
cumulative time spent in KV: 1ms
maximum memory usage: 70 KiB
network usage: 0 B (0 messages)
regions: local
isolation level: read committed
priority: normal
quality of service: regular

• lookup join (semi)
│ nodes: n1
│ regions: local
│ actual row count: 0
│ KV time: 510µs
│ KV contention time: 0µs
│ KV rows decoded: 0
│ KV bytes read: 0 B
│ KV gRPC calls: 1
│ estimated max memory allocated: 20 KiB
│ table: jobs@jobs_pkey
│ equality: (id) = (id)
│ equality cols are key
│ locking strength: for update
│ locking wait policy: skip locked
│ locking durability: guaranteed
│
└── • limit
    │ count: 1
    │
    └── • filter
        │ nodes: n1
        │ regions: local
        │ actual row count: 1
        │ filter: completed_at IS NULL
        │
        └── • scan
              nodes: n1
              regions: local
              actual row count: 100
              KV time: 633µs
              KV contention time: 0µs
              KV rows decoded: 100
              KV bytes read: 6.5 KiB
              KV gRPC calls: 1
              estimated max memory allocated: 30 KiB
              missing stats
              table: jobs@jobs_pkey
              spans: FULL SCAN (SOFT LIMIT)

@michae2
Copy link
Collaborator

michae2 commented May 17, 2024

Under read committed (or with optimizer_use_lock_op_for_serializable = on) we're performing unlocked initial scans, and then only locking the PK. So I don't think this could block during the initial scan, since it's an unlocked read. But I agree that the limit there is wrong, it needs to be above the locking lookup join.

We could do a skip-locked locking read during the initial scan, to avoid any optimizer_use_lock_op_for_serializable = off transactions that have locked the secondary index. I think we were trying to avoid locking all indexes, but it could help with contention in some cases.

@ajwerner
Copy link
Contributor

Under read committed (or with optimizer_use_lock_op_for_serializable = on) we're performing unlocked initial scans, and then only locking the PK.

I don't think I have the mental model down for what happens when a READ COMMITTED unlocked scan arrives at an intent. I was thinking with my old serializable hat when I wrote the above comment. I guess the answer is that it pushes the lock holders timestamp above its read time and moves on with its life. So yeah, in that case it is just the limit being below the locking read that's the bug.

I think we were trying to avoid locking all indexes, but it could help with contention in some cases.

Interesting.

@asg0451
Copy link
Contributor Author

asg0451 commented May 17, 2024

Thanks for taking a look guys! This was one of the first things i tried to build with CRDB when exploring it, so it was sad when it didn't work. Much of the above discussion is a bit above my head but I'll revisit it as I learn more about the system.

@mgartner
Copy link
Collaborator

To summarize the above:

  1. There is a bug in the locking of SKIP LOCKED when using serializable isolation: SELECT FOR UPDATE SKIP LOCKED skips unlocked rows sometimes? #121917 (comment)
  2. There may be a bug where a LIMIT can be incorrectly pushed below a lookup-join: SELECT FOR UPDATE SKIP LOCKED skips unlocked rows sometimes? #121917 (comment)
  • I'm not convinced there is a bug here. I'd need to see the query used to generate the query plan to confirm.

@ajwerner
Copy link
Contributor

I'd need to see the query used to generate the query plan to confirm.

The query and the schema are in the comment above (#121917 (comment)).

@michae2
Copy link
Collaborator

michae2 commented Jul 18, 2024

We worked on this during the collab session, PR coming up!

@michae2 michae2 self-assigned this Jul 18, 2024
@mgartner
Copy link
Collaborator

I'd need to see the query used to generate the query plan to confirm.

The query and the schema are in the comment above (#121917 (comment)).

I was referring to the query that leads to the lookup semi-join. Looks like this does it:

CREATE TABLE defaultdb.public.jobs (
  id UUID NOT NULL DEFAULT gen_random_uuid(),
  name STRING NULL,
  created_at TIMESTAMP NOT NULL DEFAULT current_timestamp():::TIMESTAMP,
  completed_at TIMESTAMP NULL,
  completed_by STRING NULL,
  CONSTRAINT jobs_pkey PRIMARY KEY (id ASC),
  INDEX pending_jobs_idx (id ASC, name ASC, created_at ASC, completed_at ASC) WHERE completed_at IS NULL
);

SET optimizer_use_lock_op_for_serializable=on;
SET enable_durable_locking_for_serializable=on;

EXPLAIN
SELECT
  id
FROM
  jobs
WHERE
  completed_at IS NULL
LIMIT
  1
FOR UPDATE SKIP LOCKED;
--                            info
-- ------------------------------------------------------
--   distribution: local
--   vectorized: true
-- 
--   • lookup join (semi)
--   │ table: jobs@jobs_pkey
--   │ equality: (id) = (id)
--   │ equality cols are key
--   │ locking strength: for update
--   │ locking wait policy: skip locked
--   │ locking durability: guaranteed
--
--   └── • scan
--         missing stats
--         table: jobs@pending_jobs_idx (partial index)
--         spans: LIMITED SCAN
--         limit: 1
-- (16 rows)

michae2 added a commit to michae2/cockroach that referenced this issue Jul 26, 2024
We were always building the new Lock operator after building Limit
and/or Offset operators in the same SELECT. This is incorrect when the
Lock operator uses SKIP LOCKED, as the Lock might filter out rows that
should not count toward a limit or offset.

Instead, build the new Lock operator as input to Limit and/or Offset,
and then use normalization rules to push the Limit and/or Offset below
the Lock if it does not use SKIP LOCKED.

Fixes: cockroachdb#121917

Release note (bug fix): Fix a bug in which SELECT FOR UPDATE or SELECT
FOR SHARE queries using SKIP LOCKED and a LIMIT and/or an OFFSET could
return incorrect results under Read Committed isolation.
michae2 added a commit to michae2/cockroach that referenced this issue Jul 26, 2024
When using the new Lock operator to implement FOR UPDATE and FOR SHARE,
we build unlocked scans that are then followed by a final Lock
operation. If the Lock operator uses SKIP LOCKED, however, these
unlocked scans still need to be flagged with the skip-locked wait policy
to avoid blocking, even if they do not take locks themselves.

Fixes: cockroachdb#121917

Release note (bug fix): Fix a bug in which some SELECT FOR UPDATE or
SELECT FOR SHARE queries using SKIP LOCKED could still block on locked
rows under Read Committed isolation.
craig bot pushed a commit that referenced this issue Aug 1, 2024
127718: opt: fix SKIP LOCKED under Read Committed isolation r=DrewKimball a=michae2

**opt: do not push LIMIT or OFFSET below locking with SKIP LOCKED**

We were always building the new Lock operator after building Limit
and/or Offset operators in the same SELECT. This is incorrect when the
Lock operator uses SKIP LOCKED, as the Lock might filter out locked rows
that should not count toward a limit or offset.

Instead, build the new Lock operator as input to Limit and/or Offset,
and then use normalization rules to push the Limit and/or Offset below
the Lock if it does not use SKIP LOCKED.

Fixes: #121917

Release note (bug fix): Fix a bug in which SELECT FOR UPDATE or SELECT
FOR SHARE queries using SKIP LOCKED and a LIMIT and/or an OFFSET could
return incorrect results under Read Committed isolation. This bug was
present when support for SKIP LOCKED under Read Committed isolation was
introduced in v24.1.0.

---

**opt: add skip-locked to unlocked scans below SKIP LOCKED**

When using the new Lock operator to implement FOR UPDATE and FOR SHARE,
we build unlocked scans that are then followed by a final Lock
operation. If the Lock operator uses SKIP LOCKED, however, these
unlocked scans still need to be flagged with the skip-locked wait policy
to avoid blocking, even if they do not take locks themselves.

Fixes: #121917

Release note (bug fix): Fix a bug in which some SELECT FOR UPDATE or
SELECT FOR SHARE queries using SKIP LOCKED could still block on locked
rows when using optimizer_use_lock_op_for_serializable under
Serializable isolation. This bug was present when
optimizer_use_lock_op_for_serializable was introduced in v23.2.0.

Co-authored-by: Michael Erickson <michae2@cockroachlabs.com>
@craig craig bot closed this as completed in 30a3c8f Aug 1, 2024
blathers-crl bot pushed a commit that referenced this issue Aug 1, 2024
We were always building the new Lock operator after building Limit
and/or Offset operators in the same SELECT. This is incorrect when the
Lock operator uses SKIP LOCKED, as the Lock might filter out locked rows
that should not count toward a limit or offset.

Instead, build the new Lock operator as input to Limit and/or Offset,
and then use normalization rules to push the Limit and/or Offset below
the Lock if it does not use SKIP LOCKED.

Fixes: #121917

Release note (bug fix): Fix a bug in which SELECT FOR UPDATE or SELECT
FOR SHARE queries using SKIP LOCKED and a LIMIT and/or an OFFSET could
return incorrect results under Read Committed isolation. This bug was
present when support for SKIP LOCKED under Read Committed isolation was
introduced in v24.1.0.
blathers-crl bot pushed a commit that referenced this issue Aug 1, 2024
When using the new Lock operator to implement FOR UPDATE and FOR SHARE,
we build unlocked scans that are then followed by a final Lock
operation. If the Lock operator uses SKIP LOCKED, however, these
unlocked scans still need to be flagged with the skip-locked wait policy
to avoid blocking, even if they do not take locks themselves.

Fixes: #121917

Release note (bug fix): Fix a bug in which some SELECT FOR UPDATE or
SELECT FOR SHARE queries using SKIP LOCKED could still block on locked
rows when using optimizer_use_lock_op_for_serializable under
Serializable isolation. This bug was present when
optimizer_use_lock_op_for_serializable was introduced in v23.2.0.
blathers-crl bot pushed a commit that referenced this issue Aug 1, 2024
We were always building the new Lock operator after building Limit
and/or Offset operators in the same SELECT. This is incorrect when the
Lock operator uses SKIP LOCKED, as the Lock might filter out locked rows
that should not count toward a limit or offset.

Instead, build the new Lock operator as input to Limit and/or Offset,
and then use normalization rules to push the Limit and/or Offset below
the Lock if it does not use SKIP LOCKED.

Fixes: #121917

Release note (bug fix): Fix a bug in which SELECT FOR UPDATE or SELECT
FOR SHARE queries using SKIP LOCKED and a LIMIT and/or an OFFSET could
return incorrect results under Read Committed isolation. This bug was
present when support for SKIP LOCKED under Read Committed isolation was
introduced in v24.1.0.
blathers-crl bot pushed a commit that referenced this issue Aug 1, 2024
When using the new Lock operator to implement FOR UPDATE and FOR SHARE,
we build unlocked scans that are then followed by a final Lock
operation. If the Lock operator uses SKIP LOCKED, however, these
unlocked scans still need to be flagged with the skip-locked wait policy
to avoid blocking, even if they do not take locks themselves.

Fixes: #121917

Release note (bug fix): Fix a bug in which some SELECT FOR UPDATE or
SELECT FOR SHARE queries using SKIP LOCKED could still block on locked
rows when using optimizer_use_lock_op_for_serializable under
Serializable isolation. This bug was present when
optimizer_use_lock_op_for_serializable was introduced in v23.2.0.
@asg0451
Copy link
Contributor Author

asg0451 commented Aug 6, 2024

This still appears to be broken using my original repro script.

  • with default isolation and no SETs, eventually workers fail to find jobs to pick up, though it works for a little while
  • with read committed and no SETs, the same happens but fails faster
  • with read committed and SET optimizer_use_lock_op_for_serializable='on'; SET enable_durable_locking_for_serializable = true, the same
  • with default isolation and SET optimizer_use_lock_op_for_serializable='on'; SET enable_durable_locking_for_serializable = true, the same

tested on master @ 7ea67cc

my repro script is in the original issue body up there ^ but i'd be happy to demonstrate the problem personally if anyone would like

@asg0451 asg0451 reopened this Aug 6, 2024
@michae2
Copy link
Collaborator

michae2 commented Aug 6, 2024

Thanks for trying it out @asg0451! I will try your repro script this afternoon.

@asg0451
Copy link
Contributor Author

asg0451 commented Aug 6, 2024

sounds good. I set up a local single node cluster and ran NUM_WORKERS=10 DATABASE_URL=postgresql://..... go run main.go

lmk if you have any questions

@michae2
Copy link
Collaborator

michae2 commented Aug 14, 2024

TLDR: I think changing some cfetcher batch size logic will make a big difference.


Finally got a chance to play with this using the tip of master (37a37b0). For reference, the worker queries are as follows (correct me if this is wrong, @asg0451):

CREATE TABLE public.jobs (
    id UUID NOT NULL DEFAULT gen_random_uuid(),
    name STRING NULL,
    created_at TIMESTAMP NOT NULL DEFAULT current_timestamp():::TIMESTAMP,
    completed_at TIMESTAMP NULL,
    completed_by STRING NULL,
    CONSTRAINT jobs_pkey PRIMARY KEY (id ASC),
    INDEX pending_jobs_idx (id ASC, name ASC, created_at ASC, completed_at ASC) WHERE completed_at IS NULL
);

BEGIN;
-- GetJob
SELECT id, name, created_at, completed_at, completed_by FROM jobs where completed_at IS NULL limit 1 for update skip locked;
-- FinishJob
UPDATE jobs SET completed_at = CURRENT_TIMESTAMP, completed_by = $2 WHERE id = $1 RETURNING id, name, created_at, completed_at, completed_by;
COMMIT;

Behavior is a little different for the three modes this can run in (serializable, serializable with lock op, and read committed). In all three modes the script appears to make progress, but sometimes it is very slow progress depending on the plan used for GetJob.

For all three modes, the FinishJob plan is exactly the same, and locks a single row in the primary index:

  distribution: local
  vectorized: true

  • update
  │ estimated row count: 1
  │ table: jobs
  │ set: completed_at, completed_by
  │
  └── • render
      │
      └── • scan
            estimated row count: 1 (0.99% of the table; stats collected 3 minutes ago)
            table: jobs@jobs_pkey
            spans: [/'ff7da26e-5089-4a23-bdc4-dc19031c1434' - /'ff7da26e-5089-4a23-bdc4-dc19031c1434']
            locking strength: for update

Here's what I believe is happening with the GetJob plan in each of the three modes. (Much depends on the cfetcher's batching behavior when it has a soft limit, which controls the number of rows read (and locked) by a scan in a single batch request.)

SERIALIZABLE

The GetJob plan starts out using pending_jobs_idx:

  distribution: local
  vectorized: true

  • limit
  │ count: 1
  │
  └── • index join
      │ table: jobs@jobs_pkey
      │ locking strength: for update
      │ locking wait policy: skip locked
      │
      └── • scan
            missing stats
            table: jobs@pending_jobs_idx (partial index)
            spans: FULL SCAN (SOFT LIMIT)
            locking strength: for update
            locking wait policy: skip locked

and then after a minute (after a stats collection) the plan switches to using jobs_pkey with a filter:

  distribution: local
  vectorized: true

  • limit
  │ count: 1
  │
  └── • filter
      │ estimated row count: 85
      │ filter: completed_at IS NULL
      │
      └── • scan
            estimated row count: 2 - 101 (100% of the table; stats collected 7 seconds ago)
            table: jobs@jobs_pkey
            spans: FULL SCAN (SOFT LIMIT)
            locking strength: for update
            locking wait policy: skip locked

All workers make progress when using the first plan, which usually only reads and locks a single row in both indexes.

When using the second plan, throughput drops to just a single worker making progress. All other workers consistently get 0 rows for GetJob. This is because the second plan locks all 101 rows in the primary index, due to (a) locking before filtering out completed rows, which means (b) we can't satisfy the query after the initial 1-KV batch of the scan, which means (c) we continue the scan with a second, larger batch which locks the rest of the rows in the primary index.

After a long time the script successfully finishes.

SERIALIZABLE, with optimizer_use_lock_op_for_serializable = on

Again, the plan starts out using pending_jobs_idx, but now we're only locking the primary index:

  distribution: local
  vectorized: true

  • limit
  │ count: 1
  │
  └── • index join
      │ table: jobs@jobs_pkey
      │ locking strength: for update
      │ locking wait policy: skip locked
      │
      └── • scan
            missing stats
            table: jobs@pending_jobs_idx (partial index)
            spans: FULL SCAN (SOFT LIMIT)
            locking wait policy: skip locked

Again, after stats collection, the plan switches to jobs_pkey with a filter, but now the lock is after the filter:

  distribution: local
  vectorized: true

  • limit
  │ count: 1
  │
  └── • lookup join (semi)
      │ estimated row count: 100
      │ table: jobs@jobs_pkey
      │ equality: (id) = (id)
      │ equality cols are key
      │ locking strength: for update
      │ locking wait policy: skip locked
      │
      └── • filter
          │ estimated row count: 100
          │ filter: completed_at IS NULL
          │
          └── • scan
                estimated row count: 2 - 101 (100% of the table; stats collected 5 seconds ago)
                table: jobs@jobs_pkey
                spans: FULL SCAN (SOFT LIMIT)
                locking wait policy: skip locked

Progress is uneven when using the first plan. Usually one worker can make progress, occasionally two, and the others get 0 rows for GetJob. This is because the first execution of the plan satisfies it using the initial 1-KV batch of the scan, and thus only locks one row, and then the second execution cannot satisfy in the initial 1-KV batch and ends up locking the rest of the rows due to a larger second batch.

When using the second plan all workers make progress, except for occasional brief moments of all workers except one getting 0 rows for GetJob. This is because in the second plan we can usually satisfy the entire query in the initial 1-KV batch of the lookup join, unless two workers race to lock the same row during the lookup join. The losing worker will then read another batch of rows and will then lock all the remaining uncompleted rows.

Eventually, about 30 seconds after one of these moments, the script fails with a transaction timeout:

time=2024-08-13T15:15:37.502-07:00 level=ERROR msg="run error" err="errgroup.Wait: crdbpgx.ExecuteTx: queries.FinishJob: ERROR: query execution canceled due to transaction timeout (SQLSTATE 57014)"

I'm not sure why this is yet.

READ COMMITTED

Again, starts out using pending_jobs_idx, but with guaranteed locking durability:

  distribution: local
  vectorized: true

  • limit
  │ count: 1
  │
  └── • index join
      │ table: jobs@jobs_pkey
      │ locking strength: for update
      │ locking wait policy: skip locked
      │ locking durability: guaranteed
      │
      └── • scan
            missing stats
            table: jobs@pending_jobs_idx (partial index)
            spans: FULL SCAN (SOFT LIMIT)
            locking wait policy: skip locked

and again, after a stats collection, it switches to jobs_pkey with a filter, with locking after the filter:

  distribution: local
  vectorized: true

  • limit
  │ count: 1
  │
  └── • lookup join (semi)
      │ estimated row count: 83
      │ table: jobs@jobs_pkey
      │ equality: (id) = (id)
      │ equality cols are key
      │ locking strength: for update
      │ locking wait policy: skip locked
      │ locking durability: guaranteed
      │
      └── • filter
          │ estimated row count: 83
          │ filter: completed_at IS NULL
          │
          └── • scan
                estimated row count: 2 - 101 (100% of the table; stats collected 3 seconds ago)
                table: jobs@jobs_pkey
                spans: FULL SCAN (SOFT LIMIT)
                locking wait policy: skip locked

Again, progress is uneven when using the first plan. Usually one or two workers can make progress, and the others get 0 rows for GetJob. This is due to the same explanation as above.

When using the second plan all workers make progress. Unlike the serializable with lock op mode, however, under this mode the script successfully finishes. I think this probably has something to do with readers not waiting for writers under RC.


So I think a couple more fixes will help:

  • cfetcher should be less aggressive about growing batch sizes for these locking scans with a soft limit
  • maybe we should be locking more than just the primary index when using the new lock op, to prevent some races

@asg0451
Copy link
Contributor Author

asg0451 commented Aug 14, 2024

thanks @michae2, those queries look correct, as do your explanation of the symptoms. I just tried running the script with select .. from jobs@pending_jobs_idx .. and that works perfectly (under serializable), backing up what you said above.

(unfortunately since that's not valid postgres syntax i can't use this as a workaround in my program but anyway)

why does locking before/after the filter matter? naively i'd think we'd want it to lock after the limit so we only lock one row.

@michae2
Copy link
Collaborator

michae2 commented Aug 14, 2024

why does locking before/after the filter matter?

Locking before the filter has two effects: (1) it causes us to lock extra rows that are rejected by the filter, and (2) because of the batching logic it causes us to not answer the query using the first 1-KV batch, requiring a second batch when then locks the rest of the rows in the table.

naively i'd think we'd want it to lock after the limit so we only lock one row.

This is what we do for normal SELECT FOR UPDATE. For SKIP LOCKED, however, we must check for skip locked before applying the limit, otherwise we might miss an unlocked row that could be returned.

We could do something like (lock (limit (skip-locked read))) but right now the locking and skip-locked read are the same operation, so it has to be (limit (lock skip-locked read)).

@michae2
Copy link
Collaborator

michae2 commented Aug 14, 2024

I talked to @rytaft about this, and we think the proper way to improve this workload is to push the limit down into index join and lookup join, so that we use hard-limit behavior instead of soft-limit behavior.

Unfortunately, JoinReader doesn't yet support limits. But we need to do this for #128704 anyway. So after #128704 is fixed, we can come back to this and add a rule to push the limit into index join and lookup join.

@michae2 michae2 removed their assignment Aug 14, 2024
@michae2 michae2 added C-performance Perf of queries or internals. Solution not expected to change functional behavior. and removed C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. labels Aug 14, 2024
@asg0451
Copy link
Contributor Author

asg0451 commented Aug 15, 2024

thanks @michae2 . can you expand on this a bit? because of the batching logic it causes us to not answer the query using the first 1-KV batch, requiring a second batch when then locks the rest of the rows in the table. i havent dug into kv/sql internals enough to understand this quite yet. maybe you could point me at some resources?

@michae2
Copy link
Collaborator

michae2 commented Aug 15, 2024

can you expand on this a bit? because of the batching logic it causes us to not answer the query using the first 1-KV batch, requiring a second batch when then locks the rest of the rows in the table. i havent dug into kv/sql internals enough to understand this quite yet. maybe you could point me at some resources?

Ah yes, I did not explain this well, let me try again. For a query plan like this (the second plan under serializable):

  distribution: local
  vectorized: true

  • limit
  │ count: 1
  │
  └── • filter
      │ estimated row count: 85
      │ filter: completed_at IS NULL
      │
      └── • scan
            estimated row count: 2 - 101 (100% of the table; stats collected 7 seconds ago)
            table: jobs@jobs_pkey
            spans: FULL SCAN (SOFT LIMIT)
            locking strength: for update
            locking wait policy: skip locked

the execution engine has to make some choices about how many rows to process with each operator at once. It could process a single row with each operator all the way up the plan (depth-first execution), or it could process all possible rows with the bottommost operator, then all possible rows with the next operator, etc (breadth-first execution). The former ensures we process the fewest number of rows, but has high overhead. The latter has the lowest overhead but also processes the max number of rows.

The strategy currently used by the execution engine is to process a batch of rows at once in each operator, in the hopes of balancing tradeoffs. The default batch size is 10k rows I think. But when there is a "soft limit", meaning a limit somewhere up above in the plan, the execution engine uses the soft limit as the size of the first batch. So for the locking scan, the execution engine first asks for 1 KV from kvserver, and passes this all the way up the plan. If the query has not finished after this first batch, the locking scan then asks for 10k KVs from kvserver.

@asg0451
Copy link
Contributor Author

asg0451 commented Aug 15, 2024

I see, so because that first batch doesnt match the filter (completed_at is not null), the query isnt satisfied by that batch. then instead of trying one more, it tries 10k. and it is doing the locking at the lowest level so it locks the whole table (or up to 10k rows) as a result of the query.

thanks for the explanation!

michae2 added a commit that referenced this issue Aug 16, 2024
We were always building the new Lock operator after building Limit
and/or Offset operators in the same SELECT. This is incorrect when the
Lock operator uses SKIP LOCKED, as the Lock might filter out locked rows
that should not count toward a limit or offset.

Instead, build the new Lock operator as input to Limit and/or Offset,
and then use normalization rules to push the Limit and/or Offset below
the Lock if it does not use SKIP LOCKED.

Fixes: #121917

Release note (bug fix): Fix a bug in which SELECT FOR UPDATE or SELECT
FOR SHARE queries using SKIP LOCKED and a LIMIT and/or an OFFSET could
return incorrect results under Read Committed isolation. This bug was
present when support for SKIP LOCKED under Read Committed isolation was
introduced in v24.1.0.
michae2 added a commit that referenced this issue Aug 16, 2024
When using the new Lock operator to implement FOR UPDATE and FOR SHARE,
we build unlocked scans that are then followed by a final Lock
operation. If the Lock operator uses SKIP LOCKED, however, these
unlocked scans still need to be flagged with the skip-locked wait policy
to avoid blocking, even if they do not take locks themselves.

Fixes: #121917

Release note (bug fix): Fix a bug in which some SELECT FOR UPDATE or
SELECT FOR SHARE queries using SKIP LOCKED could still block on locked
rows when using optimizer_use_lock_op_for_serializable under
Serializable isolation. This bug was present when
optimizer_use_lock_op_for_serializable was introduced in v23.2.0.
michae2 added a commit that referenced this issue Aug 16, 2024
We were always building the new Lock operator after building Limit
and/or Offset operators in the same SELECT. This is incorrect when the
Lock operator uses SKIP LOCKED, as the Lock might filter out locked rows
that should not count toward a limit or offset.

Instead, build the new Lock operator as input to Limit and/or Offset,
and then use normalization rules to push the Limit and/or Offset below
the Lock if it does not use SKIP LOCKED.

Fixes: #121917

Release note (bug fix): Fix a bug in which SELECT FOR UPDATE or SELECT
FOR SHARE queries using SKIP LOCKED and a LIMIT and/or an OFFSET could
return incorrect results under Read Committed isolation. This bug was
present when support for SKIP LOCKED under Read Committed isolation was
introduced in v24.1.0.
michae2 added a commit that referenced this issue Aug 16, 2024
When using the new Lock operator to implement FOR UPDATE and FOR SHARE,
we build unlocked scans that are then followed by a final Lock
operation. If the Lock operator uses SKIP LOCKED, however, these
unlocked scans still need to be flagged with the skip-locked wait policy
to avoid blocking, even if they do not take locks themselves.

Fixes: #121917

Release note (bug fix): Fix a bug in which some SELECT FOR UPDATE or
SELECT FOR SHARE queries using SKIP LOCKED could still block on locked
rows when using optimizer_use_lock_op_for_serializable under
Serializable isolation. This bug was present when
optimizer_use_lock_op_for_serializable was introduced in v23.2.0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-read-committed Related to the introduction of Read Committed A-sql-optimizer SQL logical planning and optimizations. branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 C-performance Perf of queries or internals. Solution not expected to change functional behavior. O-community Originated from the community T-sql-queries SQL Queries Team X-blathers-triaged blathers was able to find an owner
Projects
Status: Backlog
Development

Successfully merging a pull request may close this issue.

6 participants