-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SELECT FOR UPDATE SKIP LOCKED
skips unlocked rows sometimes?
#121917
Comments
Hello, I am Blathers. I am here to help you get the issue triaged. Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here. I have CC'd a few people who may be able to assist you:
If we have not gotten back to your issue within a few business days, you can try the following:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
I was able to reproduce this on a local crdb node (with |
hey @michae2 , thanks for the reply. It's definitely reproducible on SERIALIZABLE and iirc crdb might not have let me use SKIP LOCKED on READ COMMITTED at all, though i may be misremembering that. |
@asg0451 if you're using serializable, it would be worth trying again with
In v23.2 there is a new implementation of SELECT FOR UPDATE that fixes some problems, which is only enabled for read committed isolation. Turning on these settings enables it for serializable isolation as well. |
Hey @michae2 , I can confirm that the issue is still present with read committed isolation level. Trying those settings results in this error when running the query: build: CCL v23.2.4 @ 2024/04/08 21:52:09 (go1.21.8 X:nocoverageredesign) |
I met Miles (@asg0451) this morning, and I happen to have a debugging tool I've been pointing at cockroach. I noticed when I captured a snapshot that the initial scan was not using skip locked and was using a hard limit. That lead me to taking a statement bundle which I'll paste here. I think it's just a straight-up bug in how cockroach is applying The statement is: SELECT
id, name, created_at, completed_at, completed_by
FROM
jobs
WHERE
completed_at IS NULL
LIMIT
1
FOR UPDATE SKIP LOCKED The schema is: CREATE TABLE defaultdb.public.jobs (
id UUID NOT NULL DEFAULT gen_random_uuid(),
name STRING NULL,
created_at TIMESTAMP NOT NULL DEFAULT current_timestamp():::TIMESTAMP,
completed_at TIMESTAMP NULL,
completed_by STRING NULL,
CONSTRAINT jobs_pkey PRIMARY KEY (id ASC),
INDEX pending_jobs_idx (id ASC, name ASC, created_at ASC, completed_at ASC) WHERE completed_at IS NULL
); The plan ends up being:
A problem with this plan is that the index read does not use
|
Great point @ajwerner! Looks like the plan is correct with
and incorrect with
|
One more note is that the bad plan also happens with read committed without those session variables. |
I think the plan is also wrong even without the index join. Consider the below plan. I think the scan at the bottom needs to have a
|
Under read committed (or with We could do a skip-locked locking read during the initial scan, to avoid any |
I don't think I have the mental model down for what happens when a
Interesting. |
Thanks for taking a look guys! This was one of the first things i tried to build with CRDB when exploring it, so it was sad when it didn't work. Much of the above discussion is a bit above my head but I'll revisit it as I learn more about the system. |
To summarize the above:
|
The query and the schema are in the comment above (#121917 (comment)). |
We worked on this during the collab session, PR coming up! |
I was referring to the query that leads to the lookup semi-join. Looks like this does it: CREATE TABLE defaultdb.public.jobs (
id UUID NOT NULL DEFAULT gen_random_uuid(),
name STRING NULL,
created_at TIMESTAMP NOT NULL DEFAULT current_timestamp():::TIMESTAMP,
completed_at TIMESTAMP NULL,
completed_by STRING NULL,
CONSTRAINT jobs_pkey PRIMARY KEY (id ASC),
INDEX pending_jobs_idx (id ASC, name ASC, created_at ASC, completed_at ASC) WHERE completed_at IS NULL
);
SET optimizer_use_lock_op_for_serializable=on;
SET enable_durable_locking_for_serializable=on;
EXPLAIN
SELECT
id
FROM
jobs
WHERE
completed_at IS NULL
LIMIT
1
FOR UPDATE SKIP LOCKED;
-- info
-- ------------------------------------------------------
-- distribution: local
-- vectorized: true
--
-- • lookup join (semi)
-- │ table: jobs@jobs_pkey
-- │ equality: (id) = (id)
-- │ equality cols are key
-- │ locking strength: for update
-- │ locking wait policy: skip locked
-- │ locking durability: guaranteed
-- │
-- └── • scan
-- missing stats
-- table: jobs@pending_jobs_idx (partial index)
-- spans: LIMITED SCAN
-- limit: 1
-- (16 rows) |
We were always building the new Lock operator after building Limit and/or Offset operators in the same SELECT. This is incorrect when the Lock operator uses SKIP LOCKED, as the Lock might filter out rows that should not count toward a limit or offset. Instead, build the new Lock operator as input to Limit and/or Offset, and then use normalization rules to push the Limit and/or Offset below the Lock if it does not use SKIP LOCKED. Fixes: cockroachdb#121917 Release note (bug fix): Fix a bug in which SELECT FOR UPDATE or SELECT FOR SHARE queries using SKIP LOCKED and a LIMIT and/or an OFFSET could return incorrect results under Read Committed isolation.
When using the new Lock operator to implement FOR UPDATE and FOR SHARE, we build unlocked scans that are then followed by a final Lock operation. If the Lock operator uses SKIP LOCKED, however, these unlocked scans still need to be flagged with the skip-locked wait policy to avoid blocking, even if they do not take locks themselves. Fixes: cockroachdb#121917 Release note (bug fix): Fix a bug in which some SELECT FOR UPDATE or SELECT FOR SHARE queries using SKIP LOCKED could still block on locked rows under Read Committed isolation.
127718: opt: fix SKIP LOCKED under Read Committed isolation r=DrewKimball a=michae2 **opt: do not push LIMIT or OFFSET below locking with SKIP LOCKED** We were always building the new Lock operator after building Limit and/or Offset operators in the same SELECT. This is incorrect when the Lock operator uses SKIP LOCKED, as the Lock might filter out locked rows that should not count toward a limit or offset. Instead, build the new Lock operator as input to Limit and/or Offset, and then use normalization rules to push the Limit and/or Offset below the Lock if it does not use SKIP LOCKED. Fixes: #121917 Release note (bug fix): Fix a bug in which SELECT FOR UPDATE or SELECT FOR SHARE queries using SKIP LOCKED and a LIMIT and/or an OFFSET could return incorrect results under Read Committed isolation. This bug was present when support for SKIP LOCKED under Read Committed isolation was introduced in v24.1.0. --- **opt: add skip-locked to unlocked scans below SKIP LOCKED** When using the new Lock operator to implement FOR UPDATE and FOR SHARE, we build unlocked scans that are then followed by a final Lock operation. If the Lock operator uses SKIP LOCKED, however, these unlocked scans still need to be flagged with the skip-locked wait policy to avoid blocking, even if they do not take locks themselves. Fixes: #121917 Release note (bug fix): Fix a bug in which some SELECT FOR UPDATE or SELECT FOR SHARE queries using SKIP LOCKED could still block on locked rows when using optimizer_use_lock_op_for_serializable under Serializable isolation. This bug was present when optimizer_use_lock_op_for_serializable was introduced in v23.2.0. Co-authored-by: Michael Erickson <michae2@cockroachlabs.com>
We were always building the new Lock operator after building Limit and/or Offset operators in the same SELECT. This is incorrect when the Lock operator uses SKIP LOCKED, as the Lock might filter out locked rows that should not count toward a limit or offset. Instead, build the new Lock operator as input to Limit and/or Offset, and then use normalization rules to push the Limit and/or Offset below the Lock if it does not use SKIP LOCKED. Fixes: #121917 Release note (bug fix): Fix a bug in which SELECT FOR UPDATE or SELECT FOR SHARE queries using SKIP LOCKED and a LIMIT and/or an OFFSET could return incorrect results under Read Committed isolation. This bug was present when support for SKIP LOCKED under Read Committed isolation was introduced in v24.1.0.
When using the new Lock operator to implement FOR UPDATE and FOR SHARE, we build unlocked scans that are then followed by a final Lock operation. If the Lock operator uses SKIP LOCKED, however, these unlocked scans still need to be flagged with the skip-locked wait policy to avoid blocking, even if they do not take locks themselves. Fixes: #121917 Release note (bug fix): Fix a bug in which some SELECT FOR UPDATE or SELECT FOR SHARE queries using SKIP LOCKED could still block on locked rows when using optimizer_use_lock_op_for_serializable under Serializable isolation. This bug was present when optimizer_use_lock_op_for_serializable was introduced in v23.2.0.
We were always building the new Lock operator after building Limit and/or Offset operators in the same SELECT. This is incorrect when the Lock operator uses SKIP LOCKED, as the Lock might filter out locked rows that should not count toward a limit or offset. Instead, build the new Lock operator as input to Limit and/or Offset, and then use normalization rules to push the Limit and/or Offset below the Lock if it does not use SKIP LOCKED. Fixes: #121917 Release note (bug fix): Fix a bug in which SELECT FOR UPDATE or SELECT FOR SHARE queries using SKIP LOCKED and a LIMIT and/or an OFFSET could return incorrect results under Read Committed isolation. This bug was present when support for SKIP LOCKED under Read Committed isolation was introduced in v24.1.0.
When using the new Lock operator to implement FOR UPDATE and FOR SHARE, we build unlocked scans that are then followed by a final Lock operation. If the Lock operator uses SKIP LOCKED, however, these unlocked scans still need to be flagged with the skip-locked wait policy to avoid blocking, even if they do not take locks themselves. Fixes: #121917 Release note (bug fix): Fix a bug in which some SELECT FOR UPDATE or SELECT FOR SHARE queries using SKIP LOCKED could still block on locked rows when using optimizer_use_lock_op_for_serializable under Serializable isolation. This bug was present when optimizer_use_lock_op_for_serializable was introduced in v23.2.0.
This still appears to be broken using my original repro script.
tested on master @ 7ea67cc my repro script is in the original issue body up there ^ but i'd be happy to demonstrate the problem personally if anyone would like |
Thanks for trying it out @asg0451! I will try your repro script this afternoon. |
sounds good. I set up a local single node cluster and ran lmk if you have any questions |
TLDR: I think changing some cfetcher batch size logic will make a big difference. Finally got a chance to play with this using the tip of master (37a37b0). For reference, the worker queries are as follows (correct me if this is wrong, @asg0451): CREATE TABLE public.jobs (
id UUID NOT NULL DEFAULT gen_random_uuid(),
name STRING NULL,
created_at TIMESTAMP NOT NULL DEFAULT current_timestamp():::TIMESTAMP,
completed_at TIMESTAMP NULL,
completed_by STRING NULL,
CONSTRAINT jobs_pkey PRIMARY KEY (id ASC),
INDEX pending_jobs_idx (id ASC, name ASC, created_at ASC, completed_at ASC) WHERE completed_at IS NULL
);
BEGIN;
-- GetJob
SELECT id, name, created_at, completed_at, completed_by FROM jobs where completed_at IS NULL limit 1 for update skip locked;
-- FinishJob
UPDATE jobs SET completed_at = CURRENT_TIMESTAMP, completed_by = $2 WHERE id = $1 RETURNING id, name, created_at, completed_at, completed_by;
COMMIT; Behavior is a little different for the three modes this can run in (serializable, serializable with lock op, and read committed). In all three modes the script appears to make progress, but sometimes it is very slow progress depending on the plan used for GetJob. For all three modes, the FinishJob plan is exactly the same, and locks a single row in the primary index:
Here's what I believe is happening with the GetJob plan in each of the three modes. (Much depends on the cfetcher's batching behavior when it has a soft limit, which controls the number of rows read (and locked) by a SERIALIZABLEThe GetJob plan starts out using
and then after a minute (after a stats collection) the plan switches to using
All workers make progress when using the first plan, which usually only reads and locks a single row in both indexes. When using the second plan, throughput drops to just a single worker making progress. All other workers consistently get 0 rows for GetJob. This is because the second plan locks all 101 rows in the primary index, due to (a) locking before filtering out completed rows, which means (b) we can't satisfy the query after the initial 1-KV batch of the scan, which means (c) we continue the scan with a second, larger batch which locks the rest of the rows in the primary index. After a long time the script successfully finishes. SERIALIZABLE, with optimizer_use_lock_op_for_serializable = onAgain, the plan starts out using
Again, after stats collection, the plan switches to
Progress is uneven when using the first plan. Usually one worker can make progress, occasionally two, and the others get 0 rows for GetJob. This is because the first execution of the plan satisfies it using the initial 1-KV batch of the scan, and thus only locks one row, and then the second execution cannot satisfy in the initial 1-KV batch and ends up locking the rest of the rows due to a larger second batch. When using the second plan all workers make progress, except for occasional brief moments of all workers except one getting 0 rows for GetJob. This is because in the second plan we can usually satisfy the entire query in the initial 1-KV batch of the lookup join, unless two workers race to lock the same row during the lookup join. The losing worker will then read another batch of rows and will then lock all the remaining uncompleted rows. Eventually, about 30 seconds after one of these moments, the script fails with a transaction timeout:
I'm not sure why this is yet. READ COMMITTEDAgain, starts out using
and again, after a stats collection, it switches to
Again, progress is uneven when using the first plan. Usually one or two workers can make progress, and the others get 0 rows for GetJob. This is due to the same explanation as above. When using the second plan all workers make progress. Unlike the serializable with lock op mode, however, under this mode the script successfully finishes. I think this probably has something to do with readers not waiting for writers under RC. So I think a couple more fixes will help:
|
thanks @michae2, those queries look correct, as do your explanation of the symptoms. I just tried running the script with (unfortunately since that's not valid postgres syntax i can't use this as a workaround in my program but anyway) why does locking before/after the filter matter? naively i'd think we'd want it to lock after the limit so we only lock one row. |
Locking before the filter has two effects: (1) it causes us to lock extra rows that are rejected by the filter, and (2) because of the batching logic it causes us to not answer the query using the first 1-KV batch, requiring a second batch when then locks the rest of the rows in the table.
This is what we do for normal SELECT FOR UPDATE. For SKIP LOCKED, however, we must check for skip locked before applying the limit, otherwise we might miss an unlocked row that could be returned. We could do something like (lock (limit (skip-locked read))) but right now the locking and skip-locked read are the same operation, so it has to be (limit (lock skip-locked read)). |
I talked to @rytaft about this, and we think the proper way to improve this workload is to push the limit down into index join and lookup join, so that we use hard-limit behavior instead of soft-limit behavior. Unfortunately, JoinReader doesn't yet support limits. But we need to do this for #128704 anyway. So after #128704 is fixed, we can come back to this and add a rule to push the limit into index join and lookup join. |
thanks @michae2 . can you expand on this a bit? |
Ah yes, I did not explain this well, let me try again. For a query plan like this (the second plan under serializable):
the execution engine has to make some choices about how many rows to process with each operator at once. It could process a single row with each operator all the way up the plan (depth-first execution), or it could process all possible rows with the bottommost operator, then all possible rows with the next operator, etc (breadth-first execution). The former ensures we process the fewest number of rows, but has high overhead. The latter has the lowest overhead but also processes the max number of rows. The strategy currently used by the execution engine is to process a batch of rows at once in each operator, in the hopes of balancing tradeoffs. The default batch size is 10k rows I think. But when there is a "soft limit", meaning a limit somewhere up above in the plan, the execution engine uses the soft limit as the size of the first batch. So for the locking scan, the execution engine first asks for 1 KV from kvserver, and passes this all the way up the plan. If the query has not finished after this first batch, the locking scan then asks for 10k KVs from kvserver. |
I see, so because that first batch doesnt match the filter (completed_at is not null), the query isnt satisfied by that batch. then instead of trying one more, it tries 10k. and it is doing the locking at the lowest level so it locks the whole table (or up to 10k rows) as a result of the query. thanks for the explanation! |
We were always building the new Lock operator after building Limit and/or Offset operators in the same SELECT. This is incorrect when the Lock operator uses SKIP LOCKED, as the Lock might filter out locked rows that should not count toward a limit or offset. Instead, build the new Lock operator as input to Limit and/or Offset, and then use normalization rules to push the Limit and/or Offset below the Lock if it does not use SKIP LOCKED. Fixes: #121917 Release note (bug fix): Fix a bug in which SELECT FOR UPDATE or SELECT FOR SHARE queries using SKIP LOCKED and a LIMIT and/or an OFFSET could return incorrect results under Read Committed isolation. This bug was present when support for SKIP LOCKED under Read Committed isolation was introduced in v24.1.0.
When using the new Lock operator to implement FOR UPDATE and FOR SHARE, we build unlocked scans that are then followed by a final Lock operation. If the Lock operator uses SKIP LOCKED, however, these unlocked scans still need to be flagged with the skip-locked wait policy to avoid blocking, even if they do not take locks themselves. Fixes: #121917 Release note (bug fix): Fix a bug in which some SELECT FOR UPDATE or SELECT FOR SHARE queries using SKIP LOCKED could still block on locked rows when using optimizer_use_lock_op_for_serializable under Serializable isolation. This bug was present when optimizer_use_lock_op_for_serializable was introduced in v23.2.0.
We were always building the new Lock operator after building Limit and/or Offset operators in the same SELECT. This is incorrect when the Lock operator uses SKIP LOCKED, as the Lock might filter out locked rows that should not count toward a limit or offset. Instead, build the new Lock operator as input to Limit and/or Offset, and then use normalization rules to push the Limit and/or Offset below the Lock if it does not use SKIP LOCKED. Fixes: #121917 Release note (bug fix): Fix a bug in which SELECT FOR UPDATE or SELECT FOR SHARE queries using SKIP LOCKED and a LIMIT and/or an OFFSET could return incorrect results under Read Committed isolation. This bug was present when support for SKIP LOCKED under Read Committed isolation was introduced in v24.1.0.
When using the new Lock operator to implement FOR UPDATE and FOR SHARE, we build unlocked scans that are then followed by a final Lock operation. If the Lock operator uses SKIP LOCKED, however, these unlocked scans still need to be flagged with the skip-locked wait policy to avoid blocking, even if they do not take locks themselves. Fixes: #121917 Release note (bug fix): Fix a bug in which some SELECT FOR UPDATE or SELECT FOR SHARE queries using SKIP LOCKED could still block on locked rows when using optimizer_use_lock_op_for_serializable under Serializable isolation. This bug was present when optimizer_use_lock_op_for_serializable was introduced in v23.2.0.
Describe the problem
I'm working on a job queue system backed by CRDB and I've run into a situation I don't understand. I'm hoping someone can shed some light on it for me.
To Reproduce
DATABASE_URL="<db conn str>" NUM_WORKERS=2 go run .
The program sets up a jobs table (see schema.sql), seeds it with a few jobs, then runs some workers that poll for work using
SELECT FOR UPDATE SKIP LOCKED
hereExpected behavior
I expect each worker to grab a job, "process" it, then grab a new one, until no jobs remain. However, observing the log output, you can see that frequently workers fail to find jobs. In fact, It seems like after the first job each gets, it fails to see a new job while the other is working one one. Why is this?
Example run (annotated):
Contrast this with postgres's behaviour under the same code:
As you can see, postgres does not have this issue at all.
Additional data / screenshots
Everything is in the linked repo
Environment:
Jira issue: CRDB-37619
EDIT: All of a sudden this stopped happening for NUM_WORKERS=2 -- both of them reliably got jobs. Running with NUM_WORKERS=3 exhibits the issue again. Idk what changed.
The text was updated successfully, but these errors were encountered: