Multi-record UPSERT inserts duplicate values in PRIMARY KEY, resulting in inconsistent results #44466

mrigger · 2020-01-28T23:41:03Z

Consider the following testcase:

CREATE TABLE t0(c0 INT PRIMARY KEY, c1 BOOL, c2 INT UNIQUE);
INSERT INTO t0(c0) VALUES (0);
UPSERT INTO t0(c2, c0) VALUES (0, 0), (1, 0);
SELECT * FROM t0; -- {0 | NULL | 1}

A single row is fetched, while the following query fetches two records with a duplicate value for c0:

SELECT c0 FROM t0; -- {0, 0}

I found this issue in CockroachDB based on commit 1917a12003daad100b491e2b0195d7df5909e63f.

The text was updated successfully, but these errors were encountered:

rohany · 2020-01-30T18:56:39Z

I found the cause of this, I'll have to dig in more to see why it happened though:

  output row: ['movr']
  rows affected: 1
  Scan /Table/59/1/0{-/#}
  querying next range at /Table/59/1/0
  r59: sending batch 1 Scan to (n1,s1):1
  fetched: /t0/primary/0 -> NULL
  Put /Table/59/1/0/0 -> /TUPLE/3:3:Int/0
  Del /Table/59/2/NULL/0/0
  CPut /Table/59/2/0/0 -> /BYTES/0x88 (expecting does not exist)
  Put /Table/59/1/0/0 -> /TUPLE/3:3:Int/1
  Del /Table/59/2/NULL/0/0
  CPut /Table/59/2/1/0 -> /BYTES/0x88 (expecting does not exist)
  querying next range at /Table/59/1/0/0
  r59: sending batch 2 Put, 2 CPut, 2 Del, 1 EndTxn to (n1,s1):1
  fast path completed
  rows affected: 2
  output row: ['movr']
  rows affected: 1
  output row: ['movr']
  rows affected: 1

The upsert seems to delete Del /Table/59/2/NULL/0/0 twice rather than deleting /Table/59/2/0/0. The query then does an index scan on index 2, causing 2 rows to show up.

rohany · 2020-02-04T19:43:05Z

I investigated this more, and it is because the upserter cannot see writes that it makes. It first updates the row from 0, NULL, NULL to 0, NULL, 0, bet then cannot see that update. So it next deletes 0, NULL, NULL and inserts 0, NULL, 1 instead of deleting 0, NULL, 0. Based on offline conversation with @RaduBerinde this seems like something that was intended in the optimizer?

RaduBerinde · 2020-02-04T20:58:09Z

The optimizer's intended behavior w.r.t duplicate values in UPSERT are described in #37880. In this case it looks like the PK and the index become inconsistent (the first query returns one row and the second query returns two rows). This makes it more clear:

root@127.149.115.8:37621/movr> SELECT c0, c2 FROM t0@primary;
  c0 | c2  
+----+----+
   0 |  1  
(1 row)

root@127.149.115.8:37621/movr> SELECT c0, c2 FROM t0@t0_c2_key;
  c0 | c2  
+----+----+
   0 |  0  
   0 |  1

Regardless of the semantics of UPSERTS this is definitely a bug.. Seems like the "cannot affect row a second time" error should have been triggered.
CC @andy-kimball

andy-kimball · 2020-02-05T03:53:26Z

My first look suggests that we'd need to add a DISTINCT ON clause on the primary key in order to fix this. However, there are downsides:

In cases where we can't optimize it away it'd be a not insignificant drag on performance.
We'd be less compatible with PG, since PG errors when the same upsert statement attempts to touch the same row twice. Instead, we'd silently pick one of the duplicate values to apply.

I may be missing another option, but I think avoiding problems #1 and #2 above would require KV support. We'd need a way to tell KV to raise an error if a write tries to place an intent on a row that already has an intent from a previous write with seqNum >= X, where X = first seqNum used by the current statement.

CC @knz, @andreimatei, @nvanbenschoten, in case they have opinions on:

How much of a perf concern is adding DISTINCT ON to every UPSERT statement (assuming we optimize it away in common case where input are constant VALUES)?
How difficult/disruptive would a solution at the KV level be? Does my suggestion above have fatal flaws? Is there an alternate better way?

knz · 2020-02-05T10:22:18Z

Generally:

the DISTINCT ON on the input will make more pg users happy I think? I vaguely recall there was a discussion about this before. No comments on performance.
regarding a KV solution. It's not unreasonable because it's closer to what pg does, but there's a caveat (see below)

We'd need a way to tell KV to raise an error if a write tries to place an intent on a row that already has an intent from a previous write with seqNum >= X, where X = first seqNum used by the current statement.

That sounds complicated. I'd instead use the new machinery introduced to fix the halloween problem: seqnums that mark statement boundaries.

Then the rule is simpler, we can ask MVCC to raise an error when creating an intent if there's already one past the current read seqNum. That's more-or-less a direct translation of pg's rule "no touching the same row twice under the same sub-transaction"

The caveat: all these KV-based solutions can only detect duplicate writes when there is a single column family:

If there are 2+ column families, it's possible for a complex INSERT ON CONFLICT clause to alternatively touch different families of the same rows without detecting a duplicate write.
there's a weird special case when an UPDATE or INSERT ON CONFLICT updates a column that's part of the PK, in that case we delete the current row and create a new one. I believe (but I am not sure) it's also possible for this to touch one column family without checking the others, in which case an earlier write may not be detected as duplicate.

What I am trying to convey with these pitfalls is that we need a test suite that exercises all the combinations of code paths and then evaluate the solution using that test suite. If we don't, we'll get weird problems (like the one motivating the discussion).

nvanbenschoten · 2020-02-05T16:11:12Z

I agree with @knz - pushing this complexity into KV is possible, but before doing so, we'd need to make sure that the semantics we want can be defined at the per column family level and not at the per row level. Verifying that seems like a non-trivial task and I don't know that any one person has a good enough sense of when we read and write (and omit reads and writes) to partial rows to be able to say a per column family level check would sufficient.

I'd like to hear a bit more about how we could optimize the DISTINCT ON away for constant VALUES clauses. That is by far the most common use case of UPSERT, so if we could make this behavior free in that case and only marginally slow down performance in the others in order to fully mirror PG's behavior, that would be a compelling option.

RaduBerinde · 2020-02-05T16:33:06Z

I'd like to hear a bit more about how we could optimize the DISTINCT ON away for constant VALUES clauses

Essentially, it would be the optimizer running DISTINCT ON logic during planning. So it won't be free, but the optimizer already interns Datums so it would use pointers instead of encoding Datums.

knz · 2020-02-05T16:47:46Z

Heads up we found this blog post and point 6 in there seems to be related.

I'd like to invite @karlseguin to post their example here that triggered the dup error on insert on conflict error.

andy-kimball · 2020-02-05T22:25:23Z

@knz, I think you're getting the INSERT ... DO NOTHING variation confused with the simple UPSERT case. That blog post and a lot of discussion on #37880 are about INSERT ... DO NOTHING (i.e. we shouldn't raise any error on conflicts caused by input rows with duplicate key values). This issue is only about the UPSERT case. While the two are related, they're also different in important ways, and fixing one does not fix the other.

andy-kimball · 2020-02-20T00:25:52Z

I tried to implement the DistinctOn solution, but ran into several issues:

PG (usually, but not always) raises an error when the same row is updated twice. However, if I use the stock DistinctOn operator, CRDB would just arbitrarily apply one of the operations and ignore the others. Is that difference in behavior OK? We'd have to document it as being undefined which which operation CRDB applies when there are multiples.
Turns out that the DistinctOn operator doesn't have the right NULL behavior. It treats NULL values as equal to one another. However, the Insert operator treats NULL values as being not equal to one another. I don't see any reasonable way to get around this problem, short of implementing another variation of the DistinctOn operator. Ideas are welcome.

andy-kimball · 2020-02-20T00:28:59Z

Re: #2, one janky solution I thought about was creating an expression similar to this for nullable columns in a conflict index:

CREATE TABLE t (x INT PRIMARY KEY, y INT, UNIQUE (y));
SELECT DISTINCT ON (y, CASE WHEN y IS NULL THEN x ELSE NULL END) * FROM t

andy-kimball · 2020-02-22T03:27:40Z

I decided to go ahead with a new variation of the DistinctOn operator that has the right NULL behavior and that raises an error when it detects duplicate input rows.

Previously, there were cases where an UPSERT or INSERT..ON CONFLICT statement could operate on the same row more than once. This is a problem because these statements are designed to read once to get existing rows, and then to insert or update based on that information. They don't expect that previous writes for the same statement will affect the correctness of subsequent writes. This can lead to index corruption in cases where a row "moves" from one location to another when one of its index keys changes. See issue cockroachdb#44466 for an example. This commit fixes the problem by introducing a new variation of the DistinctOn operator that ensures that the input to the Upsert operator never has duplicates. It differs from the regular DistinctOn operator in two ways: 1. Null behavior: UpsertDistinctOn treats NULL values as not equal to one another for purposes of grouping. Two rows having a NULL- valued grouping column will be placed in different groups. This differs from DistinctOn behavior, where the two rows would be grouped together. This behavior difference reflects SQL semantics, in which a unique index key still allows multiple NULL values. 2. Duplicate behavior: UpsertDistinctOn raises an error if any distinct grouping contains more than one row. It has "input must be distinct" semantics rather than "make the input distinct" semantics. This is used to ensure that no row will be updated more than once. The optbuilder now wraps the input to the Upsert operator with this new UpsertDistinctOn operator. Fixes cockroachdb#44466 Release note (sql change): UPSERT and INSERT..ON CONFLICT statements will sometimes need to do an extra check to ensure that they never update the same row twice. This may adversely affect performance in cases where the optimizer cannot statically prove the extra check is unnecessary.

45372: sql: ensure that Upsert will never update the same row twice r=andy-kimball a=andy-kimball Previously, there were cases where an UPSERT or INSERT..ON CONFLICT statement could operate on the same row more than once. This is a problem because these statements are designed to read once to get existing rows, and then to insert or update based on that information. They don't expect that previous writes for the same statement will affect the correctness of subsequent writes. This can lead to index corruption in cases where a row "moves" from one location to another when one of its index keys changes. See issue #44466 for an example. This PR fixes the problem by introducing a new variation of the DistinctOn operator that ensures that the input to the Upsert operator never has duplicates. It differs from the regular DistinctOn operator in two ways: 1. Null behavior: UpsertDistinctOn treats NULL values as not equal to one another for purposes of grouping. Two rows having a NULL- valued grouping column will be placed in different groups. This differs from DistinctOn behavior, where the two rows would be grouped together. This behavior difference reflects SQL semantics, in which a unique index key still allows multiple NULL values. 2. Duplicate behavior: UpsertDistinctOn raises an error if any distinct grouping contains more than one row. It has "input must be distinct" semantics rather than "make the input distinct" semantics. This is used to ensure that no row will be updated more than once. The optbuilder now wraps the input to the Upsert operator with this new UpsertDistinctOn operator. In addition, there are several commits that add optimization rules designed to remove these distinct operators when they can be proven to not be necessary. Fixes #44466 Co-authored-by: Andrew Kimball <andyk@cockroachlabs.com>

This change works around a CockroachDB implementation quirk wherein an UPSERT is unable to read its own writes, which prevents the same row from being upserted twice within the same statement. This also corrects a potential source of data inconsistency in the applier code, if a single call to Apply contains both upserts and deletes of the same row. Both of the above issues can be addressed by sorting the mutations according to their HLC time and then de-duplicating them by primary key. The newly-added test in the apply package checks for the specific upsert behavior, so that this workaround can be reconsidered if a future version of CockroachDB allows duplicate rows within an UPSERT or CTE. X-Ref: cockroachdb/cockroach#44466 X-Ref: cockroachdb/cockroach#70731 X-Ref: cockroachdb/cockroach#45372

yuzefovich added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jan 29, 2020

yuzefovich added this to To do in SQL Features (Deprecated - use SQL Experience board) via automation Jan 29, 2020

yuzefovich assigned solongordon Jan 29, 2020

solongordon moved this from To do to Next up in SQL Features (Deprecated - use SQL Experience board) Feb 4, 2020

solongordon assigned rohany and unassigned solongordon Feb 4, 2020

andy-kimball mentioned this issue Feb 25, 2020

sql: ensure that Upsert will never update the same row twice #45372

Merged

craig bot closed this as completed in 166f98e Feb 26, 2020

SQL Features (Deprecated - use SQL Experience board) automation moved this from Next up to Done Feb 26, 2020

andy-kimball mentioned this issue Feb 26, 2020

opt: remove conflicting input rows in ON CONFLICT DO NOTHING case #45443

Merged

michae2 mentioned this issue Sep 24, 2021

sql: multiple CTEs with mutations on same row can cause inconsistency #70731

Open

bobvawter mentioned this issue May 31, 2022

apply: De-duplicate applied mutations cockroachdb/replicator#144

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-record UPSERT inserts duplicate values in PRIMARY KEY, resulting in inconsistent results #44466

Multi-record UPSERT inserts duplicate values in PRIMARY KEY, resulting in inconsistent results #44466

mrigger commented Jan 28, 2020

rohany commented Jan 30, 2020

rohany commented Feb 4, 2020

RaduBerinde commented Feb 4, 2020

andy-kimball commented Feb 5, 2020

knz commented Feb 5, 2020

nvanbenschoten commented Feb 5, 2020

RaduBerinde commented Feb 5, 2020

knz commented Feb 5, 2020

andy-kimball commented Feb 5, 2020

andy-kimball commented Feb 20, 2020

andy-kimball commented Feb 20, 2020

andy-kimball commented Feb 22, 2020

Multi-record UPSERT inserts duplicate values in PRIMARY KEY, resulting in inconsistent results #44466

Multi-record UPSERT inserts duplicate values in PRIMARY KEY, resulting in inconsistent results #44466

Comments

mrigger commented Jan 28, 2020

rohany commented Jan 30, 2020

rohany commented Feb 4, 2020

RaduBerinde commented Feb 4, 2020

andy-kimball commented Feb 5, 2020

knz commented Feb 5, 2020

nvanbenschoten commented Feb 5, 2020

RaduBerinde commented Feb 5, 2020

knz commented Feb 5, 2020

andy-kimball commented Feb 5, 2020

andy-kimball commented Feb 20, 2020

andy-kimball commented Feb 20, 2020

andy-kimball commented Feb 22, 2020