Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash when deleting a row on a table with a unique index and similar PK #54629

Closed
dpkirchner opened this issue Sep 21, 2020 · 25 comments · Fixed by #58153
Closed

Crash when deleting a row on a table with a unique index and similar PK #54629

dpkirchner opened this issue Sep 21, 2020 · 25 comments · Fixed by #58153
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community X-blathers-triaged blathers was able to find an owner

Comments

@dpkirchner
Copy link

Describe the problem

Deleting a row in one of our tables causes the node to instantly crash and the row to remain. I believe this is the same issue as in #47541 although I can't prove it.

This is likely a rare circumstance due to the poor indexing (see below).

To Reproduce

  • Create a cockroachdb node (standalone) or a cluster, here's a handy docker-compose.yaml: https://gist.github.com/dpkirchner/1ab4d1e0450a2d381c762837711cbb5f
  • docker exec -it crdb-1 cockroach sql --insecure -e 'create database test'
  • docker exec -it crdb-1 cockroach sql --insecure -d test
  • Execute the following SQL
  • Execute DELETE FROM foo WHERE ref1_name = 'a' AND ref2_id = 1;
  • Witness crash (docker exits with no output or possibly ERROR: driver: bad connection)
CREATE TABLE ref1 (
  name VARCHAR(50) NOT NULL,
  CONSTRAINT "primary" PRIMARY KEY (name ASC),
  FAMILY "primary" (name)
);

CREATE TABLE ref2 (
  id INT8 NOT NULL DEFAULT unique_rowid(),
  CONSTRAINT "primary" PRIMARY KEY (id ASC),
  FAMILY "primary" (id)
);

CREATE TABLE foo (
  ref1_name VARCHAR(50) NOT NULL,
  FOREIGN KEY (ref1_name) REFERENCES ref1(name),
  ref2_id INT8 NOT NULL,
  FOREIGN KEY (ref2_id) REFERENCES ref2(id)
);

-- v The crash only occurs when *both* this index and the PK exist
CREATE UNIQUE INDEX unique_ref1_ref2 ON foo (ref1_name, ref2_id);
-- BTW: I didn't mean for this unique index to exist any more, it was a holdover from other work

ALTER TABLE foo ALTER PRIMARY KEY USING COLUMNS (ref1_name, ref2_id);
-- ^

INSERT INTO ref1 VALUES ('a');
INSERT INTO ref2 VALUES (1);
INSERT INTO foo VALUES ('a', 1);

Expected behavior
I expected the row to be deleted and for the node not to crash.

Additional data / screenshots
Log of the full panic for Google-bait:

E200921 20:00:14.902436 1152 sql/conn_executor.go:812  [n1,client=127.0.0.1:57510,hostnossl,user=root] a SQL panic has occurred while executing "DELETE FROM foo WHERE (ref1_name = 'a') AND (ref2_id = 1)": runtime error: index out of range [2] with length 2
E200921 20:00:14.902535 1152 util/log/crash_reporting.go:208  [n1,client=127.0.0.1:57510,hostnossl,user=root] a panic has occurred!
I200921 20:00:15.252503 2710 kv/kvserver/replica_consistency.go:246  [n1,consistencyChecker,s1,r4/1:/System{/tsd-tse}] triggering stats recomputation to resolve delta of {ContainsEstimates:2876 LastUpdateNanos:1600718413272249000 IntentAge:0 GCBytesAge:0 LiveBytes:-8340 LiveCount:-692 KeyBytes:-33660 KeyCount:-692 ValBytes:25320 ValCount:-692 IntentBytes:0 IntentCount:0 SysBytes:0 SysCount:0}
E200921 20:00:15.583352 1152 util/log/crash_reporting.go:533  [n1,client=127.0.0.1:57510,hostnossl,user=root] Reported as error 94db42a1eef6457db0086ca49f89ec55
panic: runtime error: index out of range [2] with length 2 [recovered]
	panic: panic while executing 1 statements: DELETE FROM _ WHERE (_ = _) AND (_ = _); caused by runtime error: index out of range [2] with length 2

goroutine 1152 [running]:
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).closeWrapper(0xc003eaf000, 0x4be8060, 0xc004b27580, 0x4108040, 0xc00400b900)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:826 +0x3f9
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn.func1(0xc003eaf000, 0x4be8060, 0xc004b27580)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:477 +0x61
panic(0x4108040, 0xc00400b900)
	/usr/local/go/src/runtime/panic.go:679 +0x1b2
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.findColumnValue(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/index_encoding.go:472
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.EncodeColumns(0xc003f78c28, 0x1, 0x2, 0x0, 0x0, 0x0, 0xc004936480, 0xc004e1fb40, 0x2, 0x2, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/table.go:247 +0x347
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.EncodeSecondaryIndex(0xc00380b200, 0xc0038d0000, 0xc004936480, 0xc004e1fb40, 0x2, 0x2, 0x4be8001, 0xc003f15740, 0xc0042b0900, 0x3, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/index_encoding.go:1029 +0x1a6
github.com/cockroachdb/cockroach/pkg/sql/row.(*Deleter).DeleteRow(0xc00521ac48, 0x4be8060, 0xc003f15740, 0xc003eb1600, 0xc004e1fb40, 0x2, 0x2, 0xc004000001, 0x1, 0x11)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/row/deleter.go:161 +0x105
github.com/cockroachdb/cockroach/pkg/sql.(*tableDeleter).row(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/tablewriter_delete.go:66
github.com/cockroachdb/cockroach/pkg/sql.(*deleteNode).processSourceRow(0xc00521ac00, 0x4be8060, 0xc003f15740, 0xc0039f6780, 0xc003eaf3f0, 0xc004e1fb40, 0x2, 0x2, 0x40b8780, 0x3f38001)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/delete.go:173 +0x8e
github.com/cockroachdb/cockroach/pkg/sql.(*deleteNode).BatchedNext(0xc00521ac00, 0x4be8060, 0xc003f15740, 0xc0039f6780, 0xc003eaf3f0, 0x3f38000, 0xc0051d3401, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/delete.go:126 +0x150
github.com/cockroachdb/cockroach/pkg/sql.(*rowCountNode).startExec(0xc004e1fa80, 0x4be8060, 0xc003f15740, 0xc0039f6780, 0xc003eaf3f0, 0xc0051d3400, 0x2d66db5)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/plan_batch.go:159 +0xcd
github.com/cockroachdb/cockroach/pkg/sql.startExec.func2(0x4314820, 0x5, 0x4bea6a0, 0xc004e1fa80, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/plan.go:395 +0xb9
github.com/cockroachdb/cockroach/pkg/sql.(*planVisitor).visitInternal.func1(0xc0051d3b88, 0x4314820, 0x5, 0x4bea6a0, 0xc004e1fa80)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/walk.go:144 +0x5d
github.com/cockroachdb/cockroach/pkg/sql.(*planVisitor).visitInternal(0xc0051d3b88, 0x4bea6a0, 0xc004e1fa80, 0x4314820, 0x5)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/walk.go:721 +0x2a6
github.com/cockroachdb/cockroach/pkg/sql.(*planVisitor).visit(0xc0051d3b88, 0x4bea6a0, 0xc004e1fa80, 0x759c038, 0x759c038)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/walk.go:111 +0x8d
github.com/cockroachdb/cockroach/pkg/sql.walkPlan(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/walk.go:75
github.com/cockroachdb/cockroach/pkg/sql.startExec(0x4be8060, 0xc003f15740, 0xc0039f6780, 0xc003eaf3f0, 0x4bea6a0, 0xc004e1fa80, 0x200000000, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/plan.go:398 +0x1bb
github.com/cockroachdb/cockroach/pkg/sql.(*planNodeToRowSource).Start(0xc00493e000, 0x4be8060, 0xc003f15740, 0x759e9c0, 0x3cce6c0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/plan_node_to_row_source.go:117 +0xd6
github.com/cockroachdb/cockroach/pkg/sql/execinfra.(*ProcessorBase).Run(0xc00493e000, 0x4be8060, 0xc003f15740)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/execinfra/processorsbase.go:748 +0x52
github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*FlowBase).Run(0xc0042b3560, 0x4be8060, 0xc003f15740, 0x44d3218, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/flowinfra/flow.go:370 +0x247
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).Run(0xc00011ca00, 0xc0045085b0, 0xc0042b0510, 0xc0051d4458, 0xc000749b80, 0xc003eaf4d0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:405 +0x5f1
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).PlanAndRun(0xc00011ca00, 0x4be8120, 0xc0042abf80, 0xc003eaf4d0, 0xc0045085b0, 0xc0042b0510, 0x4bea6a0, 0xc004e1fa80, 0xc000749b80, 0xc003eaf1c8)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:1019 +0x225
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execWithDistSQLEngine(0xc003eaf000, 0x4be8120, 0xc0042abf80, 0xc003eaf3f0, 0x2, 0x7f26638adda8, 0xc004b54f80, 0xc0051d4700, 0xc00479d438, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:938 +0x3f0
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).dispatchToExecutionEngine(0xc003eaf000, 0x4be8120, 0xc0042abf80, 0xc003eaf3f0, 0x7f26638adda8, 0xc004b54f80, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:831 +0x6cd
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState(0xc003eaf000, 0x4be8120, 0xc0042abf80, 0x4c00da0, 0xc004efbb30, 0xc004ca8ab3, 0x35, 0x0, 0x1, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:536 +0xb06
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt(0xc003eaf000, 0x4be8120, 0xc0042abf80, 0x4c00da0, 0xc004efbb30, 0xc004ca8ab3, 0x35, 0x0, 0x1, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:96 +0x4f6
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd(0xc003eaf000, 0x4be8060, 0xc003f15640, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1406 +0x1b77
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(0xc003eaf000, 0x4be8060, 0xc004b27580, 0xc0003c07c0, 0x5400, 0x15000, 0xc0003c0858, 0xc003ed8c70, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1335 +0x1f2
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn(0xc000c4c580, 0x4be8060, 0xc004b27580, 0xc003eaf000, 0x5400, 0x15000, 0xc0003c0858, 0xc003ed8c70, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:479 +0x104
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync.func1(0xc004ae9b6f, 0xc004c371a0, 0x4be8060, 0xc004b27580, 0xc003ed8c70, 0xc000c4c580, 0xc004b54900, 0x4c0bd20, 0xc004c5d260, 0xc004c3d200, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:582 +0x317
created by github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:510 +0x17e

I'll send the logs to support@ after getting the issue id but I can also share them with whoever wants 'em, nothing proprietary here. The above repro works 100% of the time.

Environment:

  • CockroachDB version [e.g. 2.0.x]: 20.1.5
  • Server OS: [e.g. Linux/Distrib]: Docker Desktop for Mac, but it also happens on CoreOS Linux in GKE
  • Client app [e.g. cockroach sql, JDBC, ...]: cockroach sql

Additional context
What was the impact?

The node crashes causing query failures throughout the cluster.

Add any other context about the problem here.

Deleting either the PK or the unique index resolves the issue.

@blathers-crl blathers-crl bot added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Sep 21, 2020
@blathers-crl blathers-crl bot added this to Incoming in KV Sep 21, 2020
@blathers-crl blathers-crl bot added this to Triage in SQL Foundations Sep 21, 2020
@dpkirchner dpkirchner changed the title Crash when deleting a row on a table with a unique index Crash when deleting a row on a table with a unique index and similar PK Sep 21, 2020
@yuzefovich yuzefovich added this to Triage in BACKLOG, NO NEW ISSUES: SQL Execution via automation Sep 21, 2020
@yuzefovich yuzefovich removed this from Incoming in KV Sep 21, 2020
@yuzefovich yuzefovich removed this from Triage in SQL Foundations Sep 21, 2020
@cockroachdb cockroachdb deleted a comment from blathers-crl bot Sep 21, 2020
@yuzefovich
Copy link
Member

Thanks for the report and for the reproduction steps. We'll look into this.

@jordanlewis
Copy link
Member

Note that if you create the foo table with explicit primary key, this issue doesn't manifest. It needs the primary key change to happen to trigger the bug.

@dpkirchner
Copy link
Author

Also, if you drop the unique index, then drop the hidden rowid column, and then recreate the unique index, you can delete rows without crashes. I don't know the cockroachdb internals but given this:

panic: runtime error: index out of range [2] with length 2 [recovered]

I wonder if the issue is that rowid column, which would be the 3rd out of what appears -- in show create table, anyway -- to be 2 columns. Pure speculation.

@yuzefovich
Copy link
Member

I looked into this, and I believe that @dpkirchner's hypothesis is correct - the hidden rowid column from the original primary key is listed as ExtraColumnID on the updated primary key index descriptor, and that somehow messes up the deletion later.

However, I'm not sure where the problem (or the fix) lies - the comments on ExtraColumnID were written when that field would only be used on secondary indexes, but now we've made a previously secondary our primary index, so on the one hand, the index descriptor might be invalid after Alter PK operation. On the other hand, it might be the case that the optimizer should be including the hidden column into the set of the columns to be fetched when performing a delete (we roughly have table reader -> filter -> deleter, and the hidden rowid column is projected out).

I decided to try a different approach of updating the deleter itself to ignore ExtraColumnID, and it appears that the logic tests in delete pass. I'll open up a PR and see what happens. I have no confidence whatsoever in the fix though (whether it introduces a problem elsewhere or not).

@RaduBerinde
Copy link
Member

I looked into this, and I believe that @dpkirchner's hypothesis is correct - the hidden rowid column from the original primary key is listed as ExtraColumnID on the updated primary key index descriptor, and that somehow messes up the deletion later.

This sounds like the problem. ExtraColumnID should never be set for the primary index. In particular, in this case the optimizer determines that it only needs to produce a value for column c, and the deleter assumes that it also gets values for any ExtraColumns.

@RaduBerinde
Copy link
Member

On the other hand, it might be the case that the optimizer should be including the hidden column into the set of the columns to be fetched when performing a delete

The optimizer can't tell (through the catalog) whether the primary index has a column in ExtraColumnIDs or not, so it can't be expected to figure out that it needs to produce a value for that column.

Perhaps the right solution here is to always have empty ExtraColumnIDs for all unique indexes on non-nullable columns (these indexes never need extra columns in the key, and only such indexes can become PKs).

@asubiotto
Copy link
Contributor

Moving over to SQL Experience because of their experience with primary key changes and we've reached the end of what the execution team can do.

cpporhair added a commit to cpporhair/cockroach that referenced this issue Oct 18, 2020
to fix
Crash when deleting a row on a table with a unique index and similar PK cockroachdb#54629
cpporhair added a commit to cpporhair/cockroach that referenced this issue Oct 18, 2020
to fix
Crash when deleting a row on a table with a unique index and similar PK cockroachdb#54629
This was referenced Oct 18, 2020
@rafiss rafiss moved this from Triage to Shorter term backlog in SQL Sessions - Deprecated Nov 10, 2020
@ajwerner
Copy link
Contributor

@lucy-zhang this feels related to some of the things we were talking about recently related to special properties of primary indexes.

@thoszhang
Copy link
Contributor

Summarizing some discussion today: This seems to be a problem with the secondary index on c, not with the new primary index descriptor. This is a shorter repro of the crash, thanks to @yuzefovich:

CREATE TABLE t54629 (c INT NOT NULL, UNIQUE INDEX (c));
ALTER TABLE t54629 ALTER PRIMARY KEY USING COLUMNS (c);
INSERT INTO t54629 VALUES (1);
DELETE FROM t54629 WHERE c = 1;

Looking at the descriptors using

SELECT crdb_internal.pb_to_json('cockroach.sql.sqlbase.Descriptor', descriptor) FROM system.descriptor WHERE id = 't54629'::regclass::int

it's apparent that the new primary index (ID 3) on c doesn't have any ExtraColumnIDs, but the preexisting unique secondary index (ID 2) on c does have ExtraColumnIDs containing the rowid column. So it seems that we can't handle secondary indexes with the hidden rowid column if rowid is no longer the primary key, which is why we crash inside EncodeSecondaryIndex.

Note that in the implementation of primary key changes, we always generate a fresh primary index descriptor, which never has ExtraColumnIDs. The existing unique index (ID 2) on c is not modified or replaced.

@yuzefovich or anyone else, do you have any idea why we might be handling hidden columns incorrectly in the deleter specifically for secondary indexes?

As for the primary key change itself, I think the right thing to do for rowid is to make it non-hidden when it stops being part of the PK. We can make this change, but it would still be good to understand what's broken in the deleter presently, IMO.

@thoszhang
Copy link
Contributor

As for the primary key change itself, I think the right thing to do for rowid is to make it non-hidden when it stops being part of the PK. We can make this change, but it would still be good to understand what's broken in the deleter presently, IMO.

Thinking about this again, we probably shouldn't do this, since it'll break existing queries. So rowid would have to stay hidden. But the concept of a hidden column with no special constraints/properties is strange from a UI perspective. Could we just drop rowid in situations like this, when we're specifying a new PK?

@thoszhang
Copy link
Contributor

One more observation: I put

statement ok
CREATE TABLE t (c INT NOT NULL, UNIQUE INDEX (c))

statement ok
ALTER TABLE t ALTER PRIMARY KEY USING COLUMNS (c)

statement ok
INSERT INTO t(c) VALUES (1)

statement ok
DELETE FROM t WHERE c = 1

in a logic test, and found that it sometimes passes without crashing. It's only when I add SELECT pg_sleep(1) after the ALTER PRIMARY KEY statement that I can get this to reliably fail. Since we drop the old primary index asynchronously after ALTER PRIMARY KEY returns results, this indicates that the presence of the old primary index in the delete-and-write-only state must be having some effect, which is disturbing.

@solongordon
Copy link
Contributor

Could we just drop rowid in situations like this, when we're specifying a new PK?

From a UX perspective this makes sense to me. I suppose it's possible that the user could have queries which depend on rowid but it feels unlikely to me. We should also consider the fact that other objects could have dependencies on the rowid column. For instance there's nothing stopping a user from creating a secondary index which includes rowid in the key. Again, unlikely, but possible.

@yuzefovich
Copy link
Member

I think that the "alter primary key" operation does something wrong in this edge case because the comment on ExtraColumnIDs says they are used only for secondary indexes, yet - if my previous comments are correct - the new primary index will have rowid in there which goes against the comment. I think we might need to rewrite all of the indexes in this edge case, and maybe the fix is something like https://github.com/123youyouer/cockroach/pull/1/files.

@thoszhang
Copy link
Contributor

if my previous comments are correct - the new primary index will have rowid in there

I'm not seeing how this is true. When I tried setting up the conditions to reproduce the crash, with the following:

CREATE TABLE t54629 (c INT NOT NULL, UNIQUE INDEX (c));
ALTER TABLE t54629 ALTER PRIMARY KEY USING COLUMNS (c);

I see an empty ExtraColumnIDs list in the primary index. Here is the exact descriptor I get (on master): https://gist.github.com/lucy-zhang/cc4f5b8f51b047f352f2f002fdab0734

Were you seeing something different?

@yuzefovich
Copy link
Member

yuzefovich commented Dec 21, 2020

Hm, I don't remember - I was just looking at my old comment

the hidden rowid column from the original primary key is listed as ExtraColumnID on the updated primary key index descriptor, and that somehow messes up the deletion later.

It's quite possible that I wrote up my observations incorrectly.

@thoszhang
Copy link
Contributor

@yuzefovich would you or anyone else on the Execution team be able to take a look at what's going on with the deleter for secondary indexes, specifically, then?

From my end, I don't see anything invalid about the primary index descriptor after the PK change, and while the behavior of a non-PK rowid in the secondary index's ExtraColumnIDs is somewhat undefined, I don't see any obvious reason why it just wouldn't work at all. We can work on making any fixes to the descriptor as necessary on our end, but at this point I can't extract any information from following that stack trace and I think we need someone with execution knowledge.

@thoszhang
Copy link
Contributor

Could we just drop rowid in situations like this, when we're specifying a new PK?

From a UX perspective this makes sense to me. I suppose it's possible that the user could have queries which depend on rowid but it feels unlikely to me. We should also consider the fact that other objects could have dependencies on the rowid column. For instance there's nothing stopping a user from creating a secondary index which includes rowid in the key. Again, unlikely, but possible.

Thinking about this again, I'm no longer entirely convinced that it's a good idea to just drop rowid for the same reasons that you mentioned, because we might break existing queries that do use it.

One argument in favor of dropping rowid is that we do allow ADD CONSTRAINT PRIMARY KEY (without a prior DROP CONSTRAINT PRIMARY KEY in the same transaction) if the existing PK is rowid, so we do have some precedent for treating rowid as something other than a first-class column when it comes to PK changes. I don't have a strong opinion either way. We just need to decide on some reasonable behavior here, and then fix what we have.

This, by the way, is also broken and crashes for the same reason:

CREATE TABLE t54629 (c INT NOT NULL, UNIQUE INDEX (c));
ALTER TABLE t54629 ADD CONSTRAINT pk PRIMARY KEY (c);
INSERT INTO t54629 VALUES (1);
DELETE FROM t54629 WHERE c = 1;

@thoszhang
Copy link
Contributor

I continued experimenting with this. It turns out that manually creating an index on rowid after the PK change will cause the crash to no longer happen:

CREATE TABLE t54629 (c INT NOT NULL, UNIQUE INDEX (c));
ALTER TABLE t54629 ALTER PRIMARY KEY USING COLUMNS (c);
CREATE INDEX ON t54629 (rowid);
INSERT INTO t54629 VALUES (1);
DELETE FROM t54629 WHERE c = 1;

It doesn't even have to be a unique index, though that works too. I don't know if this actually "fixes" the crash, or if this is some sort of coincidental alignment of slice elements. But this all seems to suggest that the problem isn't with hidden columns, per se, but something to do with the other indexes.

One special thing that's going on here is that when we perform a PK swap and the prior PK was rowid, we do not create a replacement unique index:

func shouldCopyPrimaryKey(desc *tabledesc.Mutable, newPK *descpb.IndexDescriptor) bool {
oldPK := desc.PrimaryIndex
if !desc.HasPrimaryKey() {
return false
}
if desc.IsPrimaryIndexDefaultRowID() {
return false
}

@yuzefovich
Copy link
Member

Here is a simple diff that appears to fix the problem too:

diff --git a/pkg/sql/opt_exec_factory.go b/pkg/sql/opt_exec_factory.go
index 6ee3253455..82a94d6da8 100644
--- a/pkg/sql/opt_exec_factory.go
+++ b/pkg/sql/opt_exec_factory.go
@@ -1523,6 +1523,7 @@ func (ef *execFactory) ConstructDelete(
        // Truncate any FetchCols added by MakeUpdater. The optimizer has already
        // computed a correct set that can sometimes be smaller.
        rd.FetchCols = rd.FetchCols[:len(fetchColDescs)]
+       rd.FetchColIDtoRowIndex = row.ColIDtoRowIndexFromCols(rd.FetchCols)
 
        // Now make a delete node. We use a pool.
        del := deleteNodePool.Get().(*deleteNode)

Overall, it seems very suspicious to me that the exec factory modifies FetchCols field on the Deleter but doesn't adjust FetchColIDtoRowIndex accordingly - I believe those two structs should be kept in sync. @RaduBerinde do you know whether there is an explanation for this?

@thoszhang
Copy link
Contributor

Also, for what it's worth, this doesn't seem have anything to do with rowid specifically. It seems to be about the existence of indexes on the former PK, which still might be present in the ExtraColumnIDs of other secondary indexes. For example, this causes the same crash:

create table t (a int primary key, c int not null, unique index (c)); -- note the explicit PK
alter table t alter primary key using columns (c);
drop index t_a_key cascade; -- this was the unique index on (a) auto-created as part of the PK swap
insert into t values (1, 1);
delete from t where c = 1;

@yuzefovich
Copy link
Member

yuzefovich commented Dec 21, 2020

I am now pretty confident that #58153 is the right fix.

@thoszhang
Copy link
Contributor

Thanks Yahor. Do you mind briefly filling me in on what the bug was? We have an impending rewrite of primary key changes and I want to make sure we keep doing the right thing (even though it seems like the descriptors are presently fine as they are).

@yuzefovich
Copy link
Member

yuzefovich commented Dec 21, 2020

I believe the bug is as follows: Deleter internally keeps track of two things FetchCols (the column descriptors to fetch) and FetchColIDtoRowIndex (a mapping from columnID to some kind of ordinal), those two things need to stay in sync. MakeDeleter was written long time ago when we didn't have the optimizer, and that method made sure all columns that are part of the key of any index are included to be fetched. However, the optimizer is smart enough that it computes the set of columns to fetch itself, so in execFactory.ConstructDelete it overwrites FetchCols to be the prefix of the desired length (stripping away some columns that MakeDeleter might have included). And this is where the bug is - FetchColIDtoRowIndex is not updated accordingly. This results in the optimizer passing in a single value (for column c in our examples) whereas the deleter thinks that two values are necessary when deleting from the secondary index, so we get an out of bounds. The solution is simple - make sure that two internal things stay in sync with each other.

Unfortunately, I don't think we have a person who really knows how mutations work (I guess I might be the person with most knowledge, but my knowledge is very limited :/ ) and who is familiar with the primary key change, but I don't see anything wrong with the current approach.

@thoszhang
Copy link
Contributor

Thanks for the explanation.

At this point, I'm still not 100% convinced that it's safe to have non-PK columns in ExtraColumnIDs in the limited set of cases where we make it possible, even for secondary indexes. If we do something like this:

create table t (a int primary key, c int not null, unique index (c));
alter table t alter primary key using columns (c);

This is the table schema we get:

demo@127.0.0.1:26257/defaultdb> show create t;
  table_name |               create_statement
-------------+------------------------------------------------
  t          | CREATE TABLE public.t (
             |     a INT8 NOT NULL,
             |     c INT8 NOT NULL,
             |     CONSTRAINT "primary" PRIMARY KEY (c ASC),
             |     UNIQUE INDEX t_c_key (c ASC),
             |     UNIQUE INDEX t_a_key (a ASC),
             |     FAMILY "primary" (a, c)
             | )
(1 row)

Here, t_c_key has a in its ExtraColumnIDs, but there's nothing preventing us from removing either the unique or the non-null constraint from a, so we can't count on a being in the key to provide uniqueness. But this might still be safe, since in this case c itself is the primary key and t_c_key is entirely redundant. Dropping a will also cause t_c_key to be dropped, which is unpredictable behavior from a UI perspective, but at least avoids dangling references in ExtraColumnIDs.

This seems like an unpleasant case to have to account for, and it'd be better if we could somehow rewrite the metadata to avoid non-PK ExtraColumnIDs while not rewriting the index data itself, but I have no idea whether that's possible.

In any case, this problem is probably beyond the scope of this particular issue. I'll investigate more and open a separate issue if necessary.

@thoszhang
Copy link
Contributor

Talked to @jordanlewis about this offline and I'm persuaded now that we don't have a real problem here. The impact of the ExtraColumnIDs is that we can sometimes pointlessly write those values into the key instead of the value of the KV, for an index whose uniqueness is already guaranteed by the primary index. We could potentially turn those ExtraColumnIDs into StoreColumnIDs when swapping the indexes, which would make the UI clearer when dropping the columns, but it's not a correctness problem to leave it as is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community X-blathers-triaged blathers was able to find an owner
Projects
No open projects
8 participants