Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: schemachange/random-load failed [revert stuck forever] #122659

Closed
cockroach-teamcity opened this issue Apr 18, 2024 · 3 comments · Fixed by #123268
Closed

roachtest: schemachange/random-load failed [revert stuck forever] #122659

cockroach-teamcity opened this issue Apr 18, 2024 · 3 comments · Fixed by #123268
Assignees
Labels
branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-1 Issues/test failures with a fix SLA of 1 month T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Apr 18, 2024

roachtest.schemachange/random-load failed with artifacts on master @ 2d67111f0db7bec9bfed537542c60f37f8340f69:

(test_runner.go:1198).runTest: test timed out (3h0m0s)
test artifacts and logs in: /artifacts/schemachange/random-load/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/sql-foundations

This test on roachdash | Improve this report!

Jira issue: CRDB-38016

@cockroach-teamcity cockroach-teamcity added branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Apr 18, 2024
@cockroach-teamcity cockroach-teamcity added this to the 24.1 milestone Apr 18, 2024
@rafiss rafiss removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Apr 22, 2024
@rafiss
Copy link
Collaborator

rafiss commented Apr 22, 2024

It looks like these client-side goroutines were stuck waiting for a response:


\goroutine 689 gp=0xc000f8c540 m=nil [IO wait, 169 minutes]:
runtime.gopark(0xc0039a5080?, 0x0?, 0x0?, 0x0?, 0xb?)
	GOROOT/src/runtime/proc.go:402 +0xce fp=0xc001484da8 sp=0xc001484d88 pc=0x46836e
runtime.netpollblock(0x4b6958?, 0x42e626?, 0x0?)
	GOROOT/src/runtime/netpoll.go:573 +0xf7 fp=0xc001484de0 sp=0xc001484da8 pc=0x461017
internal/poll.runtime_pollWait(0x7fd9ec1e37d0, 0x72)
	GOROOT/src/runtime/netpoll.go:345 +0x85 fp=0xc001484e00 sp=0xc001484de0 pc=0x49a5a5
internal/poll.(*pollDesc).wait(0xc002396580?, 0xc001533000?, 0x0)
	GOROOT/src/internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc001484e28 sp=0xc001484e00 pc=0x4deda7
internal/poll.(*pollDesc).waitRead(...)
	GOROOT/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc002396580, {0xc001533000, 0x1000, 0x1000})
	GOROOT/src/internal/poll/fd_unix.go:164 +0x27a fp=0xc001484ec0 sp=0xc001484e28 pc=0x4e009a
net.(*netFD).Read(0xc002396580, {0xc001533000?, 0x7fd9e467f0e8?, 0xc003c88330?})
	GOROOT/src/net/fd_posix.go:55 +0x25 fp=0xc001484f08 sp=0xc001484ec0 pc=0x7660a5
net.(*conn).Read(0xc002392058, {0xc001533000?, 0xc001738fe8?, 0x4388db?})
	GOROOT/src/net/net.go:179 +0x45 fp=0xc001484f50 sp=0xc001484f08 pc=0x776d85
net.(*TCPConn).Read(0x44?, {0xc001533000?, 0xc001738fb0?, 0x4deb1c?})
	<autogenerated>:1 +0x25 fp=0xc001484f80 sp=0xc001484f50 pc=0x7892c5
crypto/tls.(*atLeastReader).Read(0xc003c88330, {0xc001533000?, 0x0?, 0xc003c88330?})
	GOROOT/src/crypto/tls/conn.go:806 +0x3b fp=0xc001484fc8 sp=0xc001484f80 pc=0x7cbb3b
bytes.(*Buffer).ReadFrom(0xc0025750b0, {0x4a0bb40, 0xc003c88330})
	GOROOT/src/bytes/buffer.go:211 +0x98 fp=0xc001485020 sp=0xc001484fc8 pc=0x4f9738
crypto/tls.(*Conn).readFromUntil(0xc002574e08, {0x4a0b8e0, 0xc002392058}, 0xc001739030?)
	GOROOT/src/crypto/tls/conn.go:828 +0xde fp=0xc001485058 sp=0xc001485020 pc=0x7cbd1e
crypto/tls.(*Conn).readRecordOrCCS(0xc002574e08, 0x0)
	GOROOT/src/crypto/tls/conn.go:626 +0x3cf fp=0xc0014852d8 sp=0xc001485058 pc=0x7c8e2f
crypto/tls.(*Conn).readRecord(...)
	GOROOT/src/crypto/tls/conn.go:588
crypto/tls.(*Conn).Read(0xc002574e08, {0xc0025cc000, 0x2000, 0x4ac46a?})
	GOROOT/src/crypto/tls/conn.go:1370 +0x156 fp=0xc001485348 sp=0xc0014852d8 pc=0x7cf6d6
github.com/jackc/pgx/v5/pgconn/internal/bgreader.(*BGReader).Read(0xc002320940, {0xc0025cc000, 0x2000, 0x2000})
	github.com/jackc/pgx/v5/pgconn/internal/bgreader/external/com_github_jackc_pgx_v5/pgconn/internal/bgreader/bgreader.go:100 +0xd7 fp=0xc0014853b8 sp=0xc001485348 pc=0xcf3a17
io.ReadAtLeast({0x4a0a6a0, 0xc002320940}, {0xc0025cc000, 0x2000, 0x2000}, 0x5)
	GOROOT/src/io/io.go:335 +0x90 fp=0xc001485400 sp=0xc0014853b8 pc=0x4d9430
github.com/jackc/pgx/v5/pgproto3.(*chunkReader).Next(0xc0008cba70, 0x5)
	github.com/jackc/pgx/v5/pgproto3/external/com_github_jackc_pgx_v5/pgproto3/chunkreader.go:80 +0x291 fp=0xc001485458 sp=0xc001485400 pc=0xcd8711
github.com/jackc/pgx/v5/pgproto3.(*Frontend).Receive(0xc002573688)
	github.com/jackc/pgx/v5/pgproto3/external/com_github_jackc_pgx_v5/pgproto3/frontend.go:220 +0x3c fp=0xc0014854e8 sp=0xc001485458 pc=0xcdfc3c
github.com/jackc/pgx/v5/pgconn.(*PgConn).peekMessage(0xc0023e0d88)
	github.com/jackc/pgx/v5/pgconn/external/com_github_jackc_pgx_v5/pgconn/pgconn.go:508 +0x138 fp=0xc001485550 sp=0xc0014854e8 pc=0xd00eb8
github.com/jackc/pgx/v5/pgconn.(*PgConn).receiveMessage(0xc0023e0d88)
	github.com/jackc/pgx/v5/pgconn/external/com_github_jackc_pgx_v5/pgconn/pgconn.go:528 +0x1c fp=0xc0014855a0 sp=0xc001485550 pc=0xd00fbc
github.com/jackc/pgx/v5/pgconn.(*MultiResultReader).receiveMessage(0xc0023e0e90)
	github.com/jackc/pgx/v5/pgconn/external/com_github_jackc_pgx_v5/pgconn/pgconn.go:1342 +0x25 fp=0xc0014855f0 sp=0xc0014855a0 pc=0xd07225
github.com/jackc/pgx/v5/pgconn.(*MultiResultReader).NextResult(0xc0023e0e90)
	github.com/jackc/pgx/v5/pgconn/external/com_github_jackc_pgx_v5/pgconn/pgconn.go:1370 +0x4b fp=0xc0014856c8 sp=0xc0014855f0 pc=0xd0766b
github.com/jackc/pgx/v5.(*Conn).execSimpleProtocol(0x2acb520?, {0x4a3a790?, 0xc0033acea0?}, {0x2fb8149?, 0x0?}, {0x0?, 0x0?, 0xc0011eeb90?})
	github.com/jackc/pgx/v5/external/com_github_jackc_pgx_v5/conn.go:509 +0xb0 fp=0xc001485720 sp=0xc0014856c8 pc=0xe28410
github.com/jackc/pgx/v5.(*Conn).exec(0xc0023c10e0, {0x4a3a790, 0xc0033acea0}, {0x2fb8149, 0x6}, {0x0?, 0x0?, 0x0?})
	github.com/jackc/pgx/v5/external/com_github_jackc_pgx_v5/conn.go:494 +0x54f fp=0xc0014857d8 sp=0xc001485720 pc=0xe281af
github.com/jackc/pgx/v5.(*Conn).Exec(0xc0023c10e0, {0x4a3a758?, 0xc000f93950?}, {0x2fb8149, 0x6}, {0x0, 0x0, 0x0})
	github.com/jackc/pgx/v5/external/com_github_jackc_pgx_v5/conn.go:414 +0x12f fp=0xc001485888 sp=0xc0014857d8 pc=0xe27b6f
github.com/jackc/pgx/v5.(*dbTx).Commit(0xc003951380, {0x4a3a758, 0xc000f93950})
	github.com/jackc/pgx/v5/external/com_github_jackc_pgx_v5/tx.go:181 +0x55 fp=0xc0014858e8 sp=0xc001485888 pc=0xe34f75
github.com/cockroachdb/cockroach/pkg/workload/schemachange.(*schemaChangeWorker).run(0xc001279730, {0x4a3a758, 0xc000f93950})
	github.com/cockroachdb/cockroach/pkg/workload/schemachange/schemachange.go:624 +0xc2e fp=0xc001485ec0 sp=0xc0014858e8 pc=0x1f2f50e
github.com/cockroachdb/cockroach/pkg/workload/schemachange.(*schemaChangeWorker).run-fm({0x4a3a758?, 0xc000f93950?})
	<autogenerated>:1 +0x33 fp=0xc001485ee8 sp=0xc001485ec0 pc=0x1f3d813
github.com/cockroachdb/cockroach/pkg/workload/cli.workerRun({0x4a3a758, 0xc000f93950}, 0xc001d7e1e0, 0xd81?, 0x0, 0xc000bfe770)
	github.com/cockroachdb/cockroach/pkg/workload/cli/run.go:278 +0xf4 fp=0xc001485f58 sp=0xc001485ee8 pc=0x1104bf4
github.com/cockroachdb/cockroach/pkg/workload/cli.runRun.func5.1(0x4100ad2b2bdbea12?, 0x412cce97fca22d84?)
	github.com/cockroachdb/cockroach/pkg/workload/cli/run.go:514 +0xa8 fp=0xc001485fc0 sp=0xc001485f58 pc=0x1107a08
github.com/cockroachdb/cockroach/pkg/workload/cli.runRun.func5.gowrap1()
	github.com/cockroachdb/cockroach/pkg/workload/cli/run.go:515 +0x28 fp=0xc001485fe0 sp=0xc001485fc0 pc=0x1107928
runtime.goexit({})
	src/runtime/asm_amd64.s:1695 +0x1 fp=0xc001485fe8 sp=0xc001485fe0 pc=0x4a0341
created by github.com/cockroachdb/cockroach/pkg/workload/cli.runRun.func5 in goroutine 653
	github.com/cockroachdb/cockroach/pkg/workload/cli/run.go:507 +0x125
goroutine 96151 gp=0xc002451c00 m=nil [IO wait]:
runtime.gopark(0x378f04a27f3eed7?, 0x0?, 0x0?, 0x0?, 0xb?)
	GOROOT/src/runtime/proc.go:402 +0xce fp=0xc000fe9268 sp=0xc000fe9248 pc=0x46836e
runtime.netpollblock(0x4b6958?, 0x42e626?, 0x0?)
	GOROOT/src/runtime/netpoll.go:573 +0xf7 fp=0xc000fe92a0 sp=0xc000fe9268 pc=0x461017
internal/poll.runtime_pollWait(0x7fd9ec01f5b8, 0x72)
	GOROOT/src/runtime/netpoll.go:345 +0x85 fp=0xc000fe92c0 sp=0xc000fe92a0 pc=0x49a5a5
internal/poll.(*pollDesc).wait(0xc0026b2480?, 0xc000659000?, 0x0)
	GOROOT/src/internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000fe92e8 sp=0xc000fe92c0 pc=0x4deda7
internal/poll.(*pollDesc).waitRead(...)
	GOROOT/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0026b2480, {0xc000659000, 0x800, 0x800})
	GOROOT/src/internal/poll/fd_unix.go:164 +0x27a fp=0xc000fe9380 sp=0xc000fe92e8 pc=0x4e009a
net.(*netFD).Read(0xc0026b2480, {0xc000659000?, 0x7fd9e46458b0?, 0xc0044cdc80?})
	GOROOT/src/net/fd_posix.go:55 +0x25 fp=0xc000fe93c8 sp=0xc000fe9380 pc=0x7660a5
net.(*conn).Read(0xc00466a2e0, {0xc000659000?, 0xc000fe94a8?, 0x4388db?})
	GOROOT/src/net/net.go:179 +0x45 fp=0xc000fe9410 sp=0xc000fe93c8 pc=0x776d85
net.(*TCPConn).Read(0x0?, {0xc000659000?, 0xc002451c00?, 0x0?})
	<autogenerated>:1 +0x25 fp=0xc000fe9440 sp=0xc000fe9410 pc=0x7892c5
crypto/tls.(*atLeastReader).Read(0xc0044cdc80, {0xc000659000?, 0x0?, 0xc0044cdc80?})
	GOROOT/src/crypto/tls/conn.go:806 +0x3b fp=0xc000fe9488 sp=0xc000fe9440 pc=0x7cbb3b
bytes.(*Buffer).ReadFrom(0xc002512630, {0x4a0bb40, 0xc0044cdc80})
	GOROOT/src/bytes/buffer.go:211 +0x98 fp=0xc000fe94e0 sp=0xc000fe9488 pc=0x4f9738
crypto/tls.(*Conn).readFromUntil(0xc002512388, {0x4a0b8e0, 0xc00466a2e0}, 0xc000fe94f0?)
	GOROOT/src/crypto/tls/conn.go:828 +0xde fp=0xc000fe9518 sp=0xc000fe94e0 pc=0x7cbd1e
crypto/tls.(*Conn).readRecordOrCCS(0xc002512388, 0x0)
	GOROOT/src/crypto/tls/conn.go:626 +0x3cf fp=0xc000fe9798 sp=0xc000fe9518 pc=0x7c8e2f
crypto/tls.(*Conn).readRecord(...)
	GOROOT/src/crypto/tls/conn.go:588
crypto/tls.(*Conn).Read(0xc002512388, {0xc000c38000, 0x2000, 0x73?})
	GOROOT/src/crypto/tls/conn.go:1370 +0x156 fp=0xc000fe9808 sp=0xc000fe9798 pc=0x7cf6d6
github.com/jackc/pgx/v5/pgconn/internal/bgreader.(*BGReader).Read(0xc0039b9340, {0xc000c38000, 0x2000, 0x2000})
	github.com/jackc/pgx/v5/pgconn/internal/bgreader/external/com_github_jackc_pgx_v5/pgconn/internal/bgreader/bgreader.go:100 +0xd7 fp=0xc000fe9878 sp=0xc000fe9808 pc=0xcf3a17
io.ReadAtLeast({0x4a0a6a0, 0xc0039b9340}, {0xc000c38000, 0x2000, 0x2000}, 0x5)
	GOROOT/src/io/io.go:335 +0x90 fp=0xc000fe98c0 sp=0xc000fe9878 pc=0x4d9430
github.com/jackc/pgx/v5/pgproto3.(*chunkReader).Next(0xc0046e4e40, 0x5)
	github.com/jackc/pgx/v5/pgproto3/external/com_github_jackc_pgx_v5/pgproto3/chunkreader.go:80 +0x291 fp=0xc000fe9918 sp=0xc000fe98c0 pc=0xcd8711
github.com/jackc/pgx/v5/pgproto3.(*Frontend).Receive(0xc00245a008)
	github.com/jackc/pgx/v5/pgproto3/external/com_github_jackc_pgx_v5/pgproto3/frontend.go:220 +0x3c fp=0xc000fe99a8 sp=0xc000fe9918 pc=0xcdfc3c
github.com/jackc/pgx/v5/pgconn.(*PgConn).peekMessage(0xc0023e0488)
	github.com/jackc/pgx/v5/pgconn/external/com_github_jackc_pgx_v5/pgconn/pgconn.go:508 +0x138 fp=0xc000fe9a10 sp=0xc000fe99a8 pc=0xd00eb8
github.com/jackc/pgx/v5/pgconn.(*ResultReader).readUntilRowDescription(0xc0023e0510)
	github.com/jackc/pgx/v5/pgconn/external/com_github_jackc_pgx_v5/pgconn/pgconn.go:1541 +0x2d fp=0xc000fe9a28 sp=0xc000fe9a10 pc=0xd0824d
github.com/jackc/pgx/v5/pgconn.(*PgConn).execExtendedSuffix(0xc0023e0488, 0xc0023e0510)
	github.com/jackc/pgx/v5/pgconn/external/com_github_jackc_pgx_v5/pgconn/pgconn.go:1128 +0x18b fp=0xc000fe9a88 sp=0xc000fe9a28 pc=0xd04deb
github.com/jackc/pgx/v5/pgconn.(*PgConn).ExecPrepared(0xc0023e0488, {0x4a3a790?, 0xc00480d1d0?}, {0xc005124770, 0xe}, {0xc003f60618, 0x1, 0x1}, {0xc004dd4418, 0x1, ...}, ...)
	github.com/jackc/pgx/v5/pgconn/external/com_github_jackc_pgx_v5/pgconn/pgconn.go:1073 +0x165 fp=0xc000fe9b38 sp=0xc000fe9a88 pc=0xd04625
github.com/jackc/pgx/v5.(*Conn).Query(0xc0023b2000, {0x4a3a758?, 0xc000f93950?}, {0x30f4f47, 0x5f}, {0xc004cb1330, 0x1, 0x1})
	github.com/jackc/pgx/v5/external/com_github_jackc_pgx_v5/conn.go:761 +0x1449 fp=0xc000fe9d30 sp=0xc000fe9b38 pc=0xe2a649
github.com/jackc/pgx/v5.(*Conn).QueryRow(...)
	github.com/jackc/pgx/v5/external/com_github_jackc_pgx_v5/conn.go:846
github.com/jackc/pgx/v5/pgxpool.(*Conn).QueryRow(0xc002445680?, {0x4a3a758?, 0xc000f93950?}, {0x30f4f47?, 0xc0004fd808?}, {0xc004cb1330?, 0x0?, 0x0?})
	github.com/jackc/pgx/v5/pgxpool/external/com_github_jackc_pgx_v5/pgxpool/conn.go:91 +0x3c fp=0xc000fe9d80 sp=0xc000fe9d30 pc=0xe3839c
github.com/jackc/pgx/v5/pgxpool.(*Pool).QueryRow(0xc000b10920?, {0x4a3a758, 0xc000f93950}, {0x30f4f47, 0x5f}, {0xc004cb1330, 0x1, 0x1})
	github.com/jackc/pgx/v5/pgxpool/external/com_github_jackc_pgx_v5/pgxpool/pool.go:631 +0x9a fp=0xc000fe9e00 sp=0xc000fe9d80 pc=0xe3af7a
github.com/cockroachdb/cockroach/pkg/workload/schemachange.(*schemaChangeWatchDog).isConnectionActive(0xc0024920c0, {0x4a3a758, 0xc000f93950})
	github.com/cockroachdb/cockroach/pkg/workload/schemachange/watch_dog.go:59 +0xdd fp=0xc000fe9ec8 sp=0xc000fe9e00 pc=0x1f3377d
github.com/cockroachdb/cockroach/pkg/workload/schemachange.(*schemaChangeWatchDog).watchLoop(0xc0024920c0, {0x4a3a758, 0xc000f93950})
	github.com/cockroachdb/cockroach/pkg/workload/schemachange/watch_dog.go:113 +0x118 fp=0xc000fe9fb8 sp=0xc000fe9ec8 pc=0x1f33c58
github.com/cockroachdb/cockroach/pkg/workload/schemachange.(*schemaChangeWatchDog).Start.gowrap1()
	github.com/cockroachdb/cockroach/pkg/workload/schemachange/watch_dog.go:136 +0x28 fp=0xc000fe9fe0 sp=0xc000fe9fb8 pc=0x1f33fe8
runtime.goexit({})
	src/runtime/asm_amd64.s:1695 +0x1 fp=0xc000fe9fe8 sp=0xc000fe9fe0 pc=0x4a0341
created by github.com/cockroachdb/cockroach/pkg/workload/schemachange.(*schemaChangeWatchDog).Start in goroutine 689
	github.com/cockroachdb/cockroach/pkg/workload/schemachange/watch_dog.go:136 +0x111

@rafiss rafiss changed the title roachtest: schemachange/random-load failed roachtest: schemachange/random-load failed [revert stuck forever] Apr 22, 2024
@rafiss
Copy link
Collaborator

rafiss commented Apr 22, 2024

This is the main thing that's suspicious from the CRDB logs:

I240418 18:23:04.168720 138718 jobs/adopt.go:252 ⋮ [T1,Vsystem,n2] 508  job 961393542022889473: resuming execution
I240418 18:23:04.173197 138771 jobs/registry.go:1553 ⋮ [T1,Vsystem,n2] 509  TYPEDESC SCHEMA CHANGE job 961393542022889473: stepping through state reverting with error: type with ID 1113 does not exist
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510  job 961393542022889473: reverting execution encountered retriable error: type with ID 1112 does not exist
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +(1) candidate pg code: 42704
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +Wraps: (2) attached stack trace
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  -- stack trace:
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.ByIDGetter.Type
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | 	github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/getters.go:109
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.MutableByIDGetter.Type
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | 	github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/getters.go:226
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | github.com/cockroachdb/cockroach/pkg/sql.(*typeSchemaChanger).cleanupEnumValues.func1
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | 	github.com/cockroachdb/cockroach/pkg/sql/type_change.go:554
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | github.com/cockroachdb/cockroach/pkg/sql.(*typeSchemaChanger).cleanupEnumValues.(*InternalDB).DescsTxn.func2
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:1753
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalDB).txn.func4
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:1850
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | 	github.com/cockroachdb/cockroach/pkg/kv/txn.go:1049
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | github.com/cockroachdb/cockroach/pkg/kv.runTxn
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:1089
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).TxnWithAdmissionControl
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:1052
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | 	github.com/cockroachdb/cockroach/pkg/kv/db.go:1027
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalDB).txn
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:1837
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalDB).DescsTxn
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:1751
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | github.com/cockroachdb/cockroach/pkg/sql.(*typeSchemaChanger).cleanupEnumValues
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | 	github.com/cockroachdb/cockroach/pkg/sql/type_change.go:604
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | github.com/cockroachdb/cockroach/pkg/sql.(*typeChangeResumer).OnFailOrCancel.func1
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | 	github.com/cockroachdb/cockroach/pkg/sql/type_change.go:1387
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | github.com/cockroachdb/cockroach/pkg/sql.(*typeChangeResumer).OnFailOrCancel
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | 	github.com/cockroachdb/cockroach/pkg/sql/type_change.go:1396
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine.func3
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1703
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1704
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).runJob
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/adopt.go:456
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).resumeJob.func1
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | 	github.com/cockroachdb/cockroach/pkg/jobs/adopt.go:290
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:485
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | runtime.goexit
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +  | 	src/runtime/asm_amd64.s:1695
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +Wraps: (3) type with ID 1112 does not exist
E240418 18:23:04.375227 138771 jobs/registry.go:1721 ⋮ [T1,Vsystem,n2] 510 +Error types: (1) *pgerror.withCandidateCode (2) *withstack.withStack (3) *errutil.leafError
E240418 18:23:04.375345 138771 jobs/adopt.go:461 ⋮ [T1,Vsystem,n2] 511  job 961393542022889473: adoption completed with error type with ID 1112 does not exist

That's logged at 18:23 on n2, which is when the operations on the cluster stopped. It happens again at 18:50 and 20:25, when I think is the schemachanger retrying the revert, but failing in the same way.

@fqazi do you think we should make "type with ID 1112 does not exist" a non-retriable error?

@cockroach-teamcity
Copy link
Member Author

roachtest.schemachange/random-load failed with artifacts on master @ 5d02bd9ff6b2bccecf6d43fc6cd647167b91f782:

(test_runner.go:1224).runTest: test timed out (3h0m0s)
test artifacts and logs in: /artifacts/schemachange/random-load/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

fqazi added a commit to fqazi/cockroach that referenced this issue Apr 30, 2024
Previously, when rolling back type descriptor schema changes if the
descriptor was already dropped we would keep retrying the schema change.
This would happen because we introduced a regression where the internal
structured error was replaced with user facing pgerror based error. To
address this, this patch will properly handle the UndefinedObject pgcode
and avoid retrying during a rollback of a typedesc schema change.

Fixes: cockroachdb#122958, cockroachdb#122659
Release note (bug fix): TYPEDESC SCHEMA CHANGE jobs could end up retrying forever
if the descriptor targeted by them was already dropped.
@craig craig bot closed this as completed in be54692 Apr 30, 2024
SQL Foundations automation moved this from Triage to Done Apr 30, 2024
blathers-crl bot pushed a commit that referenced this issue Apr 30, 2024
Previously, when rolling back type descriptor schema changes if the
descriptor was already dropped we would keep retrying the schema change.
This would happen because we introduced a regression where the internal
structured error was replaced with user facing pgerror based error. To
address this, this patch will properly handle the UndefinedObject pgcode
and avoid retrying during a rollback of a typedesc schema change.

Fixes: #122958, #122659
Release note (bug fix): TYPEDESC SCHEMA CHANGE jobs could end up retrying forever
if the descriptor targeted by them was already dropped.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-1 Issues/test failures with a fix SLA of 1 month T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
Development

Successfully merging a pull request may close this issue.

3 participants