Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: acceptance/version-upgrade failed #47024

Closed
cockroach-teamcity opened this issue Apr 4, 2020 · 11 comments · Fixed by #49662
Closed

roachtest: acceptance/version-upgrade failed #47024

cockroach-teamcity opened this issue Apr 4, 2020 · 11 comments · Fixed by #49662
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).acceptance/version-upgrade failed on release-19.2@ee759892738f7f203ff95ec7627b90d7c47b4350:

The test failed on branch=release-19.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200404-1851713/acceptance/version-upgrade/run_1
	upgrade.go:457,upgrade.go:536,upgrade.go:381,upgrade.go:349,acceptance.go:84,test_runner.go:753: dial tcp 34.71.161.233:26257: connect: connection refused

	cluster.go:1410,context.go:135,cluster.go:1399,test_runner.go:825: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1851713-1585981228-10-n4cpu4 --oneshot --ignore-empty-nodes: exit status 1 3: dead
		4: 4137
		2: 4114
		1: 4586
		Error: UNCLASSIFIED_PROBLEM:
		  - 3: dead
		    main.glob..func13
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1129
		    main.wrap.func1
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:272
		    github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).execute
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:766
		    github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).ExecuteC
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:852
		    github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).Execute
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:800
		    main.main
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1793
		    runtime.main
		    	/usr/local/go/src/runtime/proc.go:203
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357

More

Artifacts: /acceptance/version-upgrade
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-release-19.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Apr 4, 2020
@cockroach-teamcity cockroach-teamcity added this to the 20.1 milestone Apr 4, 2020
@cockroach-teamcity
Copy link
Member Author

(roachtest).acceptance/version-upgrade failed on release-19.2@07fd3bde66232e2e65ea6c9d7c3e8edf44442646:

The test failed on branch=release-19.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20200407-1857147/acceptance/version-upgrade/run_1
	upgrade.go:457,upgrade.go:536,upgrade.go:381,upgrade.go:349,acceptance.go:84,test_runner.go:753: dial tcp 35.188.82.118:26257: connect: connection refused

	cluster.go:1410,context.go:135,cluster.go:1399,test_runner.go:825: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1857147-1586240558-14-n4cpu4 --oneshot --ignore-empty-nodes: exit status 1 1: 4534
		4: dead
		2: 4264
		3: 4404
		Error: UNCLASSIFIED_PROBLEM:
		  - 4: dead
		    main.glob..func13
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1129
		    main.wrap.func1
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:272
		    github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).execute
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:766
		    github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).ExecuteC
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:852
		    github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).Execute
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:800
		    main.main
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1793
		    runtime.main
		    	/usr/local/go/src/runtime/proc.go:203
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357

More

Artifacts: /acceptance/version-upgrade
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@tbg
Copy link
Member

tbg commented Apr 8, 2020

cc @jordanlewis and/or @asubiotto - this looks like a DistSQL-turned-migration-framework issue. Same problem as #44732 (comment) but sort of a bad outcome:

The cluster starts out at 19.1, then rolls into 19.2, then rolls one node into the HEAD (=release-19.2) binary and then we see the crash. Full log at https://teamcity.cockroachdb.com/repository/download/Cockroach_Nightlies_WorkloadNightly/1857147:id/acceptance/version-upgrade/run_1/test.log

  - no inbound stream connection
    github.com/cockroachdb/cockroach/pkg/sql/flowinfra.init.ializers
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/flowinfra/flow_registry.go:30
    runtime.main
    	/usr/local/go/src/runtime/proc.go:188
    runtime.goexit
    	/usr/local/go/src/runtime/asm_amd64.s:1337
failed to run migration "remove public permissions on system.comments"

@tbg tbg added this to Triage in BACKLOG, NO NEW ISSUES: SQL Optimizer via automation Apr 8, 2020
@tbg tbg added this to Triage in BACKLOG, NO NEW ISSUES: SQL Execution via automation Apr 8, 2020
@tbg tbg assigned asubiotto and unassigned andreimatei Apr 8, 2020
@asubiotto
Copy link
Contributor

From node 1:

166 I200407 06:26:49.966649 5815 sql/flowinfra/outbox.go:230  [n1] outbox: connection dial error: initial connection heartbeat failed:
167   - rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.128.0.162:26257: connect: connection refused"
168 failed to connect to n4 at 10.128.0.162:26257

The no inbound stream connection is a symptom of n4 planning a migration and telling n1 to send data back, but for some reason n1 can't connect to n4. I'm not sure why and I assume it's not a breaker issue since I would expect the connection error to include that. I don't think this is an execution issue.

@andreimatei
Copy link
Contributor

Your sentence triggered a cache hit for #44102
Is that relevant?

@andreimatei
Copy link
Contributor

And generally #44101

@asubiotto
Copy link
Contributor

#44102 makes the fact that we have an outbox/inbox pair unexpected since those only occur in a distributed flow. It looks like the actual query that failed was trying to read some users when running depublicizeSystemComments based on these log messages on n4:

E200407 06:26:59.970729 684 sql/flowinfra/flow_registry.go:234  [n4,intExec=read-users] flow id:27e602e5-80a2-4fa1-9c73-33dff849e2e1 : 1 inbound streams timed out after 10s; propagated error throughout flow
F200407 06:26:59.973520 15 server/server.go:1592  [n4] error with attached stack trace:
    github.com/cockroachdb/cockroach/pkg/sql.(*internalExecutorImpl).execInternal.func1
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/internal.go:472
    github.com/cockroachdb/cockroach/pkg/sql.(*internalExecutorImpl).execInternal
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/internal.go:569
    github.com/cockroachdb/cockroach/pkg/sql.(*InternalExecutor).Exec
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/internal.go:301
    github.com/cockroachdb/cockroach/pkg/sqlmigrations.depublicizeSystemComments
        /go/src/github.com/cockroachdb/cockroach/pkg/sqlmigrations/migrations.go:910
    github.com/cockroachdb/cockroach/pkg/sqlmigrations.(*Manager).EnsureMigrations
        /go/src/github.com/cockroachdb/cockroach/pkg/sqlmigrations/migrations.go:552
    github.com/cockroachdb/cockroach/pkg/server.(*Server).Start
        /go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:1586

Looking at the code (GetAllRoles) it looks like that method uses a different executor. I think #44102 doesn't do enough in this case. We should probably disable all distributed queries until we implement a workaround for #44101

@cockroach-teamcity

This comment has been minimized.

tbg added a commit to tbg/cockroach that referenced this issue Apr 20, 2020
Remove a workaround that wasn't necessary any more (since I regenerated
the fixtures last week) but which caused flakes of its own because it
was re-uploading the binary for predecessorVersion while processes were
running using that binary (resulting in occasional 'text file busy' on
linux).

Closes cockroachdb#44732.
Touches cockroachdb#47024.

Release note: None
@cockroach-teamcity
Copy link
Member Author

(roachtest).acceptance/version-upgrade failed on release-19.2@3b556da231bb5ac15ac6ff678d61cb1ba516e6f7:

The test failed on branch=release-19.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/version-upgrade/run_1
	cluster.go:1766,versionupgrade.go:225,versionupgrade.go:304,versionupgrade.go:166,versionupgrade.go:155,acceptance.go:55,acceptance.go:91,test_runner.go:753: cluster.Put: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod put teamcity-1888224-1587449532-07-n4cpu4:1-4 /home/agent/temp/buildTmp/cockroach-v19.1.5.linux-amd64 ./cockroach-19.1.5 returned: exit status 1
		(1) attached stack trace
		  | main.(*cluster).PutE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1781
		  | main.(*cluster).Put
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1765
		  | main.(*versionUpgradeTest).uploadVersion
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/versionupgrade.go:225
		  | main.binaryUpgradeStep.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/versionupgrade.go:304
		  | main.(*versionUpgradeTest).run
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/versionupgrade.go:166
		  | main.runVersionUpgrade
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/versionupgrade.go:155
		  | main.registerAcceptance.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/acceptance.go:55
		  | main.registerAcceptance.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/acceptance.go:91
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (2) cluster.Put
		Wraps: (3) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod put teamcity-1888224-1587449532-07-n4cpu4:1-4 /home/agent/temp/buildTmp/cockroach-v19.1.5.linux-amd64 ./cockroach-19.1.5 returned
		  | stderr:
		  |
		  | stdout:
		  | teamcity-1888224-1587449532-07-n4cpu4: putting (dist) /home/agent/temp/buildTmp/cockroach-v19.1.5.linux-amd64 ./cockroach-19.1.5
		  | ...
		  |    1: ~ scp -r -C -o StrictHostKeyChecking=no -i /root/.ssh/id_rsa -i /root/.ssh/google_compute_engine /home/agent/temp/buildTmp/cockroach-v19.1.5.linux-amd64 ubuntu@35.226.52.127:./cockroach-19.1.5
		  | scp: ./cockroach-19.1.5: Text file busy
		  | : exit status 1
		  |    2: ~ scp -r -C -o StrictHostKeyChecking=no -i /root/.ssh/id_rsa -i /root/.ssh/google_compute_engine /home/agent/temp/buildTmp/cockroach-v19.1.5.linux-amd64 ubuntu@34.69.65.241:./cockroach-19.1.5
		  | scp: ./cockroach-19.1.5: Text file busy
		  | : exit status 1
		  |    3: ~ scp -r -C -o StrictHostKeyChecking=no -i /root/.ssh/id_rsa -i /root/.ssh/google_compute_engine /home/agent/temp/buildTmp/cockroach-v19.1.5.linux-amd64 ubuntu@35.188.88.165:./cockroach-19.1.5
		  | scp: ./cockroach-19.1.5: Text file busy
		  | : exit status 1
		  |    4: ~ scp -r -C -o StrictHostKeyChecking=no -i /root/.ssh/id_rsa -i /root/.ssh/google_compute_engine /home/agent/temp/buildTmp/cockroach-v19.1.5.linux-amd64 ubuntu@35.232.136.100:./cockroach-19.1.5
		  | scp: ./cockroach-19.1.5: Text file busy
		  | : exit status 1
		  | I200421 06:14:40.076860 1 cluster_synced.go:1178  put /home/agent/temp/buildTmp/cockroach-v19.1.5.linux-amd64 failed
		Wraps: (4) exit status 1
		Error types: (1) *withstack.withStack (2) *errutil.withMessage (3) *main.withCommandDetails (4) *exec.ExitError

More

Artifacts: /acceptance/version-upgrade
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

craig bot pushed a commit that referenced this issue Apr 21, 2020
47690: roachtest: deflake acceptance/version-upgrade r=spaskob a=tbg

Remove a workaround that wasn't necessary any more (since I regenerated
the fixtures last week) but which caused flakes of its own because it
was re-uploading the binary for predecessorVersion while processes were
running using that binary (resulting in occasional 'text file busy' on
linux).

Closes #44732.
Touches #47024.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@asubiotto asubiotto moved this from Triage to [ROWEXEC BACKLOG] Bugs/Test Failures in BACKLOG, NO NEW ISSUES: SQL Execution Apr 27, 2020
@petermattis
Copy link
Collaborator

@cockroach-teamcity
Copy link
Member Author

(roachtest).acceptance/version-upgrade failed on release-19.2@b3ccc2171c393e7c43015d090a8dacdbec499e8c:

		  |  1701.0s        0            0.0            0.0      0.0      0.0      0.0      0.0 cmt_err
		  |  1701.0s        0            0.0            0.0      0.0      0.0      0.0      0.0 err
		  |  1701.0s        0            0.0            0.0      0.0      0.0      0.0      0.0 ok
		  |  1702.0s        0            0.0            0.0      0.0      0.0      0.0      0.0 cmt_err
		  |  1702.0s        0            0.0            0.0      0.0      0.0      0.0      0.0 err
		  |  1702.0s        0            0.0            0.0      0.0      0.0      0.0      0.0 ok
		  |  1703.0s        0            0.0            0.0      0.0      0.0      0.0      0.0 cmt_err
		  |  1703.0s        0            0.0            0.0      0.0      0.0      0.0      0.0 err
		  |  1703.0s        0            0.0            0.0      0.0      0.0      0.0      0.0 ok
		  |  1704.0s        0            0.0            0.0      0.0      0.0      0.0      0.0 cmt_err
		  |  1704.0s        0            0.0            0.0      0.0      0.0      0.0      0.0 err
		  |  1704.0s        0            0.0            0.0      0.0      0.0      0.0      0.0 ok
		  |  1705.0s        0            0.0            0.0      0.0      0.0      0.0      0.0 cmt_err
		  |  1705.0s        0            0.0            0.0      0.0      0.0      0.0      0.0 err
		  |  1705.0s        0            0.0            0.0      0.0      0.0      0.0      0.0 ok
		  |  1706.0s        0            0.0            0.0      0.0      0.0      0.0      0.0 cmt_err
		  |  1706.0s        0            0.0            0.0      0.0      0.0      0.0      0.0 err
		  |  1706.0s        0            0.0            0.0      0.0      0.0      0.0      0.0 ok
		Wraps: (5) secondary error attachment
		  | signal: killed
		  | (1) signal: killed
		  | Error types: (1) *exec.ExitError
		Wraps: (6) context canceled
		Error types: (1) *withstack.withStack (2) *safedetails.withSafeDetails (3) *errutil.withMessage (4) *main.withCommandDetails (5) *secondary.withSecondaryError (6) *errors.errorString

	cluster.go:1481,context.go:135,cluster.go:1470,test_runner.go:825: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1962586-1590216092-16-n4cpu4 --oneshot --ignore-empty-nodes: exit status 1 3: 4810
		4: 4568
		2: dead
		1: dead
		Error: UNCLASSIFIED_PROBLEM: 2: dead, 1: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  | main.glob..func13
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1129
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:272
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:766
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:852
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:800
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1799
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (3) 2: dead, 1: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errors.errorString

More

Artifacts: /acceptance/version-upgrade
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@tbg
Copy link
Member

tbg commented May 27, 2020

@spaskob it looks like the test gets stuck in

06:44:54 mixed_version_schemachange.go:54: Workload step run: 2

runCmd := []string{
"./workload run",
fmt.Sprintf("schemachange --concurrency 2 --max-ops %d --verbose=1", maxOps),
fmt.Sprintf("{pgurl:1-%d}", u.c.spec.NodeCount),
}

I see this from the log output and also from the stack trace

https://teamcity.cockroachdb.com/repository/download/Cockroach_Nightlies_WorkloadNightly/1962586:id/acceptance/version-upgrade/run_1/__stacks.log

Could you take a look?

@tbg tbg added this to Triage in SQL Foundations via automation May 27, 2020
spaskob pushed a commit to spaskob/cockroach that referenced this issue May 28, 2020
…eleases

Fixes cockroachdb#47024.

Release note (bug fix):
The schema change workload is meant for testing the behavior of schema
changes on clusters with nodes with min version 19.0. It will deadlock
on earlier versions.
craig bot pushed a commit that referenced this issue May 28, 2020
49475: opt: create library that determines how joins affect input rows r=andy-kimball a=DrewKimball

Previously, there was no simple way to determine whether all rows from
a join input will be included in its output, nor whether input rows will
be duplicated by the join.

This patch adds a library that constructs a Multiplicity struct for join
operators. The Multiplicity can be queried for information about how a
join will affect its input rows (e.g. duplicated, filtered and/or
null-extended). The existing SimplifyLeftJoinWithFilters rule has been
refactored to use this library. The Multiplicity library will also be
useful for future join elimination and limit pushdown rules.

Release note: None

49662: roachtest: don't run schema change workload on 19.2 releases r=spaskob a=spaskob

Fixes #47024.

Release note (bug fix):
The schema change workload is meant for testing the behavior of schema
changes on clusters with nodes with min version 19.2. It will deadlock
on earlier versions.

Co-authored-by: Drew Kimball <andrewekimball@gmail.com>
Co-authored-by: Spas Bojanov <spas@cockroachlabs.com>
@craig craig bot closed this as completed in 21d66b4 May 29, 2020
SQL Foundations automation moved this from Triage to Done May 29, 2020
jbowens pushed a commit to jbowens/cockroach that referenced this issue Jun 1, 2020
…eleases

Fixes cockroachdb#47024.

Release note (bug fix):
The schema change workload is meant for testing the behavior of schema
changes on clusters with nodes with min version 19.2. It will deadlock
on earlier versions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants