Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: schemachange/bulkingest failed #82778

Closed
cockroach-teamcity opened this issue Jun 11, 2022 · 9 comments
Closed

roachtest: schemachange/bulkingest failed #82778

cockroach-teamcity opened this issue Jun 11, 2022 · 9 comments
Labels
branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jun 11, 2022

roachtest.schemachange/bulkingest failed with artifacts on release-22.1 @ 3181b7faa2f3b41d6a15ab4b74d2c60bcfe5132d:

		  |   552.0s        0         1156.1         1628.4      6.6     13.1     17.8     25.2 update-payload
		  |   553.0s        0         1220.3         1627.7      5.8     14.2     21.0     35.7 update-payload
		  |   554.0s        0         1309.5         1627.1      5.8     11.5     14.7     21.0 update-payload
		  |   555.0s        0         1350.2         1626.6      5.5     10.5     14.2     18.9 update-payload
		  |   556.0s        0         1266.3         1625.9      5.8     11.5     16.3     31.5 update-payload
		  |   557.0s        0         1303.2         1625.4      5.8     11.0     15.2     22.0 update-payload
		  |   558.0s        0         1231.3         1624.7      5.8     12.6     18.9     31.5 update-payload
		  |   559.0s        0         1358.1         1624.2      5.8     10.0     13.1     28.3 update-payload
		  |   560.0s        0         1262.3         1623.5      5.8     12.1     16.3     31.5 update-payload
		  | _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		  |   561.0s        0         1304.5         1623.0      5.8     11.0     16.3     24.1 update-payload
		  |   562.0s        0         1092.2         1622.0      6.0     16.3     27.3     56.6 update-payload
		  |   563.0s        0         1012.8         1620.9      6.3     18.9     30.4     41.9 update-payload
		Wraps: (4) COMMAND_PROBLEM
		Wraps: (5) Node 5. Command with error:
		  | ``````
		  | ./workload run bulkingest --duration 20m0s {pgurl:1-4} --a 100000000 --b 1 --c 1 --payload-bytes 4
		  | ``````
		Wraps: (6) exit status 1
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) errors.Cmd (5) *hintdetail.withDetail (6) *exec.ExitError

	monitor.go:127,schemachange.go:413,test_runner.go:883: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.makeSchemaChangeBulkIngestTest.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/schemachange.go:413
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:883
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	main/pkg/cmd/roachtest/monitor.go:80
		  | runtime.doInit
		  | 	GOROOT/src/runtime/proc.go:6498
		  | runtime.main
		  | 	GOROOT/src/runtime/proc.go:238
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/sql-schema

This test on roachdash | Improve this report!

Jira issue: CRDB-16653

@cockroach-teamcity cockroach-teamcity added branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jun 11, 2022
@blathers-crl blathers-crl bot added the T-sql-schema-deprecated Use T-sql-foundations instead label Jun 11, 2022
@aliher1911
Copy link
Contributor

@Xiang-Gu is it a release blocker?

@ajwerner
Copy link
Contributor

This one is interesting. We see more or less complete system unavailability for around 10s. This corresponds with a flurry of snapshots going to n1 seemingly from all of the other nodes. I wonder if this is a case of a leader in need of snapshots? I don't have much else in the way of explanation. What's interesting is that it clearly does recover. There were all of rebalances, lease transfers, and snapshots around the time things go bad. We don't see any heartbeat failures

		  |   539.0s        0          411.1         1661.4     11.5     56.6    100.7    130.0 update-payload
		  |   540.0s        0          169.9         1658.6     23.1     75.5     83.9     96.5 update-payload
		  | _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		  |   541.0s        0            0.0         1655.6      0.0      0.0      0.0      0.0 update-payload
		  |   542.0s        0            0.0         1652.5      0.0      0.0      0.0      0.0 update-payload
		  |   543.0s        0            0.0         1649.5      0.0      0.0      0.0      0.0 update-payload
		  |   544.0s        0            0.0         1646.4      0.0      0.0      0.0      0.0 update-payload
		  |   545.0s        0            0.0         1643.4      0.0      0.0      0.0      0.0 update-payload
		  |   546.0s        0            0.0         1640.4      0.0      0.0      0.0      0.0 update-payload
		  |   547.0s        0            0.0         1637.4      0.0      0.0      0.0      0.0 update-payload
		  |   548.0s        0            0.0         1634.4      0.0      0.0      0.0      0.0 update-payload
		  |   549.0s        0            0.0         1631.4      0.0      0.0      0.0      0.0 update-payload
		  |   550.0s        0          824.1         1630.0      6.3     14.7     75.5  10200.5 update-payload
		  |   551.0s        0         1233.8         1629.3      5.8     11.0     21.0     79.7 update-payload
		  |   552.0s        0         1156.1         1628.4      6.6     13.1     17.8     25.2 update-payload
		  |   553.0s        0         1220.3         1627.7      5.8     14.2     21.0     35.7 update-payload

We definitely see the 10s transaction durations in SQL. The deadline being exceeded seems bad, but not new and not a release blocker. I think in that case we want to return a restart error. I'll track that separately. For now, this issue is mostly about the stalled range which seems to, if I understand correctly, be about a snapshot. We see a burst of snapshots and then everything recovers. That sounds like KV and nothing new. I'm removing the label.

@ajwerner ajwerner removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jun 15, 2022
@ajwerner
Copy link
Contributor

@ajwerner to find the KV issue for the unavailability. We'll let this pop back up before doing anything.

@ajwerner
Copy link
Contributor

I want to say that the unavailability was #81561 and should have been fixed by #81763.

@postamar
Copy link
Contributor

Let's sit on this one for a couple of weeks and check back in mid-July.

@nvanbenschoten
Copy link
Member

@ajwerner if we can dig up metrics or evidence of a NotLeaseholderError redirection loop during the unavailability period, that would provide some evidence that this is related to #81561 and fixed by #81763.

@ajwerner
Copy link
Contributor

We do see some evidence of the backoff loop running. We see it get incremented on two different nodes aligned with the outage.
Screen Shot 2022-06-28 at 2 05 06 PM

That's some evidence. The top chart is NLEs and backoff loop increments. We see a total of 8 increments of that metric over the time interval of the outage. I'm not sure that that's enough to call this decisive.

@postamar
Copy link
Contributor

We've sat on it and we're now declaring victory. Closing.

@ajwerner
Copy link
Contributor

I'm inclined to close this and give #81561 as the alibi.

@exalate-issue-sync exalate-issue-sync bot added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-sql-schema-deprecated Use T-sql-foundations instead labels May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
None yet
Development

No branches or pull requests

5 participants