Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: disk-stalled/wal-failover/among-stores failed #124977

Closed
cockroach-teamcity opened this issue Jun 3, 2024 · 3 comments · Fixed by #125707
Closed

roachtest: disk-stalled/wal-failover/among-stores failed #124977

cockroach-teamcity opened this issue Jun 3, 2024 · 3 comments · Fixed by #125707
Assignees
Labels
branch-release-24.1.1-rc Used to mark GA and release blockers and technical advisories for 24.1.1-rc C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-storage Storage Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jun 3, 2024

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on release-24.1.1-rc @ 7d95120d7ad6f1e3f1b0f1c997e1ce0eaada24f9:

(disk_stall.go:174).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.359946586s at 2024-06-03T07:58:00Z
(cluster.go:2398).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

Jira issue: CRDB-39180

@cockroach-teamcity cockroach-teamcity added branch-release-24.1.1-rc Used to mark GA and release blockers and technical advisories for 24.1.1-rc C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-storage Storage Team labels Jun 3, 2024
@cockroach-teamcity cockroach-teamcity added this to the 24.1 milestone Jun 3, 2024
@jbowens
Copy link
Collaborator

jbowens commented Jun 4, 2024

rc branch version of #124399

@jbowens jbowens removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jun 4, 2024
@jbowens jbowens moved this from Incoming to Tests (failures, skipped, flakes) in (Deprecated) Storage Jun 4, 2024
@sumeerbhola
Copy link
Collaborator

The first disk stall is at

08:07:52 cluster.go:2418: running cmd `sudo dmsetup suspend --nofl...` on nodes [:1]; details in run_080752.752020842_n1_sudo-dmsetup-suspend.log
08:08:23 cluster.go:2418: running cmd `sudo dmsetup resume data1` on nodes [:1]; details in run_080823.401193971_n1_sudo-dmsetup-resume-.log

But the failure is from around the time the workload started and much before the stall.

07:57:52 disk_stall.go:102: test status: starting workload
07:57:52 disk_stall.go:115: test status: pausing 10m0s before simulated disk stall on n1
<snip>
09:02:55 test_impl.go:414: test failure #1: full stack retained in failure_1.log: (disk_stall.go:174).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.359946586s at 2024-06-03T07:58:00Z

We should change this test to ignore samples for a couple of minutes after starting the workload.

@cockroach-teamcity
Copy link
Member Author

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on release-24.1.1-rc @ b7a5b158354408939cb3d680aca4305c91b415af:

(disk_stall.go:174).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.004392872s at 2024-06-10T09:19:00Z
(cluster.go:2398).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=true
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

@itsbilal itsbilal self-assigned this Jun 14, 2024
craig bot pushed a commit that referenced this issue Jun 15, 2024
125707: roachtest: ignore workload for 5 mins after start in wal failover r=RaduBerinde a=itsbilal

Previously, we'd look at p99 latencies for the workload since its very start, in the disk-stall/wal-failover roachtest. This was relatively ambitious as the workload is a high-concurrency kv workload with no ramping period at the start, so the chance of high p99 latency even under normal performance is high.

This change ignores the workload's metrics from the first 5 mins of the workload (as opposed to just the first minute), and explicitly adds a 1min ramp period to the workload where concurrency is gradually increased.

Fixes #124977.

Epic: none

Release note: None

Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com>
@craig craig bot closed this as completed in bc1504a Jun 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-24.1.1-rc Used to mark GA and release blockers and technical advisories for 24.1.1-rc C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-storage Storage Team
Projects
Archived in project
(Deprecated) Storage
  
Tests (failures, skipped, flakes)
Development

Successfully merging a pull request may close this issue.

4 participants