Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: clearrange/zfs/checks=true failed #121935

Closed
cockroach-teamcity opened this issue Apr 8, 2024 · 4 comments
Closed

roachtest: clearrange/zfs/checks=true failed #121935

cockroach-teamcity opened this issue Apr 8, 2024 · 4 comments
Assignees
Labels
branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-storage Storage Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Apr 8, 2024

roachtest.clearrange/zfs/checks=true failed with artifacts on master @ 909e91dc1ea45e8223658c1144067332df7ed1ec:

(cluster.go:2347).Run: context canceled
(monitor.go:154).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/clearrange/zfs/checks=true/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=zfs
  • ROACHTEST_localSSD=false
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

Jira issue: CRDB-37629

@cockroach-teamcity cockroach-teamcity added branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-storage Storage Team labels Apr 8, 2024
@cockroach-teamcity cockroach-teamcity added this to the 24.1 milestone Apr 8, 2024
@nicktrav nicktrav moved this from Incoming to Tests (failures, skipped, flakes) in (Deprecated) Storage Apr 9, 2024
@itsbilal
Copy link
Member

itsbilal commented Apr 9, 2024

This one is weird. n4 seems to have crashed inexplicably, causing the monitor to trip up. But the logs for n4 don't say anything descriptive; we see cockroach exited with code 10 but no panics or any fatal errors in the logs.

@RaduBerinde
Copy link
Member

10 is "disk full":

func DiskFull() Code { return Code{10} }

@jbowens
Copy link
Collaborator

jbowens commented Apr 10, 2024

I think n4 was likely seeing the effect of space amplification due to eventually file-only snapshots. The disks seem tiny so we don't have a ton of leeway. n4 had a large LSM relative to the other nodes, but not especially so. I think we should add a timeseries metric for the size of zombie sstables. I'll file an issue.

Screenshot 2024-04-10 at 11 34 49 AM Screenshot 2024-04-10 at 11 34 55 AM

https://grafana.testeng.crdb.io/d/StorageAvKxELVz/storage?from=1712588792350&to=1712589191384&var-cluster=teamcity-14738640-1712555393-97-n10cpu16&orgId=1&var-instances=All

@jbowens jbowens removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Apr 10, 2024
@jbowens jbowens closed this as completed Apr 10, 2024
(Deprecated) Storage automation moved this from Tests (failures, skipped, flakes) to Done Apr 10, 2024
@jbowens
Copy link
Collaborator

jbowens commented Apr 10, 2024

I'm going to close it out; we have a path forward to resolution (cockroachdb/pebble#3500) and better diagnostics for confirming the cause (#122110).

jbowens added a commit to jbowens/cockroach that referenced this issue Apr 10, 2024
Add a new timeseries metric that provides visibility into the volume of data
that exists in sstables that are not part of the most recent version of the
LSM.

Epic: none
Informs cockroachdb#121935.
Informs cockroachdb#122139.
Informs cockroachdb/pebble#3500.
Close cockroachdb#122110.
Release note (ops change): Adds a new timeseries metric
storage.sstable.zombie.bytes.
craig bot pushed a commit that referenced this issue Apr 11, 2024
122152: kvserver: add storage.sstable.zombie.bytes metric r=RaduBerinde a=jbowens

Add a new timeseries metric that provides visibility into the volume of data that exists in sstables that are not part of the most recent version of the LSM.

Epic: none
Informs #121935.
Informs #122139.
Informs cockroachdb/pebble#3500.
Close #122110.
Release note (ops change): Adds a new timeseries metric storage.sstable.zombie.bytes.

Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>
blathers-crl bot pushed a commit that referenced this issue Apr 11, 2024
Add a new timeseries metric that provides visibility into the volume of data
that exists in sstables that are not part of the most recent version of the
LSM.

Epic: none
Informs #121935.
Informs #122139.
Informs cockroachdb/pebble#3500.
Close #122110.
Release note (ops change): Adds a new timeseries metric
storage.sstable.zombie.bytes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-storage Storage Team
Projects
Archived in project
Development

No branches or pull requests

4 participants