Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: backup-restore/mixed-version failed #120178

Closed
cockroach-teamcity opened this issue Mar 10, 2024 · 12 comments · Fixed by #120462
Closed

roachtest: backup-restore/mixed-version failed #120178

cockroach-teamcity opened this issue Mar 10, 2024 · 12 comments · Fixed by #120462
Labels
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Mar 10, 2024

roachtest.backup-restore/mixed-version failed with artifacts on release-23.1.17-rc @ cc68ca88d04a7ca8e92beb51f3dbef6493fe175c:

(mixedversion.go:594).Run: mixed-version test failure while running step 28 (run "verify some backups"): mixed-version: error waiting for job to finish: job 950246478697365505 failed with error: importing 50 ranges: addsstable [/Table/186/1/0/0,/Table/186/1/1/0/NULL): split failed while applying backpressure to AddSSTable [/Table/186/1/0/0,/Table/186/1/1/0/NULL) on range r634:/Table/18{3/1/99-6/1/4} [(n2,s2):1, (n3,s3):5, (n1,s1):4, next=6, gen=105]: could not find valid split key
test artifacts and logs in: /artifacts/backup-restore/mixed-version/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

- #119856 roachtest: backup-restore/mixed-version failed [C-test-failure O-roachtest O-robot T-disaster-recovery branch-release-23.1 release-blocker]

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-36526

@cockroach-teamcity cockroach-teamcity added branch-release-23.1.17-rc C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-disaster-recovery labels Mar 10, 2024
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Mar 10, 2024
@dt dt added this to Incoming in KV via automation Mar 11, 2024
@blathers-crl blathers-crl bot added the T-kv KV Team label Mar 11, 2024
@dt dt removed this from Backlog in Disaster Recovery Backlog Mar 11, 2024
@dt
Copy link
Member

dt commented Mar 11, 2024

23.1: could not find valid split key

@dt
Copy link
Member

dt commented Mar 12, 2024

Looks like the replica that returned the error to AddSSTable was on n2, which was on 22.2 at the time (this test stresses upgrades, so since this was testing 23.1 it started on 22.2 to then do the upgrade and the node that returned this error was on 22.2 at the time).

The range that returned it is in table/183 or something, which isn't the jobs table.

08:14:16 mixed_version_backup.go:1419: not setting any custom cluster settings (using defaults) means this was off-the-shelf and not some randomly weird config.

@dt
Copy link
Member

dt commented Mar 12, 2024

r634 isn't looking pretty:

 "stats": {
          "contains_estimates": 0,
          "last_update_nanos": 1710062613391780670,
          "intent_age": 0,
          "gc_bytes_age": 426629484084,
          "live_bytes": 16415,
          "live_count": 1,
          "key_bytes": 98166,
          "key_count": 1,
          "val_bytes": 134125670,
          "val_count": 8180,

@dt
Copy link
Member

dt commented Mar 12, 2024

this is 22.2, which is EOL'ed, and is only being tested here since 23.1 needs to remain compatible with it during upgrades, but we're not going to go dig into and fix 22.2 bugs, so I might suggest we just close this as unactionable? But I'll let KV weigh in too.

@nvanbenschoten
Copy link
Member

We're backpressuring an AddSSTable when the range is only 134MB large. Is this because 22.2 had a 64MB max_range_size? Assuming this is the case (I'll check), it doesn't look like anything is going wrong in KV.

The backpressure is throwing an error because there's only a single key in the range, which has 8180 versions. @dt do you know what's in this /Table/186 and whether this is expected?

@nvanbenschoten
Copy link
Member

Is this because 22.2 had a 64MB max_range_size?

Yes, this is the case: range_max_bytes: 67108864.

@dt
Copy link
Member

dt commented Mar 13, 2024

I think we can close this as unactionable since this is a known 22.2 behavior we wont' be changing ?

@nvanbenschoten
Copy link
Member

So the remaining question is whether the composition of r643 (/Table/18{3/1/99-6/1/4}) is expected:

          "key_bytes": 98166,
          "key_count": 1,
          "val_bytes": 134125670,
          "val_count": 8180,

Is the single-key, many-16KB-version surprising?

@nvanbenschoten
Copy link
Member

The init command is cockroach workload init bank {pgurl:1-4} --payload-bytes 16384 --ranges 0 --rows 100, so that explains the version size.

Do we expect this to write 8180 versions to the same key?

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Mar 13, 2024

Oh, I suspect that this is because we ran the bank workload after, which updates these rows and writes new versions. We see errors like:

E240310 09:23:29.618860 1 workload/cli/run.go:550  [-] 2614  pq: split failed while applying backpressure to Put [/Table/183/1/4/0,/Min), EndTxn(parallel commit) [/Table/183/1/4/0], [txn: 9904f87b] on range r690:/Table/183/1/{4-5} [(n3,s3):4, (n4,s4):2, (n1,s1):3, next=5, gen=94]: could not find valid split key

We have 16KB sized rows (the maximum option in bankPossiblePayloadBytes), only 100 of them (the minimum option in bankPossibleRows), then run a workload that continually updates these rows. I would expect us to hit this in never versions of CRDB as well, as this will quickly fill up a range ((512MB*2) / (16KB * 3500qps / (100/2)) = 15m).

Removing the release-blocker label and handing this to OwnerDisasterRecovery to determine how they want to deflake the test.

@nvanbenschoten nvanbenschoten added T-disaster-recovery and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Mar 13, 2024
@blathers-crl blathers-crl bot added this to Backlog in Disaster Recovery Backlog Mar 13, 2024
@nvanbenschoten nvanbenschoten removed this from Incoming in KV Mar 13, 2024
@dt dt linked a pull request Mar 14, 2024 that will close this issue
@dt
Copy link
Member

dt commented Mar 14, 2024

Oooff, that random combo of row count and size is sneaky. Thanks for the thorough investigation! I've opened #120462

@cockroach-teamcity

This comment was marked as off-topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

3 participants