SHOW BACKUP perf degrades severely in case cluster has a relatively large number of tables, even in case a LIMIT is applied to the result set #88376

joshimhoff · 2022-09-21T19:02:23Z

Describe the problem

SHOW BACKUP perf degrades severely in case cluster has a relatively large number of tables, even in case a LIMIT is applied to the result set.

To Reproduce

This can be reproduced locally using nodelocal storage as follows.

First, run something like this (I like fish shell lol):

$ for i in (seq 0 2500); echo $i; ./cockroachshort sql --insecure -e 'CREATE TABLE users'$i' (id UUID PRIMARY KEY, city STRING, name STRING);'; end

Then get a SQL shell and run:

root@localhost:26257/defaultdb> select count(*) from crdb_internal.tables;
  count
---------
   2261
(1 row)


Time: 96ms total (execution 95ms / network 1ms)

root@localhost:26257/defaultdb> BACKUP INTO 'nodelocal://1/test' AS OF SYSTEM TIME '-10s';
        job_id       |  status   | fraction_completed | rows | index_entries | bytes
---------------------+-----------+--------------------+------+---------------+--------
  798609029606146049 | succeeded |                  1 |   22 |             6 |  1489
(1 row)


Time: 2.955s total (execution 2.955s / network 0.000s)

root@localhost:26257/defaultdb> SELECT * FROM [SHOW BACKUP FROM LATEST IN 'nodelocal://1/test'] LIMIT 1;
  database_name | parent_schema_name | object_name | object_type | backup_type | start_time |          end_time          | size_bytes | rows | is_full_cluster
----------------+--------------------+-------------+-------------+-------------+------------+----------------------------+------------+------+------------------
  NULL          | NULL               | system      | database    | full        | NULL       | 2022-09-21 18:55:24.439891 |       NULL | NULL |        t
(1 row)


Time: 35.591s total (execution 35.591s / network 0.000s)

Note that I am running a custom build from @adityamaru:

$ ./cockroachshort version
Build Tag:        v22.2.0-alpha.3-393-gb911eecad4
Build Time:       
Distribution:     CCL
Platform:         darwin amd64 (x86_64-apple-darwin21.6.0)
Go Version:       go1.19
C Compiler:       Apple LLVM 14.0.0 (clang-1400.0.29.102)
Build Commit ID:  b911eecad4393e7af3ccba8cd0478377202ebd15
Build Type:       development
$ git log
commit b911eecad4393e7af3ccba8cd0478377202ebd15 (HEAD, aditya/speedup-show-backup)
Author: adityamaru <adityamaru@gmail.com>
Date:   Tue Sep 20 16:28:17 2022 -0400

    wip

Expected behavior

Faster response times in general!
Would be nice if limiting the result set reduced latency.

Additional context
Here is an internal link to ticket involving a CC cluster with 10K tables, where SHOW JOBS is very slow: https://cockroachlabs.atlassian.net/browse/SREOPS-5221

Jira issue: CRDB-19792

The text was updated successfully, but these errors were encountered:

blathers-crl · 2022-09-21T19:02:25Z

cc @cockroachdb/disaster-recovery

adityamaru · 2022-09-21T19:56:09Z

Thanks for the reproduction, the traces in #88293 tell a story!

The SHOW BACKUP on a 2.5K empty tables takes ~34s:

     0.000ms      0.000ms    [flow: {count: 1, duration 34.6s}]
     0.000ms      0.000ms    [columnarizer: {count: 1, duration 34.6s}]
     0.000ms      0.000ms    [batch flow coordinator: {count: 1, duration 34.6s}]
     0.000ms      0.000ms    [plugin: {count: 1, duration 34.6s}]
     0.000ms      0.000ms    [backupccl.showBackupPlanHook: {count: 1, duration 34.6s}]
     0.000ms      0.000ms    [SHOW BACKUP: {count: 1, duration 34.6s}]
     0.000ms      0.000ms    [backupccl.showBackup: {count: 1, duration 34.6s}]
     0.000ms      0.000ms    [backupccl.backupShowerDefault.fn: {count: 1, duration 34.5s}]
     0.000ms      0.000ms    [backupccl.backupShowerDefault.fn.0: {count: 1, duration 34.5s}]
     0.000ms      0.000ms    [backupccl.backupShowerDefault.fn.secondDescriptorLoop: {count: 1, duration 34.5s}]
     0.000ms      0.000ms    [sql.ShowCreate: {count: 2516, duration 34.5s}]
     0.000ms      0.000ms    [sql.newInternalLookupCtxFromDescriptorProtos: {count: 2515, duration 34.4s}]

As you can see the secondDescriptorLoop that is the span covering this iteration https://github.com/cockroachdb/cockroach/blob/b911eecad4393e7af3ccba8cd0478377202ebd15/pkg/ccl/backupccl/show.go#L721-L827 is taking all the time. In that we can see the call to ShowCreate is taking all the time. This call is used to give us the schema of the tables that have been backed up, but we only display this schema when we run SHOW BACKUP SCHEMAS. So, in every other flavour of SHOW BACKUP this is a wasted cost since we discard its results. Making the call to ShowCreate conditional brings our SHOW BACKUP on 2.5k tables down to ~200ms.

 0.000ms      0.000ms    [flow: {count: 1, duration 109ms}]
     0.000ms      0.000ms    [columnarizer: {count: 1, duration 109ms}]
     0.000ms      0.000ms    [batch flow coordinator: {count: 1, duration 108ms}]
     0.000ms      0.000ms    [plugin: {count: 1, duration 108ms}]
     0.000ms      0.000ms    [backupccl.showBackupPlanHook: {count: 1, duration 108ms}]
     0.000ms      0.000ms    [SHOW BACKUP: {count: 1, duration 108ms}]
     0.000ms      0.000ms    [backupccl.showBackup: {count: 1, duration 76ms}]
     0.000ms      0.000ms    [backupdest.ResolveBackupManifests: {count: 1, duration 31ms}]
     0.000ms      0.000ms    [backupdest.FindPriorBackups: {count: 3, duration 21ms}]
     0.000ms      0.000ms    [backupccl.backupShowerDefault.fn: {count: 1, duration 21ms}]
     0.000ms      0.000ms    [backupccl.backupShowerDefault.fn.0: {count: 1, duration 21ms}]
     0.000ms      0.000ms    [backupccl.backupShowerDefault.fn.secondDescriptorLoop: {count: 1, duration 21ms}]
     0.000ms      0.000ms    [backupinfo.ReadBackupManifestFromStore: {count: 1, duration 10ms}]
     0.000ms      0.000ms    [backupinfo.ReadBackupManifest: {count: 1, duration 10ms}]
     0.000ms      0.000ms    [backupinfo.readManifest: {count: 1, duration 10ms}]
     0.000ms      0.000ms    [backupdest.ResolveIncrementalsBackupLocation: {count: 1, duration 466µs}]
     0.000ms      0.000ms    [backupdest.backupsFromLocation: {count: 2, duration 450µs}]
     0.000ms      0.000ms    [optimizer: {count: 1, duration 313µs}]
     0.000ms      0.000ms    [backupccl.backupShowerDefault.fn.firstDescriptorLoop: {count: 1, duration 113µs}]
     0.000ms      0.000ms    [backupccl.getTableSizes: {count: 1, duration 6µs}]
     0.000ms      0.000ms    [commit sql txn: {count: 1, duration 4µs}]
     0.000ms      0.000ms    [backupccl.getLogicalSSTSize: {count: 1, duration 2µs}]
     0.000ms      0.000ms    [backupinfo.GetLocalityInfo: {count: 1, duration 1µs}]

You can see the secondDescriptorLoop is only taking 21ms now.

joshimhoff · 2022-09-21T20:02:17Z

Woohoo!

postamar · 2022-09-21T21:06:23Z

Is this a regression compared to 22.1? If so, I suspect my recent changes to the descs.Collection might be responsible, in which case feel free to file an issue on the schema board and cc me on it.

87533: sqlliveness: add timeouts to heartbeats r=ajwerner a=aadityasondhi Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations. Resolves #85541 Release note: None Release justification: low-risk bugfix to existing functionality 88293: backupccl: elide expensive ShowCreate call in SHOW BACKUP r=stevendanna a=adityamaru In #88376 we see the call to `ShowCreate` taking ~all the time on a cluster with 2.5K empty tables. In all cases except `SHOW BACKUP SCHEMAS` we do not need to construct the SQL representation of the table's schema. This results in a marked improvement in the performance of `SHOW BACKUP` as can be seen in #88376 (comment). Fixes: #88376 Release note (performance improvement): `SHOW BACKUP` on a backup containing several table descriptors is now more performant 88471: sql/schemachanger: plumb context, check for cancelation sometimes r=ajwerner a=ajwerner Fixes #87246 This will also improve tracing. Release note: None 88557: testserver: add ShareMostTestingKnobsWithTenant option r=msbutler a=stevendanna The new ShareMostTestingKnobs copies nearly all of the testing knobs specified for a TestServer to any tenant started for that server. The goal here is to make it easier to write tests that depend on testing hooks that work under probabilistic tenant testing. Release justification: non-production code change Release note: None 88562: upgrade grpc to v.1.49.0 r=erikgrinaker a=pavelkalinnikov Fixes #81881 Touches #72083 Release note: upgraded grpc to v1.49.0 to fix a few panics that the old version caused 88568: sql: fix panic due to missing schema r=ajwerner a=ajwerner A schema might not exist because it has been dropped. We need to mark the lookup as required. Fixes #87895 Release note (bug fix): Fixed a bug in pg_catalog tables which could result in an internal error if a schema is concurrently dropped. Co-authored-by: David Hartunian <davidh@cockroachlabs.com> Co-authored-by: Aaditya Sondhi <aadityas@cockroachlabs.com> Co-authored-by: adityamaru <adityamaru@gmail.com> Co-authored-by: Andrew Werner <awerner32@gmail.com> Co-authored-by: Steven Danna <danna@cockroachlabs.com> Co-authored-by: Pavel Kalinnikov <pavel@cockroachlabs.com>

In cockroachdb#88376 we see this call taking ~all the time on a cluster with 2.5K empty tables. In all cases except `SHOW BACKUP SCHEMAS` we do not need to construct the SQL representation of the table's schema. This results in a marked improvement in the performance of `SHOW BACKUP` as can be seen in cockroachdb#88376 (comment). Fixes: cockroachdb#88376 Release note (performance improvement): `SHOW BACKUP` on a backup containing several table descriptors is now more performant

Backport 1/1 commits from cockroachdb#88293. /cc @cockroachdb/release In cockroachdb#88376 we see the call to ShowCreate taking ~all the time on a cluster with 2.5K empty tables. In all cases except SHOW BACKUP SCHEMAS we do not need to construct the SQL representation of the table's schema. This results in a marked improvement in the performance of SHOW BACKUP as can be seen in cockroachdb#88376 (comment). Fixes: cockroachdb#88376 Release note (performance improvement): SHOW BACKUP on a backup containing several table descriptors is now more performant Release justification: low risk performance improvement required for the use of schedules in CockroachCloud

joshimhoff added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-disaster-recovery O-sre For issues SRE opened or otherwise cares about tracking. T-disaster-recovery labels Sep 21, 2022

blathers-crl bot added this to Triage in Disaster Recovery Backlog Sep 21, 2022

adityamaru self-assigned this Sep 21, 2022

adityamaru mentioned this issue Sep 21, 2022

backupccl: elide expensive ShowCreate call in SHOW BACKUP #88293

Merged

craig bot closed this as completed in cc68174 Sep 23, 2022

Disaster Recovery Backlog automation moved this from Triage to Done Sep 23, 2022

adityamaru mentioned this issue Sep 26, 2022

release-22.2: backupccl: elide expensive ShowCreate call in SHOW BACKUP #88711

Merged

adityamaru mentioned this issue Dec 6, 2022

release-22.1: backupccl: elide expensive ShowCreate call in SHOW BACKUP #93143

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SHOW BACKUP perf degrades severely in case cluster has a relatively large number of tables, even in case a LIMIT is applied to the result set #88376

SHOW BACKUP perf degrades severely in case cluster has a relatively large number of tables, even in case a LIMIT is applied to the result set #88376

joshimhoff commented Sep 21, 2022 •

edited

blathers-crl bot commented Sep 21, 2022

adityamaru commented Sep 21, 2022

joshimhoff commented Sep 21, 2022

postamar commented Sep 21, 2022

SHOW BACKUP perf degrades severely in case cluster has a relatively large number of tables, even in case a LIMIT is applied to the result set #88376

SHOW BACKUP perf degrades severely in case cluster has a relatively large number of tables, even in case a LIMIT is applied to the result set #88376

Comments

joshimhoff commented Sep 21, 2022 • edited

blathers-crl bot commented Sep 21, 2022

adityamaru commented Sep 21, 2022

joshimhoff commented Sep 21, 2022

postamar commented Sep 21, 2022

joshimhoff commented Sep 21, 2022 •

edited