Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zebra full sync is taking at least 50% longer over the last 3 weeks #7618

Closed
github-actions bot opened this issue Sep 25, 2023 · 29 comments · Fixed by #7663
Closed

Zebra full sync is taking at least 50% longer over the last 3 weeks #7618

github-actions bot opened this issue Sep 25, 2023 · 29 comments · Fixed by #7663
Assignees
Labels
A-network Area: Network protocol updates or fixes A-state Area: State / database changes I-slow Problems with performance or responsiveness S-needs-investigation Status: Needs further investigation

Comments

@github-actions
Copy link

github-actions bot commented Sep 25, 2023

Motivation

The scheduled full sync test timed out after 3 days. It usually takes just over 2 days.

The last time it ran successfully was 3 weeks ago.

There are two possible causes:

  • mempool spam slowing down network communication
  • we merged a PR that slowed down syncing in the last 3 weeks

Logs

The failure is a timeout:
https://github.com/ZcashFoundation/zebra/actions/runs/6274009969/job/17038728970#step:9:21231

GitHub Actions workflow CI Docker #6194 failed.

Event: schedule
Branch: main
Commit: b737ccf5709b2905c7c53ce91eea17dd753a6c03

Created by jayqi/failed-build-issue-action

@github-actions github-actions bot added the S-ci-fail-auto-issue Status: Auto opened issue for a CI failure. This label is auto-assigned. label Sep 25, 2023
@teor2345 teor2345 changed the title main branch CI failed: schedule in CI Docker Zebra full sync is taking at least 50% longer over the last 3 weeks Sep 25, 2023
@teor2345 teor2345 added S-needs-investigation Status: Needs further investigation P-Medium ⚡ I-slow Problems with performance or responsiveness A-network Area: Network protocol updates or fixes A-state Area: State / database changes and removed S-ci-fail-auto-issue Status: Auto opened issue for a CI failure. This label is auto-assigned. labels Sep 25, 2023
@teor2345 teor2345 self-assigned this Sep 25, 2023
@teor2345
Copy link
Contributor

Here is a full sync on the v1.2.0 release tag:
https://github.com/ZcashFoundation/zebra/actions/runs/6305243389

If it finishes within about 2 days and 5 hours, then there's a serious performance bug in the current main branch.
If it times out or takes 3 days, then there's a performance issue on the Zcash network, likely to do with mempool spam.

@teor2345

This comment was marked as resolved.

@teor2345
Copy link
Contributor

Here is a testnet full sync on the main branch: ZcashFoundation/zebra/actions/runs/6305352427

This was very slow and I cancelled it after 24 hours so we could run the other job. It hadn't verified any new blocks for 90 minutes.

It's hard to conclude anything at this point, because testnet could be slow, or maybe our code could be slow.

And on the v1.2.0 tag: ZcashFoundation/zebra/actions/runs/6305365448

This is running now, let's see if it's any faster. If it's faster, then it's probably our code. If it's just as slow, then it's probably testnet that's slow.

@teor2345
Copy link
Contributor

And on the v1.2.0 tag: ZcashFoundation/zebra/actions/runs/6305365448

This is running now, let's see if it's any faster. If it's faster, then it's probably our code. If it's just as slow, then it's probably testnet that's slow.

This appears to be a testnet issue, so the testnet performance jobs are not going to give us any useful information:

2023-09-27T00:31:55.911968Z WARN {zebrad="6f50304" net="Test"}: zebrad::components::sync::progress: chain updates have stalled, state height has not increased for 51 minutes. Hint: check your network connection, and your computer clock and time zone sync_percent=9.475% current_height=Height(204000) network_upgrade=BeforeOverwinter time_since_last_state_block=51m target_block_spacing=1m 15s max_block_spacing=7m 30s is_syncer_stopped=false

https://github.com/ZcashFoundation/zebra/actions/runs/6305365448/job/17158137748#step:8:2273

@teor2345
Copy link
Contributor

Here is a full sync on the v1.2.0 release tag: ZcashFoundation/zebra/actions/runs/6305243389

If it finishes within about 2 days and 5 hours, then there's a serious performance bug in the current main branch. If it times out or takes 3 days, then there's a performance issue on the Zcash network, likely to do with mempool spam.

The job is still running, but it looks like performance of v1.2.0 is about the same as it was 3 weeks ago. So it doesn't seem to be a problem with mainnet.

So it seems to be a problem that was merged in the last 3 weeks, in either:

  • the new self-hosted runner machines
  • a CI workflow change
  • a Docker change
  • a Zebra Rust test harness change
  • a Zebra production Rust code change (network, syncer, parallel block/transaction validation, state block validation, or block writes)

If we can find a short test that reproduces the problem, it will be much easier to track down. I'll also check the logs to see if they narrow it down.

@teor2345
Copy link
Contributor

Block timeouts seem to be much more frequent in the failing job logs:

2023-09-25T11:46:49.547990Z WARN {zebrad="b737ccf" net="Main"}:sync:try_to_sync: zebrad::components::sync: error downloading and verifying block e=ValidationRequestError { error: Elapsed(()), height: Height(1933729), hash: block::Hash("00000000013f070db690244f3e948c34b28d1de6a6b84e181150656e55462237") }

https://github.com/ZcashFoundation/zebra/actions/runs/6274009969/job/17038728970#step:9:21125

@teor2345
Copy link
Contributor

The first block timeout is around 53%, so we might be able to use the checkpoint update sync job as a quick test for this issue:

2023-09-23T22:30:32.3700397Z 2023-09-23T22:30:32.368714Z INFO zebra_state::service::finalized_state::disk_format::upgrade: checking new blocks were written in current database format running_version=25.2.2
2023-09-23T22:30:32.3702560Z 2023-09-23T22:30:32.368740Z INFO zebra_state::service::finalized_state::disk_format::upgrade: checking database format produced by new blocks in this instance is valid running_version=25.2.2
2023-09-23T22:30:33.6743143Z 2023-09-23T22:30:33.673053Z INFO zebra_state::service::finalized_state::disk_format::upgrade: database format is valid running_version=25.2.2 inital_disk_version=25.2.2
2023-09-23T22:30:56.1333929Z 2023-09-23T22:30:56.131826Z WARN {zebrad="b737ccf" net="Main"}:sync:try_to_sync: zebrad::components::sync: error downloading and verifying block e=ValidationRequestError { error: Elapsed(()), height: Height(1200833), hash: block::Hash("00000000018529b73ef7c3f8fdd1edde2e31296d88b3a974af4eea789c278eaf") }
2023-09-23T22:30:56.1336562Z 2023-09-23T22:30:56.132577Z INFO {zebrad="b737ccf" net="Main"}:sync: zebrad::components::sync: waiting to restart sync timeout=67s state_tip=Some(Height(1200827))
2023-09-23T22:31:32.1775574Z 2023-09-23T22:31:32.176102Z INFO {zebrad="b737ccf" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=53.538% current_height=Height(1200827) network_upgrade=Canopy remaining_sync_blocks=1042134 time_since_last_state_block=8m

@teor2345
Copy link
Contributor

If we can find a short test that reproduces the problem, it will be much easier to track down. I'll also check the logs to see if they narrow it down.

I wasn't able to find an existing short test that narrows it down.

The logs suggest that it's an issue in Zebra's production code, or possibly the Docker build. We just compiled v1.2.0 with the latest Rust compiler, and ran it with the latest Google Cloud machines, so we know it's not that.

So we need to make a list of PRs that are after the v1.2.0 tag, up to and including commit b737ccf5709b2905c7c53ce91eea17dd753a6c03 last week, which change:

  • Zebra's production code (network, syncer, parallel block/transaction validation, state block validation, or block writes),
  • any dependencies they use, or
  • the Docker build.

@mpguerra
Copy link
Contributor

@arya2 would you mind pairing with @teor2345 on this one? :)

@teor2345
Copy link
Contributor

Here is a full sync on the v1.2.0 release tag: ZcashFoundation/zebra/actions/runs/6305243389

If it finishes within about 2 days and 5 hours, then there's a serious performance bug in the current main branch. If it times out or takes 3 days, then there's a performance issue on the Zcash network, likely to do with mempool spam.

It finished in 2 days 5.5 hours. I'd expect it to be slightly slower than main because it doesn't have as many checkpoints, but it's actually much faster.

@teor2345
Copy link
Contributor

teor2345 commented Sep 29, 2023

we need to make a list of PRs that are after the v1.2.0 tag, up to and including commit b737ccf5709b2905c7c53ce91eea17dd753a6c03 last week

I've pushed two branches to track the range of the performance issue. Here is the list of all the commits, please double-check I did it right:
known-good-perf...known-poor-perf

You'll need to click on "Load more commits" to get the commits after September 20.

we need to make a list of PRs which change:
Zebra's production code (network, syncer, parallel block/transaction validation, state block validation, or block writes),
any dependencies they use, or
the Docker build.

Here are the possible PRs I found, in order from v1.2.0:

I tried to get all the possible PRs, even if it was unlikely, because missing the bug would be worse than having extras.

Please add any PRs that I've missed, and check off any PRs you think are likely to have caused issues. Also feel free to add more comments. Then we can focus on those PRs.

@teor2345
Copy link
Contributor

Another possible diagnostic method is performance analysis. We could compare flamegraphs from v1.2.0 and known-poor-perf:
https://github.com/ZcashFoundation/zebra/blob/main/book/src/user/tracing.md#flamegraphs

But we'd have to reproduce the issue on a smaller block range first. So we could try comparing the sync times of:

  • 100K blocks from genesis
  • 100K blocks from the checkpoint
  • the earliest cached state to the tip (currently 30K blocks)

Then if there's a significant difference, we could do a flamegraph to see where the problem is.

@teor2345
Copy link
Contributor

There was a significant change in CI performance between:

  • 5-6 September (timezone unknown)
  • 20-21 September

So that's where I'd suggest we start looking.

Screenshot 2023-09-29 at 15 59 17

https://dashboard.mergify.com/github/ZcashFoundation/repo/zebra/stats?queue=batched

@teor2345
Copy link
Contributor

teor2345 commented Sep 29, 2023

From ticket #7649:

What do you think of the steps in this comment: #7618 (comment)

Comparing the sync times for 100K blocks from genesis is a great idea, though I'm not sure how to check the sync times closer to the tip.

Find the oldest zebrad tip cached disk:
https://console.cloud.google.com/compute/images?hl=en&project=zfnd-dev-zebra&tab=images

Open a PR that hard-codes that cached disk name just above this line:

echo "Selected Disk: $CACHED_DISK_NAME"

Then when CI runs on that branch it will sync all those blocks (currently 30K). You can also manually run a CI workflow, but you probably just want to restart the update sync job.

If you want to keep that disk, open a PR that skips deleting it:
https://github.com/ZcashFoundation/zebra/actions/workflows/delete-gcp-resources.yml
I have temporarily turned off that delete workflow. (It was scheduled to run in 50 minutes time.)

@arya2
Copy link
Contributor

arya2 commented Oct 3, 2023

@teor2345 It looks like it's gotten far slower near the beginning of the initial sync.

Latest, https://github.com/ZcashFoundation/zebra/actions/runs/6351284888 (X-axis is height, Y-axis is timestamp_diff / height_diff):

image

v1.2.0, https://github.com/ZcashFoundation/zebra/actions/runs/6305243389/job/17119807358:

image

const fs = require("fs");

const parse_logs = async (file_name) =>
  new Promise((resolve, reject) => {
    fs.readFile(`./${file_name}`, "utf8", (err, data) => {
      if (err) {
        reject(err);
      }

      resolve(
        data
          .split("\n")
          .map((line) => ({
            timestamp: Date.parse(line.split(" ")[0]),
            line,
          }))
          .filter(
            ({ timestamp, line }) =>
              timestamp != NaN && line.includes("current_height")
          )
          .map(({ timestamp, line }) => ({
            timestamp: timestamp,
            height: (
              line.split("current_height")[1].split("Height(")[1] || ""
            ).split(")")[0],
          }))
      );
    });
  });

(async () => {
  const logs = await parse_logs("logs.txt");
    // const logs = await parse_logs("logs-1.2.0.txt");

  let prev_timestamp = logs[0].timestamp;
  let prev_height = 0;
  const sync_rates = logs
    .slice(1)
    .map(({ timestamp, height }) => {
      const time_diff = timestamp - prev_timestamp;
      const height_diff = height - prev_height;
      const sync_rate = time_diff / height_diff;
      prev_timestamp = timestamp;
      prev_height = height;
      return {
        height,
        sync_rate,
        timestamp,
        time_diff,
        height_diff,
      };
    })
    .filter(
      ({ height, sync_rate, timestamp, time_diff, height_diff }) =>
        time_diff > 0 &&
        height > 0 &&
        sync_rate > 0 &&
        height_diff > 0
    );

  for (const { height, sync_rate, ...rest } of sync_rates) {
    console.log(`${height}, ${sync_rate}`);
  }
})();

@teor2345
Copy link
Contributor

teor2345 commented Oct 3, 2023

@teor2345 It looks like it's gotten far slower near the beginning of the initial sync.

That's interesting and not what I expected. But it's good news, because we only need to sync to 250,000 to see a significant difference.

What's in the logs around the heights where the rate changes? Can you link to the logs you used for each graph?

@upbqdn
Copy link
Member

upbqdn commented Oct 3, 2023

It is slow at the start. It took 5 hours to get to 350_000 for main, but 50 minutes for v1.2.0.

@upbqdn
Copy link
Member

upbqdn commented Oct 3, 2023

I'm now running a sync for #7437 (cc61bd5) to see how it behaves.

@upbqdn
Copy link
Member

upbqdn commented Oct 3, 2023

#7437 (cc61bd5) looks good. Now I'm running #7566 7a7d79d.

@upbqdn
Copy link
Member

upbqdn commented Oct 3, 2023

It looks like PR #7590 introduces the bug we're dealing with. The slowness shows within the first five minutes of syncing from scratch, and is introduced in 2dce686. At commit 2dce686 and four subsequent commits that I tried, Zebra syncs ~ 30_000 blocks in ~ 5 minutes. At commit 92d6da3 that directly precedes 2dce686 and at three previous commits that I tried, Zebra syncs ~ 100_000 blocks in ~ 5 minutes.

@teor2345
Copy link
Contributor

teor2345 commented Oct 3, 2023

Yep, we've had problems with database iterator performance before. I'm going to look up the rocksdb docs to see if there's any way to configure the database differently to avoid that (or code the iterator differently).

PR #7531 also uses database iterators during an upgrade (but so did #7437), and #7602 runs some of that code every 5 minutes in CI, so we might need to fix that code as well.

I wonder if it's the existence of an iterator that's the problem, or the number of iterators we're creating?

@teor2345
Copy link
Contributor

teor2345 commented Oct 3, 2023

This might be a tricky bug to diagnose, because we've been using zs_last_key_value() in the tip() method for a very long time (but maybe it's always been slowing Zebra down, and we've just accepted that performance):

pub fn tip(&self) -> Option<(block::Height, block::Hash)> {
let hash_by_height = self.db.cf_handle("hash_by_height").unwrap();
self.db.zs_last_key_value(&hash_by_height)

Edit: this is the only existing use of zs_last_key_value() in v1.2.0. It is also used in the recently merged PR #7590 and the open PR #7636.

@teor2345
Copy link
Contributor

teor2345 commented Oct 3, 2023

Here's one possible benchmark test for this code:

  • call a database read function that uses database iterators thousands of times, and see how long it takes

It's possible that the iterator only performs badly when we're creating, updating, or deleting column family entries. So we might need to do that in the test as well. I guess we'll find out when we write it.

It might also help to call all the database read functions that use iterators, not just one of them.

@teor2345
Copy link
Contributor

teor2345 commented Oct 4, 2023

I opened PR #7663 to fix this issue, @arya2 or @upbqdn can you check if it fixes the performance issue?
(I'm not sure how to reproduce your tests or analysis.)

@teor2345
Copy link
Contributor

teor2345 commented Oct 4, 2023

This might be a tricky bug to diagnose, because we've been using zs_last_key_value() in the tip() method for a very long time

It looks like this bug only happens when:

  • there are a lot of deleted keys still marked in the database, and
  • we're iterating through a range that might include those keys (called "tombstones").

So most column families aren't impacted:
https://github.com/ZcashFoundation/zebra/blob/main/book/src/dev/state-db-upgrades.md#current-state-database-format

This is consistent with what we're seeing in these graphs:

It looks like it's gotten far slower near the beginning of the initial sync.

Latest, ZcashFoundation/zebra/actions/runs/6351284888 (X-axis is height, Y-axis is timestamp_diff / height_diff):

The performance gradually decreases as the number of tombstones increases, then the tombstones are periodically cleared, and the performance improves again.

Here's what I think we could do to fix it:

PRs #7663 and #7392 fix this bug for the history and sprout trees.
Ticket #7664 will fix it for the address UTXO index (we don't look it up except in RPCs, so it isn't impacting performance all the time).

@teor2345
Copy link
Contributor

teor2345 commented Oct 4, 2023

I opened PR #7663 to fix this issue, @arya2 or @upbqdn can you check if it fixes the performance issue? (I'm not sure how to reproduce your tests or analysis.)

I did a local test of this fix, and Zebra is up to 1.2 million blocks in under 5 hours. But I don't know how fast v1.2.0 was on my machine. So I'd appreciate someone else checking.

@upbqdn
Copy link
Member

upbqdn commented Oct 4, 2023

I can confirm PR #7663 brings the sync performance back to where it was at v1.2.0.

@mpguerra
Copy link
Contributor

mpguerra commented Oct 5, 2023

@teor2345 can I please get an estimate here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-network Area: Network protocol updates or fixes A-state Area: State / database changes I-slow Problems with performance or responsiveness S-needs-investigation Status: Needs further investigation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants