Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.18: Archives storages directly (backport of #503) #1430

Open
wants to merge 2 commits into
base: v1.18
Choose a base branch
from

Conversation

mergify[bot]
Copy link

@mergify mergify bot commented May 20, 2024

Problem

When archiving a snapshot, we need to add all the storages to the archive. Currently, we flush each file and then have the archiver load the file from disk.

Since the storages are all resident/in-sync with the validator process, we know they are all up to date. Also, we already flushed all the storages to disk in AccountsBackgroundService when calling add_bank_snapshot(), so flushing again is unnecessary and negatively impacts the rest of the system1.

Summary of Changes

Instead of flushing to disk first, get the storages data directly from the mmaps and add it to the archive.

Additional Testing

Running ledger-tool create-snapshot on mnb showed that this PR does not impact how long it takes to archive a snapshot.

master:

solana_runtime::snapshot_utils] Successfully created /home/sol/ledger/snapshot-257173898-5hquoxg3NyjYVnX86x79VeS5KHBAwW5viLBiVbjsYZXe.tar.zst. slot: 257173898, elapsed ms: 1181220, size: 67725864176

this pr:

solana_runtime::snapshot_utils] Successfully created /home/sol/ledger/snapshot-257173898-5hquoxg3NyjYVnX86x79VeS5KHBAwW5viLBiVbjsYZXe.tar.zst. slot: 257173898, elapsed ms: 1181327, size: 67722885806

I also tested loading from this new snapshot created with this PR to ensure it actually works properly. Happy to report that was successful too.


This is an automatic backport of pull request #503 done by Mergify.

Footnotes

  1. Quantifying the negative impacts is something that I believe @alessandrod has done/shared.

(cherry picked from commit 4247a8a)

# Conflicts:
#	accounts-db/src/accounts_file.rs
#	accounts-db/src/tiered_storage/hot.rs
#	accounts-db/src/tiered_storage/readable.rs
#	runtime/src/snapshot_utils.rs
@mergify mergify bot added the conflicts label May 20, 2024
@mergify mergify bot requested a review from a team as a code owner May 20, 2024 05:18
Copy link
Author

mergify bot commented May 20, 2024

Cherry-pick of 4247a8a has failed:

On branch mergify/bp/v1.18/pr-503
Your branch is up to date with 'origin/v1.18'.

You are currently cherry-picking commit 4247a8a546.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   accounts-db/src/append_vec.rs

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   accounts-db/src/accounts_file.rs
	both modified:   accounts-db/src/tiered_storage/hot.rs
	both modified:   accounts-db/src/tiered_storage/readable.rs
	both modified:   runtime/src/snapshot_utils.rs

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

@steviez
Copy link

steviez commented May 20, 2024

The other day in Discord, it was realized that there was a significant mismatch in snapshot sizes created by a v1.17/v1.18 validator vs. what was created with a tip-of-master node. For example, on mainnet right now:

  • bv4 has 71 GB full snapshot (running v1.17.33)
  • bv1 has 70 GB full snapshot (running v1.18.13)
  • My test node has 53 GB full snapshot (running near tip of master)

This is roughly a 25% reduction in snapshot size. I was chatting with Brooks and he mentioned that the commit that this PR backports was a likely suspect for being the cause of the change. So, I started up a fresh testnet node running v1.18 + this BP. I made sure to download a fresh snapshot from a v1.18 node, so that I started with a "large" snapshot

This plot shows the full snapshot size; blue is tv and purple is my testnet node. Over time, we can see the snapshot size decrease, seemingly as we rewrite accounts throughout the epoch.
image

We also see that the size of incremental snapshots grows at a slower rate as the slot distance from full snapshot slot grows.
image

Writing snapshots is known to be stressful for the node, so the 25% drop in size seems like a pretty sizable win for a change that isn't too complex. Thus, I am in favor of BP'ing this change, but we should also get thoughts from @brooksprumo (who wrote the initial PR) and @jeffwashington as subject matter experts.

brooksprumo
brooksprumo previously approved these changes May 20, 2024
Copy link

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the commit that resolves the merge conflicts, and it looks good.

I am fine with this PR getting backported to v1.18, but I'm not strongly pushing for it. I'll defer to the others on the question of if we should backport or not. I sign off on the code changes themselves.

@alessandrod
Copy link

Ship ittt

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 72.41379% with 8 lines in your changes are missing coverage. Please review.

Project coverage is 81.6%. Comparing base (1ce727d) to head (ee56213).

Additional details and impacted files
@@            Coverage Diff            @@
##            v1.18    #1430     +/-   ##
=========================================
- Coverage    81.6%    81.6%   -0.1%     
=========================================
  Files         828      828             
  Lines      225537   225532      -5     
=========================================
- Hits       184177   184165     -12     
- Misses      41360    41367      +7     

Copy link

@CriesofCarrots CriesofCarrots left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a reasonable "bug" fix to me. I approve bp-ing.

@jeffwashington
Copy link

@brooksprumo why does this pr reduce snapshot size so drastically?

@brooksprumo
Copy link

@brooksprumo why does this pr reduce snapshot size so drastically?

Not sure actually. This was an interesting discovery by @steviez. Maybe the end of storage files contains lots of bytes that are not actually in the mmap's internal slice? Could this be due to removing the recycler? Or maybe we were/are oversizing files still, and the mmap's slice only contains the valid data?

@HaoranYi
Copy link

HaoranYi commented May 22, 2024

@brooksprumo why does this pr reduce snapshot size so drastically?

Not sure actually. This was an interesting discovery by @steviez. Maybe the end of storage files contains lots of bytes that are not actually in the mmap's internal slice? Could this be due to removing the recycler? Or maybe we were/are oversizing files still, and the mmap's slice only contains the valid data?

If we are snapshotting on the disk file, we may need to call sync_all to sync all the mmap data to disk.

pub fn sync_all(&self) -> Result<()>

@brooksprumo
Copy link

If we are snapshotting on the disk file, we may need to call sync_all to sync all the mmap data to disk.

With this PR we now get the account storage data from the mmap directly and do not go to the filesystem. So my understanding is that we don't need to do this sync.

Note that before archiving, we already have flushed the mmaps to disk as part of serializing the bank to disk, which is needed for fastboot. This happens in AccountsBackgroundService when we call snapshot_bank_utils::add_bank_snapshot(). We flush all the storage files, which flushes all the mmaps, and eventually is calling libc::msync().

@HaoranYi
Copy link

If we are snapshotting on the disk file, we may need to call sync_all to sync all the mmap data to disk.

With this PR we now get the account storage data from the mmap directly and do not go to the filesystem. So my understanding is that we don't need to do this sync.

Note that before archiving, we already have flushed the mmaps to disk as part of serializing the bank to disk, which is needed for fastboot. This happens in AccountsBackgroundService when we call snapshot_bank_utils::add_bank_snapshot(). We flush all the storage files, which flushes all the mmaps, and eventually is calling libc::msync().

You are right. The mmap files are flushed. I verified that sum of mmap sizes is equal to the sum of appendvec file size.

@brooksprumo brooksprumo requested a review from a team May 24, 2024 13:46
@CriesofCarrots
Copy link

@brooksprumo , I already approved this on behalf of backport reviewers. It doesn't seem like there have been any changes pushed since then, have there?

@brooksprumo
Copy link

It doesn't seem like there have been any changes pushed since then, have there?

Nope, no new changes.

I already approved this on behalf of backport reviewers.

Ah ok, sorry about re-adding backport reviewers. I wasn't sure why I didn't see them listed as a reviewer, and I missed the "CriesofCarrots approved these changes on behalf of https://github.com/orgs/anza-xyz/teams/backport-reviewers" line.

@brooksprumo brooksprumo removed the request for review from a team May 24, 2024 14:30
@steviez
Copy link

steviez commented May 24, 2024

@brooksprumo @HaoranYi @jeffwashington - Any remaining concerns about merging this ?

@HaoranYi
Copy link

HaoranYi commented May 24, 2024

@brooksprumo @HaoranYi @jeffwashington - Any remaining concerns about merging this ?

I think this PR looks more correct to me. :shipit:

We choose exactly the account storages to put into the archive in this PR in a loop, while the old code put everything inside staging_accounts_dir into the archive. I suspect there could be other garbage files in staging_accounts_dir that was put into archive, which cause larger snapshot size.

@brooksprumo
Copy link

@brooksprumo - Any remaining concerns about merging this ?

None from me

@jeffwashington
Copy link

@brooksprumo why does this pr reduce snapshot size so drastically?

Not sure actually. This was an interesting discovery by @steviez. Maybe the end of storage files contains lots of bytes that are not actually in the mmap's internal slice? Could this be due to removing the recycler? Or maybe we were/are oversizing files still, and the mmap's slice only contains the valid data?

The map field is the full size of the file. We use len() to know the valid bytes within the file. So, we should be returning the entire file size, including bytes left over at the end that are unused/invalid.

I don't understand why this pr would cause a smaller snapshot file. That makes me nervous that we're skipping files or parts of files. Presumably the snapshots we create are correct.
Or, maybe we are getting better compression.
It would be great to know the resulting size on disk of all append vecs when uncompressing both ways. That would tell us whether the savings are from better compression or having less files.
I imagine running ledger-tool create-snapshot with and without this change, then untarring the snapshots (without fastboot) and seeing what the differences are would satisfy my questions and then I could satisfy my concerns if we like the answers.

@HaoranYi
Copy link

HaoranYi commented May 28, 2024

@brooksprumo why does this pr reduce snapshot size so drastically?

Not sure actually. This was an interesting discovery by @steviez. Maybe the end of storage files contains lots of bytes that are not actually in the mmap's internal slice? Could this be due to removing the recycler? Or maybe we were/are oversizing files still, and the mmap's slice only contains the valid data?

The map field is the full size of the file. We use len() to know the valid bytes within the file. So, we should be returning the entire file size, including bytes left over at the end that are unused/invalid.

I don't understand why this pr would cause a smaller snapshot file. That makes me nervous that we're skipping files or parts of files. Presumably the snapshots we create are correct. Or, maybe we are getting better compression. It would be great to know the resulting size on disk of all append vecs when uncompressing both ways. That would tell us whether the savings are from better compression or having less files. I imagine running ledger-tool create-snapshot with and without this change, then untarring the snapshots (without fastboot) and seeing what the differences are would satisfy my questions and then I could satisfy my concerns if we like the answers.

I made an experiment branch to compare the size of mmap with the size of the actual file. The log shows that the sizes are actually the same.
https://github.com/anza-xyz/agave/compare/master...HaoranYi:solana:experiment/mmmap_file_size?expand=1

sol@shared-dev-am-11:~/logs$ grep haoran solana-validator.log.5 | grep "total mmap" | head -n 10
[2024-05-23T00:00:04.989674539Z INFO  solana_runtime::snapshot_utils] haoran total mmap 5928611736 file 5928611736
[2024-05-23T00:00:47.666132020Z INFO  solana_runtime::snapshot_utils] haoran total mmap 6001447880 file 6001447880
[2024-05-23T00:01:35.829938734Z INFO  solana_runtime::snapshot_utils] haoran total mmap 6063692864 file 6063692864
[2024-05-23T00:02:30.213866551Z INFO  solana_runtime::snapshot_utils] haoran total mmap 6191821384 file 6191821384
[2024-05-23T00:03:12.682075083Z INFO  solana_runtime::snapshot_utils] haoran total mmap 6229769688 file 6229769688
[2024-05-23T00:04:02.773589822Z INFO  solana_runtime::snapshot_utils] haoran total mmap 6271044256 file 6271044256
[2024-05-23T00:04:43.588315724Z INFO  solana_runtime::snapshot_utils] haoran total mmap 6376330712 file 6376330712
[2024-05-23T00:05:27.334345669Z INFO  solana_runtime::snapshot_utils] haoran total mmap 6415419112 file 6415419112
[2024-05-23T00:06:07.737018001Z INFO  solana_runtime::snapshot_utils] haoran total mmap 6507286328 file 6507286328
[2024-05-23T00:06:56.092658137Z INFO  solana_runtime::snapshot_utils] haoran total mmap 6603694896 file 6603694896

@HaoranYi
Copy link

HaoranYi commented May 28, 2024

maybe we are getting better compression

Might be the "ordering" of the files that are added to the archive? For old code, we depend on the ordering of the files in the staging directory when they are added to the archive. Or would it be possible that there are left-over garbage files in the staging folder?

@jeffwashington
Copy link

I made an experiment branch to compare the size of mmap with the size of the actual file.

yes, this confirms what I thoughts. mmap.len() == file size.
So, this is NOT the source of the smaller snapshot file.

@steviez
Copy link

steviez commented May 28, 2024

I imagine running ledger-tool create-snapshot with and without this change, then untarring the snapshots (without fastboot) and seeing what the differences are would satisfy my questions and then I could satisfy my concerns if we like the answers.

FWIW, I initially attempted to test this PR just by running ledger-tool create-snapshot. I started with a v1.18 created snapshot and then replayed/created the snap with this commit after ~10 slots of replay. The size of the created snapshot archives was essentially the same.

Based on some observations made in this comment above, I'm inclined to think the drop in snapshot size is correlated to accounts getting rewritten. So, I wouldn't expect to see a drop in snapshot size with this branch with ledger-tool create-snapshot UNLESS a large number of slots are replayed first.

As it relates to the experiment Jeff proposed, might need to look fairly closely at what differs instead of a broad "this directory is now 25% smaller"

@steviez
Copy link

steviez commented May 28, 2024

I don't understand why this pr would cause a smaller snapshot file. That makes me nervous that we're skipping files or parts of files. Presumably the snapshots we create are correct.

I believe the snapshot we are creating are correct. Namely, my nodes (and all of the canaries and everyone else's tip of master dev nodes for that matter) haven't diverged from the cluster. If we were dropping account(s) or account data, I would expect the epoch accounts hash to diverge

@jeffwashington
Copy link

I believe the snapshot we are creating are correct. Namely, my nodes (and all of the canaries and everyone else's tip of master dev nodes for that matter) haven't diverged from the cluster. If we were dropping account(s) or account data, I would expect the epoch accounts hash to diverge

We don't USE the snapshots we create in continuously running nodes, so the fact they are running well doesn't tell us anything about the quality of their snapshots, right? Something here as changed to make the snapshots smaller. I have not heard a good answer as to WHAT has changed that makes them smaller and I have concerns that we don't understand what was going on (that we accidentally found made things better) or worse, that what we're doing now (or were doing before) is somehow wrong.

@steviez
Copy link

steviez commented May 28, 2024

We don't USE the snapshots we create in continuously running nodes, so the fact they are running well doesn't tell us anything about the quality of their snapshots, right?

Fair point about a node that is continuously running. However, I know that I have restarted my nodes a handful of times, and the canaries are restarted on set schedules. That being said, all of these nodes could be using fastboot.

As an easy experiment for the background, let me restart one of my nodes with fastboot disabled

I have not heard a good answer as to WHAT has changed that makes them smaller and I have concerns that we don't understand what was going on (that we accidentally found made things better) or worse, that what we're doing now (or were doing before) is somehow wrong.

Understood. I'm not an expert in this area of the codebase which is why I pinged y'all 😉 . With the exchange of comments this morning, it is now much more clear to me that we still have work to do in understanding the "why", and you spelling out the experiment that Haoran is going to run / I can tag in on is helpful.

@HaoranYi
Copy link

HaoranYi commented May 28, 2024

I stopped my validator and used the ledger tool to create two snapshots w/o this PR right after the base slot.

$ cargo run --release --bin agave-ledger-tool --  create-snapshot 268484163 --ledger ~/ledger 

The snapshot created with this PR is 1510359Byte (~1M) smaller.

sol@shared-dev-am-11:~/ledger$ ls -l snapshot-268484163-*
-rw-r--r-- 1 sol users 58605692407 May 28 20:41 snapshot-268484163-new-CEuTpjqPntsn4P9xxGFqx4KZ3owwQ93jWJ1H7j4WWGr.tar.zst
-rw-r--r-- 1 sol users 58607202766 May 28 19:55 snapshot-268484163-old-CEuTpjqPntsn4P9xxGFqx4KZ3owwQ93jWJ1H7j4WWGr.tar.zst

After untaring the zst file, the total number of files under accounts/ are the same. And the total size of accounts dir with the new PR is a little smaller.

sol@shared-dev-am-11:~/ledger$ du   268484163-new/accounts
281052268       268484163-new/accounts
sol@shared-dev-am-11:~/ledger$ du   268484163-old/accounts
281058184       268484163-old/accounts

However, The filenames in the accounts folder are slightly different (probably due to different seeds when ledger tool starts).

 $ diff -qr 268484163-old 268484163-new 
Only in 268484163-old/accounts: 268057299.81242
Only in 268484163-new/accounts: 268057299.81266
Only in 268484163-new/accounts: 268057300.61307
Only in 268484163-old/accounts: 268057300.61319

I wrote a simple python script to check to make sure that the accounts files for the same slot are the same across both snapshots.

 sol@shared-dev-am-11:~/ledger$ python3 comp.py
build old files …
build new files …
compare files …

Here is the python script, which can be used to compare the accounts files extracted from different snapshots.

# comp.py
import os
f1 = os.listdir('268484163-old/accounts/')
f2 = os.listdir('268484163-new/accounts/')


print('build old files ...')
s1 = dict()
for x in f1:
    k = x.split('.')[0]
    s1[k] = x

print('build new files ...')
s2 = dict()
for x in f2:
    k = x.split('.')[0]
    s2[k] = x

print('compare files ...')
for k in s1:
    f1 = s1[k]
    if k not in s2:
        print(k, 'not found in new')
        continue
    f2 = s2[k]
    cmd = 'sudo diff ./268484163-old/accounts/{0} ./268484163-new/accounts/{1}'.format(f1, f2)
    ret = os.system(cmd)
    if ret > 0:
        print('found diff in ', cmd)


for k in s2:
    if k not in s1:
        print(k, 'not found in old')

@jeffwashington
Copy link

I stopped my validator and used the ledger tool to create two snapshots w/o this PR right after the base slot.

I'm sorry to go around. This can't be mnb, right? Seems way too small.

So snapshot size on your machine was ~1MB smaller out of 58G snapshot?

@brooksprumo
Copy link

Another thought I had, if a node without this PR boots from the new small snapshot, will it start creating snapshots with gradually increasing size over the next epoch, back up to the larger 77 GB size?

@jeffwashington
Copy link

Another thought I had, if a node without this PR boots from the new small snapshot, will it start creating snapshots with gradually increasing size over the next epoch, back up to the larger 77 GB size?

I suspect that the first snapshot a node without this pr creates will be the normal size. The original snapshot size appears to have nothing to do with the accounts files it creates.

I guess the improvements are because of ordering somehow or maybe because we're using buffers instead of file handles that the compression is compressing across files better?

@HaoranYi
Copy link

I stopped my validator and used the ledger tool to create two snapshots w/o this PR right after the base slot.

I'm sorry to go around. This can't be mnb, right? Seems way too small.

This is for mnb.

So snapshot size on your machine was ~1MB smaller out of 58G snapshot?
Yes, only 1M difference.

I think @steviez has already pointed out in his comments that running ledger-tool to create snapshot won't show much difference for the new snapshot size with this change.

My theory is that, when running ledger-tool, we always start with the same base snapshot unpacked. The locations of the accounts files on disk are the same. Therefore, archive the by using staging directory and loop would result in almost the same ordering of the files being added to the archive. Hence the output file size won't be much different.

While for long running validators, the accounts files ordering from staging directory could be much different from the loop in this PR.

I am running another experiment to test this theory by randomizing the order of the accounts file in archiving.
I will report how this experiment goes.

@HaoranYi
Copy link

HaoranYi commented May 29, 2024

@steviez @brooksprumo Is it possible to get the snapshots from two running validators against mnb for the same slot with and without this PR? If so, then we can use comp.py scripts to check the accounts files between these two snapshots.

@HaoranYi
Copy link

HaoranYi commented May 29, 2024

I am running another experiment to test this theory by randomizing the order of the accounts file in archiving.
I will report how this experiment goes.

Randomizing the order of the account files doesn't make much difference (58G~).

diff --git a/runtime/src/snapshot_utils.rs b/runtime/src/snapshot_utils.rs
index e51db65b87..812a9de1dc 100644
--- a/runtime/src/snapshot_utils.rs
+++ b/runtime/src/snapshot_utils.rs
@@ -803,7 +803,13 @@ pub fn archive_snapshot_package(
                 .append_dir_all(SNAPSHOTS_DIR, &staging_snapshots_dir)
                 .map_err(E::ArchiveSnapshotsDir)?;

-            for storage in &snapshot_package.snapshot_storages {
+            let mut storages: Vec<_> = snapshot_package.snapshot_storages.iter().cloned().collect();
+            use rand::seq::SliceRandom;
+            use rand::thread_rng;
+            let mut rng = thread_rng();
+            storages.shuffle(&mut rng);
+
+            for storage in storages {
                 let path_in_archive = Path::new(ACCOUNTS_DIR).join(AccountsFile::file_name(
                     storage.slot(),
                     storage.append_vec_id(),
sol@shared-dev-am-11:~/ledger$ ls -l  snapshot-268484163*
-rw-r--r-- 1 sol users 58605692407 May 28 20:41 snapshot-268484163-new-CEuTpjqPntsn4P9xxGFqx4KZ3owwQ93jWJ1H7j4WWGr.tar.zst
-rw-r--r-- 1 sol users 58607202766 May 28 19:55 snapshot-268484163-old-CEuTpjqPntsn4P9xxGFqx4KZ3owwQ93jWJ1H7j4WWGr.tar.zst
-rw-r--r-- 1 sol users 58606321870 May 29 14:47 snapshot-268484163-rand-CEuTpjqPntsn4P9xxGFqx4KZ3owwQ93jWJ1H7j4WWGr.tar.zst

@steviez
Copy link

steviez commented May 29, 2024

Is it possible to get the snapshots from two running validators against mnb for the same slot with and without this PR? If so, then we can use comp.py scripts to check the accounts files between these two snapshots.

@HaoranYi - I have snapshots with and without this PR for testnet. Testnet should be just fine for testing, and given that snapshots are an order of magnitude smaller on testnet, any analysis should go much quicker too. Per our DM's, I have copied them over to dev1:

$ ls -hlR gh1430_compare_snaps/
...
gh1430_compare_snaps/v1.18_modified:
total 4.1G
-rw-r--r-- 1 sol users 4.1G May 29 15:24 snapshot-273849800-8Mg9KiquqZNcwFU64DZuNWbMmBS8FqoYKTPTXygaZ7sQ.tar.zst

gh1430_compare_snaps/v1.18_original:
total 6.7G
-rw-r--r-- 1 sol users 6.7G May 29 15:32 snapshot-273849800-8Mg9KiquqZNcwFU64DZuNWbMmBS8FqoYKTPTXygaZ7sQ.tar.zst

Final note - I originally used this PR branch to "shrink" the snapshot, but the node has been running tip of master recently. But, I previously isolated the drop in snapshot size to this commit so I think this is fine

@HaoranYi
Copy link

HaoranYi commented May 29, 2024

The testnet snapshots from @steviez DOES show that the account files are actually smaller in this PR.

Taking slot 273663851 as an example, the sizes are 3420(0x85c0) vs 36864 (0x9000).

sol@shared-dev-am-11:~/ledger$ sudo ls -l testnet_snap_orig/accounts/273663851.1297167 testnet_snap_modi/accounts/273663851.2923417
---------- 1 sol users 34240 Jan  1  1970 testnet_snap_modi/accounts/273663851.2923417
-rw-r--r-- 1 sol users 36864 May 28 15:13 testnet_snap_orig/accounts/273663851.1297167

Looking at the binary of original accounts file, it shows that [0x85c0..0x9000] are all zeros.

> 00008590: 0000 0000 0000 0000 473e 0aaf 55dd 49fc  ........G>..U.I.
> 000085a0: 772b 6806 6c57 cfaa 2937 2056 1540 358e  w+h.lW..)7 V.@5.
> 000085b0: c6bf cdea fa0d 9770 0000 0000 0000 0000  .......p........
> 000085c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000085d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000085e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000085f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008600: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008610: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008620: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008630: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008640: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008650: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008660: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008670: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008680: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008690: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000086a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000086b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000086c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000086d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000086e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000086f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008700: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008710: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008720: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008730: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008740: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008750: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008760: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008770: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008780: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008790: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000087a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000087b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000087c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000087d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000087e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000087f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008800: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008810: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008820: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008830: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008840: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008850: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008860: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008870: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008880: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008890: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000088a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000088b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000088c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000088d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000088e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000088f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008900: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008910: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008920: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008930: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008940: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008950: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008960: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008970: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008980: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008990: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000089a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000089b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000089c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000089d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000089e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000089f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008a00: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008a10: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008a20: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008a30: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008a40: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008a50: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008a60: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008a70: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008a80: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008a90: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008aa0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008ab0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008ac0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008ad0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008ae0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008af0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008b00: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008b10: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008b20: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008b30: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008b40: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008b50: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008b60: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008b70: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008b80: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008b90: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008ba0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008bb0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008bc0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008bd0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008be0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008bf0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008c00: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008c10: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008c20: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008c30: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008c40: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008c50: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008c60: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008c70: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008c80: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008c90: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008ca0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008cb0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008cc0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008cd0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008ce0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008cf0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008d00: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008d10: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008d20: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008d30: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008d40: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008d50: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008d60: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008d70: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008d80: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008d90: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008da0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008db0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008dc0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008dd0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008de0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008df0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008e00: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008e10: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008e20: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008e30: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008e40: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008e50: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008e60: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008e70: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008e80: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008e90: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008ea0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008eb0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008ec0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008ed0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008ee0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008ef0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008f00: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008f10: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008f20: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008f30: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008f40: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008f50: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008f60: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008f70: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008f80: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008f90: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008fa0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008fb0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008fc0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008fd0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008fe0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00008ff0: 0000 0000 0000 0000 0000 0000 0000 0000  ................

It seems like that mmap file didn't get synced to disk in the staging directory?

@alessandrod
Copy link

Taking slot 273663851 as an example, the sizes are 3420(0x85c0) vs 36864 (0x9000).

note how one size is page aligned, the other is not

@HaoranYi
Copy link

HaoranYi commented May 29, 2024

Taking slot 273663851 as an example, the sizes are 3420(0x85c0) vs 36864 (0x9000).

note how one size is page aligned, the other is not

yeah. exactly. All accounts file in the original snapshot are aligned.

sol@shared-dev-am-11:~/ledger$ ls -l testnet_snap_orig/accounts | head -n 10 | awk '{printf("%x\n", $5)}'
0
9000
a000
e000
a000
9000
a000
9000
b000
9000

sol@shared-dev-am-11:~/ledger$ ls -l testnet_snap_modi/accounts | head -n 10 | awk '{printf("%x\n", $5)}'
0
a5da0
9a1f8
9fcd8
974a8
a3220
99b48
9b470
a0760
96488

@HaoranYi
Copy link

seems like that on-disk representation of mmap is aligned, while in memory representation is not.
That would explain why using in memory representation saves snapshot space.

@HaoranYi
Copy link

  • total number of files
sol@shared-dev-am-11:~/ledger$ ls -l testnet_snap_modi/accounts/ | wc -l
421244
sol@shared-dev-am-11:~/ledger$ ls -l testnet_snap_orig/accounts/ | wc -l
421220
  • difference of total size of files (578M out of 30G)
sol@shared-dev-am-11:~/ledger$ ls -l testnet_snap_orig/accounts | awk 'BEGIN {total=0} {total += $5;} END { print total }'
30682248360
sol@shared-dev-am-11:~/ledger$ ls -l testnet_snap_modi/accounts | awk 'BEGIN {total=0} {total += $5;} END { print total }'
30075820832
  • snapshot size (difference 2.6G)
sol@shared-dev-am-11:~$ ls -lh gh1430_compare_snaps/v1.18_modified/
total 4.1G
-rw-r--r-- 1 sol users 4.1G May 29 15:24 snapshot-273849800-8Mg9KiquqZNcwFU64DZuNWbMmBS8FqoYKTPTXygaZ7sQ.tar.zst
sol@shared-dev-am-11:~$ ls -lh gh1430_compare_snaps/v1.18_original/
total 6.7G
-rw-r--r-- 1 sol users 6.7G May 29 15:32 snapshot-273849800-8Mg9KiquqZNcwFU64DZuNWbMmBS8FqoYKTPTXygaZ7sQ.tar.zst

I find one thing interesting about these two snapshots. The total accounts file size difference is only 578M. But the compressed tar ball difference is 2.6G.

@alessandrod
Copy link

seems like that on-disk representation of mmap is aligned, while in memory representation is not. That would explain why using in memory representation saves snapshot space.

Why are we writing extra data to disk tho? Do we have a bug? Is it in flush()?

@HaoranYi
Copy link

HaoranYi commented May 30, 2024

Why are we writing extra data to disk tho? Do we have a bug? Is it in flush()?

Yes, it's in flush.

We are calling flush at the beginning of taking a snapshot, i.e. add_bank_snapshot() in ABS, which calls into flush() on mmap2.

for storage in snapshot_storages {

However, mmap2 crate flush adjusts the size to be aligned when syncing to the disk.

https://github.com/RazrFalcon/memmap2-rs/blob/f7c50830ed332d4a53dd948194c87bfd3b25855b/src/unix.rs#L306

@alessandrod
Copy link

This looks like a bug, I think it should align all ranges except the last segment if shorter than page size. Anyway, one more reason to stop flushing! 😏

@HaoranYi
Copy link

Yeah. I have created PR #1543 to stop doing flush. Looking at the metrics on mnb, this could save us about 2s for incremental snapshot and 5-7s for full snapshots.

@brooksprumo brooksprumo dismissed their stale review June 3, 2024 12:58

Now that v1.18 is rolling out to mainnet-beta, I don't feel confident to backport this PR until we fully understand why the snapshot archive size changed so much.

@brooksprumo
Copy link

brooksprumo commented Jun 5, 2024

I have been running this branch (v1.18 + PR) for a bit now, and over seven full snapshots, the archive size does not change.

Screenshot 2024-06-05 at 9 38 36 AM

It is my conclusion that archiving the storages directly does not change the archive size. There is something else happening.


While doing testing on this, I have found that for a storage file from the same slot1 on master vs v1.18, the master one compresses more. Importantly, before compression, the storage files are the exact same size.

I ran zstd on a storage file from master, and it compressed to approximately 100 KB smaller than v1.18.

Extrapolating, if every storage file was 100 KB smaller, that would be
= 100 KB/slot * 432,000 slots
= 43,200,000 KB
= 43.2 GB
smaller.

It's unlikely every storage compresses this much smaller, as older slots are smaller/have more dead stuff. But minimally something is at play here.

I'm currently trying to actually root cause where in master this changed. Maybe flush changed the order of accounts being flushed? Maybe the rust data structure being flushed traverses the accounts differently?

Footnotes

  1. Using ledger-tool create-snapshot --incremental, I created a snapshot for one slot beyond the loaded snapshot using master, v1.18.14, and v1.17.34. This causes a flush to run prior to snapshotting. I then looked at this newly created highest storage file. This will have zero dead accounts in the storage (assuming no bugs). And the storage must have the same accounts for all the versions.

Copy link

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm back on board with this backport now.

By archiving the storages directly, we address a performance issue that occurs when archiving snapshots.

Since backports are meant for bug fixes only, should this qualify? I would suggest, yes, it does qualify. If a node's performance is poor enough where it cannot operate, that has the same result as a logic bug that causes a panic; both result in a node that is inoperable.

I'd still like sign off from @jeffwashington on this before merging.

@HaoranYi
Copy link

HaoranYi commented Jun 5, 2024

Follow up #1430 (comment)

I run another experiment with one account file for slot 273663851.

sol@shared-dev-am-11:~/ledger$ ls -l testnet_snap_modi/accounts/273663851*
---------- 1 sol users 34240 Jan  1  1970 testnet_snap_modi/accounts/273663851.2923417
---------- 1 sol users  7458 Jan  1  1970 testnet_snap_modi/accounts/273663851.2923417.zst
sol@shared-dev-am-11:~/ledger$ ls -l testnet_snap_orig/accounts/273663851*
-rw-r--r-- 1 sol users 36864 May 28 15:13 testnet_snap_orig/accounts/273663851.1297167
-rw-r--r-- 1 sol users 13282 May 28 15:13 testnet_snap_orig/accounts/273663851.1297167.zst

orig new diff
uncompressed 36864 34240 2624
compressed 13282 7458 5824

I removed the padding of the original file, and used zstd to compress the truncated file. The result file size 13283 bytes. Only one byte difference.

Therefore, padding wouldn't be the cause of such huge file size difference of the snapshots.

I tend to agree with @brooksprumo hunch that the order of the accounts in the files has changed, which makes it compressed better.

@HaoranYi
Copy link

HaoranYi commented Jun 5, 2024

The following two PRs will help to make the account storage compress better.
#469
#476

sol@shared-dev-am-11:~/ledger$ grep 0000 -oh a.hex | wc -l
7386
sol@shared-dev-am-11:~/ledger$ grep 0000 -oh 273663851.mod.hex | wc -l
10597

The new account file produced seems to have more zeros.

@jeffwashington
Copy link

I don't see demonstrated value on why would take any chances backporting this. THe smaller snapshot size and the use of mmap will come along in the next release with a normal cycle of testing.

@brooksprumo
Copy link

The following two PRs will help to make the account storage compress better. #469 #476

sol@shared-dev-am-11:~/ledger$ grep 0000 -oh a.hex | wc -l
7386
sol@shared-dev-am-11:~/ledger$ grep 0000 -oh 273663851.mod.hex | wc -l
10597

The new account file produced seems to have more zeros.

Nice find!

If I had just updated the store-tool to print out everything, we may've seen this right from the beginning!

a v1.18 stored account

AppendVec(AppendVecStoredAccountMeta { meta: StoredMeta { write_version_obsolete: 1239417977168, data_len: 0, pubkey: 8A4EdAdJ5NkEH9waDUEhKay8TBGHX3RTu3NEDXKCR9ib }, account_meta: AccountMeta { lamports: 32272557488, rent_epoch: 18446744073709551615, owner: 11111111111111111111111111111111, executable: false }, data: [], offset: 21076912, stored_size: 136, hash: AccountHash(AxUWPt4ZMntj3xQgk2pYHVDUqP1jQ3tjBpTCYiyVc4vV) })

a master stored account

AppendVec(AppendVecStoredAccountMeta { meta: StoredMeta { write_version_obsolete: 0, data_len: 0, pubkey: 8A4EdAdJ5NkEH9waDUEhKay8TBGHX3RTu3NEDXKCR9ib }, account_meta: AccountMeta { lamports: 32272557488, rent_epoch: 18446744073709551615, owner: 11111111111111111111111111111111, executable: false }, data: [], offset: 3816600, stored_size: 136, hash: AccountHash(11111111111111111111111111111111) })

The difference is the write version and account hash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants