-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.18: Archives storages directly (backport of #503) #1430
base: v1.18
Are you sure you want to change the base?
Conversation
(cherry picked from commit 4247a8a) # Conflicts: # accounts-db/src/accounts_file.rs # accounts-db/src/tiered_storage/hot.rs # accounts-db/src/tiered_storage/readable.rs # runtime/src/snapshot_utils.rs
Cherry-pick of 4247a8a has failed:
To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally |
The other day in Discord, it was realized that there was a significant mismatch in snapshot sizes created by a v1.17/v1.18 validator vs. what was created with a tip-of-master node. For example, on mainnet right now:
This is roughly a 25% reduction in snapshot size. I was chatting with Brooks and he mentioned that the commit that this PR backports was a likely suspect for being the cause of the change. So, I started up a fresh testnet node running v1.18 + this BP. I made sure to download a fresh snapshot from a v1.18 node, so that I started with a "large" snapshot This plot shows the full snapshot size; blue is We also see that the size of incremental snapshots grows at a slower rate as the slot distance from full snapshot slot grows. Writing snapshots is known to be stressful for the node, so the 25% drop in size seems like a pretty sizable win for a change that isn't too complex. Thus, I am in favor of BP'ing this change, but we should also get thoughts from @brooksprumo (who wrote the initial PR) and @jeffwashington as subject matter experts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the commit that resolves the merge conflicts, and it looks good.
I am fine with this PR getting backported to v1.18, but I'm not strongly pushing for it. I'll defer to the others on the question of if we should backport or not. I sign off on the code changes themselves.
Ship ittt |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## v1.18 #1430 +/- ##
=========================================
- Coverage 81.6% 81.6% -0.1%
=========================================
Files 828 828
Lines 225537 225532 -5
=========================================
- Hits 184177 184165 -12
- Misses 41360 41367 +7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a reasonable "bug" fix to me. I approve bp-ing.
@brooksprumo why does this pr reduce snapshot size so drastically? |
Not sure actually. This was an interesting discovery by @steviez. Maybe the end of storage files contains lots of bytes that are not actually in the mmap's internal slice? Could this be due to removing the recycler? Or maybe we were/are oversizing files still, and the mmap's slice only contains the valid data? |
If we are snapshotting on the disk file, we may need to call |
With this PR we now get the account storage data from the mmap directly and do not go to the filesystem. So my understanding is that we don't need to do this sync. Note that before archiving, we already have flushed the mmaps to disk as part of serializing the bank to disk, which is needed for fastboot. This happens in AccountsBackgroundService when we call |
You are right. The mmap files are flushed. I verified that sum of mmap sizes is equal to the sum of appendvec file size. |
@brooksprumo , I already approved this on behalf of backport reviewers. It doesn't seem like there have been any changes pushed since then, have there? |
Nope, no new changes.
Ah ok, sorry about re-adding backport reviewers. I wasn't sure why I didn't see them listed as a reviewer, and I missed the "CriesofCarrots approved these changes on behalf of https://github.com/orgs/anza-xyz/teams/backport-reviewers" line. |
@brooksprumo @HaoranYi @jeffwashington - Any remaining concerns about merging this ? |
I think this PR looks more correct to me. We choose exactly the account storages to put into the archive in this PR in a loop, while the old code put everything inside |
None from me |
The I don't understand why this pr would cause a smaller snapshot file. That makes me nervous that we're skipping files or parts of files. Presumably the snapshots we create are correct. |
I made an experiment branch to compare the size of mmap with the size of the actual file. The log shows that the sizes are actually the same.
|
Might be the "ordering" of the files that are added to the archive? For old code, we depend on the ordering of the files in the staging directory when they are added to the archive. Or would it be possible that there are left-over garbage files in the staging folder? |
yes, this confirms what I thoughts. mmap.len() == file size. |
FWIW, I initially attempted to test this PR just by running Based on some observations made in this comment above, I'm inclined to think the drop in snapshot size is correlated to accounts getting rewritten. So, I wouldn't expect to see a drop in snapshot size with this branch with As it relates to the experiment Jeff proposed, might need to look fairly closely at what differs instead of a broad "this directory is now 25% smaller" |
I believe the snapshot we are creating are correct. Namely, my nodes (and all of the canaries and everyone else's tip of master dev nodes for that matter) haven't diverged from the cluster. If we were dropping account(s) or account data, I would expect the epoch accounts hash to diverge |
We don't USE the snapshots we create in continuously running nodes, so the fact they are running well doesn't tell us anything about the quality of their snapshots, right? Something here as changed to make the snapshots smaller. I have not heard a good answer as to WHAT has changed that makes them smaller and I have concerns that we don't understand what was going on (that we accidentally found made things better) or worse, that what we're doing now (or were doing before) is somehow wrong. |
Fair point about a node that is continuously running. However, I know that I have restarted my nodes a handful of times, and the canaries are restarted on set schedules. That being said, all of these nodes could be using fastboot. As an easy experiment for the background, let me restart one of my nodes with fastboot disabled
Understood. I'm not an expert in this area of the codebase which is why I pinged y'all 😉 . With the exchange of comments this morning, it is now much more clear to me that we still have work to do in understanding the "why", and you spelling out the experiment that Haoran is going to run / I can tag in on is helpful. |
I stopped my validator and used the ledger tool to create two snapshots w/o this PR right after the base slot.
The snapshot created with this PR is 1510359Byte (~1M) smaller.
After untaring the zst file, the total number of files under accounts/ are the same. And the total size of accounts dir with the new PR is a little smaller.
However, The filenames in the accounts folder are slightly different (probably due to different seeds when ledger tool starts).
I wrote a simple python script to check to make sure that the accounts files for the same slot are the same across both snapshots.
Here is the python script, which can be used to compare the accounts files extracted from different snapshots.
|
I'm sorry to go around. This can't be mnb, right? Seems way too small. So snapshot size on your machine was ~1MB smaller out of 58G snapshot? |
Another thought I had, if a node without this PR boots from the new small snapshot, will it start creating snapshots with gradually increasing size over the next epoch, back up to the larger 77 GB size? |
I suspect that the first snapshot a node without this pr creates will be the normal size. The original snapshot size appears to have nothing to do with the accounts files it creates. I guess the improvements are because of ordering somehow or maybe because we're using buffers instead of file handles that the compression is compressing across files better? |
This is for mnb.
I think @steviez has already pointed out in his comments that running ledger-tool to create snapshot won't show much difference for the new snapshot size with this change. My theory is that, when running ledger-tool, we always start with the same base snapshot unpacked. The locations of the accounts files on disk are the same. Therefore, archive the by using staging directory and loop would result in almost the same ordering of the files being added to the archive. Hence the output file size won't be much different. While for long running validators, the accounts files ordering from staging directory could be much different from the loop in this PR. I am running another experiment to test this theory by randomizing the order of the accounts file in archiving. |
@steviez @brooksprumo Is it possible to get the snapshots from two running validators against mnb for the same slot with and without this PR? If so, then we can use |
Randomizing the order of the account files doesn't make much difference (58G~).
|
@HaoranYi - I have snapshots with and without this PR for testnet. Testnet should be just fine for testing, and given that snapshots are an order of magnitude smaller on testnet, any analysis should go much quicker too. Per our DM's, I have copied them over to
Final note - I originally used this PR branch to "shrink" the snapshot, but the node has been running tip of master recently. But, I previously isolated the drop in snapshot size to this commit so I think this is fine |
The testnet snapshots from @steviez DOES show that the Taking slot
Looking at the binary of original accounts file, it shows that [0x85c0..0x9000] are all zeros.
It seems like that mmap file didn't get synced to disk in the staging directory? |
note how one size is page aligned, the other is not |
yeah. exactly. All accounts file in the original snapshot are aligned.
|
seems like that on-disk representation of mmap is aligned, while in memory representation is not. |
I find one thing interesting about these two snapshots. The total accounts file size difference is only 578M. But the compressed tar ball difference is 2.6G. |
Why are we writing extra data to disk tho? Do we have a bug? Is it in flush()? |
Yes, it's in flush. We are calling flush at the beginning of taking a snapshot, i.e. add_bank_snapshot() in ABS, which calls into agave/runtime/src/snapshot_utils.rs Line 1262 in 2970654
However, mmap2 crate flush adjusts the size to be aligned when syncing to the disk. |
This looks like a bug, I think it should align all ranges except the last segment if shorter than page size. Anyway, one more reason to stop flushing! 😏 |
Yeah. I have created PR #1543 to stop doing flush. Looking at the metrics on mnb, this could save us about 2s for incremental snapshot and 5-7s for full snapshots. |
Now that v1.18 is rolling out to mainnet-beta, I don't feel confident to backport this PR until we fully understand why the snapshot archive size changed so much.
I have been running this branch (v1.18 + PR) for a bit now, and over seven full snapshots, the archive size does not change. It is my conclusion that archiving the storages directly does not change the archive size. There is something else happening. While doing testing on this, I have found that for a storage file from the same slot1 on master vs v1.18, the master one compresses more. Importantly, before compression, the storage files are the exact same size. I ran Extrapolating, if every storage file was 100 KB smaller, that would be It's unlikely every storage compresses this much smaller, as older slots are smaller/have more dead stuff. But minimally something is at play here. I'm currently trying to actually root cause where in master this changed. Maybe Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm back on board with this backport now.
By archiving the storages directly, we address a performance issue that occurs when archiving snapshots.
Since backports are meant for bug fixes only, should this qualify? I would suggest, yes, it does qualify. If a node's performance is poor enough where it cannot operate, that has the same result as a logic bug that causes a panic; both result in a node that is inoperable.
I'd still like sign off from @jeffwashington on this before merging.
Follow up #1430 (comment) I run another experiment with one account file for slot 273663851.
I removed the padding of the original file, and used zstd to compress the truncated file. The result file size 13283 bytes. Only one byte difference. Therefore, padding wouldn't be the cause of such huge file size difference of the snapshots. I tend to agree with @brooksprumo hunch that the order of the accounts in the files has changed, which makes it compressed better. |
I don't see demonstrated value on why would take any chances backporting this. THe smaller snapshot size and the use of mmap will come along in the next release with a normal cycle of testing. |
Nice find! If I had just updated the store-tool to print out everything, we may've seen this right from the beginning! a v1.18 stored account
a master stored account
The difference is the write version and account hash. |
Problem
When archiving a snapshot, we need to add all the storages to the archive. Currently, we flush each file and then have the archiver load the file from disk.
Since the storages are all resident/in-sync with the validator process, we know they are all up to date. Also, we already flushed all the storages to disk in AccountsBackgroundService when calling
add_bank_snapshot()
, so flushing again is unnecessary and negatively impacts the rest of the system1.Summary of Changes
Instead of flushing to disk first, get the storages data directly from the mmaps and add it to the archive.
Additional Testing
Running
ledger-tool create-snapshot
on mnb showed that this PR does not impact how long it takes to archive a snapshot.master:
this pr:
I also tested loading from this new snapshot created with this PR to ensure it actually works properly. Happy to report that was successful too.
This is an automatic backport of pull request #503 done by Mergify.
Footnotes
Quantifying the negative impacts is something that I believe @alessandrod has done/shared. ↩