Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce deserialization allocations/copies #1197

Merged
merged 2 commits into from
May 7, 2024

Conversation

alessandrod
Copy link

This removes an allocation and a copy in AccountSharedData::reserve(). Calling data_mut().reserve(additional) used to result in two allocs and memcpys: the first to unshare the underlying vector, and the second upon calling reserve since Arc::make_mut clones so it uses capacity == len. With this change we now manually "unshare" allocating with capacity = len + additional, therefore saving on extra alloc and memcpy.

Additionally we skip another extra copy in AccountSharedData::set_data_from_slice(). We used call make_data_mut() from set_data_from_slice() from the days when direct mapping couldn't deal with accounts getting shrunk. That changed in solana-labs#32649 (see the if callee_account.capacity() < min_capacity check in cpi.rs:update_caller_account()). We now don't call make_data_mut() anymore before set_data_from_slice(), saving the cost of a memcpy since set_data_from_slice() overrides the whole account content anyway.

These two changes improve deserialization perf.

Before:
Screenshot 2024-05-06 at 4 54 16 pm

After:
Screenshot 2024-05-06 at 4 55 20 pm

On my mnb node this reduces cost of deserialization from 16% to 5%. It improves overall replay perf by 6%, although I expect the saving to be even larger on nodes with a lot of stake which seem to struggle more with memory allocations.

Calling data_mut().reserve(additional) used to result in _two_ allocs
and memcpys: the first to unshare the underlying vector, and the second
upon calling reserve since Arc::make_mut clones so it uses capacity ==
len.

With this fix we manually "unshare" allocating with capacity = len +
additional, therefore saving on extra alloc and memcpy.
We used call make_data_mut() from set_data_from_slice() from the days
when direct mapping couldn't deal with accounts getting shrunk. That
changed in solana-labs#32649 (see the
if callee_account.capacity() < min_capacity check in
cpi.rs:update_caller_account()).

With this change we don't call make_data_mut() anymore before
set_data_from_slice(), saving the cost of a memcpy since
set_data_from_slice() overrides the whole account content anyway.
Comment on lines +872 to +875
// Note that we intentionally don't call self.make_data_mut() here. make_data_mut() will
// allocate + memcpy the current data if self.account is shared. We don't need the memcpy
// here tho because account.set_data_from_slice(data) is going to replace the content
// anyway.
Copy link

@seanyoung seanyoung May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice catch finding that self.make_data_mut() is redundant. Totally makes sense.

Suggested change
// Note that we intentionally don't call self.make_data_mut() here. make_data_mut() will
// allocate + memcpy the current data if self.account is shared. We don't need the memcpy
// here tho because account.set_data_from_slice(data) is going to replace the content
// anyway.
// Note that we intentionally don't call self.make_data_mut() here, since we are replacing
// the contents transaction wide anyway with account.set_data_from_slice(data)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +0 on the edit, so I think I'm going to keep my comment. I think it's worth being explicit about make_data_mut() doing a copy, since the method name itself doesn't immediately suggest that.

@alessandrod alessandrod merged commit f180b08 into anza-xyz:master May 7, 2024
48 checks passed
@ryoqun
Copy link
Member

ryoqun commented May 8, 2024

much like #1192 (comment), i did similar benmarking. i confirmed this improves favorably unified scheduler performance as well yet again.

before(at 206a87a) (love bureaucratic work? lol. this result almost same as the result seen in after(at 10e5086))

--block-verification-method blockstore-processor:
ledger processed in 32 seconds, 286 ms
ledger processed in 32 seconds, 470 ms
ledger processed in 32 seconds, 569 ms

--block-verification-method unified-scheduler:
ledger processed in 16 seconds, 847 ms
ledger processed in 17 seconds, 331 ms
ledger processed in 17 seconds, 310 ms

after(at f180b08)

--block-verification-method blockstore-processor:

ledger processed in 31 seconds, 335 ms
ledger processed in 31 seconds, 164 ms
ledger processed in 31 seconds, 473 ms

--block-verification-method unified-scheduler:

ledger processed in 16 seconds, 512 ms
ledger processed in 16 seconds, 601 ms
ledger processed in 16 seconds, 363 ms

blockstore-processor sees ~1 sec reduction (~3% faster) while unified scheduler sees ~0.8 sec reduction (~5% faster)

Now unified scheduler is basically faster twice than the blockstore processor. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants