Reduce deserialization allocations/copies #1197

alessandrod · 2024-05-06T08:59:19Z

This removes an allocation and a copy in AccountSharedData::reserve(). Calling data_mut().reserve(additional) used to result in two allocs and memcpys: the first to unshare the underlying vector, and the second upon calling reserve since Arc::make_mut clones so it uses capacity == len. With this change we now manually "unshare" allocating with capacity = len + additional, therefore saving on extra alloc and memcpy.

Additionally we skip another extra copy in AccountSharedData::set_data_from_slice(). We used call make_data_mut() from set_data_from_slice() from the days when direct mapping couldn't deal with accounts getting shrunk. That changed in solana-labs#32649 (see the if callee_account.capacity() < min_capacity check in cpi.rs:update_caller_account()). We now don't call make_data_mut() anymore before set_data_from_slice(), saving the cost of a memcpy since set_data_from_slice() overrides the whole account content anyway.

These two changes improve deserialization perf.

Before:

After:

On my mnb node this reduces cost of deserialization from 16% to 5%. It improves overall replay perf by 6%, although I expect the saving to be even larger on nodes with a lot of stake which seem to struggle more with memory allocations.

Calling data_mut().reserve(additional) used to result in _two_ allocs and memcpys: the first to unshare the underlying vector, and the second upon calling reserve since Arc::make_mut clones so it uses capacity == len. With this fix we manually "unshare" allocating with capacity = len + additional, therefore saving on extra alloc and memcpy.

We used call make_data_mut() from set_data_from_slice() from the days when direct mapping couldn't deal with accounts getting shrunk. That changed in solana-labs#32649 (see the if callee_account.capacity() < min_capacity check in cpi.rs:update_caller_account()). With this change we don't call make_data_mut() anymore before set_data_from_slice(), saving the cost of a memcpy since set_data_from_slice() overrides the whole account content anyway.

seanyoung · 2024-05-07T08:51:22Z

sdk/src/transaction_context.rs

+        // Note that we intentionally don't call self.make_data_mut() here.  make_data_mut() will
+        // allocate + memcpy the current data if self.account is shared. We don't need the memcpy
+        // here tho because account.set_data_from_slice(data) is going to replace the content
+        // anyway.


Very nice catch finding that self.make_data_mut() is redundant. Totally makes sense.

Suggested change

// Note that we intentionally don't call self.make_data_mut() here. make_data_mut() will

// allocate + memcpy the current data if self.account is shared. We don't need the memcpy

// here tho because account.set_data_from_slice(data) is going to replace the content

// anyway.

// Note that we intentionally don't call self.make_data_mut() here, since we are replacing

// the contents transaction wide anyway with account.set_data_from_slice(data)

I'm +0 on the edit, so I think I'm going to keep my comment. I think it's worth being explicit about make_data_mut() doing a copy, since the method name itself doesn't immediately suggest that.

ryoqun · 2024-05-08T05:28:42Z

much like #1192 (comment), i did similar benmarking. i confirmed this improves favorably unified scheduler performance as well yet again.

before(at 206a87a) (love bureaucratic work? lol. this result almost same as the result seen in after(at 10e5086))

--block-verification-method blockstore-processor:
ledger processed in 32 seconds, 286 ms
ledger processed in 32 seconds, 470 ms
ledger processed in 32 seconds, 569 ms

--block-verification-method unified-scheduler:
ledger processed in 16 seconds, 847 ms
ledger processed in 17 seconds, 331 ms
ledger processed in 17 seconds, 310 ms

after(at f180b08)

--block-verification-method blockstore-processor:

ledger processed in 31 seconds, 335 ms
ledger processed in 31 seconds, 164 ms
ledger processed in 31 seconds, 473 ms

--block-verification-method unified-scheduler:

ledger processed in 16 seconds, 512 ms
ledger processed in 16 seconds, 601 ms
ledger processed in 16 seconds, 363 ms

blockstore-processor sees ~1 sec reduction (~3% faster) while unified scheduler sees ~0.8 sec reduction (~5% faster)

Now unified scheduler is basically faster twice than the blockstore processor. :)

alessandrod added 2 commits May 6, 2024 09:28

alessandrod force-pushed the faster-deser branch from 2989af7 to b03ce11 Compare May 6, 2024 09:28

alessandrod requested review from Lichtso and seanyoung May 7, 2024 01:24

seanyoung approved these changes May 7, 2024

View reviewed changes

Lichtso approved these changes May 7, 2024

View reviewed changes

alessandrod merged commit f180b08 into anza-xyz:master May 7, 2024
48 checks passed

ryoqun mentioned this pull request May 21, 2024

Fine-tune unified scheduler loops by select_biased #1437

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce deserialization allocations/copies #1197

Reduce deserialization allocations/copies #1197

alessandrod commented May 6, 2024

seanyoung May 7, 2024 •

edited

Loading

alessandrod May 7, 2024

ryoqun commented May 8, 2024 •

edited

Loading

Reduce deserialization allocations/copies #1197

Reduce deserialization allocations/copies #1197

Conversation

alessandrod commented May 6, 2024

seanyoung May 7, 2024 • edited Loading

Choose a reason for hiding this comment

alessandrod May 7, 2024

Choose a reason for hiding this comment

ryoqun commented May 8, 2024 • edited Loading

seanyoung May 7, 2024 •

edited

Loading

ryoqun commented May 8, 2024 •

edited

Loading