program cache: reduce contention #1192

alessandrod · 2024-05-05T10:30:52Z

Before this change we used to take the write lock to extract(). This means that even in the ideal case (all programs are already cached), the cache was contended by all batches and all operations were serialized.

With this change we now take the write lock only when we store a new entry in the cache, and take the read lock to extract(). This means that in the common case where most/all programs are cached, there is no contention and all batches progress in parallel.

This improves node replay perf by 20-25% on current mnb traffic.

Before:

After:

...can't show because there's nothing to show :P But can show a zoomed in profile of replenish.

Before:

After:

you can see that locking on write() goes from taking 80% of replenish, to 0.4%

codecov-commenter · 2024-05-05T11:31:56Z

Codecov Report

Attention: Patch coverage is 79.41176% with 7 lines in your changes are missing coverage. Please review.

Project coverage is 82.1%. Comparing base (b0cfffb) to head (b688cac).

Additional details and impacted files

@@            Coverage Diff            @@
##           master    #1192     +/-   ##
=========================================
- Coverage    82.1%    82.1%   -0.1%     
=========================================
  Files         886      886             
  Lines      236372   236369      -3     
=========================================
- Hits       194204   194185     -19     
- Misses      42168    42184     +16

Lichtso · 2024-05-05T14:07:01Z

program-runtime/src/loaded_programs.rs

+    /// It is possible that multiple TX batches from different slots need different versions of a
+    /// program. The deployment slot of a program is only known after load tho,
+    /// so all loads for a given program key are serialized.
+    loading_entries: Mutex<HashMap<Pubkey, (Slot, std::thread::ThreadId)>>,


In the replacement of the index structure I am working on I also experimented with pulling cooperative_loading_lock out of the SecondLevel. We might be able to take this even further because when we use the account last-modified-slot we could also parallelize the loading of missing programs. Basically the "The deployment slot of a program is only known after load tho" part could be improved then.

program-runtime/src/loaded_programs.rs

Lichtso · 2024-05-05T20:11:15Z

Do you plan on backporting this to v1.18? Because I also need to add a parameter to finish_cooperative_loading_task() to fix the recompilation phase issue that happened last Friday. Thus, these will cause merge conflicts.

alessandrod · 2024-05-06T02:24:45Z

Do you plan on backporting this to v1.18? Because I also need to add a parameter to finish_cooperative_loading_task() to fix the recompilation phase issue that happened last Friday. Thus, these will cause merge conflicts.

This is patch 1 of N in my attempt to improve skip rate on one of our nodes which struggles in replay during leader slots. I plan to try to backport some of the patches, but not this one because of how tricky the program cache logic is. And also because we still don't have multi threaded tests for the cache 😭. I've been running a node with this for 3 days and is still making roots, but def don't want to risk it.

If your fix is ready I don't mind merging after yours as I imagine the conflicts on this PR will be easy to reconcile. Or I'm happy to merge first, up to you really!

Before this change we used to take the write lock to extract(). This means that even in the ideal case (all programs are already cached), the cache was contended by all batches and all operations were serialized. With this change we now take the write lock only when we store a new entry in the cache, and take the read lock to extract(). This means that in the common case where most/all programs are cached, there is no contention and all batches progress in parallel. This improves node replay perf by 20-25% on current mnb traffic.

ryoqun · 2024-05-08T04:38:43Z

as this pr is similar to mine (#1037), i confirmed this improves favorably unified scheduler performance as well. thanks for fixing the last write-lock contention source in the program cache. :)

i measured the perf before/after the merged commit (10e5086). as a quick context, this bench-marking was conducted by replaying a short slice of mb ledgers using the same minimized snapshot: (repro steps: solana-labs#35286 (comment))

As said above, I saw consistent performance improvement in my testing as well, although i couldn't see 20%-25%. I think this is due to being solana-ledger-tool (offline; ample of free cpu cores) vs solana-validator (online; system is kind of overloaded... meh)

before(at aa6c69a)

--block-verification-method blockstore-processor:

ledger processed in 32 seconds, 572 ms
ledger processed in 32 seconds, 747 ms
ledger processed in 32 seconds, 571 ms

--block-verification-method unified-scheduler:

ledger processed in 18 seconds, 164 ms
ledger processed in 17 seconds, 930 ms
ledger processed in 18 seconds, 111 ms

after(at 10e5086)

--block-verification-method blockstore-processor:

ledger processed in 32 seconds, 393 ms
ledger processed in 32 seconds, 333 ms
ledger processed in 32 seconds, 343 ms

--block-verification-method unified-scheduler:

ledger processed in 17 seconds, 187 ms
ledger processed in 17 seconds, 220 ms
ledger processed in 16 seconds, 871 ms

similar to #1037, unified scheduler saw some drastic improvement (5-6% faster), while blockstore processor saw small gain while being consistent(<1% faster)

Lichtso · 2024-05-28T19:03:54Z

svm/src/transaction_processor.rs

@@ -401,49 +400,51 @@ impl<FG: ForkGraph> TransactionBatchProcessor<FG> {
                        &program_cache,
                    ));
                }
-                // Submit our last completed loading task.
-                if let Some((key, program)) = program_to_store.take() {
-                    loaded_programs_for_txs.as_mut().unwrap().loaded_missing = true;


@alessandrod I think you accidentally deleted this line here while moving the code. It was added in #1037 which was open around the same time as this PR, so most likely a merge conflict that was not resolved properly.

When I investigated the validator node of @buffalojoec (which went out of disk space as it was not configured to rotate log files) I discovered that the program cache was growing far beyond the capacity as it was not evicting anymore.

Can you open a PR to add the line back in?

I will add a metric for the number of loaded entries (complementary to the evictions metric we already have) recorded in ProgramCache::evict_using_2s_random_selection() so that this doesn't happen again.

alessandrod requested review from Lichtso and pgarg66 May 5, 2024 12:06

Lichtso reviewed May 5, 2024

View reviewed changes

program-runtime/src/loaded_programs.rs Show resolved Hide resolved

Lichtso reviewed May 5, 2024

View reviewed changes

program-runtime/src/loaded_programs.rs Outdated Show resolved Hide resolved

alessandrod force-pushed the cache-contention branch from b688cac to fd8436b Compare May 6, 2024 09:40

ProgramCache: remove SecondLevel structure

4ebdbcc

alessandrod force-pushed the cache-contention branch from fd8436b to 4ebdbcc Compare May 6, 2024 10:06

Lichtso approved these changes May 6, 2024

View reviewed changes

pgarg66 approved these changes May 6, 2024

View reviewed changes

Lichtso merged commit 10e5086 into anza-xyz:master May 6, 2024
38 checks passed

ryoqun mentioned this pull request May 8, 2024

Reduce deserialization allocations/copies #1197

Merged

ryoqun mentioned this pull request May 21, 2024

Fine-tune unified scheduler loops by select_biased #1437

Merged

Lichtso reviewed May 28, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

program cache: reduce contention #1192

program cache: reduce contention #1192

alessandrod commented May 5, 2024

codecov-commenter commented May 5, 2024

Lichtso May 5, 2024

Lichtso commented May 5, 2024

alessandrod commented May 6, 2024 •

edited

Loading

ryoqun commented May 8, 2024 •

edited

Loading

Lichtso May 28, 2024 •

edited

Loading

program cache: reduce contention #1192

program cache: reduce contention #1192

Conversation

alessandrod commented May 5, 2024

codecov-commenter commented May 5, 2024

Codecov Report

Lichtso May 5, 2024

Choose a reason for hiding this comment

Lichtso commented May 5, 2024

alessandrod commented May 6, 2024 • edited Loading

ryoqun commented May 8, 2024 • edited Loading

Lichtso May 28, 2024 • edited Loading

Choose a reason for hiding this comment

alessandrod commented May 6, 2024 •

edited

Loading

ryoqun commented May 8, 2024 •

edited

Loading

Lichtso May 28, 2024 •

edited

Loading