Reduce ProgramCache write lock contention #1037

ryoqun · 2024-04-25T04:34:32Z

Problem

ProgramCache's current locking behavior is needlessly hurting the unified scheduler's performance.

Unified scheduler is a new transaction scheduler for the block verification (i.e. the replaying stage). It doesn't employ batching at all as a design choice. Technically, the unified scheduler still is using TransactionBatches but only with 1 transaction for each. That means it bares all the extra overheads, which has been amortized by batching.

Sidenote: I believe these overheads are all solvable (but not SOON except this one...). Also note that it's already 1.8x faster than the to-be-replaced blockstore_processor after this pr.

Among all those overhead sources, one of the most visible one is ProgramCache's write-lock contentions. Currently, ProgramCache is write-locking 3 times unconditionally per 1-tx batch (for loading by replenish_program_cache(), for evictions by load_and_execute_sanitized_transactions(), for updated program commiting by commit_transactions()). so, it acutely hampers the unified scheduler concurrency.

Summary of Changes

Reduce write-lock of ProgramCache to bare-minimal. Now the usual case is 1 time for loading by replenish_program_cache(), while the worst case is still remaining at 3 times.

roughly ~5% consistent improvement. also note that blockstore-processor isn't affected while both are taking the same code-path.

perf improvements

before

# --block-verification-method unified-scheduler
ledger processed in 19 seconds, 182 ms
ledger processed in 19 seconds, 375 ms
ledger processed in 19 seconds, 460 ms

# --block-verification-method blockstore-processor
ledger processed in 32 seconds, 845 ms
ledger processed in 32 seconds, 786 ms
ledger processed in 32 seconds, 853 ms

after

after-the-pr's speedup-factor is now 1.8x:

irb(main):001:0> 32.691/17.970
=> 1.8191986644407347

# --block-verification-method unified-scheduler
ledger processed in 18 seconds, 77 ms
ledger processed in 17 seconds, 926 ms
ledger processed in 18 seconds, 126 ms

# --block-verification-method blockstore-processor
ledger processed in 32 seconds, 773 ms
ledger processed in 32 seconds, 691 ms
ledger processed in 32 seconds, 451 ms

(for the record) the merged commit

before(3f7b352):

ledger processed in 19 seconds, 347 ms
ledger processed in 19 seconds, 314 ms
ledger processed in 19 seconds, 278 ms

after(90bea33):

ledger processed in 18 seconds, 29 ms
ledger processed in 18 seconds, 78 ms
ledger processed in 18 seconds, 114 ms

context: extracted from #593

cc: @apfitzge

ryoqun · 2024-04-25T04:35:21Z

svm/src/transaction_processor.rs

@@ -292,6 +292,7 @@ impl<FG: ForkGraph> TransactionBatchProcessor<FG> {

        execution_time.stop();

+        /*


firstly, let's see which test should fail. :)

hehe, turned out there's no test to test this code specifically...

codecov-commenter · 2024-04-25T07:01:03Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.0%. Comparing base (8e331e1) to head (cce3075).
Report is 9 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1037   +/-   ##
=======================================
  Coverage    82.0%    82.0%           
=======================================
  Files         860      860           
  Lines      232898   232911   +13     
=======================================
+ Hits       191071   191104   +33     
+ Misses      41827    41807   -20

ryoqun · 2024-04-25T08:05:13Z

runtime/src/bank.rs

@@ -4162,7 +4162,7 @@ impl Bank {
                programs_modified_by_tx,
            } = execution_result
            {
-                if details.status.is_ok() {
+                if details.status.is_ok() && !programs_modified_by_tx.is_empty() {


hope this one is fairly uncontroversial. haha

ryoqun · 2024-04-25T08:06:05Z

svm/src/transaction_processor.rs

+        // ProgramCache entries. Note that this flag is deliberately defined, so that there's still
+        // at least one other batch, which will evict the program cache, even after the occurrences
+        // of cooperative loading.
+        if programs_loaded_for_tx_batch.borrow().loaded_missing {


i guess this one isn't so straightforward. better ideas are very welcome.

Global cache can also grow via cache.merge(programs_modified_by_tx); above, not just by loading missing.

how about this? d50c11c

Better, but the number of insertions and evictions can still be unbalanced because it is only a boolean.
Also, maybe we should move eviction to the place where we merge in new deployments? That way they could share a write lock.

Better, but the number of insertions and evictions can still be unbalanced because it is only a boolean.

I intentionally chosen boolean, thinking the number of insertions and evictions doesn't need to be balanced. That's because evict_using_2s_random_selection() continues to evict entries until they're under 90% of MAX_LOADED_ENTRY_COUNT(=256) just with a single invocation. So, we just need to ensure these are called with sufficient frequency/timings to avoid cache bomb dos attack.

Also, maybe we should move eviction to the place where we merge in new deployments? That way they could share a write lock.

This is possible and that looks appealing. however it isn't trivial. Firstly, load_and_execute_sanitized_transactions can be entered via 3 code path: replaying, banking, rpc tx simulation. I guess that's the reason this eviction is placed here to begin with as a the most shared code path for all of transaction executions?

The place where we merge in new deployments is the commit_transactions(), which isn't touched by the rpc tx simulation for obvious reason. So, moving this eviction there would expose unbounded program cache entry grow dos (theoretically; assumes no new blocks for extended duration). Also, replaying and banking take the commit code-path under slightly different semantics. so, needs a bit of care to move this eviction nevertheless, even if we ignore the rpc concern...

all that said, I think the current code change should be good enough and safe enough?

ryoqun · 2024-04-25T13:21:28Z

runtime/src/bank.rs

@@ -4162,7 +4162,7 @@ impl Bank {
                programs_modified_by_tx,
            } = execution_result
            {
-                if details.status.is_ok() {
+                if details.status.is_ok() && !programs_modified_by_tx.is_empty() {
                    let mut cache = self.transaction_processor.program_cache.write().unwrap();


actually, i noticed that this write-lock is per-tx write-lock, if the batch contains 2 or more transactions, while writing this: #1037 (comment)

You are right. How about:

if execution_results.iter().any(|execution_result| matches!(execution_result, TransactionExecutionResult::Executed { details, programs_modified_by_tx } if details.status.is_ok() && !programs_modified_by_tx.is_empty())) { let mut cache = self.transaction_processor.program_cache.write().unwrap(); for execution_result in &execution_results { if let TransactionExecutionResult::Executed { programs_modified_by_tx, .. } = execution_result { cache.merge(programs_modified_by_tx); } } }

hmm, that incurs 2 pass looping for the worst case (totaling, O(2*N)). also rather heavy code duplication.

Considering !programs_modified_by_tx.is_empty() should be rare (unless malice), I think a quick and dirty memoization like this will be enough (this worst case's overall cost is O(Cm*N), where Cm << 2, Cm == memoization overhead cce3075

program-runtime/src/loaded_programs.rs

Lichtso

LGTM + some nits about the comments

apfitzge · 2024-04-26T17:32:52Z

Changes seem fine to me, but I'm less familiar with program-cache stuff. In terms of lock contention, I just ran banking-bench as a sanity check - no programs should be compiled but if locks are grabbed less often might expect small boost in throughput.

Results seem to show around the same if not slightly better:

without:
{'name': 'banking_bench_total', 'median': '43963.39'}
{'name': 'banking_bench_tx_total', 'median': '45240.95'}
{'name': 'banking_bench_success_tx_total', 'median': '43159.22'}

{'name': 'banking_bench_total', 'median': '44160.32'}
{'name': 'banking_bench_tx_total', 'median': '45466.70'}
{'name': 'banking_bench_success_tx_total', 'median': '43331.09'}

{'name': 'banking_bench_total', 'median': '46828.73'}
{'name': 'banking_bench_tx_total', 'median': '48264.15'}
{'name': 'banking_bench_success_tx_total', 'median': '46307.88'}

with:
{'name': 'banking_bench_total', 'median': '41763.08'}
{'name': 'banking_bench_tx_total', 'median': '42942.16'}
{'name': 'banking_bench_success_tx_total', 'median': '40872.13'}

{'name': 'banking_bench_total', 'median': '48499.97'}
{'name': 'banking_bench_tx_total', 'median': '50037.59'}
{'name': 'banking_bench_success_tx_total', 'median': '48014.97'}

{'name': 'banking_bench_total', 'median': '46336.45'}
{'name': 'banking_bench_tx_total', 'median': '47713.88'}
{'name': 'banking_bench_success_tx_total', 'median': '45633.68'}

ryoqun · 2024-04-30T15:01:43Z

In terms of lock contention, I just ran banking-bench as a sanity check - no programs should be compiled but if locks are grabbed less often might expect small boost in throughput.

Results seem to show around the same if not slightly better:

this result isn't surprising considering:

transactiion processing is actually batched for the bench? (iirc, 64); unified scheduler benches above aren't batched
banking thread is 4 by default (iirc); unified scheduler benches above are measured with 12 threads

So, banking stage is much like the blockstore processor in this regard...

ryoqun commented Apr 25, 2024

View reviewed changes

ryoqun force-pushed the program-cache-reduced-lock-contention branch from 86c433e to 1073cf7 Compare April 25, 2024 07:31

Reduce ProgramCache write-lock contention

cc12279

ryoqun force-pushed the program-cache-reduced-lock-contention branch from 1073cf7 to cc12279 Compare April 25, 2024 08:01

ryoqun changed the title ~~(wip) Reduce ProgramCache write lock contention~~ Reduce ProgramCache write lock contention Apr 25, 2024

ryoqun requested a review from Lichtso April 25, 2024 08:02

ryoqun marked this pull request as ready for review April 25, 2024 08:02

ryoqun commented Apr 25, 2024

View reviewed changes

Also consider programs modified by tx

d50c11c

ryoqun commented Apr 25, 2024

View reviewed changes

Memoize program cache write lock

cce3075

Lichtso reviewed Apr 26, 2024

View reviewed changes

program-runtime/src/loaded_programs.rs Show resolved Hide resolved

Lichtso approved these changes Apr 26, 2024

View reviewed changes

ryoqun merged commit 90bea33 into anza-xyz:master Apr 27, 2024
38 checks passed

ryoqun mentioned this pull request May 8, 2024

program cache: reduce contention #1192

Merged

ryoqun mentioned this pull request May 21, 2024

Fine-tune unified scheduler loops by select_biased #1437

Merged

Lichtso mentioned this pull request Jul 29, 2024

v2.0: Add removed line from PR1192 (backport of #2249) #2319

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce ProgramCache write lock contention #1037

Reduce ProgramCache write lock contention #1037

ryoqun commented Apr 25, 2024 •

edited

Loading

ryoqun Apr 25, 2024

ryoqun Apr 25, 2024 •

edited

Loading

codecov-commenter commented Apr 25, 2024 •

edited

Loading

ryoqun Apr 25, 2024 •

edited

Loading

ryoqun Apr 25, 2024 •

edited

Loading

Lichtso Apr 25, 2024

ryoqun Apr 25, 2024

Lichtso Apr 25, 2024 •

edited

Loading

ryoqun Apr 25, 2024 •

edited

Loading

ryoqun Apr 25, 2024 •

edited

Loading

Lichtso Apr 25, 2024 •

edited

Loading

ryoqun Apr 26, 2024 •

edited

Loading

Lichtso left a comment

apfitzge commented Apr 26, 2024

ryoqun commented Apr 30, 2024 •

edited

Loading

		@@ -292,6 +292,7 @@ impl<FG: ForkGraph> TransactionBatchProcessor<FG> {

		execution_time.stop();

		/*

Reduce ProgramCache write lock contention #1037

Reduce ProgramCache write lock contention #1037

Conversation

ryoqun commented Apr 25, 2024 • edited Loading

Problem

Summary of Changes

perf improvements

before

after

(for the record) the merged commit

ryoqun Apr 25, 2024

Choose a reason for hiding this comment

ryoqun Apr 25, 2024 • edited Loading

Choose a reason for hiding this comment

codecov-commenter commented Apr 25, 2024 • edited Loading

Codecov Report

ryoqun Apr 25, 2024 • edited Loading

Choose a reason for hiding this comment

ryoqun Apr 25, 2024 • edited Loading

Choose a reason for hiding this comment

Lichtso Apr 25, 2024

Choose a reason for hiding this comment

ryoqun Apr 25, 2024

Choose a reason for hiding this comment

Lichtso Apr 25, 2024 • edited Loading

Choose a reason for hiding this comment

ryoqun Apr 25, 2024 • edited Loading

Choose a reason for hiding this comment

ryoqun Apr 25, 2024 • edited Loading

Choose a reason for hiding this comment

Lichtso Apr 25, 2024 • edited Loading

Choose a reason for hiding this comment

ryoqun Apr 26, 2024 • edited Loading

Choose a reason for hiding this comment

Lichtso left a comment

Choose a reason for hiding this comment

apfitzge commented Apr 26, 2024

ryoqun commented Apr 30, 2024 • edited Loading

ryoqun commented Apr 25, 2024 •

edited

Loading

ryoqun Apr 25, 2024 •

edited

Loading

codecov-commenter commented Apr 25, 2024 •

edited

Loading

ryoqun Apr 25, 2024 •

edited

Loading

ryoqun Apr 25, 2024 •

edited

Loading

Lichtso Apr 25, 2024 •

edited

Loading

ryoqun Apr 25, 2024 •

edited

Loading

ryoqun Apr 25, 2024 •

edited

Loading

Lichtso Apr 25, 2024 •

edited

Loading

ryoqun Apr 26, 2024 •

edited

Loading

ryoqun commented Apr 30, 2024 •

edited

Loading