Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce ProgramCache write lock contention #1037

Merged
merged 3 commits into from
Apr 27, 2024

Conversation

ryoqun
Copy link
Member

@ryoqun ryoqun commented Apr 25, 2024

Problem

ProgramCache's current locking behavior is needlessly hurting the unified scheduler's performance.

Unified scheduler is a new transaction scheduler for the block verification (i.e. the replaying stage). It doesn't employ batching at all as a design choice. Technically, the unified scheduler still is using TransactionBatches but only with 1 transaction for each. That means it bares all the extra overheads, which has been amortized by batching.

Sidenote: I believe these overheads are all solvable (but not SOON except this one...). Also note that it's already 1.8x faster than the to-be-replaced blockstore_processor after this pr.

Among all those overhead sources, one of the most visible one is ProgramCache's write-lock contentions. Currently, ProgramCache is write-locking 3 times unconditionally per 1-tx batch (for loading by replenish_program_cache(), for evictions by load_and_execute_sanitized_transactions(), for updated program commiting by commit_transactions()). so, it acutely hampers the unified scheduler concurrency.

Summary of Changes

Reduce write-lock of ProgramCache to bare-minimal. Now the usual case is 1 time for loading by replenish_program_cache(), while the worst case is still remaining at 3 times.

roughly ~5% consistent improvement. also note that blockstore-processor isn't affected while both are taking the same code-path.

perf improvements

before
# --block-verification-method unified-scheduler
ledger processed in 19 seconds, 182 ms
ledger processed in 19 seconds, 375 ms
ledger processed in 19 seconds, 460 ms

# --block-verification-method blockstore-processor
ledger processed in 32 seconds, 845 ms
ledger processed in 32 seconds, 786 ms
ledger processed in 32 seconds, 853 ms
after

after-the-pr's speedup-factor is now 1.8x:

irb(main):001:0> 32.691/17.970
=> 1.8191986644407347
# --block-verification-method unified-scheduler
ledger processed in 18 seconds, 77 ms
ledger processed in 17 seconds, 926 ms
ledger processed in 18 seconds, 126 ms

# --block-verification-method blockstore-processor
ledger processed in 32 seconds, 773 ms
ledger processed in 32 seconds, 691 ms
ledger processed in 32 seconds, 451 ms

(for the record) the merged commit

before(3f7b352):

ledger processed in 19 seconds, 347 ms
ledger processed in 19 seconds, 314 ms
ledger processed in 19 seconds, 278 ms

after(90bea33):

ledger processed in 18 seconds, 29 ms
ledger processed in 18 seconds, 78 ms
ledger processed in 18 seconds, 114 ms

context: extracted from #593

cc: @apfitzge

@@ -292,6 +292,7 @@ impl<FG: ForkGraph> TransactionBatchProcessor<FG> {

execution_time.stop();

/*
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

firstly, let's see which test should fail. :)

Copy link
Member Author

@ryoqun ryoqun Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hehe, turned out there's no test to test this code specifically...

@codecov-commenter
Copy link

codecov-commenter commented Apr 25, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.0%. Comparing base (8e331e1) to head (cce3075).
Report is 9 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1037   +/-   ##
=======================================
  Coverage    82.0%    82.0%           
=======================================
  Files         860      860           
  Lines      232898   232911   +13     
=======================================
+ Hits       191071   191104   +33     
+ Misses      41827    41807   -20     

@ryoqun ryoqun force-pushed the program-cache-reduced-lock-contention branch from 86c433e to 1073cf7 Compare April 25, 2024 07:31
@ryoqun ryoqun force-pushed the program-cache-reduced-lock-contention branch from 1073cf7 to cc12279 Compare April 25, 2024 08:01
@ryoqun ryoqun changed the title (wip) Reduce ProgramCache write lock contention Reduce ProgramCache write lock contention Apr 25, 2024
@ryoqun ryoqun requested a review from Lichtso April 25, 2024 08:02
@ryoqun ryoqun marked this pull request as ready for review April 25, 2024 08:02
@@ -4162,7 +4162,7 @@ impl Bank {
programs_modified_by_tx,
} = execution_result
{
if details.status.is_ok() {
if details.status.is_ok() && !programs_modified_by_tx.is_empty() {
Copy link
Member Author

@ryoqun ryoqun Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hope this one is fairly uncontroversial. haha

// ProgramCache entries. Note that this flag is deliberately defined, so that there's still
// at least one other batch, which will evict the program cache, even after the occurrences
// of cooperative loading.
if programs_loaded_for_tx_batch.borrow().loaded_missing {
Copy link
Member Author

@ryoqun ryoqun Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess this one isn't so straightforward. better ideas are very welcome.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Global cache can also grow via cache.merge(programs_modified_by_tx); above, not just by loading missing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about this? d50c11c

Copy link

@Lichtso Lichtso Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better, but the number of insertions and evictions can still be unbalanced because it is only a boolean.
Also, maybe we should move eviction to the place where we merge in new deployments? That way they could share a write lock.

Copy link
Member Author

@ryoqun ryoqun Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better, but the number of insertions and evictions can still be unbalanced because it is only a boolean.

I intentionally chosen boolean, thinking the number of insertions and evictions doesn't need to be balanced. That's because evict_using_2s_random_selection() continues to evict entries until they're under 90% of MAX_LOADED_ENTRY_COUNT(=256) just with a single invocation. So, we just need to ensure these are called with sufficient frequency/timings to avoid cache bomb dos attack.

Also, maybe we should move eviction to the place where we merge in new deployments? That way they could share a write lock.

This is possible and that looks appealing. however it isn't trivial. Firstly, load_and_execute_sanitized_transactions can be entered via 3 code path: replaying, banking, rpc tx simulation. I guess that's the reason this eviction is placed here to begin with as a the most shared code path for all of transaction executions?

The place where we merge in new deployments is the commit_transactions(), which isn't touched by the rpc tx simulation for obvious reason. So, moving this eviction there would expose unbounded program cache entry grow dos (theoretically; assumes no new blocks for extended duration). Also, replaying and banking take the commit code-path under slightly different semantics. so, needs a bit of care to move this eviction nevertheless, even if we ignore the rpc concern...

all that said, I think the current code change should be good enough and safe enough?

@@ -4162,7 +4162,7 @@ impl Bank {
programs_modified_by_tx,
} = execution_result
{
if details.status.is_ok() {
if details.status.is_ok() && !programs_modified_by_tx.is_empty() {
let mut cache = self.transaction_processor.program_cache.write().unwrap();
Copy link
Member Author

@ryoqun ryoqun Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, i noticed that this write-lock is per-tx write-lock, if the batch contains 2 or more transactions, while writing this: #1037 (comment)

Copy link

@Lichtso Lichtso Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. How about:

if execution_results.iter().any(|execution_result| matches!(execution_result, TransactionExecutionResult::Executed { details, programs_modified_by_tx } if details.status.is_ok() && !programs_modified_by_tx.is_empty())) {
    let mut cache = self.transaction_processor.program_cache.write().unwrap();
    for execution_result in &execution_results {
        if let TransactionExecutionResult::Executed { programs_modified_by_tx, .. } = execution_result {
            cache.merge(programs_modified_by_tx);
        }
    }
}

Copy link
Member Author

@ryoqun ryoqun Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, that incurs 2 pass looping for the worst case (totaling, O(2*N)). also rather heavy code duplication.

Considering !programs_modified_by_tx.is_empty() should be rare (unless malice), I think a quick and dirty memoization like this will be enough (this worst case's overall cost is O(Cm*N), where Cm << 2, Cm == memoization overhead cce3075

Copy link

@Lichtso Lichtso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM + some nits about the comments

@apfitzge
Copy link

Changes seem fine to me, but I'm less familiar with program-cache stuff. In terms of lock contention, I just ran banking-bench as a sanity check - no programs should be compiled but if locks are grabbed less often might expect small boost in throughput.

Results seem to show around the same if not slightly better:

without:
{'name': 'banking_bench_total', 'median': '43963.39'}
{'name': 'banking_bench_tx_total', 'median': '45240.95'}
{'name': 'banking_bench_success_tx_total', 'median': '43159.22'}

{'name': 'banking_bench_total', 'median': '44160.32'}
{'name': 'banking_bench_tx_total', 'median': '45466.70'}
{'name': 'banking_bench_success_tx_total', 'median': '43331.09'}

{'name': 'banking_bench_total', 'median': '46828.73'}
{'name': 'banking_bench_tx_total', 'median': '48264.15'}
{'name': 'banking_bench_success_tx_total', 'median': '46307.88'}

with:
{'name': 'banking_bench_total', 'median': '41763.08'}
{'name': 'banking_bench_tx_total', 'median': '42942.16'}
{'name': 'banking_bench_success_tx_total', 'median': '40872.13'}

{'name': 'banking_bench_total', 'median': '48499.97'}
{'name': 'banking_bench_tx_total', 'median': '50037.59'}
{'name': 'banking_bench_success_tx_total', 'median': '48014.97'}

{'name': 'banking_bench_total', 'median': '46336.45'}
{'name': 'banking_bench_tx_total', 'median': '47713.88'}
{'name': 'banking_bench_success_tx_total', 'median': '45633.68'}

@ryoqun ryoqun merged commit 90bea33 into anza-xyz:master Apr 27, 2024
38 checks passed
@ryoqun
Copy link
Member Author

ryoqun commented Apr 30, 2024

In terms of lock contention, I just ran banking-bench as a sanity check - no programs should be compiled but if locks are grabbed less often might expect small boost in throughput.

Results seem to show around the same if not slightly better:

this result isn't surprising considering:

  • transactiion processing is actually batched for the bench? (iirc, 64); unified scheduler benches above aren't batched
  • banking thread is 4 by default (iirc); unified scheduler benches above are measured with 12 threads

So, banking stage is much like the blockstore processor in this regard...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants