[WIP] the Implementation of Parallel EVM 2.0(v1.1.16 rebased) #1130

setunapo · 2022-10-12T10:04:03Z

Description

This is part 2 of the implementation of BEP 130:Parallel Transaction Execution
The implementation part 1 can be found at: Parallel 1.0 Implementation

The motivation and architecture design could refer the BEP-130 document.
As noted in Parallel 1.0, Parallel 2.0 is a performance enhancement version, it tries to improve the performance based on the architecture of Parallel 1.0 by introducing more advanced methodologies.

Specification

Architecture

The architecture of Parallel 2.0 is based on Parallel 1.0, it only touches the execution layer, mainly state_processor.go and state_db.go, state_object.go, the architecture can be briefly described with 3 diagrams too:

Module
Pipeline
Lifecycle of transaction

Module

Here is the major components of Parallel 2.0

Pipeline

Pipeline of Parallel 2.0

Take 8 concurrency as an example for a brief view.

Post the Pipeline of Parallel 1.0 for comparison.

The pipeline of 1.0 and 2.0 are quite different. There are lots of changes, the most obvious changes include:

There is no waiting state, a transaction can be executed without waiting for its previous transaction.
A new shadow routine is created for each slot.
Stage1 & Stage2 are introduced.
A new routine is created: RT Conflict Check.
Conflict Check, Finalize, Merge are all moved into the main routine dispatcher.

Lifecycle of transaction

Lifecycle of Parallel 2.0

Post the lifecycle of 1.0 for comparison.

As the transaction lifecycle, the main differences are:

No dispatch IPC cost.
No waiting state.
UnconfirmedAccess.
LightCopy.
NewSlotDB is now moved to execution routine, while conflict detect & finalize are now in main dispatcher routine.

Introduce features of 2.0

Streaming Pipeline
If a Tx's execution stage(EVM) is completed, it doesn't need to wait for its previous transaction's merge result. The transaction can queue its result to the shadow slot and move on to execute the next transaction in the pending queue.
Operations of ConflictDetect, Finalize, Tx Result Merge will all be done by the main dispatcher. And each execution slot will have a shadow slot, it is a backup slot, do exactly the same job as the primary slot. Shadow slot is used to make sure redo can be scheduled ASAP.

Universal Unconfirmed State Access
It is very complicate, with unconfirmed state access, there will be a priority to access StateObject:
Self Dirty ➝ UnConfirmed Dirty ➝ Main StateDB(dirty, pending, snapshot and trie tree)
In a word, it should try best to get the desired information to reduce conflict rate.

Conflict Detect 2.0
In Parallel 1.0, the conflict detecter is a simple "double for loop" to see if two SlotDB has overlapped state change. We mark the execution result as conflicted if it reads a state which has been changed by other transactions within the conflict window.
But in Parallel 2.0, we do conflict check based on read, we no longer care about what has been changed, the only thing we should care is to check what we read is correct or not. We will keep the detail read record and compare with the main StateDB on conflict Detect. It is more straightforward and accurate.
And a new routine call Stage2ConfirmLoop is added to do conflict detect in advance, when most if the transactions have been executed at least once and it is configurable.

Parallel KV Conflict Detect
It is CPU consuming to do conflict check, especially the storage check. We have to go through all the read address, each address could have lots of KV read record. It is one of the bottlenecks right now, we do KV conflict detect to speed it up.

Memory: Shared Memory Pool&LightCopy&ZeroCopy
According to the memory analysis for the parallel 1.0, CopyForSlot will allocate 62K memory every time. Since the memory mostly is costed by the maps, we can use sync.Pool to manage all the maps. We can recycle the maps used by the slot db asynchronously when the block is committing.
Parallel 1.0 use DeepCopy for Copy-On-Write, it is cpu&memory consuming when the storage contains lots of KV elements. We replace it by LightCopy to avoid redundant memory copy of StateObject. With LightCopy, we do not copy any of the storage, actually it is not an option, but a must if we use UnConfirmed Reference , since the storage would be accessed from different unconfirmed DB, we can not simply copy all the KV elements of a single state object.
And we use map in sequential mode and sync.map in parallel mode for concurrent StateObject access.

Trie prefetch In Advance
Trie prefetch is key to save the cost of validation, we will do trie prefetch even for unconfirmed results to make sure the trie prefetch can be scheduled ASAP.

Dispatcher 2.0
Parallel 2.0 actually removed the dispatch action, the dispatch channel IPC is no longer needed. Dispatch 2.0 has 2 parts:
static dispatch & dynamic dispatch.
Static dispatch is done at the beginning of block process, it is responsible to make sure potential conflict transactions are dispatched to the same slot and try best to make workload balance between slots.
Dynamic dispatch is for runtime, there is a stolen mode when a slot finished its static dispatched tasks, it can steal a transaction from other busy slot.

Corner Case

The behavior of parallel is somehow different from sequential and there are corners cases we have to handle specially.

don't panic if there is anything wrong reading state
skip system address's balance check
handle WBNB contract to reduce conflict rate by balance make up, a new interface GetBalanceOpCode is added.

Performance Test

I setup 2 instance to test the performance benefit, with parallel number 8 and --pipecommit enabled.
The 2 instances use same hardware configuration, with 16 cores, 64G memory, 7T SSD,
It ran for ~50hours , The total block process(execution, validation, commit) cost reduce by ~20% -> ~50%, the benefits varies for difference block pattern.

Add a new interface StateProcessor.ProcessParallel(...), it is a copy of Process(...) right now. This patch is a placeholder, we will implement BEP-130 based on it.

** modules of init, slot executer and dispatcher BEP 130 parallel transaction execution will maintain a tx execution routine pool, a configured number of slot(routine) to execution transactions. Init is executed once on startup and will create the routine pool. Slot executer is the place to execute transactions. The dispacther is the module that will dispatch transaction to the right slot. ** workflow: Stage Apply, Conflict Detector, Slot, Gas... > two stages of applyTransaction For sequential execution, applyTransaction will do transaction execution and result finalization. > Conflict detector We will check the parallel execution result for each transaction. If there is a confliction, the result can not be committed, redo will be scheduled to update its StateDB and re-run For parallel execution, the execution result may not be reliable(conflict), use try-rerun policy, the transaction could be executed more than once to get the correct result. Once the result is confirm, we will finalize it to StateDB. Balance, KV, Account Create&Suicide... will be checked And conflict window is important for conflict check. > Slot StateDB Each slot will have a StateDB to execute transaction in slot. The world state changes are stored in this StateDB and merged to the main StateDB when transaction result is confirmed. SlotState.slotdbChan is the current execute TX's slotDB. And only dirty state object are allowed to merge back, otherwise, there is a race condition of merge outdated stateobject back. ** others gas pool, transaction gas, gas fee reward to system address evm instance, receipt CumulativeGasUsed & Log Index, contract creation, slot state, parallel routine safety: 1.only dispatcher can access main stateDB 2.slotDB will be created and merged to stateDB in dispatch goroutine. ** workflow 2: CopyForSlot, redesign dispatch, slot StateDB reuse & several bugfix > simplifiy statedb copy with CopyForSlot only copy dirtied state objects delete prefetcher ** redesign dispatch, slot StateDB reuse... > dispatch enhance remove atomic idle, curExec... replace by pendingExec for slot. >slot StateDB reuse It will try to reuse the latest merged slotDB in the same slot. If reuse failed(conflict), it will try to update to the latest world state and redo. The reuse SlotDB will the same BaseTxIndex, since its world state was sync when it was created based on that txIndex Conflict check can skip current slot now. it is more aggressive to reuse SlotDB for idle dispatch not only pending Txs but also the idle dispatched Txs try to reuse SlotDB now. ** others state change no needs to store value add "--parallel" startup options Parallel is not enabled by default. To enable it, just add a simple flag to geth: --parallel To config parallel execute parameter: --parallel.num 20 --parallel.queuesize 30 "--parallel.num" is the number of parallel slot to execute Tx, by default it is CPUNum-1 "--parallel.queuesize" is the maxpending queue size for each slot, by default it is 10 For example: ./build/bin/geth --parallel ./build/bin/geth --parallel --parallel.num 10 ./build/bin/geth --parallel --parallel.num 20 --parallel.queuesize 30 ** several BugFix 1.system address balance conflict We take system address as a special address, since each transaction will pay gas fee to it. Parallel execution reset its balance in slotDB, if a transaction try to access its balance, it will receive 0. If the contract needs the real system address balance, we will schedule a redo with real system address balance One transaction that accessed system address: https://bscscan.com/tx/0xcd69755be1d2f55af259441ff5ee2f312830b8539899e82488a21e85bc121a2a 2.fork caused by address state changed and read in same block 3.test case error 4.statedb.Copy should initialize parallel elements 5.do merge for snapshot

** move .Process() close to .ProcessParallel() ** InitParallelOnce & preExec & postExec for code maintenance ** MergedTxInfo -> SlotChangeList & debug conflict ratio ** use ParallelState to keep all parallel statedb states. ** enable queue to same slot ** discard state change of reverted transaction And debug log refine ** add ut for statedb

…ch for parallel this patch has 3 changes: 1.change default queuesize to 20, since 10 could be not enough and will cause more conflicts 2.enable slot DB trie prefetch, use the prefetch of main state DB. 3.disable transaction cache prefetch when parallel is enabled since in parallel mode CPU resource could be limitted, and paralle has its own piped transaction execution 4.change dispatch policy ** queue based on from address ** queue based on to address, try next slot if current is full Since from address is used to make dispatch policy, the pending transactions in a slot could have several different To address, so we will compare the To address of every pending transactions.

** use sync map for the stateObjects in parallel ** others fix a SlotDB reuse bug & enable it delete unnecessary parallel initialize for none slot DB.

…t, prefetch, fork This is a complicated patch, to do some fixup ** fix MergeSlotDB Since copy-on-write is used, transaction will do StateObject deepCopy before it writes the state; All the dirty state changed will be recorded in this copied one first, the ownership will be transfered to main StateDB on merge. It has a potential race condition that the simple ownership transfer may discard other state changes by other concurrent transactions. When copy-on-write is used, we should do StateObject merge. ** fix Suicide Suicide has an address state read operation. And it also needs do copy-on-write, to avoid damage main StateDB's state object. ** fix conflict detect If state read is not zero, should do conflict detect with addr state change first. Do conflict detect even with current slot, if we use copy-on-write and slotDB reuse, same slot could has race conditon of conflict. ** disable prefetch on slotDB trie prefetch should be started on main DB on Merge ** Add/Sub zero balance, Set State These are void operation, optimized to reduce conflict rate. Simple test show, conflict rate dropped from ~25% -> 12% **fix a fork on block 15,338,563 It a nonce conflict caused by opcode: opCreate & opCreate2 Generally, the nonce is advanced by 1 for the transaction sender; But opCreate & opCreate2 will try to create a new contract, the caller will advance its nonce too. It makes the nonce conflict detect more complicated: as nonce is a fundamental part of an account, as long as it has been changed, we mark the address as StateChanged, any concurrent access to it will be considered as conflicted.

** optimize conflict for AddBalance(0) Add balance with 0 did nothing, but it will do an empty() check, and add a touch event. Add on transaction finalize, the touch event will check if the StateObject is empty, do empty delete if it is. This patch is to take the empty check as a state check, if the addr state has not been changed(create, suicide, empty delete), then empty check is reliable. ** optimize conflict for system address ** some code improvement & lint fixup & refactor for params ** remove reuse SlotDB Reuse SlotDB was added to reduce copy of StateObject, in order to mitigate the Go GC problem. And COW(Copy-On-Write) is used to address the GC problem too. With COW enabled, reuse can be removed as it has limitted benefits now and add more complexity. ** fix trie prefetch on dispatcher Trie prefetch will be scheduled on object finalize. With parallel, we should schedule trie prefetch on dispatcher, since the TriePrefetcher is not safe for concurrent access and it is created & stopped on dispatcher routine. But object.finalize on slot cleared its dirtyStorage, which broken the later trie prefetch on dispatcher when do MergeSlotDB.

No fundamental change, some improvements, include: ** Add a new type ParallelStateProcessor; ** move Parallel Config to BlockChain ** more precious ParallelNum set ** Add EnableParallelProcessor() ** remove panic() ** remove useless: redo flag, ** change waitChan from `chan int` to `chan struct {}` and communicate by close() ** dispatch policy: queue `from` ahead of `to` ** pre-allocate allLogs ** disable parallel processor is snapshot is not enabled ** others: rename...

1.features of 2.0: ** Streaming Pipeline ** Implement universal unconfirmed state db reference, try best to get account object state. ** New conflict detect, check based on what it has read. ** Do parallel KV conflict check for large KV read ** new Interface StateDBer and ParallelStateDB ** shared memory pool for parallel objects ** use map in sequential mode and sync.map in parallel mode for concurrent StateObject access ** replace DeepCopy by LightCopy to avoid redundant memory copy of StateObject ** do trie prefetch in advance ** dispatcher 2.0 Static Dispatch & Dynamic Dispatch Stolen mode for TxReq when a slot finished its static dispatched tasks RealTime result confirm in Stage2, when most if the tx have been executed at least once Make it configurable 2.Handle of corner case: ** don't panic if there is anything wrong reading state ** handle system address, skip its balance check ** handle WBNB contract to reduce conflict rate by balance make up WBNB balance makeup by GetBalanceOpCode & depth add a lock to fix WBNB make up concurrent crash add a new interface GetBalanceOpCode

DashYang · 2022-12-07T11:49:04Z

hi. in the Figure of 'Pipeline of Parallel 2.0', I have 2 questions

why tx1 needs redo?
why tx11 is dispatched to slot 3 and slot 6, as no stealing happened

setunapo and others added 13 commits October 12, 2022 17:57

Parallel: Kick off for BEP-130: Parallel Transaction Execution.

c1a1e9b

Add a new interface StateProcessor.ProcessParallel(...), it is a copy of Process(...) right now. This patch is a placeholder, we will implement BEP-130 based on it.

Parallel: implement COW(Copy-On-Write)

c8be553

** use sync map for the stateObjects in parallel ** others fix a SlotDB reuse bug & enable it delete unnecessary parallel initialize for none slot DB.

improve: some code prune

83a1c35

bugfix: fix the fork: GetState for suicide addr

c4b3b61

Parallel2.0: fix some minor rebase issues

0aed943

Parallel2.0: fix a bug of PutSyncPool

e7938e7

setunapo force-pushed the Parallel_2.0_based_onv1.1.16 branch from 8a1d7f2 to e7938e7 Compare October 12, 2022 10:15

setunapo closed this Oct 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] the Implementation of Parallel EVM 2.0(v1.1.16 rebased) #1130

[WIP] the Implementation of Parallel EVM 2.0(v1.1.16 rebased) #1130

setunapo commented Oct 12, 2022

DashYang commented Dec 7, 2022

[WIP] the Implementation of Parallel EVM 2.0(v1.1.16 rebased) #1130

[WIP] the Implementation of Parallel EVM 2.0(v1.1.16 rebased) #1130

Conversation

setunapo commented Oct 12, 2022

Description

Specification

Architecture

Module

Pipeline

Lifecycle of transaction

Introduce features of 2.0

Corner Case

Performance Test

DashYang commented Dec 7, 2022