Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic state snapshots #20152

Merged
merged 28 commits into from Mar 23, 2020
Merged

Dynamic state snapshots #20152

merged 28 commits into from Mar 23, 2020

Conversation

@karalabe
Copy link
Member

karalabe commented Oct 4, 2019

Note, this PR is semi-experimental work. All code included has been extensively tested on live nodes, but it is very very very sensitive code. As such, the PR hides the included logic behind --snapshot. We've decided to merge it to get the code onto master as it's closing in on the 6 month development mark already.


This PR creates a secondary data structure for storing the Ethereum state, called a snapshot. This snapshot is special as it dynamically follows the chain and can also handle small-ish reorgs:

  • At the very bottom, the snapshot consists of a disk layer, which is essentially a semi-recent full flat dump of the account and storage contents. This is stored in LevelDB as a <hash> -> <account> mapping for the account trie and <account-hash><slot-hash> -> <slot-value> mapping for the storage tries. The layout permits fast iteration over the accounts and storage, which will be used for a new sync algorithm.
  • Above the disk layer there is a tree of in-memory diff layers that each represent one block's worth of state mutations. Every time a new block is processed, it is linked on top of the existing diff tree, and the bottom layers flattened together to keep the maximum tree depth reasonable. At the very bottom, the first diff layer acts as an accumulator which only gets flattened into the disk layer when it outgrows it's memory allowance. This is done mostly to avoid thrashing LevelDB.

The snapshot can be built fully online, during the live operation of a Geth node. This is harder than it seems because rebuilding the snapshot for mainnet takes 9 hours, during which the in-memory garbage collection long deletes the state needed for a single capture.

  • The PR achieves this by gradually iterating the state tries and maintaining a marker to the account/storage slot position until which the snapshot was already generated. Every time a new block is executed, state mutations prior to the marker get applied directly (the ones afterwards get discarded) and the snapshot builder switches to iterating the new root hash.
  • To handle reorgs, the builder operates on HEAD-128 and is capable of suspending/resuming if a state is missing (a restart will only write out some tries, not all cached in memory).

The benefit of the snapshot is that it acts as an acceleration structure for state accesses:

  • Instead of doing O(log N) disk reads (+leveldb overhead) to access an account / storage slot, the snapshot can provide direct, O(1) access time. This should be a small improvement in block processing and a huge improvement in eth_call evaluations.
  • The snapshot supports account and storage iteration at O(1) complexity per entry + sequential disk access, which should enable remote nodes to retrieve state data significantly cheaper than before (the sort order is the state trie leaf order, so responses can directly be assembled into tries too).
  • The presence of the snapshot can also enable more exotic use cases such as deleting and rebuilding the entire state trie (guerilla pruning) as well as building alternative state trie (e.g. binary vs. hexary), which might be needed in the future.

The downside of the snapshot is that the raw account and storage data is essentially duplicated. In the case of mainnet, this means an extra 15GB of SSD space used.

}
// Cache doesn't contain account, pull from disk and cache for later
blob := rawdb.ReadAccountSnapshot(dl.db, hash)
dl.cache.Set(key, blob)

This comment has been minimized.

Copy link
@holiman

holiman Oct 10, 2019

Contributor

I'm torn on whether we really should cache nil-items here...

@karalabe

This comment has been minimized.

Copy link
Member Author

karalabe commented Oct 10, 2019

@holiman

This comment has been minimized.

Copy link
Contributor

holiman commented Oct 12, 2019

Another couple of points,

  • Right now, if each block modifies roughly 2000 items (accounts+storage), and we have 128 layers,
  • And each layer represents one block, whereas the bottom layer (before disk) represents 200 blocks, that means
  • We'll have 128 layers of N size, and 1 layer of 200 * N size.
  • Eventually, we'll flush the bottom-most layer, and at that point we have 128 blocks in memory, essentially (going down from e..g 350).

Now, if we have M bytes of memory availalbe, it seems to me that it would make more sense to have a gradual slope of memory usage, instead of N, N, ...200 * N . Instead have N, N, 2N, 2N, 4N...64N (totalling 254N in this example). The consequence would be the folowing:

  • Upside: When flushing the lowest layer, we'd not lose the majority of blocks from memory, only a smaller portion. So smaller fluctuations in performance,
  • Upside: When accessing an item, we'd not iterate through 128 layers, only ~14
  • Downside: When reorg:ing, or accessing a particular block state, we'd have to potentially re-execute some blocks (maybe up to 63 blocks). However, if tuned well, this would not happen on mainnet.
dl.lock.Lock()
defer dl.lock.Unlock()

dl.parent = parent.flatten()

This comment has been minimized.

Copy link
@karalabe

karalabe Oct 12, 2019

Author Member

We might need some smarter locking here. If the parents have side branches, those don't get locked by the child's lock.

// Snapshot represents the functionality supported by a snapshot storage layer.
type Snapshot interface {
// Info returns the block number and root hash for which this snapshot was made.
Info() (uint64, common.Hash)

This comment has been minimized.

Copy link
@holiman

holiman Oct 12, 2019

Contributor

It would be 'nicer' if the Info returns the span of blocks that a snapshot represents, and not just the last block

This comment has been minimized.

Copy link
@karalabe

karalabe Oct 23, 2019

Author Member

I'm actually thinking of nuking the whole number tracking. Clique gets messy because roots are not unique across blocks. We could make it root + number, but that would entail statedb needing to add block numbers to every single method, which gets messy fast.

Alternatively, we can just not care about block number and instead just track child -> parent hierarchies. This would mean that we could end up with a lot more state "cached" in memory than 128 if there are ranges of empty clique blocks, but those wouldn't add any new structs, just keep the existing ones around for longer, so should be fine.

func (dl *diffLayer) Journal() error {
dl.lock.RLock()
defer dl.lock.RUnlock()

This comment has been minimized.

Copy link
@holiman

holiman Oct 17, 2019

Contributor

Perhaps we should check dl.stale here and error out if set

This comment has been minimized.

Copy link
@holiman

holiman Oct 23, 2019

Contributor

actually, probably better to do it in journal, since one of the parents might be stale, not just this level

// If we still have diff layers below, recurse
if parent, ok := diff.parent.(*diffLayer); ok {
return st.cap(parent, layers-1, memory)
Comment on lines 232 to 234

This comment has been minimized.

Copy link
@holiman

holiman Oct 23, 2019

Contributor

Now that it's standalone and not internal to a difflayer, would be nicer to iterate instead of recurse, imo

}
writer = file
}
// Everything below was journalled, persist this layer too

This comment has been minimized.

Copy link
@holiman

holiman Oct 23, 2019

Contributor
Suggested change
// Everything below was journalled, persist this layer too
if dl.stale{
return nil, ErrSnapshotStale
}
// Everything below was journalled, persist this layer too
}
// If we haven't reached the bottom yet, journal the parent first
if writer == nil {
file, err := dl.parent.(*diffLayer).journal()

This comment has been minimized.

Copy link
@holiman

holiman Oct 23, 2019

Contributor

Right now, we obtain the readlock in Journal -- but here we're calling the parent journal directly, bypassing the lock-taking.
We should remove the locking from Journal and do it in this method instead, right after the parent is done writing

@karalabe karalabe force-pushed the karalabe:snapshot-5 branch 4 times, most recently from 0b8d955 to 0f86812 Nov 22, 2019

// If the layer is being generated, ensure the requested hash has already been
// covered by the generator.
if dl.genMarker != nil && bytes.Compare(key, dl.genMarker) > 0 {

This comment has been minimized.

Copy link
@holiman

holiman Nov 26, 2019

Contributor

This seems to work, but would be more obvious with the following change:

Suggested change
if dl.genMarker != nil && bytes.Compare(key, dl.genMarker) > 0 {
if dl.genMarker != nil && bytes.Compare(accountHash[:], dl.genMarker) > 0 {
core/state/snapshot/disklayer_generate.go Outdated Show resolved Hide resolved
@holiman

This comment has been minimized.

Copy link
Contributor

holiman commented Nov 27, 2019

I don't understand... At the end of this method, we return a new diskLayer. Who remembers these accounts that were left stranded in this layer and not copied to disk?

Answering myself: we the new diskLayer doesn't contain it, so the caller will later have to resolve it from the trie, which is fine... I think?

@karalabe karalabe force-pushed the karalabe:snapshot-5 branch from 9809fd1 to 4cdddb7 Nov 27, 2019

speed := done/uint64(time.Since(gs.start)/time.Millisecond+1) + 1 // +1s to avoid division by zero
ctx = append(ctx, []interface{}{
"eta", common.PrettyDuration(time.Duration(left/speed) * time.Millisecond),

This comment has been minimized.

Copy link
@holiman

holiman Nov 27, 2019

Contributor

Could you add percentage too? I think that's a pretty user-friendly thing to have. Basically 100 * binary.BigEndian.Uint32(marker[:4]) / uint32(-1))

holiman and others added 8 commits Dec 2, 2019
… lookup
@karalabe karalabe force-pushed the karalabe:snapshot-5 branch from 586c663 to 06d4470 Feb 25, 2020
@@ -468,6 +466,10 @@ func (s *StateDB) updateStateObject(obj *stateObject) {

// If state snapshotting is active, cache the data til commit
if s.snap != nil {
// If the account is an empty resurrection, unmark the storage nil-ness
if storage, ok := s.snapStorage[obj.addrHash]; storage == nil && ok {

This comment has been minimized.

Copy link
@holiman

holiman Mar 2, 2020

Contributor

I'm not convinced this is correct. There are a couple of things that can happen:

  • Old code (before CREATE2): A contract with storage is selfdestructed in tx n. In tx n+1, someone sends a wei to the address, and the account is recreated. The desired end-state should be that the storage has become nil and the account exists.

  • New code, withe CREATE2: A contract is killed in tx n. In tx n+1, the contract is recreated, and the initcode sets new storaege slots. So the old storage slots are all cleared, and there are now new storage slots set. We need to handle this (we don't currently)

holiman and others added 3 commits Mar 2, 2020
@karalabe karalabe force-pushed the karalabe:snapshot-5 branch from 4b257db to 6e05ccd Mar 3, 2020
karalabe added 3 commits Mar 3, 2020
@karalabe karalabe force-pushed the karalabe:snapshot-5 branch from 641e415 to 328de18 Mar 4, 2020
holiman and others added 3 commits Mar 4, 2020
Copy link
Member

gballet left a comment

LGTM, a couple comments here and there, as well as a question about future-proofing the PR.

Account(hash common.Hash) (*Account, error)

// AccountRLP directly retrieves the account RLP associated with a particular
// hash in the snapshot slim data format.
AccountRLP(hash common.Hash) ([]byte, error)

// Storage directly retrieves the storage data associated with a particular hash,
// within a particular account.
Storage(accountHash, storageHash common.Hash) ([]byte, error)
Comment on lines +101 to +109

This comment has been minimized.

Copy link
@gballet

gballet Mar 9, 2020

Member

This isn't a showstopper for merging the PR, merely a question regarding future evolutions of Ethereum: there is an EIP (haven't found it yet) that suggests merging the account and storage tries. This kept recurring in stateless 1.x discussions. It would be a good idea to have a more generic method like GetBlobAtHash(f) inerface{}, taking a function f to deserialize the blob.

//
// Note, the method is an internal helper to avoid type switching between the
// disk and diff layers. There is no locking involved.
Parent() snapshot

This comment has been minimized.

Copy link
@gballet

gballet Mar 9, 2020

Member

if it's an internal helper, then it shouldn't be public

// The goal of a state snapshot is twofold: to allow direct access to account and
// storage data to avoid expensive multi-level trie lookups; and to allow sorted,
// cheap iteration of the account/storage tries for sync aid.
type Tree struct {

This comment has been minimized.

Copy link
@gballet

gballet Mar 9, 2020

Member

Change name to match comment, or vice-versa.

@karalabe karalabe modified the milestones: 1.9.12, 1.9.13 Mar 16, 2020
* core/state/snapshot/iterator: fix two disk iterator flaws

* core/rawdb: change SnapshotStoragePrefix to avoid prefix collision with preimagePrefix
@karalabe karalabe force-pushed the karalabe:snapshot-5 branch from d645b8f to 074efe6 Mar 23, 2020
@karalabe karalabe merged commit 613af7c into ethereum:master Mar 23, 2020
1 check was pending
1 check was pending
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.