Store plain historical states to reduce the db size of archive node #13317

yihuang · 2022-09-16T07:25:49Z

Plain Historical State

Context

The archive node's db size increase very fast, akthough pruning is good, but many users still need to access the complete chain history, to keep it sustainable, we need a way to reduce the disk space requirement of archive nodes.

One observation is that most of the time, we don't need to generate Merkle proofs for historical states, so we can prune the iavl tree aggressively and keep the historical states in a compact format.

Proposal

At the end of the block, before committing the iavl tree, also save the change sets in another archive storage, the storage format is explained below.

Support querying from the archive storage transparently.

Nodes can use much more aggressive pruning settings for iavl tree, and still be able to query full chain history.

The total db size is expected to be much lower than an archived iavl tree.

Write Amplification

The write amplification is increased, but it should be ok if writing speed is not the bottleneck for the workload (need to verify), after all, we only do a batch of writes per several seconds.

Archive Format

Option 1

Store in kv db (inspired by erigon ¹).

History Index, key -> set of block numbers

The set of block numbers that modified the key, is encoded as a bitmap.
Change Sets, (block number, key) -> value

Record the prior value for each modification.

Option 2

Store the history index in kv db.

Store the change sets in plain files in an append-only way, together with some index files to record the offsets of blocks.

When querying, first locate the block number through the history index, then locate the index into the change set file, then find the key in the change set of the block with linear search.

If the change set of a block is big, it could be encoded as a mmap-able b-tree, for more efficient querying.

https://github.com/ledgerwatch/erigon/blob/devel/docs/programmers_guide/db_walkthrough.MD#table-history-of-accounts ↩

The text was updated successfully, but these errors were encountered:

tac0turtle · 2022-09-16T07:39:56Z

Big fan of changing how we do writing and handling of historical state. Right now we are writing a specification and better testing for the store package to better understand the semantics and then we plan on making changes. There are a ton of changes we can do to bring huge improvements in performance.

yihuang · 2022-09-17T16:38:29Z

If there are some nesserary hooks in SDK, it might be possible to develop this feature outside of SDK:

A hook for third-party to handle the change sets before committing to the iavl tree.
- I see there's already a WriteListener, may just extend that?
A hook to provide the external historical states to grpc query handlers.

tac0turtle · 2022-09-17T16:43:43Z

If there are some nesserary hooks in SDK, it might be possible to develop this feature outside of SDK:

A hook for third-party to handle the change sets before committing to the iavl tree.

I see there's already a WriteListener, may just extend that?

A hook to provide the external historical states to grpc query handlers.

this is adr-038, no?

yihuang · 2022-09-17T16:50:07Z

If there are some nesserary hooks in SDK, it might be possible to develop this feature outside of SDK:

A hook for third-party to handle the change sets before committing to the iavl tree.

I see there's already a WriteListener, may just extend that?

A hook to provide the external historical states to grpc query handlers.

this is adr-038, no?

yeah, didn't know that, looks like exactly what we need.

yihuang · 2022-09-17T17:23:01Z

it'd be great if adr-038 could be backported to as far as v0.44 though, so we can stream the chain since the genesis.

yihuang · 2022-09-17T17:37:20Z

~~I guess one way to handle the legacy blocks, is to recompute the changesets by doing a complete comparison between two versions in the iavl tree, I assume that'd be too slow to be practical.~~, let's just backport the streamer and dump the changesets.

yihuang · 2022-09-23T03:05:34Z

Implementation Plan

// VersionStore is a versioned storage of a flat key-value pairs.
// it don't need to support merkle proof, so could be implemented in a much more efficient way.
// `nil` version means the latest version.
type VersionStore interface {
	GetAtVersion(storeKey string, key []byte, version *int64) ([]byte, error)
	HasAtVersion(storeKey string, key []byte, version *int64) (bool, error)
	IteratorAtVersion(storeKey string, start, end []byte, version *int64) types.Iterator
	ReverseIteratorAtVersion(storeKey string, start, end []byte, version *int64) types.Iterator
	GetLatestVersion() int64

	// Persist the change set of a block,
	// the `changeSet` should be ordered by (storeKey, key),
	// the version should be latest version plus one.
	PutAtVersion(version int64, changeSet []types.StoreKVPair) error
}

Integrate VersionStore into rootmulti.Store

Use the optional VersionStore to serve reading calls if provided.
~~Call the VPut method in Commit~~, we can do the commit in end blocker with the state listener interface in adr-038.

VersionStore can be implemented on top of dbm.DB, we might also experiment with lmdb/mdbx, which may not fit into the current dbm interface.

tac0turtle · 2022-09-23T13:29:47Z

in this approach are you assuming iavl does not handle versions anymore?

I think if we fix iavl ky format queries will improve dramatically

alexanderbez · 2022-09-23T13:33:51Z

Sounds like you're describing a dedicated SS layer. I'm definitely in favor of splitting SS + SC responsibilities from the IAVL!

I would really really really love for all this to be consolidated into a single EPIC. Which we have here -> #12986

yihuang · 2022-10-06T03:24:29Z

in this approach are you assuming iavl does not handle versions anymore?

yes, only keep the recent ones necessary for the consensus state machine, pruning=everything.

I think if we fix iavl ky format queries will improve dramatically

We are mainly focus on db size right now, in local test, the new db format is 10x smaller than the iavl db.
I see great potential in the new iavl key format too, that should improve the db size of iavl too, but won't match the versiondb.

yihuang · 2022-10-06T03:27:05Z

Sounds like you're describing a dedicated SS layer. I'm definitely in favor of splitting SS + SC responsibilities from the IAVL!

The difference with the SS described in v2 store is it also support versioning, so it can support an archived grpc query service.
The problem with changing iavl itself is, it's impractical to migrate the whole state on production network.

I would really really really love for all this to be consolidated into a single EPIC. Which we have here -> #12986

happy to move the discussion there.
Currently the implementation can be done and experimented externally(crypto-org-chain/cronos#722), thanks to the adr-038 state streaming interface.
Now I need to hook the alternative db to the grpc query service, I'm trying to find a way that don't need to change too much to sdk internals.

tac0turtle · 2022-10-06T09:16:47Z

personally the design of v2 was poor. the idea that proof support for historical versions means you need to reconstruct the tree is a bad design. Proof support while not widely used today is at the core of blockchain and decentralising data access.

Secondly, after talking to dev from osmosis, there are designs where proof support will want to exist for previous heights. We should allow an alternative for chains that don't care, but I think we should not write it off entirely

yihuang · 2022-10-06T10:42:31Z

We should allow an alternative for chains that don't care, but I think we should not write it off entirely

Agree, that's why I try to keep the implementation outside of sdk for now. But will need to change the API a little bit to hook the db with the grpc query.

alexanderbez · 2022-10-07T20:50:51Z

@yihuang do you have a concrete proposal? I do agree that SS should not concern itself with versioning -- it has no need to. Seems like we're on the same page.

yihuang · 2022-10-08T02:34:25Z

@yihuang do you have a concrete proposal? I do agree that SS should not concern itself with versioning -- it has no need to. Seems like we're on the same page.

We have a prototype implementation already crypto-org-chain/cronos#722 .
this is only for reducing archive node db size by dropping historical merkle proofs, not a generic replacement for the iavl tree though.

alexanderbez · 2022-10-08T21:53:59Z

Amazing. Do you have an ADR, spec or doc detailing the idea?

yihuang · 2022-10-11T05:46:47Z

Amazing. Do you have an ADR, spec or doc detailing the idea?

currently the best description of the db design is here
then we use the state listener to observe the state changes and store the results in this db
then implement a read-only MultiStore with this db
hook it into the grpc query server, this part need an API change in cosmos-sdk.

alexanderbez · 2022-10-11T17:00:27Z

I see. We can consider and evaluate that approach as we refactor the store. I'm not sure we'll go exactly with that approach though.

alexanderbez · 2022-12-04T01:55:45Z

@yihuang so I am going to write up an ADR this week for a formal design proposal of concrete changes to state management based primarily on the work and approach designed here.

Am I correct in understanding that merkle proofs on historical data is not supported in the approach you've outlined? What if that is required or a desired property, e.g. relayers or IBC clients. What is your proposal or ideas around proof support for historical state?

yihuang · 2022-12-04T03:22:00Z

@yihuang so I am going to write up an ADR this week for a formal design proposal of concrete changes to state management based primarily on the work and approach designed here.

Am I correct in understanding that merkle proofs on historical data is not supported in the approach you've outlined? What if that is required or a desired property, e.g. relayers or IBC clients. What is your proposal or ideas around proof support for historical state?

yes, it don't support proof generation, for that case, we can probably keep an archive iavl tree with more compression for that special case, for example, omitting the value field from leaf nodes.
BTW, versiondb and iavl tree are interchangeable (assuming the writings are ordered by key in each version), in the sense that we can rebuild iavl tree from versiondb changesets, also can extract changesets from iavl tree versions, so in theory no information is lost, it's just too slow to rebuild iavl tree on the fly, I guess we can also keep some checkpoints to speedup iavl tree rebuild, which is more space efficient than a full archive.

yihuang · 2022-12-04T03:32:38Z

@alexanderbez recently I'm also thinking about a design choice regarding what value to store in changeset, there are two options:

store the before value, this is the current proposal, the advantage is it don't duplicate values of the latest version.
store the after value, the disadvantage is it duplicate the latest state, but the advantage is changeset is self-contained, which could is useful when we chunking the historical states, for example, a node can sync changesetes up to the version N, and can already serve queries up to version N. In the option 1, the changeset is only useful when combined with the latest states.

alexanderbez · 2022-12-04T14:38:32Z

On first glance, I think I prefer the current approach. Having the latest state + some subset (or all) of historical state seems like a pretty normal expectation to have.

yihuang · 2022-12-05T00:14:11Z

Another small issue compared with erigon is, we can't do chunking of the bitmap index, in erigon the chunking is done by append chunk id to the key, but since our keys have variable length, we can't do that, otherwise, it'll interleave with other keys.

alexanderbez · 2022-12-05T14:50:58Z

So I don't know what any of this is, so when I write the ADR, it would be helpful if you could link me to all relevant specification and documentation on the historicalDB approach (Sorry in advance if you've already linked it :p )

yihuang · 2022-12-06T04:14:20Z

So I don't know what any of this is, so when I write the ADR, it would be helpful if you could link me to all relevant specification and documentation on the historicalDB approach (Sorry in advance if you've already linked it :p )

The most complete description is still this doc¹, I just made some complements, describing the bitmap chunking issue, and more advantages of the alternative design to store the after value.

https://hackmd.io/@2O2cXDfdQpijemGc_vlHCA/SkXxuTAli ↩

yihuang · 2022-12-28T16:05:16Z

I'm trying the rocksdb user-defined timestamp¹ to implement versiondb, it seems pretty good.
I've implemented commands² to extract change set from iavl versions, and store the change set in plain file format, and write the change set into sst file, using the version as user-defined timestamp of keys.
The one downside is it bind with rocksdb backend, at least for the versiondb part, good-side is it's dead simple to implement and likely more performant than custom indexings.

alexanderbez · 2022-12-29T15:54:40Z

Something we'll discuss in the storage WG!

tac0turtle · 2023-05-15T08:34:37Z

closing this because of version db and the work that will commence with store v2

tac0turtle added the C:Store label Sep 16, 2022

This was referenced Sep 19, 2022

backport adr-038 streaming service to older versions #13324

Closed

Problem: there's no compact historical state storage crypto-org-chain/cronos#704

Closed

draft ADR-003 Contract Storage Backends evmos/ethermint#1304

Closed

yihuang mentioned this issue Oct 6, 2022

EPIC: Storage #12986

Closed

22 tasks

yihuang mentioned this issue Oct 13, 2022

feat: support alternative query multistore #13529

Merged

19 tasks

yihuang mentioned this issue Nov 11, 2022

Make the guarantee that the iavl tree is inserted with ordered keys in each version #13837

Open

tac0turtle closed this as completed May 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store plain historical states to reduce the db size of archive node #13317

Store plain historical states to reduce the db size of archive node #13317

yihuang commented Sep 16, 2022 •

edited

tac0turtle commented Sep 16, 2022

yihuang commented Sep 17, 2022 •

edited

tac0turtle commented Sep 17, 2022 •

edited

yihuang commented Sep 17, 2022

yihuang commented Sep 17, 2022

yihuang commented Sep 17, 2022 •

edited

yihuang commented Sep 23, 2022 •

edited

tac0turtle commented Sep 23, 2022

alexanderbez commented Sep 23, 2022

yihuang commented Oct 6, 2022 •

edited

yihuang commented Oct 6, 2022 •

edited

tac0turtle commented Oct 6, 2022

yihuang commented Oct 6, 2022 •

edited

alexanderbez commented Oct 7, 2022

yihuang commented Oct 8, 2022

alexanderbez commented Oct 8, 2022

yihuang commented Oct 11, 2022

alexanderbez commented Oct 11, 2022

alexanderbez commented Dec 4, 2022

yihuang commented Dec 4, 2022 •

edited

yihuang commented Dec 4, 2022 •

edited

alexanderbez commented Dec 4, 2022 •

edited

yihuang commented Dec 5, 2022

alexanderbez commented Dec 5, 2022

yihuang commented Dec 6, 2022 •

edited

yihuang commented Dec 28, 2022 •

edited

alexanderbez commented Dec 29, 2022

tac0turtle commented May 15, 2023

Store plain historical states to reduce the db size of archive node #13317

Store plain historical states to reduce the db size of archive node #13317

Comments

yihuang commented Sep 16, 2022 • edited

Plain Historical State

Context

Proposal

Write Amplification

Archive Format

Option 1

Option 2

Footnotes

tac0turtle commented Sep 16, 2022

yihuang commented Sep 17, 2022 • edited

tac0turtle commented Sep 17, 2022 • edited

yihuang commented Sep 17, 2022

yihuang commented Sep 17, 2022

yihuang commented Sep 17, 2022 • edited

yihuang commented Sep 23, 2022 • edited

Implementation Plan

tac0turtle commented Sep 23, 2022

alexanderbez commented Sep 23, 2022

yihuang commented Oct 6, 2022 • edited

yihuang commented Oct 6, 2022 • edited

tac0turtle commented Oct 6, 2022

yihuang commented Oct 6, 2022 • edited

alexanderbez commented Oct 7, 2022

yihuang commented Oct 8, 2022

alexanderbez commented Oct 8, 2022

yihuang commented Oct 11, 2022

alexanderbez commented Oct 11, 2022

alexanderbez commented Dec 4, 2022

yihuang commented Dec 4, 2022 • edited

yihuang commented Dec 4, 2022 • edited

alexanderbez commented Dec 4, 2022 • edited

yihuang commented Dec 5, 2022

alexanderbez commented Dec 5, 2022

yihuang commented Dec 6, 2022 • edited

Footnotes

yihuang commented Dec 28, 2022 • edited

Footnotes

alexanderbez commented Dec 29, 2022

tac0turtle commented May 15, 2023

yihuang commented Sep 16, 2022 •

edited

yihuang commented Sep 17, 2022 •

edited

tac0turtle commented Sep 17, 2022 •

edited

yihuang commented Sep 17, 2022 •

edited

yihuang commented Sep 23, 2022 •

edited

yihuang commented Oct 6, 2022 •

edited

yihuang commented Oct 6, 2022 •

edited

yihuang commented Oct 6, 2022 •

edited

yihuang commented Dec 4, 2022 •

edited

yihuang commented Dec 4, 2022 •

edited

alexanderbez commented Dec 4, 2022 •

edited

yihuang commented Dec 6, 2022 •

edited

yihuang commented Dec 28, 2022 •

edited