Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store plain historical states to reduce the db size of archive node #13317

Closed
yihuang opened this issue Sep 16, 2022 · 28 comments
Closed

Store plain historical states to reduce the db size of archive node #13317

yihuang opened this issue Sep 16, 2022 · 28 comments
Labels

Comments

@yihuang
Copy link
Collaborator

yihuang commented Sep 16, 2022

Plain Historical State

Context

The archive node's db size increase very fast, akthough pruning is good, but many users still need to access the complete chain history, to keep it sustainable, we need a way to reduce the disk space requirement of archive nodes.

One observation is that most of the time, we don't need to generate Merkle proofs for historical states, so we can prune the iavl tree aggressively and keep the historical states in a compact format.

Proposal

At the end of the block, before committing the iavl tree, also save the change sets in another archive storage, the storage format is explained below.

Support querying from the archive storage transparently.

Nodes can use much more aggressive pruning settings for iavl tree, and still be able to query full chain history.

The total db size is expected to be much lower than an archived iavl tree.

Write Amplification

The write amplification is increased, but it should be ok if writing speed is not the bottleneck for the workload (need to verify), after all, we only do a batch of writes per several seconds.

Archive Format

Option 1

Store in kv db (inspired by erigon 1).

  • History Index, key -> set of block numbers

    The set of block numbers that modified the key, is encoded as a bitmap.

  • Change Sets, (block number, key) -> value

    Record the prior value for each modification.

Option 2

Store the history index in kv db.

Store the change sets in plain files in an append-only way, together with some index files to record the offsets of blocks.

When querying, first locate the block number through the history index, then locate the index into the change set file, then find the key in the change set of the block with linear search.

If the change set of a block is big, it could be encoded as a mmap-able b-tree, for more efficient querying.

Footnotes

  1.  https://github.com/ledgerwatch/erigon/blob/devel/docs/programmers_guide/db_walkthrough.MD#table-history-of-accounts

@tac0turtle
Copy link
Member

Big fan of changing how we do writing and handling of historical state. Right now we are writing a specification and better testing for the store package to better understand the semantics and then we plan on making changes. There are a ton of changes we can do to bring huge improvements in performance.

@yihuang
Copy link
Collaborator Author

yihuang commented Sep 17, 2022

If there are some nesserary hooks in SDK, it might be possible to develop this feature outside of SDK:

  • A hook for third-party to handle the change sets before committing to the iavl tree.
    • I see there's already a WriteListener, may just extend that?
  • A hook to provide the external historical states to grpc query handlers.

@tac0turtle
Copy link
Member

tac0turtle commented Sep 17, 2022

If there are some nesserary hooks in SDK, it might be possible to develop this feature outside of SDK:

  • A hook for third-party to handle the change sets before committing to the iavl tree.

    • I see there's already a WriteListener, may just extend that?
  • A hook to provide the external historical states to grpc query handlers.

this is adr-038, no?

@yihuang
Copy link
Collaborator Author

yihuang commented Sep 17, 2022

If there are some nesserary hooks in SDK, it might be possible to develop this feature outside of SDK:

  • A hook for third-party to handle the change sets before committing to the iavl tree.

    • I see there's already a WriteListener, may just extend that?
  • A hook to provide the external historical states to grpc query handlers.

this is adr-038, no?

yeah, didn't know that, looks like exactly what we need.

@yihuang
Copy link
Collaborator Author

yihuang commented Sep 17, 2022

it'd be great if adr-038 could be backported to as far as v0.44 though, so we can stream the chain since the genesis.

@yihuang
Copy link
Collaborator Author

yihuang commented Sep 17, 2022

I guess one way to handle the legacy blocks, is to recompute the changesets by doing a complete comparison between two versions in the iavl tree, I assume that'd be too slow to be practical., let's just backport the streamer and dump the changesets.

@yihuang
Copy link
Collaborator Author

yihuang commented Sep 23, 2022

Implementation Plan

// VersionStore is a versioned storage of a flat key-value pairs.
// it don't need to support merkle proof, so could be implemented in a much more efficient way.
// `nil` version means the latest version.
type VersionStore interface {
	GetAtVersion(storeKey string, key []byte, version *int64) ([]byte, error)
	HasAtVersion(storeKey string, key []byte, version *int64) (bool, error)
	IteratorAtVersion(storeKey string, start, end []byte, version *int64) types.Iterator
	ReverseIteratorAtVersion(storeKey string, start, end []byte, version *int64) types.Iterator
	GetLatestVersion() int64

	// Persist the change set of a block,
	// the `changeSet` should be ordered by (storeKey, key),
	// the version should be latest version plus one.
	PutAtVersion(version int64, changeSet []types.StoreKVPair) error
}

Integrate VersionStore into rootmulti.Store

  • Use the optional VersionStore to serve reading calls if provided.
  • Call the VPut method in Commit, we can do the commit in end blocker with the state listener interface in adr-038.

VersionStore can be implemented on top of dbm.DB, we might also experiment with lmdb/mdbx, which may not fit into the current dbm interface.

@tac0turtle
Copy link
Member

in this approach are you assuming iavl does not handle versions anymore?

I think if we fix iavl ky format queries will improve dramatically

@alexanderbez
Copy link
Contributor

Sounds like you're describing a dedicated SS layer. I'm definitely in favor of splitting SS + SC responsibilities from the IAVL!

I would really really really love for all this to be consolidated into a single EPIC. Which we have here -> #12986

@yihuang
Copy link
Collaborator Author

yihuang commented Oct 6, 2022

in this approach are you assuming iavl does not handle versions anymore?

yes, only keep the recent ones necessary for the consensus state machine, pruning=everything.

I think if we fix iavl ky format queries will improve dramatically

We are mainly focus on db size right now, in local test, the new db format is 10x smaller than the iavl db.
I see great potential in the new iavl key format too, that should improve the db size of iavl too, but won't match the versiondb.

@yihuang
Copy link
Collaborator Author

yihuang commented Oct 6, 2022

Sounds like you're describing a dedicated SS layer. I'm definitely in favor of splitting SS + SC responsibilities from the IAVL!

The difference with the SS described in v2 store is it also support versioning, so it can support an archived grpc query service.
The problem with changing iavl itself is, it's impractical to migrate the whole state on production network.

I would really really really love for all this to be consolidated into a single EPIC. Which we have here -> #12986

happy to move the discussion there.
Currently the implementation can be done and experimented externally(crypto-org-chain/cronos#722), thanks to the adr-038 state streaming interface.
Now I need to hook the alternative db to the grpc query service, I'm trying to find a way that don't need to change too much to sdk internals.

@yihuang yihuang mentioned this issue Oct 6, 2022
22 tasks
@tac0turtle
Copy link
Member

personally the design of v2 was poor. the idea that proof support for historical versions means you need to reconstruct the tree is a bad design. Proof support while not widely used today is at the core of blockchain and decentralising data access.

Secondly, after talking to dev from osmosis, there are designs where proof support will want to exist for previous heights. We should allow an alternative for chains that don't care, but I think we should not write it off entirely

@yihuang
Copy link
Collaborator Author

yihuang commented Oct 6, 2022

We should allow an alternative for chains that don't care, but I think we should not write it off entirely

Agree, that's why I try to keep the implementation outside of sdk for now. But will need to change the API a little bit to hook the db with the grpc query.

@alexanderbez
Copy link
Contributor

@yihuang do you have a concrete proposal? I do agree that SS should not concern itself with versioning -- it has no need to. Seems like we're on the same page.

@yihuang
Copy link
Collaborator Author

yihuang commented Oct 8, 2022

@yihuang do you have a concrete proposal? I do agree that SS should not concern itself with versioning -- it has no need to. Seems like we're on the same page.

We have a prototype implementation already crypto-org-chain/cronos#722 .
this is only for reducing archive node db size by dropping historical merkle proofs, not a generic replacement for the iavl tree though.

@alexanderbez
Copy link
Contributor

Amazing. Do you have an ADR, spec or doc detailing the idea?

@yihuang
Copy link
Collaborator Author

yihuang commented Oct 11, 2022

Amazing. Do you have an ADR, spec or doc detailing the idea?

@alexanderbez
Copy link
Contributor

I see. We can consider and evaluate that approach as we refactor the store. I'm not sure we'll go exactly with that approach though.

@alexanderbez
Copy link
Contributor

@yihuang so I am going to write up an ADR this week for a formal design proposal of concrete changes to state management based primarily on the work and approach designed here.

Am I correct in understanding that merkle proofs on historical data is not supported in the approach you've outlined? What if that is required or a desired property, e.g. relayers or IBC clients. What is your proposal or ideas around proof support for historical state?

@yihuang
Copy link
Collaborator Author

yihuang commented Dec 4, 2022

@yihuang so I am going to write up an ADR this week for a formal design proposal of concrete changes to state management based primarily on the work and approach designed here.

Am I correct in understanding that merkle proofs on historical data is not supported in the approach you've outlined? What if that is required or a desired property, e.g. relayers or IBC clients. What is your proposal or ideas around proof support for historical state?

yes, it don't support proof generation, for that case, we can probably keep an archive iavl tree with more compression for that special case, for example, omitting the value field from leaf nodes.
BTW, versiondb and iavl tree are interchangeable (assuming the writings are ordered by key in each version), in the sense that we can rebuild iavl tree from versiondb changesets, also can extract changesets from iavl tree versions, so in theory no information is lost, it's just too slow to rebuild iavl tree on the fly, I guess we can also keep some checkpoints to speedup iavl tree rebuild, which is more space efficient than a full archive.

@yihuang
Copy link
Collaborator Author

yihuang commented Dec 4, 2022

@alexanderbez recently I'm also thinking about a design choice regarding what value to store in changeset, there are two options:

  • store the before value, this is the current proposal, the advantage is it don't duplicate values of the latest version.
  • store the after value, the disadvantage is it duplicate the latest state, but the advantage is changeset is self-contained, which could is useful when we chunking the historical states, for example, a node can sync changesetes up to the version N, and can already serve queries up to version N. In the option 1, the changeset is only useful when combined with the latest states.

@alexanderbez
Copy link
Contributor

alexanderbez commented Dec 4, 2022

On first glance, I think I prefer the current approach. Having the latest state + some subset (or all) of historical state seems like a pretty normal expectation to have.

@yihuang
Copy link
Collaborator Author

yihuang commented Dec 5, 2022

Another small issue compared with erigon is, we can't do chunking of the bitmap index, in erigon the chunking is done by append chunk id to the key, but since our keys have variable length, we can't do that, otherwise, it'll interleave with other keys.

@alexanderbez
Copy link
Contributor

So I don't know what any of this is, so when I write the ADR, it would be helpful if you could link me to all relevant specification and documentation on the historicalDB approach (Sorry in advance if you've already linked it :p )

@yihuang
Copy link
Collaborator Author

yihuang commented Dec 6, 2022

So I don't know what any of this is, so when I write the ADR, it would be helpful if you could link me to all relevant specification and documentation on the historicalDB approach (Sorry in advance if you've already linked it :p )

The most complete description is still this doc1, I just made some complements, describing the bitmap chunking issue, and more advantages of the alternative design to store the after value.

Footnotes

  1. https://hackmd.io/@2O2cXDfdQpijemGc_vlHCA/SkXxuTAli

@yihuang
Copy link
Collaborator Author

yihuang commented Dec 28, 2022

I'm trying the rocksdb user-defined timestamp1 to implement versiondb, it seems pretty good.
I've implemented commands2 to extract change set from iavl versions, and store the change set in plain file format, and write the change set into sst file, using the version as user-defined timestamp of keys.
The one downside is it bind with rocksdb backend, at least for the versiondb part, good-side is it's dead simple to implement and likely more performant than custom indexings.

Footnotes

  1. https://github.com/facebook/rocksdb/wiki/User-defined-Timestamp-(Experimental)

  2. https://github.com/crypto-org-chain/cronos/pull/791

@alexanderbez
Copy link
Contributor

Something we'll discuss in the storage WG!

@tac0turtle
Copy link
Member

closing this because of version db and the work that will commence with store v2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants