Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/geth: implement data import and export #22931

Merged
merged 20 commits into from
Nov 2, 2021

Conversation

rjl493456442
Copy link
Member

@rjl493456442 rjl493456442 commented May 24, 2021

This PR offers two more database sub commands for exporting and importing snapshot data.
These two commands can be useful in this scenario that: archive node just upgrade and start to use snapshot. It's a very expensive to regenerate the snapshot, especially for the archive node which has a huge database. So the node operators can just snap sync a fresh new geth node(a few hours) and import all the snap shot data into the archive node. The archive node can pick it up and do the repair work(a few hours). Compare with the endless snapshot generation, this manual work is much faster.

This two commands can be used in more general way for importing/exporting arbitrary chain data.

Copy link
Contributor

@holiman holiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few other thoughts:

  • Would be good to have a progress report every 8 seconds or so,
  • What happens if user presses ctrl-c? Does it exit gracefully or just croaks?
  • Since these are long-running operations, it might be neat if user can press ctrl-c, exit orderly, and then restart it again with where it left off. Perhaps using startKey or something.

cmd/geth/dbcmd.go Outdated Show resolved Hide resolved
cmd/utils/cmd.go Outdated
}
defer fh.Close()

var reader io.Reader = fh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrapping it in a buffered reader might be a good idea

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ still a good idea :)

cmd/utils/cmd.go Outdated Show resolved Hide resolved
@holiman
Copy link
Contributor

holiman commented May 25, 2021

Triage discussion: this could be 'generalized' so we have a generic importer which imports key/value pairs. And then we could have specific exporters which export preimages, or snap data, or whatever it may be.
It does require a different binary format, so we can distinguish between keys and values though

@rjl493456442
Copy link
Member Author

@holiman I will not add the start option right now. Since it's a little bit complicated and not sure it's worthwhile for the complexity. So example the snapshot data has two sub types: account snapshot and storage snapshot. If we stop somewhere in the last run, we have to specify the prefix and the start position.
Let's see how long it takes to import the snap data.

Comment on lines 610 to 612
// The prefix used to identify the snapshot data type,
// 0: account snapshot
// 1: storage snapshot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do it this way, we're limiting imports to be only the ones that are predefined. If we instead encode it as [key, value], [key, value], [key, valiue], then the importer doesn't need to be datatype-aware, and we can use the same import function regardless if we're importing snapshot data or e.g. preimage data or trie data.

@holiman
Copy link
Contributor

holiman commented Aug 3, 2021

@rjl493456442 any thoughts on making it so that we only ever need one single generic importer -- even if we decide to have customer exporters for different data types? I think that would be pretty good, because then an older node could import data generated by a more recent one which is able to export the things it needs.

@rjl493456442
Copy link
Member Author

@holiman yes yes sure. I can check how to implement this general exporter/importer.

@holiman
Copy link
Contributor

holiman commented Aug 4, 2021 via email

cmd/utils/cmd.go Outdated Show resolved Hide resolved
cmd/utils/cmd.go Outdated Show resolved Hide resolved
cmd/utils/cmd.go Outdated Show resolved Hide resolved
cmd/utils/cmd.go Outdated
}
defer fh.Close()

var reader io.Reader = fh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ still a good idea :)

@holiman
Copy link
Contributor

holiman commented Oct 11, 2021

My suggestions: holiman@cf1260d

@rjl493456442
Copy link
Member Author

@holiman Can you move your code into this PR? It looks good to me.

@holiman
Copy link
Contributor

holiman commented Oct 12, 2021

Ok, done. Now it's nearly fully generic. So for example, we could write an exporter that exports a particular trie root. Or we could write one that spits out all metadata, to more easily analyze what has gone wrong, e.g. if someone has inconsistencies between pivot block / latest header / latest block / latest snapshot etc.
However, one thing the generic importer is incapable of, is deleting elements.
If we were able to add deletions, we could also use this feature to delete the pivot marker (and a few other fields) to trigger a re-sync. I'm not 100% sure what extra usability we could gain by adding deletions, but it's a decision we need to make now, because if we do want to support deletions, then we need to modify the data format to not only be key/value pairs, but also have a flag to signal deletion.

@holiman
Copy link
Contributor

holiman commented Oct 12, 2021

Another thing to consider: adding a metadata field as the first element. It could contain version info, the exported 'type' and date fo export

@holiman
Copy link
Contributor

holiman commented Oct 12, 2021

Added a header now + testcases. The header contains timestamp, version and "kind"-string.

@holiman
Copy link
Contributor

holiman commented Oct 12, 2021

Now also with a magic fingerprint, so arbitrary rlp lists doesn't import just because it happens to look like the header, rlp wise.

@holiman
Copy link
Contributor

holiman commented Oct 12, 2021

Now that we have versioning, maybe we don't have to add support for deletions right now. We can add that in the future, if we decide we need it.

Copy link
Contributor

@holiman holiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (but then I wrote some of the code, so someone else maybe should thumb it up too)

cmd/geth/chaincmd.go Outdated Show resolved Hide resolved
@holiman
Copy link
Contributor

holiman commented Oct 19, 2021

We need to extend the format to handle deletions aswell. If we import snapshot data, we need to delete the metadata, so it's regenerated based on the newest block and not interpreted as some old already present but overwritten snapshot.

@holiman
Copy link
Contributor

holiman commented Oct 19, 2021

Currently, the schema is an rlp stream:

header, k1, v1, k2, v2, ... kN,  vN 

Possible schemas for deletions, if element X is to be deleted

Direct approach

Instead of encoding key/value pairs, encode triplets: op/key/value.

header, op1, k1, v1, op2, k2, v2... opN, kN, vN

Pro: simple, no special cases
Con: High overhead, even if op is just a byte.

Magic key

Instead of doing key/value pair, we'd have a special key meaning "delete", and the value would be the key to delete.

header, k1, v1, k2, v2...  <deleteMagic>, kX ...kN, vN 

Pro:

  • Zero overhead for all the additions, which most likely make up the bulk of the data

Cons:

  • We'd have a 'magic' key, e.g "DeleteElement".

Any other ideas?

@MariusVanDerWijden
Copy link
Member

What do you think about something like this:

#Addtiions, #Deletions, header, ak1, av1, ak2, av2, ... akN, avN, dk1, dk2, .... dkN, dvN

@holiman
Copy link
Contributor

holiman commented Oct 19, 2021

What do you think about something like this:

Yes, that's also something I've considered. That scheme makes it so that all deletions are in a particular order, at the end. Which I guess is fine, as long as the full import is performed.

One might want to produce a dump which first of all deletes some marker, then starts filling data. That way, if the import is aborted, and geth started, there is no corruption. Whereas if we force deletions to go last, we lose that ability.

Also, it means that if there for some reason are a lot of deletions (though I can't think of why that would be), then the exporter would have to hang on to those and not flush until after it's done with the additions).

@rjl493456442
Copy link
Member Author

rjl493456442 commented Oct 21, 2021

@holiman @MariusVanDerWijden

What about adding a new area called metadata?

I think exporting a batch of deletion markers and then import these markers into another db for deletion sounds not realistic. We can add one more area which contains the customized key-value pairs(deletion can be included here).

The exporting file will contain these data: EXPORT_HEADER || METADATA || DATA

Just like the header, the metadata can also be a struct

type entry struct {
        Key, Val []byte
}
type Metadata []entry

We can put all deletion markers there with value as nil. Also the metadata area is always handled before the data area.

@holiman
Copy link
Contributor

holiman commented Oct 21, 2021

@rjl493456442 so essentially, your scheme would be

header, [ d1, d2, d3, .. dn], k1, v1, k2, v3

Where the deletion markers, from an rlp perspective, is not just appended (like the kv pairs), but an actual rlp-list?
I think that could work, since the delete-list should be pretty small. It also has nearly zero overhead and it's not ambiguous.

Copy link
Contributor

@holiman holiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was a good choice to switch over to iterators

cmd/utils/cmd.go Outdated
start = time.Now()
logged = time.Now()
)
for key, val, next := iter.Next(); next; key, val, next = iter.Next() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both for this iterator and the next: the final bool value being returned isn't really safe to rely on. What you have implemented is not "are there more elements", but rather "is there a chance that there are more elements?". Or, "go to next, was it ok?"
So it makes more sense to name it

	for key, val, ok := iter.Next(); ok;  key, val, ok = iter.Next() {

@holiman
Copy link
Contributor

holiman commented Nov 1, 2021

Rebased, and fixed so the format is op, key, value, op, key, value.... .
@rjl493456442 LGTY?

@rjl493456442
Copy link
Member Author

@holiman LGTM!

@fjl fjl removed the status:triage label Nov 2, 2021
@holiman holiman added this to the 1.10.12 milestone Nov 2, 2021
@holiman holiman merged commit 2e8b58f into ethereum:master Nov 2, 2021
sidhujag pushed a commit to syscoin/go-ethereum that referenced this pull request Nov 2, 2021
This PR offers two more database sub commands for exporting and importing data.
Two exporters are implemented: preimage and snapshot data respectively. 
The import command is generic, it can take any data export and import into leveldb. 
The data format has a 'magic' for disambiguation, and a version field for future compatibility.
yongjun925 pushed a commit to DODOEX/go-ethereum that referenced this pull request Dec 3, 2022
This PR offers two more database sub commands for exporting and importing data.
Two exporters are implemented: preimage and snapshot data respectively. 
The import command is generic, it can take any data export and import into leveldb. 
The data format has a 'magic' for disambiguation, and a version field for future compatibility.
@cesarhuret
Copy link

cesarhuret commented Oct 5, 2023

@holiman

what is "key" in this struct? Is it linked to the account's address? (a hash of the account's address?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants