cmd/geth: implement data import and export #22931

rjl493456442 · 2021-05-24T07:36:32Z

This PR offers two more database sub commands for exporting and importing snapshot data.
These two commands can be useful in this scenario that: archive node just upgrade and start to use snapshot. It's a very expensive to regenerate the snapshot, especially for the archive node which has a huge database. So the node operators can just snap sync a fresh new geth node(a few hours) and import all the snap shot data into the archive node. The archive node can pick it up and do the repair work(a few hours). Compare with the endless snapshot generation, this manual work is much faster.

This two commands can be used in more general way for importing/exporting arbitrary chain data.

holiman

A few other thoughts:

Would be good to have a progress report every 8 seconds or so,
What happens if user presses ctrl-c? Does it exit gracefully or just croaks?
Since these are long-running operations, it might be neat if user can press ctrl-c, exit orderly, and then restart it again with where it left off. Perhaps using startKey or something.

cmd/geth/dbcmd.go

holiman · 2021-05-25T06:31:51Z

cmd/utils/cmd.go

+	}
+	defer fh.Close()
+
+	var reader io.Reader = fh


Wrapping it in a buffered reader might be a good idea

^ still a good idea :)

cmd/utils/cmd.go

holiman · 2021-05-25T08:31:33Z

Triage discussion: this could be 'generalized' so we have a generic importer which imports key/value pairs. And then we could have specific exporters which export preimages, or snap data, or whatever it may be.
It does require a different binary format, so we can distinguish between keys and values though

rjl493456442 · 2021-05-26T04:32:38Z

@holiman I will not add the start option right now. Since it's a little bit complicated and not sure it's worthwhile for the complexity. So example the snapshot data has two sub types: account snapshot and storage snapshot. If we stop somewhere in the last run, we have to specify the prefix and the start position.
Let's see how long it takes to import the snap data.

holiman · 2021-05-31T10:38:35Z

cmd/geth/dbcmd.go

+			// The prefix used to identify the snapshot data type,
+			// 0: account snapshot
+			// 1: storage snapshot


If we do it this way, we're limiting imports to be only the ones that are predefined. If we instead encode it as [key, value], [key, value], [key, valiue], then the importer doesn't need to be datatype-aware, and we can use the same import function regardless if we're importing snapshot data or e.g. preimage data or trie data.

holiman · 2021-08-03T15:49:06Z

@rjl493456442 any thoughts on making it so that we only ever need one single generic importer -- even if we decide to have customer exporters for different data types? I think that would be pretty good, because then an older node could import data generated by a more recent one which is able to export the things it needs.

rjl493456442 · 2021-08-04T11:15:06Z

@holiman yes yes sure. I can check how to implement this general exporter/importer.

holiman · 2021-08-04T11:24:53Z

Note: I'm talking about general importer, but specialized exporter(s)

cmd/utils/cmd.go

holiman · 2021-09-27T11:08:49Z

cmd/utils/cmd.go

+	}
+	defer fh.Close()
+
+	var reader io.Reader = fh


^ still a good idea :)

holiman · 2021-10-11T20:39:24Z

My suggestions: holiman@cf1260d

rjl493456442 · 2021-10-12T06:46:56Z

@holiman Can you move your code into this PR? It looks good to me.

holiman · 2021-10-12T07:23:03Z

Ok, done. Now it's nearly fully generic. So for example, we could write an exporter that exports a particular trie root. Or we could write one that spits out all metadata, to more easily analyze what has gone wrong, e.g. if someone has inconsistencies between pivot block / latest header / latest block / latest snapshot etc.
However, one thing the generic importer is incapable of, is deleting elements.
If we were able to add deletions, we could also use this feature to delete the pivot marker (and a few other fields) to trigger a re-sync. I'm not 100% sure what extra usability we could gain by adding deletions, but it's a decision we need to make now, because if we do want to support deletions, then we need to modify the data format to not only be key/value pairs, but also have a flag to signal deletion.

holiman · 2021-10-12T07:31:02Z

Another thing to consider: adding a metadata field as the first element. It could contain version info, the exported 'type' and date fo export

holiman · 2021-10-12T09:24:25Z

Added a header now + testcases. The header contains timestamp, version and "kind"-string.

holiman · 2021-10-12T09:29:51Z

Now also with a magic fingerprint, so arbitrary rlp lists doesn't import just because it happens to look like the header, rlp wise.

holiman · 2021-10-12T09:30:34Z

Now that we have versioning, maybe we don't have to add support for deletions right now. We can add that in the future, if we decide we need it.

holiman

LGTM (but then I wrote some of the code, so someone else maybe should thumb it up too)

cmd/geth/chaincmd.go

holiman · 2021-10-19T09:08:28Z

We need to extend the format to handle deletions aswell. If we import snapshot data, we need to delete the metadata, so it's regenerated based on the newest block and not interpreted as some old already present but overwritten snapshot.

holiman · 2021-10-19T12:43:33Z

Currently, the schema is an rlp stream:

header, k1, v1, k2, v2, ... kN,  vN

Possible schemas for deletions, if element X is to be deleted

Direct approach

Instead of encoding key/value pairs, encode triplets: op/key/value.

header, op1, k1, v1, op2, k2, v2... opN, kN, vN

Pro: simple, no special cases
Con: High overhead, even if op is just a byte.

Magic key

Instead of doing key/value pair, we'd have a special key meaning "delete", and the value would be the key to delete.

header, k1, v1, k2, v2...  <deleteMagic>, kX ...kN, vN

Pro:

Zero overhead for all the additions, which most likely make up the bulk of the data

Cons:

We'd have a 'magic' key, e.g "DeleteElement".

Any other ideas?

MariusVanDerWijden · 2021-10-19T12:53:35Z

What do you think about something like this:

#Addtiions, #Deletions, header, ak1, av1, ak2, av2, ... akN, avN, dk1, dk2, .... dkN, dvN

holiman · 2021-10-19T13:12:19Z

What do you think about something like this:

Yes, that's also something I've considered. That scheme makes it so that all deletions are in a particular order, at the end. Which I guess is fine, as long as the full import is performed.

One might want to produce a dump which first of all deletes some marker, then starts filling data. That way, if the import is aborted, and geth started, there is no corruption. Whereas if we force deletions to go last, we lose that ability.

Also, it means that if there for some reason are a lot of deletions (though I can't think of why that would be), then the exporter would have to hang on to those and not flush until after it's done with the additions).

rjl493456442 · 2021-10-21T06:40:19Z

@holiman @MariusVanDerWijden

What about adding a new area called metadata?

I think exporting a batch of deletion markers and then import these markers into another db for deletion sounds not realistic. We can add one more area which contains the customized key-value pairs(deletion can be included here).

The exporting file will contain these data: EXPORT_HEADER || METADATA || DATA

Just like the header, the metadata can also be a struct

type entry struct {
        Key, Val []byte
}
type Metadata []entry

We can put all deletion markers there with value as nil. Also the metadata area is always handled before the data area.

holiman · 2021-10-21T07:11:22Z

@rjl493456442 so essentially, your scheme would be

header, [ d1, d2, d3, .. dn], k1, v1, k2, v3

Where the deletion markers, from an rlp perspective, is not just appended (like the kv pairs), but an actual rlp-list?
I think that could work, since the delete-list should be pretty small. It also has nearly zero overhead and it's not ambiguous.

holiman

I think it was a good choice to switch over to iterators

holiman · 2021-10-26T07:07:26Z

cmd/utils/cmd.go

+		start  = time.Now()
+		logged = time.Now()
+	)
+	for key, val, next := iter.Next(); next;  key, val, next = iter.Next() {


Both for this iterator and the next: the final bool value being returned isn't really safe to rely on. What you have implemented is not "are there more elements", but rather "is there a chance that there are more elements?". Or, "go to next, was it ok?"
So it makes more sense to name it

for key, val, ok := iter.Next(); ok; key, val, ok = iter.Next() {

holiman · 2021-11-01T20:28:32Z

Rebased, and fixed so the format is op, key, value, op, key, value.... .
@rjl493456442 LGTY?

rjl493456442 · 2021-11-02T02:21:48Z

@holiman LGTM!

This PR offers two more database sub commands for exporting and importing data. Two exporters are implemented: preimage and snapshot data respectively. The import command is generic, it can take any data export and import into leveldb. The data format has a 'magic' for disambiguation, and a version field for future compatibility.

cesarhuret · 2023-10-05T19:17:32Z

@holiman

what is "key" in this struct? Is it linked to the account's address? (a hash of the account's address?)

holiman reviewed May 25, 2021

View reviewed changes

holiman added the status:triage label May 25, 2021

fjl removed the status:triage label May 25, 2021

rjl493456442 requested a review from karalabe as a code owner May 26, 2021 03:00

holiman reviewed May 31, 2021

View reviewed changes

rjl493456442 force-pushed the export-snap branch from a5ab283 to b656582 Compare August 10, 2021 07:01

holiman reviewed Sep 27, 2021

View reviewed changes

holiman force-pushed the export-snap branch from 2aa175e to 8054913 Compare October 12, 2021 07:16

holiman approved these changes Oct 12, 2021

View reviewed changes

holiman added the status:triage label Oct 14, 2021

karalabe reviewed Oct 19, 2021

View reviewed changes

cmd/geth/chaincmd.go Outdated Show resolved Hide resolved

holiman force-pushed the export-snap branch from 4e1b654 to c925a9b Compare October 19, 2021 12:27

holiman reviewed Oct 26, 2021

View reviewed changes

rjl493456442 and others added 20 commits November 1, 2021 20:58

cmd: add two database sub-commands

405601f

cmd: update

c86c7dd

cmd, core: updates

a30ca98

cmd: updates

1cf3dc5

cmd: add ctrl+c monitor

8d40fd7

cmd/geth: update

54a1c05

cmd, core: implement general importer/exporter

69d697c

cmd/geth: update

be9fb8e

cmd/utils: update

2f76587

core/rawdb: rollback unneeded changes

c96c8bd

cmd/geth: make imports generic

04d53d4

cmd/utils: use buffered reader for import

64bc1ba

cmd/utils: tests for export/import

91145c1

cmd/utils: improve tests + clean up tempfile

aeb351b

cmd/utils: add header for data dumps

8eb2dad

cmd/utils: add magic for exportdata

1dc9c28

cmd: minor fix and add file license

c1eb849

core, cmd: improve chain exporter

4eecf50

cmd/utils: remove useless

cbf83d8

cmd/geth: have op as a separate entity in db import/export format

1e7d4e1

holiman force-pushed the export-snap branch from 4c94038 to 1e7d4e1 Compare November 1, 2021 20:27

fjl removed the status:triage label Nov 2, 2021

holiman added this to the 1.10.12 milestone Nov 2, 2021

holiman merged commit 2e8b58f into ethereum:master Nov 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/geth: implement data import and export #22931

cmd/geth: implement data import and export #22931

rjl493456442 commented May 24, 2021 •

edited

Loading

holiman left a comment

holiman May 25, 2021

holiman Sep 27, 2021

holiman commented May 25, 2021

rjl493456442 commented May 26, 2021

holiman May 31, 2021

holiman commented Aug 3, 2021

rjl493456442 commented Aug 4, 2021

holiman commented Aug 4, 2021 via email

holiman Sep 27, 2021

holiman commented Oct 11, 2021

rjl493456442 commented Oct 12, 2021

holiman commented Oct 12, 2021

holiman commented Oct 12, 2021

holiman commented Oct 12, 2021

holiman commented Oct 12, 2021

holiman commented Oct 12, 2021

holiman left a comment

holiman commented Oct 19, 2021

holiman commented Oct 19, 2021

MariusVanDerWijden commented Oct 19, 2021

holiman commented Oct 19, 2021 •

edited

Loading

rjl493456442 commented Oct 21, 2021 •

edited

Loading

holiman commented Oct 21, 2021

holiman left a comment

holiman Oct 26, 2021

holiman commented Nov 1, 2021

rjl493456442 commented Nov 2, 2021

cesarhuret commented Oct 5, 2023 •

edited

Loading

cmd/geth: implement data import and export #22931

cmd/geth: implement data import and export #22931

Conversation

rjl493456442 commented May 24, 2021 • edited Loading

holiman left a comment

Choose a reason for hiding this comment

holiman May 25, 2021

Choose a reason for hiding this comment

holiman Sep 27, 2021

Choose a reason for hiding this comment

holiman commented May 25, 2021

rjl493456442 commented May 26, 2021

holiman May 31, 2021

Choose a reason for hiding this comment

holiman commented Aug 3, 2021

rjl493456442 commented Aug 4, 2021

holiman commented Aug 4, 2021 via email

holiman Sep 27, 2021

Choose a reason for hiding this comment

holiman commented Oct 11, 2021

rjl493456442 commented Oct 12, 2021

holiman commented Oct 12, 2021

holiman commented Oct 12, 2021

holiman commented Oct 12, 2021

holiman commented Oct 12, 2021

holiman commented Oct 12, 2021

holiman left a comment

Choose a reason for hiding this comment

holiman commented Oct 19, 2021

holiman commented Oct 19, 2021

Direct approach

Magic key

MariusVanDerWijden commented Oct 19, 2021

holiman commented Oct 19, 2021 • edited Loading

rjl493456442 commented Oct 21, 2021 • edited Loading

holiman commented Oct 21, 2021

holiman left a comment

Choose a reason for hiding this comment

holiman Oct 26, 2021

Choose a reason for hiding this comment

holiman commented Nov 1, 2021

rjl493456442 commented Nov 2, 2021

cesarhuret commented Oct 5, 2023 • edited Loading

rjl493456442 commented May 24, 2021 •

edited

Loading

holiman commented Oct 19, 2021 •

edited

Loading

rjl493456442 commented Oct 21, 2021 •

edited

Loading

cesarhuret commented Oct 5, 2023 •

edited

Loading