Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TODO: SimpleSerialize (SSZ) spec #2

Closed
hwwhww opened this issue Sep 22, 2018 · 14 comments
Closed

TODO: SimpleSerialize (SSZ) spec #2

hwwhww opened this issue Sep 22, 2018 · 14 comments

Comments

@hwwhww
Copy link
Contributor

hwwhww commented Sep 22, 2018

Open an issue for following up our discussion on gitter.

Specification requirements

  • Design rationale
  • Encoding
  • Decoding
  • Types
    • Integers
      • [not in the current implemenation] Specify the signed and unsigned integers support.
      • [not in the current implemenation] Static types?

Anything else? :)

Reference

cc @vbuterin @djrtwo @arnetheduck @mratsim @paulhauner @NatoliChris @PoSeyy

@hwwhww
Copy link
Contributor Author

hwwhww commented Sep 22, 2018

Repost the gitter discussion about integer types:

  • @djrtwo: What is the argument for not supporting any int size that satifies size % 8 == 0?
  • @mratsim: for compiled language default int types are all that can be represented as 2^p. int128 and int256 works but int40 or int24 don’t. That’s not super blocking as we can always represent them as an array of bytes or packed uint8 + uint16. i.e. in memory it’s just easier to use uint32 constrained to a certain range for uint24 for example. For serialization we would use an array of bytes.
  • @djrtwo: in memory, absolutely these things should be in 16/32/64 chunks. for the serialization, I see the value in tightly packing
  • @arnetheduck: when processing, it sometimes makes sense to use the smaller types in memory as well - compilers are actually pretty smart nowadays and in some static languages you can get away with 24-bit integers and the like. of course there are other considerations as well, alignment for the type, struct packing etc etc - all architecture-dependent
  • @hwwhww: Maybe use the integer type expression in solidity https://solidity.readthedocs.io/en/latest/types.html#integers
    Integers int / uint: Signed and unsigned integers of various sizes. Keywords uint8 to uint256 in steps of 8 (unsigned of 8 up to 256 bits) and int8 to int256. But don’t use this: uint and int are aliases for uint256 and int256, respectively. <— that’s too confusing in our use cases.
  • @arnetheduck: yeah - having 8-bit increments can be useful sometimes - the corresponding feature in C/C++ are bit fields - https://en.cppreference.com/w/cpp/language/bit_field - optimizers that are smart can do interesting things with them because they know upper and lower bounds more precisely - also, if there are large amounts of data, more fits in cache and less needs to be moved around, in general. the downside tends to be in unaligned access - most platforms have the property that aligned reads are faster - on some, it's even an error to read unaligned data, and you might have to resort to reading a 24-bit value byte by byte
    llvm likewise uses arbitrary-precision integers (http://llvm.org/docs/LangRef.html#integer-type) while wasm limits itself to i32 and i64

@paulhauner
Copy link
Contributor

paulhauner commented Sep 22, 2018

Good call @hwwhww!

Here's our README: https://github.com/sigp/lighthouse/tree/master/ssz

EDIT: Oh, I see it's already in your list :)

@p-mc2
Copy link
Contributor

p-mc2 commented Sep 24, 2018

This is great, I love having so much information all in one place. A few quick comments:

Some of the above on uint vs int and explicit types is covered here and will hopefully clear up this small(?) issue.

I'm writing a blog post on the basics of ssz and the rationale/motivation for introducing a new serialization format, I think the biggest issue with something like this is getting people to understand the rationale for a new format/understanding why it's necessary. Cheers.

@p-mc2
Copy link
Contributor

p-mc2 commented Sep 24, 2018

One thing I did want to question: in RLP the main motivations were simplicity and guaranteed absolute byte-perfect consistency. Obviously simplicity is carried over, it seems like guaranteed absolute byte-perfect consistency is also a part of simple serialize. Can someone confirm this is the case?

@ChihChengLiang
Copy link
Contributor

I have some archaeological discovery that might shed some light on this issue.

One of our core principles in Ethereum is simplicity; the protocol should be as simple as possible, and the protocol should not contain any black boxes. Every single feature of every single sub-protocol should be precisely 100% documented on the whitepaper or wiki, and implemented using that as a specification (ie. test-driven development).

Buterin, V. 2014 RLP v.s. external technologies

@mratsim
Copy link
Contributor

mratsim commented Sep 26, 2018

Here is Nim implementation, simple tests and the discussion for a common test format.

One thing that I'm not too sure of regarding current SSZ are serializing container types. For example:

# CrystallizedState
fields = {
    # List of validators
    'validators': [ValidatorRecord],
    # Last CrystallizedState recalculation
    'last_state_recalc': 'int64',
    # What active validators are part of the attester set
    # at what slot, and in what shard. Starts at slot
    # last_state_recalc - CYCLE_LENGTH
    'shard_and_committee_for_slots': [[ShardAndCommittee]],
    # The last justified slot
    'last_justified_slot': 'int64',
    # Number of consecutive justified slots ending at this one
    'justified_streak': 'int64',
    # The last finalized slot
    'last_finalized_slot': 'int64',
    # The current dynasty
    'current_dynasty': 'int64',
    # Records about the most recent crosslink `for each shard
    'crosslink_records': [CrosslinkRecord],
    # Used to select the committees for each shard
    'dynasty_seed': 'hash32',
    # Start of the current dynasty
    'dynasty_start': 'int64'
}

# ValidatorRecord
fields = {
    # The validator's public key
    'pubkey': 'int256',
    # What shard the validator's balance will be sent to 
    # after withdrawal
    'withdrawal_shard': 'int16',
    # And what address
    'withdrawal_address': 'address',
    # The validator's current RANDAO beacon commitment
    'randao_commitment': 'hash32',
    # Current balance
    'balance': 'int128',
    # Dynasty where the validator  is inducted
    'start_dynasty': 'int64',
    # Dynasty where the validator leaves
    'end_dynasty': 'int64'
}

# ShardsAndCommittee
fields = {
    # The shard ID
    'shard_id': 'int16',
    # Validator indices
    'committee': ['int24']
}

# CrosslinkRecord
fields = {
    # What dynasty the crosslink was submitted in
    'dynasty': 'int64',
    # What slot
    'slot': 'int64',
    # The block hash
    'hash': 'hash32'
}

@mratsim
Copy link
Contributor

mratsim commented Sep 29, 2018

Some thoughts in the past 2 days:

Type prefixes

Replying to @AlexeyAkhunov ethereum/beacon_chain#94 and ethereum/eth2.0-pm#8

As mentioned by @raulk in ethereum/eth2.0-pm#8 call, Protocol Labs (libp2p) will probably develop specialized Wireshard dissectors for Eth2.0 which should help traffic analysis a lot.

However, I do think type prefixes would be useful at the top level for BeaconBlock, ActiveState, CrystallizedState i.e. treat them as tagged unions/sum types, but probably unneeded for nested types like ShardAndCommittee, CrosslinkRecord, ValidatorRecord, AttestationRecord (unless you serialize and send them as standalone?)

For instance let's look at ActiveState and CrystallizedState

# ActiveState
fields = {
    # Attestations that have not yet been processed
    'pending_attestations': [AttestationRecord],
    # Most recent 2 * CYCLE_LENGTH block hashes, older to newer
    'recent_block_hashes': ['hash32']
}
# CrystallizedState
fields = {
    # List of validators
    'validators': [ValidatorRecord],
    # Last CrystallizedState recalculation
    'last_state_recalc': 'int64',
    # What active validators are part of the attester set
    # at what slot, and in what shard. Starts at slot
    # last_state_recalc - CYCLE_LENGTH
    'shard_and_committee_for_slots': [[ShardAndCommittee]],
    # The last justified slot
    'last_justified_slot': 'int64',
    # Number of consecutive justified slots ending at this one
    'justified_streak': 'int64',
    # The last finalized slot
    'last_finalized_slot': 'int64',
    # The current dynasty
    'current_dynasty': 'int64',
    # Records about the most recent crosslink `for each shard
    'crosslink_records': [CrosslinkRecord],
    # Used to select the committees for each shard
    'dynasty_seed': 'hash32',
    # Start of the current dynasty
    'dynasty_start': 'int64'
}

To be able to deserialize and distinguish between those efficiently you would need to send or agree on the types beforehand.

Also having type prefixes for top-level types would allow to concat multiple messages in a single one. I don't think we should use type prefixes for 'hash32' or 'address' unless they are serialized stand-alone that would add a lot of bytes especially for types that can appear in a list.

On schema version

From ethereum/beacon_chain#94 as well

I think we should have 2 bytes at the start to encode SSZ major.minor version

On length prefix vs length suffix

ethereum/beacon_chain#94 and ethereum/eth2.0-pm#8

  • The argument for moving at the end is because otherwise you would have to iterate on the data structure once to get the length then once again to get feed the data. ==> All languages even Haskell have a vector implementation that tracks length so you wouldn't need the first iteration to get length.

  • Furthermore putting length at the end means you need specific token to indicate end of data like ‘\0’ in cstring. This is causing a lot of issues in C code also this prevents reading by chunk into a buffer, you would have to read token by token (or read by chunk and backtrack).

@raulk
Copy link
Contributor

raulk commented Sep 29, 2018

As mentioned by @raulk in ethereum/eth2.0-pm#8 call, Protocol Labs (libp2p) will probably develop specialized Wireshard dissectors for Eth2.0 which should help traffic analysis a lot.

A community member has made progress with developing Wireshark dissectors for some basic libp2p protocols, like SecIO (security), Multistream (protocol selection), Yamux (multiplexing), etc. However, the challenge is accessing the session secrets to further decrypt the payload. We've briefly discussed that in this issue: libp2p/specs#46.

Wireshark uses an "onion" approach to dissect frames. It calls dissectors iteratively to decapsulate the different layers in the message. At one point the dissection is going to encounter an application specific protocol in the payload (eth2.0 protocol), and to decrypt it, a Wireshark dissector should be registered for it. It's best if those dissectors are developed and maintained by the Ethereum 2.0 community, as they are application-specific and they'll need to evolve in lockstep with the protocol.

EDIT: the existing dissectors are likely outdated and not exhaustively tested.

EDIT 2: Contributions super welcome. Let’s buidl this tooling together.

@arnetheduck
Copy link
Contributor

@mratsim

All languages even Haskell have a vector implementation that tracks length so you wouldn't need the first iteration to get length.

the argument is that you don't keep all data in memory at all - ie not even in a vector. Consider a linked list - there are two common implementations - one that has O(n) size and one with O(1) - the latter stores size in a central location but this breaks several nice properties of the linked lists.. likewise, if you have the data in a database, you may want to stream it instead of putting it in a vector first - this will make the max-memory requirement of your application go down. Both these things can be solved with a two-pass approach which is sometimes appropriate

Furthermore putting length at the end means you need specific token to indicate end of data like ‘\0’ in cstring. This is causing a lot of issues in C code also this prevents reading by chunk into a buffer, you would have to read token by token (or read by chunk and backtrack).

This depends - if you know the size of the full buffer (as is common when working with a datagram-style protocol, or if like above, you're storing the total size somewhere), you can work backwards from there to figure out the positions of the fields (just like when it's at the front, and you're implicitly assuming the buffer starts at "0").

@arnetheduck
Copy link
Contributor

Based on my understanding, one of the priorities of the format was efficient in-memory access (along with a willingness to pay a price in terms of space efficiency)

I just went through the exercise of hardening our implementation against overruns and misaligned pointers (status-im/nimbus-eth2#7 - hopefully caught all of them 😄) - the following things are noteworthy, when writing cross-platform code:

  • every field (of size >1) much be alignment-checked, including list items - this causes some performance degradation and prevents us from using mmap and the like to directly map the data, when loading from a trusted source
  • while it's possible to carefully nurse specific structs to minimize misalignment issues by manually ordering fields in a particular way, it remains impossible to do so generally:
    • anything following a variable-length field can be misaligned
    • anything that has a mix of type lengths will need to manually add padding/dummy fields

A possible solution to this issue is to introduce a rule to the serializer that ensures that padding is mandatory and deterministic:

  • 0-pad any data up to the size-aligned boundary of the next field (4 or 8 bytes for variable-length fields - the size must be aligned, but also the items themselves)
  • introduce field-reorder or possibly a check, such fields in the serialized format remain ordered by size, descending - this minimizes the amount of padding necessary, but also makes it harder to do upgrades / add fields

For a good example of a well-aligned format, check out https://google.github.io/flatbuffers/flatbuffers_internals.html

It is worth to note that modern X86 processors have mostly solved unaligned access and the performance penalty is small - nonetheless, for several cases of optimized access, even X86 requires aligned data as the optimization level of the compiler gets cranked up - sse4.2 for example, as seen here: http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html

@recmo
Copy link

recmo commented Oct 18, 2018

I was reading through the current spec. Couple of thoughts:

  • The terminology hash96/97 for BLS aggregate public keys is confusing. It is not a hash function in the normal cryptographical sense. (Though admittedly it shares some properties). Why not bytes96 instead?
  • In fact, why not have bytesN for fixed size byte strings much like the fixed size integers?
  • List/Vector explanation assumes elements have a fixed length (len(list) * sizeof(element)), but the rest of the spec and reference implementation support variable sized elements. It is thus not clear whether variable sized list items are supported. This potentially makes a O(n) vs O(1) difference for random indexing performance, so clarity would be appreciated.
  • I'm not entirely clear on the intended scope of the spec: network format, on disk storage, in memory representation (i.e. mmap-able), ABI format, VM input format, etc. All would imply different trade offs in design.
  • Unlike RLP, this format is not self-describing. A schema needs to be provided to disambiguate the serialized data. Where in RLP the valid bytestrings are in one-to-one correspondence with semantic inputs, here they are not: a valid SSZ bytestring can have multiple interpretations depending on which schema is used. This means that a hash of SSZ data is not unique. This is can be solved perfectly by hashing in a serialization of the schema like in EIP-712. Alternatively, given the scope, it might be sufficient to manually disambiguate where necessary.
  • As far as I can see there is only one unique encoding for every input, so no issues there. Just don't forget the assert in the bool decoding :)
  • Fun facts: For every bytestring there exists a schema that will produce a valid decoding (proof: ['uint8'] * len(s). For a given schema there always exists an invalid bytestring (proof: take a valid bytestring and append a byte).

@arnetheduck Here's another nice example of a well-designed serialization format meant for fast mmaping: https://capnproto.org/index.html I do think that takes alignment into consideration will necessarily not be 'simple'.

@arnetheduck
Copy link
Contributor

@recmo
it does deal with alignment: https://capnproto.org/encoding.html
also, flatbuffers has lots of valuable discussion on the topic: https://google.github.io/flatbuffers/flatbuffers_internals.html

@recmo
Copy link

recmo commented Oct 29, 2018

@arnetheduck I know it handles alignment, that's why i brought it up. :)

The point I failed to make was that handling alignment and creating a "Simple" format are somewhat conflicting goals.

@djrtwo
Copy link
Contributor

djrtwo commented Nov 5, 2018

Let's take discussion of alignment and it's relevance to another issue if necessary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants