Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Initial SimpleSerialize (SSZ) spec #18

Merged
merged 9 commits into from
Oct 3, 2018
Merged

Conversation

NatoliChris
Copy link
Contributor

Initial specification outline for SSZ for #2.

  • Basic information about serializing types.
  • Basic information about deserializing types.
  • Checks to be performed for sanity.
  • Links to implementations.

TODO:

  • Ensure all implementations are listed.

There has been some discussion about ssz so if you notice any discrepancies would be great to have any input for changes 😄.

Also, if I've forgotten an implementation feel free to update the implementation list. There are some eth2.0 implementations that do not use ssz at this point in time (that I could see), but if I couldn't see it please make sure to add them.

Any help is appreciated, especially in terms of readability and making it easier to understand. We appreciate your time!

@NatoliChris
Copy link
Contributor Author

I think we should also add the test format once it has been finalised.

Thoughts? 😄

@djrtwo
Copy link
Contributor

djrtwo commented Oct 2, 2018

Thank you! Will review in the morning.

I've been debating about where to put the test specifications. We need to provide the generic format that our tests formats will conform to, and then we need a spec for each test format.

I was thinking about making a dir specs/tests and have specs/tests/format.md and then specs/tests/ssz.md, etc, etc.

The argument for this would be to not clutter up the specs with test format info.
The argument against would be that maybe it's nice to have everything in the same place :)

I'll add the generic test format doc tomorrow and we can discuss from there.


## About

`SimpleSerialize` was first proposed by Vitalik Buterin as the serializaiton
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: serializaiton -> serialization

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Completely missed the misspelling 😄 Thanks!

specs/simpleserialize.md Outdated Show resolved Hide resolved
@NatoliChris
Copy link
Contributor Author

Thanks @hwwhww 😄

@hwwhww hwwhww added the general:RFC Request for Comments label Oct 2, 2018
Copy link
Contributor

@mratsim mratsim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

One thing that is missing are "container types" serialization, for example ActiveState

# ActiveState
fields = {
    # Attestations that have not yet been processed
    'pending_attestations': [AttestationRecord],
    # Most recent 2 * CYCLE_LENGTH block hashes, older to newer
    'recent_block_hashes': ['hash32']
}

+ [Serialize/Encode](#serializeencode)
- [uint: 8/16/24/32/64/256](#uint-816243264256)
- [Address](#address)
- [Hash32](#hash32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With BLS we will have Hash96 for the public keys maybe Hash97 if it comes with a recovery bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mratsim interesting point! I'll leave this open for discussion if we should tentatively add it in or leave it until we have more clarity on the point?

I'll draft it up and maybe add it into a PR?

Copy link
Contributor

@mratsim mratsim Oct 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a simple Hash32 and other hash sizes is probably enough at the moment


#### Hash32

The hash32 should already be a 32 byte length serialized data format. The safety
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hash32 works for Keccak/Blake2b[0 ..< 32] not for BLS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, great point! Did we address in point above?

I think we should have the Hash97 and Hash96 separately or address similar to the way uint is serialized/deserialized?
Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think like uint

1. For each item in list:
1. serialize.
2. append to string.
2. Get size of serialized string. Encode into a 4 byte integer.
Copy link
Contributor

@mratsim mratsim Oct 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is Python specific. This should read as

#### List/Vectors

1. Get the number of raw bytes to serialize: it is `len(list) * sizeof(element)`.
    Encode that as a 4-byte bigEndian uint32
2. Append your elements in a packed manner.

* Note on efficiency: consider using a container that does not need to iterate over all elements to get its length. For example Python lists, C++ vectors or Rust Vec.

Then your Python implementation notes.

The perf notes are to address the problem highlighted by @AlexeyAkhunov, if the container does keep track of its length (for example LinkedLists) prefixing the length is inefficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This is great.

I think I'll add yours specifically, since it's more generic and should cater to larger of an audience!

Thanks for pointing me to the notes, I'll make sure to run through them and give a reference! 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can refer to this post: #2 (comment)

On length prefix vs length suffix

ethereum/beacon_chain#94 and ethereum/eth2.0-pm#8

  • The argument for moving at the end is because otherwise you would have to iterate on the data structure once to get the length then once again to get feed the data. ==> All languages even Haskell have a vector implementation that tracks length so you wouldn't need the first iteration to get length.

  • Furthermore putting length at the end means you need specific token to indicate end of data like ‘\0’ in cstring. This is causing a lot of issue in C code also this prevents reading by chunk into a buffer, you would have to read token by token (or read by chunk and backtrack).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mratsim! 👍

I'll hold this PR open until we get more clarification/acceptance?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's up to @djrtwo and @hwwhww, either we have all the discussions happening in this single PR or we kick the ball running by merging it and let the issues come 😄.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely don't want to block it, just thought it will be nice to have more eyes on it, like @mratsim. 👍👀 It looks good for me to merge it, as a clear document to be audited and to be disucussed.

@NatoliChris
Copy link
Contributor Author

Thanks @mratsim for the excellent comments!
I'll provide the updates shortly! 😄

@NatoliChris NatoliChris changed the title Initial SimpleSerialize (SSZ) spec [WIP] Initial SimpleSerialize (SSZ) spec Oct 2, 2018
@djrtwo
Copy link
Contributor

djrtwo commented Oct 2, 2018

I'm going to give it a look this morning and merge. Just need to make sure some sort of "work in progress" signal is at the top of the doc

@NatoliChris
Copy link
Contributor Author

Added Updates

  • Changed title to include explicit WIP status.
  • Updated Hash types to include hash96 and hash97 for BLS.
    • Note: It may be a point of discussion how this should be implemented. At this moment we went for simplicity and assumed it will be similar to the uint serialization technique.
  • Updated List to List/Vector and added @mratsim's suggestions.

There are currently 2 items that are to be discussed:

Container Serialization (@mratsim's comment):

Nice work!
One thing that is missing are "container types" serialization, for example ActiveState

# ActiveState
fields = {
    # Attestations that have not yet been processed
   'pending_attestations': [AttestationRecord],
    # Most recent 2 * CYCLE_LENGTH block hashes, older to newer
    'recent_block_hashes': ['hash32']
}

I feel like this should be an open discussion around the appropriate methods of serializing a container. @AlexeyAkhunov may also have input.

List/Vector Serialization

@mratsim's comment:

The perf notes are to address the problem highlighted by @AlexeyAkhunov, if the container does keep > track of its length (for example LinkedLists) prefixing the length is inefficient.

You can refer to this post: #2 (comment)

On length prefix vs length suffix

ethereum/beacon_chain#94 and ethereum/eth2.0-pm#8

  • The argument for moving at the end is because otherwise you would have to iterate on the data structure once to get the length then once again to get feed the data. ==> All languages even Haskell have a vector implementation that tracks length so you wouldn't need the first iteration to get length.
  • Furthermore putting length at the end means you need specific token to indicate end of data like ‘\0’ in cstring. This is causing a lot of issue in C code also this prevents reading by chunk into a buffer, you would have to read token by token (or read by chunk and backtrack).

It might be appropriate like @hwwhww suggested to make this a separate discussion.

Thanks for all the input 😄

Copy link
Contributor

@djrtwo djrtwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! 👍
A handful of small edits for consistency/clarity

| `byte_order` | Specifies [endianness:](https://en.wikipedia.org/wiki/Endianness) Big Endian or Little Endian. |
| `len` | Length/Number of Bytes. |
| `to_bytes` | Convert to bytes. Should take parameters ``size`` and ``byte_order``. |
| `from_bytes` | Convert form bytes to object. Should take ``bytes`` and ``byte_order``. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "form" -> "from"


| Constant | Value | Definition |
|:---------------|:-----:|:------------------------------------------------------------------------|
| `LENGTH_BYTES` | 4 | Number of bytes used for the length added before the serialized object. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe say 'before a variable-length serialized object'. To be clear at this point that not all items need a length prefix

specs/simpleserialize.md Show resolved Hide resolved
specs/simpleserialize.md Show resolved Hide resolved

1. Get the number of raw bytes to serialize: it is `len(list) * sizeof(element)`.
* Encode that as a `4-byte` **big endian** `uint32`.
2. Append your elements in a packed manner.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change "your" to "the"

specs/simpleserialize.md Show resolved Hide resolved

The decoding requires knowledge of the type of the item to be decoded. When
performing decoding on an entire serialized string, it also requires knowledge
of what order the objects have been serialized in.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"of the order in which the objects have been serialized"


```python
bytes_length = int.from_bytes(rawbytes[current_index:current_index + LENGTH_BYTES], 'big')
new_index = current_index + LENGTH_BYTES + bytes_lenth
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: bytes_lenth -> bytes_length

```python
bytes_length = int.from_bytes(rawbytes[current_index:current_index + LENGTH_BYTES], 'big')
new_index = current_index + LENGTH_BYTES + bytes_lenth
return rawbytes[current_index + LENGTH_BYTES:current_index+ LENGTH_BYTES +bytes_length], new_index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider using two helper vars defined before the return statement for the indices -- bytes_start & bytes_end

entire length of the list.


| Check type | code |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use "Check to perform" as in previous tables, and add the assertion in the sample code

@NatoliChris
Copy link
Contributor Author

Thanks @djrtwo! All applied 😄. Thanks for the excellent comments!

@djrtwo
Copy link
Contributor

djrtwo commented Oct 3, 2018

@NatoliChris Let's add a "TODO section/note" in the document for "container types" to make it clear it's missing.

After that, I'll merge this

@NatoliChris
Copy link
Contributor Author

Done 😄
Thanks again @djrtwo !

@djrtwo
Copy link
Contributor

djrtwo commented Oct 3, 2018

Excellent. Thanks!

@djrtwo djrtwo merged commit 66e316d into ethereum:master Oct 3, 2018
hwwhww pushed a commit that referenced this pull request Jul 23, 2020
* Simplify Makefile
* Add CircleCI configuration
hwwhww pushed a commit that referenced this pull request Jul 23, 2020
* Simplify Makefile
* Add CircleCI configuration
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
general:RFC Request for Comments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants