I'm running all of the specification in indentation, and keeping my comments here. This will give me the ability to comment on parts of the document a bit at a time.
::: Binary Electronic Data specification
This binary data serialization format was created for use in
RESTful services or with Websockets. Compared to most other
formats, it comes with three special types: hyperlinks,
symbols and RFC 3339 dates.
The idea of including hyperlinks is huge. We need that for HATEOAS, so why not simply include it right away? That move is brilliant. Keep it! Also, the RFC 3339 support beats ISO8601 due to simplicity. I like it a lot.
…
We are looking for any kind of feedback at this point.
Two precedents that must be studied:
- data.fressian—the storage format of Datomic
- UBF(A/B)—Joe Armstrong's Universal binary format
The reason to study these two is that they get a lot of things right that other formats gets wrong. If you avoid something from these formats, you must at the least provide rationale for doing so.
The data.fressian format is described here https://github.com/Datomic/fressian/wiki
The ubf format is described here http://ubf.github.io/ubf/ubf-user-guide.en.html
- The data.fressian format generalizes BED-symbols by using bytes for caching any part of the data. This is useful since you can cache an 8-byte timestamp value as a 1-byte entry.
- A (mathematical) Set type is included
- UUIDs are included in the format as a core type
- Regular expressions is a data type
- Large pieces of data—strings and binaries—are stored chunked in 64 Kilobyte chunks. This helps with decoders since they can stream-process data.
And for ubf(a):
- UBF(A) has semantic tags. That is, a way to annotate data with a tag. This means you can avoid user-defined data if it can be built from the data already in the BED spec. Because you can just tag it semantically with its meaning.
- UBF(A) runs as a virtual machine with a recognition stack. This allows you to cache data by pushing the stack to a register and pulling it back again.
Also, one thing I have considered is to build up things with a two-stage format. First you encode the structure of the data, followed by the data itself. This means you can re-use the map keys in multiple map entries which look the same. It avoids a lot of the needless copying of keys in JSON. In addition, a parser can exit early if it doesn't recognize the structure as valid. It can force correct structure early.
…
In addition to these types, it is possible to define up to
255 additional application-specific types using extensions.
I would probably use semantic tags for this to avoid half a billion different variants of the format.
:: Media type
…
: `application/x-bed`
…
: `application/x-bed-stream`
I really like this distinction.
…
:: Types
…
: Symbol
Symbols are a very simple and yet very important optimization
and serve to greatly reduce the quantity of data being sent.
I think I would look at generalized caching instead of this.
: Binary
: String
Strings must be valid UTF-8 text, as defined in RFC 3629.
Yes! Split binaries and strings in the spec. Good move.
: RFC 3339 date
…
: Integer
…
: IEEE 754 binary64
Double precision floating-point numbers can be encoded using
the IEEE 754 binary64 format. This format uses 64 bits to
encode the number.
…
: IEEE 754 decimal64
Decimal numbers can be encoded using the IEEE 754 decimal64
format. This format uses 64 bits to encode the number.
…
: Map
Maps are an unordered list of a fixed number of key/value pairs.
Keys can be of any type, although they probably should be
symbols when possible. Values can of course be of any type.
The order of key/value pairs has no meaning and does not
need to be preserved.
There must not be any key duplicate in the map. If a duplicate
is found then decoding must stop immediately and an error must
be thrown. Decoders can of course provide an option to disable
this behavior.
I really like that you rule out the corner-cases right away. This usually helps. Consider encodings where keys can be reused efficiently among several maps in an array.
: Array
Arrays are an ordered list of a fixed number of values.
Values can be of any type.
Thought: call these vectors and force all values to be the same type.
…
: List
…
:: Extensions
…
:: Binary representation
The BED format uses network byte order (big endian).
…
:: Encoding/decoding state
To be able to encode or decode symbols appropriately, state
must be maintained.
…
:: Security considerations
The format provides no security mechanism. A number of problems
may occur during decoding or after decoding a message.
During decoding, extra care must be taken to avoid running out
of memory. Some of the types allow for very large values. However
it is not recommended to limit the size of these directly. Two
solutions are available.
…
I'm running all of the specification in indentation, and keeping my comments here. This will give me the ability to comment on parts of the document a bit at a time.
The idea of including hyperlinks is huge. We need that for HATEOAS, so why not simply include it right away? That move is brilliant. Keep it! Also, the RFC 3339 support beats ISO8601 due to simplicity. I like it a lot.
Two precedents that must be studied:
The reason to study these two is that they get a lot of things right that other formats gets wrong. If you avoid something from these formats, you must at the least provide rationale for doing so.
The data.fressian format is described here https://github.com/Datomic/fressian/wiki
The ubf format is described here http://ubf.github.io/ubf/ubf-user-guide.en.html
And for ubf(a):
Also, one thing I have considered is to build up things with a two-stage format. First you encode the structure of the data, followed by the data itself. This means you can re-use the map keys in multiple map entries which look the same. It avoids a lot of the needless copying of keys in JSON. In addition, a parser can exit early if it doesn't recognize the structure as valid. It can force correct structure early.
I would probably use semantic tags for this to avoid half a billion different variants of the format.
I really like this distinction.
I think I would look at generalized caching instead of this.
Yes! Split binaries and strings in the spec. Good move.
I really like that you rule out the corner-cases right away. This usually helps. Consider encodings where keys can be reused efficiently among several maps in an array.
Thought: call these vectors and force all values to be the same type.