Skip to content

Comments on the BED format #1

@jlouis

Description

@jlouis

I'm running all of the specification in indentation, and keeping my comments here. This will give me the ability to comment on parts of the document a bit at a time.

::: Binary Electronic Data specification

This binary data serialization format was created for use in
RESTful services or with Websockets. Compared to most other
formats, it comes with three special types: hyperlinks,
symbols and RFC 3339 dates.

The idea of including hyperlinks is huge. We need that for HATEOAS, so why not simply include it right away? That move is brilliant. Keep it! Also, the RFC 3339 support beats ISO8601 due to simplicity. I like it a lot.

…

We are looking for any kind of feedback at this point.

Two precedents that must be studied:

  • data.fressian—the storage format of Datomic
  • UBF(A/B)—Joe Armstrong's Universal binary format

The reason to study these two is that they get a lot of things right that other formats gets wrong. If you avoid something from these formats, you must at the least provide rationale for doing so.

The data.fressian format is described here https://github.com/Datomic/fressian/wiki

The ubf format is described here http://ubf.github.io/ubf/ubf-user-guide.en.html

  • The data.fressian format generalizes BED-symbols by using bytes for caching any part of the data. This is useful since you can cache an 8-byte timestamp value as a 1-byte entry.
  • A (mathematical) Set type is included
  • UUIDs are included in the format as a core type
  • Regular expressions is a data type
  • Large pieces of data—strings and binaries—are stored chunked in 64 Kilobyte chunks. This helps with decoders since they can stream-process data.

And for ubf(a):

  • UBF(A) has semantic tags. That is, a way to annotate data with a tag. This means you can avoid user-defined data if it can be built from the data already in the BED spec. Because you can just tag it semantically with its meaning.
  • UBF(A) runs as a virtual machine with a recognition stack. This allows you to cache data by pushing the stack to a register and pulling it back again.

Also, one thing I have considered is to build up things with a two-stage format. First you encode the structure of the data, followed by the data itself. This means you can re-use the map keys in multiple map entries which look the same. It avoids a lot of the needless copying of keys in JSON. In addition, a parser can exit early if it doesn't recognize the structure as valid. It can force correct structure early.

…

In addition to these types, it is possible to define up to
255 additional application-specific types using extensions.

I would probably use semantic tags for this to avoid half a billion different variants of the format.

:: Media type

…

: `application/x-bed`

…

: `application/x-bed-stream`

I really like this distinction.

…

:: Types

…

: Symbol

Symbols are a very simple and yet very important optimization
and serve to greatly reduce the quantity of data being sent.

I think I would look at generalized caching instead of this.

: Binary
: String

Strings must be valid UTF-8 text, as defined in RFC 3629.

Yes! Split binaries and strings in the spec. Good move.

: RFC 3339 date

…

: Integer

…

: IEEE 754 binary64

Double precision floating-point numbers can be encoded using
the IEEE 754 binary64 format. This format uses 64 bits to
encode the number.

…

: IEEE 754 decimal64

Decimal numbers can be encoded using the IEEE 754 decimal64
format. This format uses 64 bits to encode the number.

…

: Map

Maps are an unordered list of a fixed number of key/value pairs.

Keys can be of any type, although they probably should be
symbols when possible. Values can of course be of any type.

The order of key/value pairs has no meaning and does not
need to be preserved.

There must not be any key duplicate in the map. If a duplicate
is found then decoding must stop immediately and an error must
be thrown. Decoders can of course provide an option to disable
this behavior.

I really like that you rule out the corner-cases right away. This usually helps. Consider encodings where keys can be reused efficiently among several maps in an array.

: Array

Arrays are an ordered list of a fixed number of values.

Values can be of any type.

Thought: call these vectors and force all values to be the same type.

…

: List

…

:: Extensions

…

:: Binary representation

The BED format uses network byte order (big endian).

…

:: Encoding/decoding state

To be able to encode or decode symbols appropriately, state
must be maintained.

…

:: Security considerations

The format provides no security mechanism. A number of problems
may occur during decoding or after decoding a message.

During decoding, extra care must be taken to avoid running out
of memory. Some of the types allow for very large values. However
it is not recommended to limit the size of these directly. Two
solutions are available.

…

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions