prototype of binary json codec optimized for in-place access
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
LICENSE
README.md
index.js
package.json
read.js
test.js

README.md

bipf

Binary In-Place Format. A binary format designed for in-place (without parsing) reads, with schemaless json-like semantics.

motivation

in-place reads

In a database there are many cases where you need to read a bunch of records, filter out most of it (if one or two fields do not match) and then immediately write whats left to a network socket. With json, this means parsing possibly hundreds of thousands of json objects (which is suprisingly slow), and then reserializing whats left. An inplace format doesn't actually require parsing as a whole at all. You only need to parse the fields you actually read, and using length delimited fields instead of escapes, means you do not have to look at every byte to parse a field.

length delimited collections

Unfortunately, most binary json-like formats (such as msgpack and cbor) use element counts on collections (objects and arrays, in json-land) this means to find the end of a collection, you have to step past each item in it (including the fields in any object contained inside of it). However, if the collections are length delimited, meaning marked by the encoded byte length of the object, not the number of items inside it, then it's easy to jump right to the end of the object in one go. For this reason, databases (for example, mongodb, and couchdb) use length delimited collections.

format

Every type of field is encoded with a type tag and a length packed into a varint. This means that short types have a one byte tag, and one byte value. The type is stored in the lowest 3 bits, and the length the higher bits. Since a varint stores values up to 128 bits in a single byte, values less than 16 bytes long have a one byte tag, and values up to 8k long have a two byte tag, values up to 1048576 bytes have a 3 byte tag, and so on.

<tag: varint(encoding_length(value) << 3 | type)><value>

the type indicates the encoding of the value. valid types are:

STRING  : 0  // utf8 encoded string
BUFFER  : 1  // raw binary buffer
INT     : 2  // little endian 32 bit integer
DOUBLE  : 3  // little endian 64 bit float
ARRAY   : 4  // array of any other value
OBJECT  : 5  // list of string: value pairs
BOOLNULL: 6  // a boolean, or null.
EXTENDED: 7  // custom type. specific type should be indicated by varint at start of buffer.

All values must have a correct length field. This makes it possible to traverse all fields without looking at the values. Theirfor it is possible to quickly jump to any subvalue if you know it's path. If you are looking for a particular string, you can also skip any with the wrong length! Since object and array fields also begin with a length, you can jump past them if you know the do not contain the value you are looking for. This means that seeking inside a more tree like object is more efficient than seeking into a more list like object!

performance

This design is optimized for the performance of in-place reads. Encoding is expected to be slower because of the need to calculate the length of collections before encoding them. If encoding is within half as fast as a format intended for encoding perf, that is good. Of course, the intention with an in-place read system is that you encode once and then never decode. Just pass around the binary object, reading fields out when necessary.

Because of the length encoding, the ability to update in-place is very limited (not recommended actualy) but if you are building a system around immutable data, that is not much of a problem. Although, since subobjects are fully valid as an encoded value, you can easily copy a subobject into a new object, etc, without re-encoding.

benchmark

I did a simple benchmark, where I encoded and decoded this module's package.json file in various ways. Please not that I am comparing the performance of code written in C with code written in javascript. If the javascript is within 10x the performance of the C then we are doing well! (and a C implementation would likely close that gap)

The measurement is milliseconds to perform 10k operations. Code is at the end of ./test.js

binary.encode 228
JSON.stringify 38
binary.decode 212
JSON.parse 52
JSON.parse(buffer) 66
JSON.stringify(JSON.parse()) 96
binary.seek(string) 44
binary.seek(buffer) 10

As expected, binary.encode is much slower than JSON.stringify, but it's only 6 times worse. But the interesting comparison is JSON.stringify(JSON.parse()) and binary.seek(buffer). Often, in implementing a database, you need to read something from disk, examine one or two fields (to check if it matches a query) and then write it to network.

(note: the binary.seek operation is fairly realistic, we seek to the "dependencies" object, then look up "varint" inside of that, then decode the version range of "varint". So it's two comparisons and decoding a string out)

So, in JSON land, that usually means reading it, parsing it, checking it, stringifying it again. This involves reading each byte in the input and allocating memory for the parsed object. Then traversing that object in memory and writing something to a string (more memory allocation, and all this memory allocation means the garbage collector needs to handle it too)

But if we have in-place reads, we just read raw binary, seek into the appropiate places to check wether it's the objects we want, and then write it to the network directly. We don't allocate any new memory after reading it.

Further benchmarks and tests are necessary, but that it can be this fast using a javascript implementation is impressive.

cannonicisity

For a system with signatures, it's highly important that data is cannonical. There should be exactly one way to encode a given data structure. There are a few edge cases here that need to be checked for. (not implemented yet)

  • varints must not be zero padded
  • chrome and firefox preserve order of object keys, but any integer keys greater than zero come first, and are in increasing order.
  • the length of subfields must be checked to not excede their container's length. (This is a security issue)

These properties can all be checked by traversing the tags but without reading the keys or values. I will not consider this module ready until there are tests that cover these invalid cases, to ensure that implementations throw an error.

api

encode, decode, encodingLength follow the interface specified by abstract-encoding

encode (value, buffer, start) => length

write value to buffer from start. returns the number of bytes used.

decode (buffer, start) => value

read the next value from buffer at start. returns the value, and sets decode.bytes to number of bytes used.

encodingLength (value) => length

returns the length needed to encode value

getValueType (value) => type

returns the type tag that will be used to encode this type.

getEncodedType (buffer, start) => type

get the type tag at start

types.{string,buffer,int,double,array,object,boolnull,reserved}

an object containing the type tags.

seekKey (buffer, start, target) => pointer

seek for a key target within an object. If getEncodedType(buffer, start) !== types.object then will return -1. Otherwise, seekKey will iterate over the encoding object and return a pointer to where it starts.

Since this defines a recursive encoding, a pointer to any valid sub-encoding is a valid start value.

var obj = {
  foo: 1,
  bar: true,
  baz: 'hello'
}
//allocate a correctly sized buffer
var length = b.encodingLength(obj)
var buffer = Buffer.alloc(length)

//encode object to buffer
b.encode(obj, buffer, 0)

//parse entire object and read a single value
console.log(b.decode(buffer, 0).baz)

//seek and decode a single value
console.log(b.decode(buffer, b.seekKey(buffer, 0, 'baz')))

see performance section for discussion on the performance of seek - if it's only needed to parse a couple of elements, it can be significantly faster than parsing.

seekPath (buffer, start, path, path_start) => pointer

The same as seekKey, except for a recursive path. path should be an array of strings encoded with this format. path_start should a pointer to the encoded array in that buffer.

License

MIT