New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Look into use of DataViews and ArrayBuffers for more efficient data send/recv #2204

Closed
pzwang opened this Issue Apr 22, 2015 · 25 comments

Comments

@almarklein

This comment has been minimized.

Contributor

almarklein commented Apr 23, 2015

What we need

First a definition: this is all about improving the support for typed homogeneous array data. Let's call this a typed-array, which means e.g. Float32Array in JS, and a Numpy array in Python.

We need a format:

  • A format to serialize structured data, either to be send directly between client and server, or to export to HTML, as an alternative to json, but with support for typed-arrays.
  • On the Python side, serialize numpy arrays directly, without turning into a list first.
  • On the JS side, parse directly into e.g. Float32Array, without going to a normal array first.
  • The array data should be stored in an efficient manner. Storing in a binary format is preferred to avoid base64 encoding.
  • Compression would be nice, but is not trivial in JS, so probably not a key necessity.
  • Need support for typed-arrays data of different dtypes (uint16, float32, float64, etc.)
  • Dimensionality: JS only knows 1D arrays, but maybe we want to support nd arrays at some point (via some sort of numpy.js)? It would also make the format usable in other contexts (e.g. as a lightweight scientific data container, or Python-R communication).

Proposal

I can see three options now:

A: use JSON
We can write a Python and JS "extension" to store typed-arrays as a dict, e.g.: {"__array__": "LONG_STRING_OF_BASE64", "size": 512, "dtype": "float32"}. The __array__ is a special marker than can be recognised by a decoder so this field can be decoded as a typed array. The advantage is that we still use JSON which is what ppl are familiar to. The disadvantage is that its not very efficient since the arrays need to be encoded/decoded in base64, which takes time and space.

B: use UBJSON
We can (mis)use UBJSON's feature of efficiently storing homogeneous arrays to store typed-arrays. The advantage is that its binary and fast, using an existing protocol (but we do have to modify the parsers a bit). The disadvantage is that we're using UBJSON in a way that it's not really intended for. Also would be limited to 1D arrays.

C: roll our own format
We could specify a new format with full support for typed arrays (and giving up JSON compatibility). If we derive it from UBJSON (which has a very smart and simple spec IMO) it would not be a huge task. The advantage is that we get exactly what we need. The disadvantage is that we would add yet another competing format.


details below


Here's some "binary json" variants:

  • ubjson is interesting in that it aims for compliance with JSON (whereas other JSON-ish formats introduce new stuff), yet they support efficient storage of typed homogeneous arrays, and with it allow storing binary data efficiently. The spec is also very simple to ease adoption. I don't think typed homogeneous arrays are parsed into JS typed arrays, but maybe we can write a JS parser that does this, and a Python parser that is numpy aware?
  • msgpack is a binary JSON variant aimed to be very small. They allow binary data and also extensions (which makes less incompatible with JSON). No support for typed homogeneous arrays (but could be, with an extension). Indeed there is binary-pack, which adds support for "distinct string and binary types", cannot find any docs though.
  • bson is another binary JSON format, that also has types like data, MD5 and more, making them not very compatible with JSON.
  • smile another binary JSON, seems dated.
  • bjson seems dead.

The problem with most of these is that we need something more than json. This makes these formats not very suitable, except perhaps ubjson, for which we could hijack the support for storing homogeneous arrays more efficiently as a means to store typed-arrays.

We could "extend" json by storing typed-arrays as base64 encoded strings. If we have a special encoder/decoder on both the JS and Python side, numpy arrays could transparently become typed JS arrays. See e.g. http://stackoverflow.com/a/24375113/2271927 and EJSON.

Other formats

  • For reference, I once wrote the SSDF format, which has an implementation in Python and Matlab, that is similar to JSON (in its data types, and that its human readable), but also supports nd typed-arrays, which are zlib-compressed and then base64 encoded.
  • XML: just kidding :)
  • hdf5 is too much

Since we're interested in storing data, we quickly end up in the more scientific formats, which are generally rather complex. Should we consider coming up with something ourselves?

@almarklein

This comment has been minimized.

Contributor

almarklein commented May 14, 2015

cc @bryevdv we already had an issue for this. Will start looking into this next week.

@almarklein almarklein self-assigned this May 14, 2015

@almarklein

This comment has been minimized.

Contributor

almarklein commented May 19, 2015

I looked into compression. Not necessary per see, but it would be nice to reduce the size of transfers and exported documents. Unfortunately, you'd need a 3d party library, and these are all pretty big.

Examples for compression schemes that are also built into Python:

  • Pako is a zlib implementation in JS which is reportedly very fast. Comes in at a 5.5k sloc count though.
  • js-deflate provides delate/inflate (i.e. zlib) compression at about 2k sloc.
  • lzma is an LZMA (de)compressor at 2.6k sloc.
  • here is a bz2 decompressor in 245 sloc, docs are very sparse, but it seems to have been the basis for many other bz2-related libs.

Other examples (would need something on the Python side):

  • lz-string is a string compression lib (no typed arrays iiuc) provides LZW-ish compression at 485 sloc. Not an "official" format though.
    lz4.js provides lz4 compression at 6k sloc.
  • minilzo implements LZO (de)compression in 560 sloc, based on minilzo.c which is written by the O in LZO.
  • LZ77-kit provides LZ77 compression for various languages. LZ77 is a component of many other compression schemes, but on itself it is a very simple algorithm, very fast, and offering moderate compression.
@almarklein

This comment has been minimized.

Contributor

almarklein commented May 21, 2015

Ok, I think I've done enough googling and reading through format specs for now. I need some help moving further. In my first post of this issue I put together an overview and I propose three options. cc @bryevdv @pzwang

@bryevdv

This comment has been minimized.

Member

bryevdv commented May 21, 2015

@almarklein An idea that has been bandied about was to remove actual data from ColumnDataSource, and instead have data sources be a lightweight objects that are configured with a "remote" actual data store. This could be a URL to a REST endpoint, or a Blaze server. Or, it could be a reference to a "local remote" data store that lives in the browser but is separate from all the other Bokeh models. I think if we did this kind of separation of the actual data payload from the lightweight data source model, it woudl allow all the normal bokeh objects to remain simple plain JSON, and then just the data columns could be transmitted separately in an enhanced JSON, or non-JSON format? Like the idea of making all the Bokeh models "lightweight" on its own, but it if would also ihelp preserve the simple JSON representation for the majority of things that would be another big point in favor for me. Thoughts @bokeh/core

@bryevdv

This comment has been minimized.

Member

bryevdv commented May 21, 2015

Other comments: I'm OK with 1d only, if we want to do something simple like storing a simple shape in some conventional way, that would be ok too. Almost all the use-cases in bokeh are around tabular columns so that is what we should optimize for.

@almarklein

This comment has been minimized.

Contributor

almarklein commented May 21, 2015

If we separate the data, as you suggest, there is no longer a need for a structured data format, so we could probably do with something simpler in that case. I'm interested in hearing more about this ...

@almarklein almarklein referenced this issue Jul 20, 2015

Closed

Ongoing WebGL related dev #2590

16 of 22 tasks complete
@bryevdv

This comment has been minimized.

Member

bryevdv commented Jul 21, 2015

@almarklein we should probably have a call sometime soon. My current plan is to implement a binary protocol over web sockets. Doing this, it seems possible to send NumPy/Pandas data directly into a JS ArrayBuffer, which can have a type array view on it without any copying. I'd like to get more input for people (and possibly help as well)

@bryevdv

This comment has been minimized.

Member

bryevdv commented Jul 21, 2015

To add a little more I intend to make the wire-protocol an implementation detail, so that we can change things later if we need to. For instance, msgpack has had some integration into blaze-server so it might make sense to look at. But for now I am just going to to arr.tobytes along with a head that has type/shape info, which works perfectly fine.

@almarklein

This comment has been minimized.

Contributor

almarklein commented Jul 21, 2015

If we're storing/sending the data separate, we only need to store a blob, a shape and a type, so a simple dedicated format is fine IMO. Msgpack seems overkill unless we want to send the data along with all the model stuff that we now send via json.

I assume we can use the same format to store data in static HTML (but base64 encoded)?

@damianavila

This comment has been minimized.

Contributor

damianavila commented Jul 22, 2015

Interesting stuff... @bryevdv, do you have a branch were you prototyped this?

@waylonflinn

This comment has been minimized.

waylonflinn commented Feb 8, 2016

@bryevdv @almarklein I'm working out exactly the same data exchange problem for interoperability between numpy and weblas. I came to the same conclusion that you guys did: bytes with type and shape should be sufficient.

There aren't a ton of ways to do that, but I'd like to be compatible with you guys from the start. Do you have code or a simple spec you can share?

@bryevdv

This comment has been minimized.

Member

bryevdv commented Feb 8, 2016

I haven't worked anything specific out yet, past just the "proof of concept" which was nothing more than sending arr. tobytes over a web socket as a binary message. But I'm certainly open to any discussion.

@bryevdv

This comment has been minimized.

Member

bryevdv commented Feb 8, 2016

@damianavila I found it, it's not much:

from __future__ import print_function
from flask import Flask, render_template
from tornado.wsgi import WSGIContainer
from tornado.web import Application, FallbackHandler
from tornado.websocket import WebSocketHandler
from tornado.ioloop import IOLoop

import numpy as np

arr = np.arange(10, dtype=np.float32)
arr_bytes = arr.tobytes()
shp = [0,0,0][:len(arr.shape)] = arr.shape
meta = {
    "size": len(arr_bytes),
    "shape": shp,
    "type": "float32",
}


class WebSocket(WebSocketHandler):
    def open(self):
        print("Socket opened.")

    def on_message(self, message):
        self.write_message("\0")
        self.write_message(meta)
        self.write_message(arr_bytes, binary=True)

    def on_close(self):
        print("Socket closed.")

app = Flask('flasknado')

@app.route('/')
def index():
    return render_template('index.html')

if __name__ == "__main__":
    container = WSGIContainer(app)
    server = Application([
        (r'/array/', WebSocket),
        (r'.*', FallbackHandler, dict(fallback=container))
    ])
    server.listen(8080)
    IOLoop.instance().start()
@bryevdv

This comment has been minimized.

Member

bryevdv commented Feb 8, 2016

and then something like this:

/* Client-side component for the Flasknado! demo application. */

var socket = null;
var state = 0;
var header = null;
var array = null;
$(document).ready(function() {
    socket = new WebSocket("ws://" + document.domain + ":8080/array/");
    socket.binaryType = 'arraybuffer';

    socket.onopen = function() {
        socket.send("Joined");
    }

    socket.onmessage = function(message) {
        if (state == 0 && message.data == "\0") {
            state = 1;
        }
        else if (state == 1) {
            header = message.data;
            state = 2;
        }
        else if (state == 2) {
            array = new Float32Array(message.data);
            state = 0;
            debugger;
        }
    }
});

function submit() {
    var text = $("input#message").val();
    socket.send(text);
    $("input#message").val('');
@datnamer

This comment has been minimized.

datnamer commented Feb 8, 2016

Look like datashader could use this as well, if it would reduce JSON comm overhead : pyviz/datashader#49 (comment) @brendancol @philippjfr

@waylonflinn

This comment has been minimized.

waylonflinn commented Feb 8, 2016

@bryevdv thanks! that's about where I am too (though I'm serializing to disk and serving with http-server). Working on something simple (based on npy) to augment this. Will keep you guys in the loop, if you're interested.

here's my client side code

var xhr = new XMLHttpRequest();
var data = null;

xhr.open("GET", "arr.buf", true);
xhr.responseType = "arraybuffer";

xhr.onload = function (e) {
  var arrayBuffer = xhr.response; // Note: not xhr.responseText
  if (arrayBuffer) {
    data = new Float32Array(arrayBuffer);
  }
};

xhr.send(null);

and here's the snippet for serializing

# given array 'a'
f = open('./arr.buf', 'wb')
f.write(a.astype(np.float32).tostring())
f.close

@datnamer in my tests, serializing (float32) to disk as bytes (instead of json) reduces to 1/5 the size. very significant for me.

@bryevdv

This comment has been minimized.

Member

bryevdv commented Feb 9, 2016

Size is an issue (though there are probably cases where the size actually increases, an array of small ints, e.g) but being able to skip the encoding entirely and get the data into a typed array view directly is another huge benefit. I should also clarify, we do have a higher level protocol for Bokeh that allows for multipart messages. My intent was to send each buffer as a separate message part to avoid unnecessary copying. It's this "just for the array" part of the protocol that has not been fleshed out. Any input is certainly very welcome.

@waylonflinn

This comment has been minimized.

waylonflinn commented Feb 9, 2016

Thanks. I'm just beggining to feel my way around the space. It's great to hear how other smart people have solved the problem.

I like your point about just getting the data into a typed array as quickly as possible. I was also considering a separate descriptor file as an option for just that reason. Another option I like a lot (for the Ajax/HTTP case) is custom headers. Maybe using a custom mime type and an extra header field for shape. It would be great if this could be made to play well with caching, so that reshaping the same data didn't trigger a new download.

Love to hear thoughts on these ideas.


Sent from Mailbox

On Tue, Feb 9, 2016 at 9:37 AM, Bryan Van de Ven notifications@github.com
wrote:

Size is a big issue but being able to skip the encoding entirely and get the data into a typed array view directly is another huge benefit. I should also clarify, we do have a higher level protocol for Bokeh that allows for multipart messages. My intent was to send each buffer as a separate message part to avoid unnecessary copying. It's this "just for the array" part of the protocol that has not been fleshed out. Any input is certainly very welcome.

Reply to this email directly or view it on GitHub:
#2204 (comment)

@jakirkham

This comment has been minimized.

Contributor

jakirkham commented Sep 15, 2016

Would be really great to hear about progress in this space. Am trying to prepare some stuff for publication and Bokeh is going to play a part in that. However, have noticed how big these HTML files are and performance is something we are seeking to improve.

Using something like bson that can be serialized easily between pure Python or the Python Mongo API and JavaScript as well as go into MongoDB, seems pretty nice all around given what you want to achieve. If the format is too flexible, I suppose one can restrict themselves to the relevant subset that will work. Though maybe there are other constraints that I'm unaware of.

While compression is definitely a laudable goal, my recommendation would be to think about it after choosing a binary format that works. Compression is always a game of trade-offs and what one person is willing to give up another might not. So perhaps having a simple plugin interface for different compression options would be valuable to avoid being too attached to a particular one. Though I would note one general constraint that seems to be important to Bokeh (being an interactive data visualization program) is speed. If too much time is spent doing decompression, it can hurt user experience.

@jakirkham

This comment has been minimized.

Contributor

jakirkham commented Oct 3, 2016

Also Numscrypt may be worth looking at as far as having an array in JavaScript/TypeScript.

@jakirkham

This comment has been minimized.

Contributor

jakirkham commented Oct 17, 2016

This is a little orthogonal from the serialization issue. However, scijs provides support for ndarrays in JavaScript along with a host of functions to work with and compute things from them. Probably worth a look at least.

@jakirkham

This comment has been minimized.

Contributor

jakirkham commented Oct 17, 2016

Also numjs, which builds on scijs may give a more NumPy-like feeling when working in JavaScript.

@bryevdv

This comment has been minimized.

Member

bryevdv commented Nov 8, 2016

some additional info in this experimental PR: #5429

basically sing a simple base 64 encode seems to give a ~3x improvement over non-websocket type renders, and a 14x speedup over push notebooks. So I think we will just start with a base64 approach, the trick is making it work completely over all the different possible ways to send end embed and transmit things... I think there will need to be some comprehensive work starting from the lowest level encoders and also some consolidation of how push_notebook works.

@bryevdv

This comment has been minimized.

Member

bryevdv commented Jan 1, 2017

There are still possibilities of exploring other encodings, or multi-part messages in the context of the server. But the work in #5544 provides a clear improvement, and also sets a foundation. Future work should have new issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment