Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server rewrite based on Tornado #2794

Merged
merged 306 commits into from
Nov 4, 2015
Merged

Server rewrite based on Tornado #2794

merged 306 commits into from
Nov 4, 2015

Conversation

bryevdv
Copy link
Member

@bryevdv bryevdv commented Sep 1, 2015

This PR is intended to be a place to have concrete discussions about technical points in the new server development. I intend to submit a new PR with all commits squashed, when this is ready to merge.

Notes

To see some basic operation with messages passed back and forth from client and server, run

python bk.py  # basic server

and

python client.py # basic client

The class/logic in client.py is eventually intended to be rolled into bokeh.session and then real control, updating of the server will be possible with no more ugly polling, ever.

Concepts

  • WSHandler — takes individual message framents, passes to Recevier until a message is completed, then passes message to a handler
  • Receiver — responsible for assembling message fragments into complete messages
  • Message — define operations on the server or client
  • Protocol — defines a specific collection of message revisions and helps automate their creation
  • ServerSession — an in-memory document cache for one client connection (e.g., a single browser tab)
  • ServerDocument encapsulate a Bokeh Document on the server. Manages a work queue that serializes any storage operations on the document.

Wire Protocol

Modeled loosely after the IPython protocol. A message comprises these sub-parts, sent as individual websocket messags:

[
    # these are required
    b'foobarbaz',       # HMAC signature
    b'{header}',        # serialized header dict
    b'{metadata}',      # serialized metadata dict
    b'{content},        # serialized content dict

    # these are optional, and come in pairs
    b'{buf_header}',    # serialized buffer header dict
    b'array'            # raw buffer payload data, possibly binary
    ...
]

New Document Idea

The basic idea is for Document to be a data structure composed of three parts: variables, models, stores:

class Document(HasProps):

    # basically a namespace for simple state to be included in Bokeh docs
    vars = Dict(String, Any)

    # like current models, except all lightweight -- no actual data
    models = Dict(String, Instance(Model))

    # actual data goes here
    stores = Dict(String, Instance(DataStore))

Various use-cases are possible:

  • sessions can copy everything when the session is created (no updates pushed to sessions)
  • sessions could share/be linked to some combination of vars, models, stores (updates get pushed to sessions)

Additionally there should be some provision or mode for clients to store changes that that push to a session back to the persistent backend. Need to distinguish between changes pushed to session cache vs changes inteded to be saved. GH commit vs push analogy?

Questions

  • Message subclasses have handle_server and handle_client methods on them, which are intended used by the server and client respectively to handle incoming messages. Would it be worthwhile to factor out the handlers? There are definitely some nice reasons to do that, but it would also spread out the codebase somewhat even more.
  • The create method on the Protocol, picks out the right message and calls create on the message class. It just passes on any required params as args and kwargs, but the user who calls create has to just know what any additional params are for the specific message type. Maybe there is a better way?
  • Is there any particularly good way to start a tornado server in ipython notebook? Obviously we could just use subprocess, etc. but though perhaps there might be some tighter ipython integration possible.
  • Protocol is defined in an __init__ there's probably a better arrangement. Source layout suggestions welcome in general.
  • Currently, switching between encoded strings/bytes seems a bit ad-hoc and clunky. More consistent approach?

Todo

  • specify remaining messages/operations
  • implement new document class
  • specify storage API
  • Implement storage API
    • in-memory
    • redis
  • implement message handlers
  • implement JS client
  • improve python client, Session integration
  • admin page
  • index page
  • integrate/delegate auth checking
  • update polling demos to not do polling

@fpliger
Copy link
Contributor

fpliger commented Sep 2, 2015

My hot reaction to the proposal is absolutely positive. Need to dedicate some more time to go through it more carefully but here are a couple of quick thoughts:

  • The document structure separation is a nice one. It makes things more flexible and easier to manage models/data partial changes/serialization.
  • I didn't see a mention to pushes/updates (I may be overlooking). I'd strongly favor partial vars/models/data updates rather then monolitic entire updates
  • With this new concept the server simply acts as a service. This means that the notion of "bokeh-server applet" is gone. Applications (either web or native) should just handle their own logic and what not on their own and "use" the bokeh-server service.. Is it correct?

Before pouring the river of thoughts I have 😝 it's better to wait for the last question ..

@bryevdv
Copy link
Member Author

bryevdv commented Sep 2, 2015

@fpliger Yes I need to add some notes about updates, that's basically the "decide what message/operations" exist bullet is about. I had planned on support both full and incremental and partial updates (i.e., a message specifically for a streaming append, for instance, that checks that all columns get appended to equally and complains informatively if not) I'll put up some initial thoughts today and we can iterate on them.

Regarding applets, I was still imagining a way to associate "scripts" or "apps" with a document, so the code could be run in the server (in thread pool executors since we are async) with "direct" access to the models. This is basically @havocp's "server callbacks" afaict.

Edit: but maybe everything interacting through messages would be simpler and enough? I'm open to discussion though I feel like the want is to be able to "publish" an app and further I think in people's minds this implies it is somehow stored in the server as well.


@gen.coroutine
def workon(self, item):
yield self._queue.put(item)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means you are not interested in the result of the processing, or even the information that the processing is done, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly, most of the processing here will involve updating the persistent store, perhaps as a result of a longer running computation. Those updates might trigger sending of notifications. Right now this part is just basic scaffolding, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance, a computationally expensive callback gets queued here, that eventually updates the data source for a plot. That causes another chunk of work to get queues on the main IOLoop that will either push updates to the connected client, or send a message that there are updates available to pull. Does that seem reasonable? I suppose the yield here could return that secondary work and queue it on the other queue here, though...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking of this topic, I was planning on serializing any operations that might write to the document, and servicing reads from the document immediately.

@havocp
Copy link
Contributor

havocp commented Sep 2, 2015

I'm very enthusiastic about this direction! Some comments from an initial skim of the code.

  1. You can probably simplify some protocol things here by relying on TCP connection properties of a websocket.
    • the per-message HMAC shouldn't be needed unless I'm missing something - TCP already protects you from corruption. The one thing that can happen is truncation (connection closed before you get the whole message), but you could detect that with just a length instead of the expensive HMAC.
    • messages could have an incrementing integer (serial number) rather than a relatively expensive UUID, because they are "scoped" by the connection (the session or connection could have a UUID, if you need a message to have a UUID you could combine the session UUID with the message's serial number)
  2. It isn't clear to me the difference between "header" and "metadata"
  3. I've often seen protocol versioning schemes like this that go unused - because the easier path is generally to make compatible protocol changes, instead, like adding fields or adding new messages (plus having a convention to ignore unknown messages). But before discussing how to evolve the protocol... what is the scenario where the client and server diverge? Won't we be serving the JavaScript from the same server that speaks the protocol, so they will always be in sync?

METADATA = 2
CONTENT = 3
BUFFER_HEADER = 4
BUFFER_PAYLOAD = 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A possibly-more-concise way to do this state machine is to make each state a function, and have a variable containing the current function. Like self.current_consumer = consume_hmac kind of thing. Then you avoid the "if" statements and the enum itself. Just a thought

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup was already thinking of making basically that change.

@pitrou
Copy link
Contributor

pitrou commented Sep 2, 2015

Message subclasses have handle_server and handle_client methods on them, which are intended used by the server and client respectively to handle incoming messages.

IMO deciding what to do with messages is the prerogative of the server and client, respectively, so those methods have nothing to do on the message objects themselves. Message objects should provide methods pertaining to their inner logic, but that's all.

raise gen.Return(None)

# make sure the session ID from the client message matches
if message.header['sessid'] != self.session.id:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect this check to be once when the connection opens, rather than per-message

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just being paranoid I guess, seemed that perhaps a client could forge a different session id if it wanted to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I meant I guess is you could send a special "my session id is ___" message as the first thing on the connection and then omit it from each message, then it can't be wrong and doesn't need to be checked per-message

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sending the session ID is something that happens on when the server replies to a connection open with an ACK. I think it would still need to be sent, if we are going to simplify the message IDs?

@bryevdv
Copy link
Member Author

bryevdv commented Sep 2, 2015

IMO deciding what to do with messages is the prerogative of the server and client, respectively, so those methods have nothing to do on the message objects themselves. Message objects should provide methods pertaining to their inner logic, but that's all.

Yah I tend to agree, was just imagining an explosion of little handler classes. But better to do things right.

@bryevdv
Copy link
Member Author

bryevdv commented Sep 2, 2015

the per-message HMAC shouldn't be needed unless I'm missing something - TCP already protects you from corruption. The one thing that can happen is truncation (connection closed before you get the whole message), but you could detect that with just a length instead of the expensive HMAC.
messages could have an incrementing integer (serial number) rather than a relatively expensive UUID, because they are "scoped" by the connection (the session or connection could have a UUID, if you need a message to have a UUID you could combine the session UUID with the message's serial number)

Full confession some of this part was cargo culting on my part. I cribbed the HMAC bit from the ipython protocol. I had assumed it was to detect corruption or adulteration of message contents over HTTP connections. I do agree that an incremental counter could suffice in conjuction with a session ID, but also with an origin tag (client or server) as well? So the client has one counter, and the server has another. Otherwise how to synchronize one counter across both?

Edit: perhaps an option to enable/disable HMAC checking?

It isn't clear to me the difference between "header" and "metadata"

metadata is just an isolated scratch space. Imagine we want to be able to instrument messages arbitrarily for some kind of debug or trace mode.

I've often seen protocol versioning schemes like this that go unused - because the easier path is generally to make compatible protocol changes, instead, like adding fields or adding new messages (plus having a convention to ignore unknown messages). But before discussing how to evolve the protocol... what is the scenario where the client and server diverge? Won't we be serving the JavaScript from the same server that speaks the protocol, so they will always be in sync?

It's a hedge for sure, on the notion that if you do need it later, it's a royal pain to tack on after the fact. I have definitely seen protocol versioning used. So, I dunno. One scenario I can easily imagine is adding new message types. Which might not be incompatible per se, but it's nice to have a record of exacltly where and how these things happen. And no, I think a user could create a document e.g., load BokehJS from CDN but also expects to talk to a Bokeh server for the document data.

Edit: I should say that scenario is currently do-able but not trivially so. But in general if a person publishes a document with a specific version of Bokeh is there never an expectation that it be rendered with that same version? If so, I think a server deployed as a "public service" would actually need to be able to serve several versions of BokehJS.

@bryevdv
Copy link
Member Author

bryevdv commented Sep 8, 2015

@bokeh/dev OK some notes on the latest pushes:

  • separated out handler code from messages
  • plumbing for "push" and "pull" document messages
  • scaffolding for storage backends and new Document

I think if just the new Document class and "push" and "pull" messages can be implemented with both an in-memory and Redis storage backed, that would be enough to reproduce the current server capability, and that additional niceties (binary array buffers, streaming and partial update messages) could be added smoothly in an incremental fashion later. I think this is the best chance for having something ready for 0.10 so I plan to try and stand up something rudimentary by the beginning of next week.

@fpliger
Copy link
Contributor

fpliger commented Sep 8, 2015

I think if just the new Document class and "push" and "pull" messages can be implemented with both an in-memory and Redis storage backed, that would be enough to reproduce the current server capability, and that additional niceties (binary array buffers, streaming and partial update messages) could be added smoothly in an incremental fashion later. I think this is the best chance for having something ready for 0.10

@bryevdv do you intend to replace the current server completely with the new (limited) one in 0.10? If so we should should out loud about those limitations so users that have bokeh-server in production (know that their app may break. I don't expect many users using bokeh-server in production but still, there may be a significant number... My gut feeling is that maybe there are users using the server streaming capabilities..

@bryevdv
Copy link
Member Author

bryevdv commented Sep 8, 2015

@fpliger I may not have been clear, any limitations would be relative to the future new server, not relative to the current one. As I mentioned implementing push/pull should reproduce current server capability. I don't intend to replace anything until that parity is reached. (Maybe some syntax will break or usage will have to change, but the feature set should be at least equal before replacing.)

I don't have strong feelings about shipping both servers tho I think my preference is to have a clean break. If people need the old server for a time they could just hold off on upgrading.

@bryevdv
Copy link
Member Author

bryevdv commented Sep 8, 2015

I also don't think getting this in 0.10 is especially likely, TBH. I guess I'd give it 50/50 odds at the moment. If not, then we can go straight to 0.11 when it is ready (definitely worth a minor version bump)

@fpliger
Copy link
Contributor

fpliger commented Sep 9, 2015

@bryevdv sounds good. I misunderstood your previous comment. 👍

@damianavila damianavila mentioned this pull request Sep 11, 2015
def _start_helper(self):
self.io_loop = IOLoop.current()
atexit.register(self._atexit)
signal.signal(signal.SIGTERM, self._sigterm)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a stupid question, but why is this done asynchronous? Can't this not simply be done in the init or start() ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this was inspired loosely from the notebook server init code. I'll try simplifying it.

@bryevdv
Copy link
Member Author

bryevdv commented Sep 13, 2015

@pitrou thinking about this a bit more -- I think it will have to be a little more sophisticated? Basically we need to serialize all the "work" for a given document. For instance we don't want any operations that write to a given document to overlap, but operations for different documents can be concurrent. I think that it is possible for multiple operations on a single doc to overlap, with the single ThreadPoolExecutor?

My original thinking was that each doc would it's own executor (with thread/process limit 1, so that jobs for that document would get serialized). Does this seem reasonable? I guess that makes it hard to control the total number of threads/processes. Do you have any other suggestions?

BTW I played around with ProcessPoolExecutor and it seems pretty flaky compared to the thread pool executor. OTOH things like callbacks could definitely potentially be CPU intensive, so that would seem to be preferable.

@pitrou
Copy link
Contributor

pitrou commented Sep 13, 2015

BTW I played around with ProcessPoolExecutor and it seems pretty flaky compared to the thread pool executor.

Flaky how so? It's true that process execution can be more fragile (the child process may be killed independently of the parent), although recent concurrent.futures versions should normally be quite robust.

My original thinking was that each doc would it's own executor (with thread/process limit 1, so that jobs for that document would get serialized). Does this seem reasonable? I guess that makes it hard to control the total number of threads/processes.

How many documents do you have running tasks simultaneously? If it's a small multiple of the number of cores, then it sounds reasonable. Otherwise, you risk memory consumption issues, as well as cache thrashing. Using processes only makes the issue more severe.

Another possibility is to use a per-document Lock, but inside the main thread, not inside the workers. A threading Lock inside a worker would block the whole worker thread, while a coroutine Lock from the main thread code calling the executor will simply yield execution to other coroutines. See http://tornado.readthedocs.org/en/stable/locks.html#tornado.locks.Lock

So your scheduling code could look like:

    with (yield doc.lock.acquire()):
        res = yield executor.submit(doc.task, *some_args)
    raise gen.Return(res)

@bryevdv
Copy link
Member Author

bryevdv commented Sep 14, 2015

Flaky how so? It's true that process execution can be more fragile (the child process may be killed independently of the parent), although recent concurrent.futures versions should normally be quite robust.

I mean running more than two or three clients immediately results in this:

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/Users/bryan/anaconda/envs/tornado/lib/python3.4/threading.py", line 920, in _bootstrap_inner
    self.run()
  File "/Users/bryan/anaconda/envs/tornado/lib/python3.4/threading.py", line 868, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/bryan/anaconda/envs/tornado/lib/python3.4/concurrent/futures/process.py", line 228, in _queue_management_worker
    result_item = reader.recv()
  File "/Users/bryan/anaconda/envs/tornado/lib/python3.4/multiprocessing/connection.py", line 251, in recv
    return ForkingPickler.loads(buf.getbuffer())
EOFError

and then no more work is processed at all.

Regarding number of documents, it's hard to say, I think the best we can do to start is to build in some monitoring capability so that it's more evident when adding resources is necessary. I will play around with you second idea though, thanks!

@pitrou
Copy link
Contributor

pitrou commented Sep 14, 2015

I mean running more than two or three clients immediately results in this:

Do you have a short reproducer? If it's a Python bug, it may be useful reporting it...

@bryevdv
Copy link
Member Author

bryevdv commented Sep 14, 2015

Do you have a short reproducer? If it's a Python bug, it may be useful reporting it...

I believe you can just change the thread pool to a process pool with what is currently checked in, and start up several clients at once, to reproduce.

@martindurant
Copy link

Please permit me to summarise my understanding of the architecture and flow here, so that my misunderstandings can be squashed earlier rather than later.

The whole system revolves around the BokehServer Tornado application. It receives input from Clients (which can, e.g., create and update Document) and bokehJS in users' browsers (which can request to view documents and pass back user actions in a view). It provides views and updates back to bokehJS and can request actions by clients which have created documents and are still connected. The server holds a set of Documents, which may have different states in each session (browser view).
Messages as formatted according to a set Protocol, and the currently understood set of messages is in server.protocol.messages.* . Both clients and servers have Receivers to construct messages out of socket streams.
Clients may receive messages (e.g., callbacks from user interactions) from the server and act on them, or the simplest clients (scripts) will create a document and then exit without accepting messages. Any example script should be runnable on the server as-is.
Do JS and client communications happen on the same port and using the same Protocol, or will JS-side comms remain more similar to how they are now?

Is this on the right lines?
One question: if a Document has a distinct state for each session, and so user actions can depend upon that state, how is the state synced between the server and the client controling that Document, which has to react to actions? Or am I wrong, and the actions are processed by the server alone, in which case I don't know how the client can provide a complete description of what to do for all actions. I see that all communication happens asynchronously, but presumably they are all waiting for IO rather than doing CPU-bound processing.

@bryevdv
Copy link
Member Author

bryevdv commented Sep 15, 2015

@martindurant yes that is basically on the right track. One thing to clarify is that a session (containing a view of documents) is basically an in-memory cache for a browser tab. That's one of the nice things about using web sockets, you just automatically get this 1-1 correspondence.

There are complications, we are trying to figure out what to do about "callbacks" that a user might want to specify to do actual work in response to UI events. These obviously might involve CPU work. After a discussion with @havocp today we have a few (hopefully) clarifying ideas. I am going to try to get those in a Wiki document for wider review tonight.

fpliger and others added 12 commits November 3, 2015 16:57
…t and hplot with VBox and HBox to avoid auto add to plotting.curdoc
…OTE: population and taylor are still failing for other reasons
remove superfluous server examples
…izer

I had somehow reinvented Property.from_json in a very broken way, when
Property already had exactly the right method. This should fix tuples
with PlotObject members, and also no doubt a number of other cases.
Python default out of sync with js
an integer cannot be a callback arg, must be plot object. pass in nlines
into code directly.

also use the renderer that's returned from calling line()
clean-up ajax_source pylint - excuse to make an examples commit
@birdsarah
Copy link
Member

two questions about selection_histogram.py example - https://github.com/bokeh/bokeh/blob/tornado/examples/plotting/server/selection_histogram.py

1. why are we doing box_select_tool.select_every_mousemove = False and lasso_select_tool.select_every_mousemove = False? I noticed it because I just changed the python default (70b987d) for box_select, so I'm guessing that we don't need the first one. It's also making me think that we need to change the python default for lasso_select.

  1. Why do i not get a fresh document/session (not sure of my terminology) when I open it in another browser? if i open this example in two separate browsers, the selection state made on one is shared across the windows. Whilst this is cool, I though the default position was that fresh client instances got fresh documents. (in case it's relevant, my bokeh serve output is showing: DEBUG:bokeh.server.tornado:[pid 5912] 3 clients connected which is correct)

Edit - I think I know the answer to the first question:

  1. we no longer need box_select_tool.select_every_mousemove = False
  2. we don't need to change the python default for lasso select as it matches the js already and True is a sensible default for the lasso select

@birdsarah
Copy link
Member

ping @havocp and @fpliger about the above as I believe you were working on this example.

@havocp
Copy link
Contributor

havocp commented Nov 4, 2015

The two browser tabs are synced because the session id is in the url giving them the same session. It has to be that way because the example works by having a Python client also talk to this same session. Remove the session ID and you should get a blank page (because no Python client to fill it in).

This whole setup is for local messing around, it isn't a production-multiuser-server-compatible setup. For a real app you would want to run the selection_histogram.py code in the server rather than as a client, which is the "bokeh serve foo.py" way. Ultimately we should port most examples to that way and only leave a couple maybe that show you can have a Python client talking to a fixed session id. But for step 1 I guess we're trying to keep the examples more similar to how they worked originally.

@havocp
Copy link
Contributor

havocp commented Nov 4, 2015

the main reason we have this single-session behavior if I remember right would be for use with a local notebook, or for something like a dashboard display or demo that runs on a single computer.

@havocp havocp changed the title Extreme WIP for new tornado server discussion Server rewrite based on Tornado Nov 4, 2015
havocp added a commit that referenced this pull request Nov 4, 2015
Server rewrite based on Tornado
@havocp havocp merged commit 1214c2d into master Nov 4, 2015
@havocp havocp deleted the tornado branch November 4, 2015 17:13
@damianavila damianavila added this to the 0.11.0 milestone Nov 5, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants