Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An initial sketch of websocket protocol RFC #4

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

tailhook
Copy link
Contributor

No description provided.

@tailhook tailhook requested a review from a team February 19, 2020 12:53
1. The URL to connect to is ``ws://<host>:<port>/ws/<database-name>``
2. We use single WebSocket subprotocol name ``edgedb-binary``.
3. Authentication is done at HTTP level (see Authentication_ section)
4. WebSocket framing is used as a message delimitation. The first
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at https://tools.ietf.org/html/rfc6455, and I see no mention of user-defined message opcodes. What exactly does "WebSocket framing" mean? If we have to embed the message type in the payload body anyway, what does reframing buy us and why can't we just embed the backend messages wholesale? This makes it possible to layer and reuse the protocol implementations instead of coupling them tightly.

Copy link
Member

@1st1 1st1 Feb 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are my thoughts on why I think this is a bad idea:

Per the WebSocket protocol, an unmasked frame uses 2 to 10 bytes for the header data. In case of data packets (which are the most used ones) the most likely header size would be 4 bytes (payload longer than 125 bytes but shorter than 16kb). In our binary protocol a data message uses 5 bytes for the header (1 byte for message kind, 4 bytes for payload length). Purely from the standpoint of replacing our framing with WebSocket framing we wouldn't save any data.

That said, if we use WebSocket as a tunnel we would be using both the WS framing and ours, which would lead to more overhead. Or would it? In production setting applications will use our Optimistic Execute protocol flow, where a client sends one message to the server and the server replies with a few "data" messages + a "command complete" message. The server response can be packed into one WS frame. In this case the overhead of layering edgedb protocol on top of WS is 4 bytes for sending the query and 4 bytes for receiving the results.

Performance-wise the picture is worse for WS framing. Using individual WS frames instead of EdgeDB frames would mean that unpacking N "data" messages would lead to N runs of the WebSocket.onmessage JS callback. The callback will need to include some logic to make sure that the protocol is in the right state and then parse the data. Overall, it's a bunch of additional work on top of the current while (message='D') parse_it() approach. Our experience with asyncpg showed that 'data' message parsing code is the most performance-critical part of it; even adding an extra "if" check can be visible in benchmarks. Using event callbacks instead of a loop will make data parsing at least an order of magnitude slower.

Using WS's framing instead of native Postgres/EdgeDB framing essentially means that we're creating a new EdgeDB protocol that would co-exist with the current one. Long term this would lead to the increased complexity of driver(s) code; short term, we need to re-architect our JS driver.

Switching our main binary protocol to the WS framing also seems unoptimal -- we'd have break our protocol again, but also the implementation will become slightly less efficient, because we'd need to repack 'data' messages (right now we're just buffering them and forwarding to clients). Paul argues that using WS framing would lead to simpler driver implementations, as people don't need to implement the buffer/framing logic itself. However, given my experience of implementing 2 EdgeDB drivers I can say that buffer is a relatively simple problem compared to data codecs and overall protocol flow implementation.

Most importantly, WS is probably not the last protocol we'd like to use as transport for EdgeDB. There will be more. Using tunneling is an accepted practice in protocol design and in this specific case the overhead of it is negligible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly does "WebSocket framing" mean?

Yes. No custom opcodes. Only message size is got from the websocket codec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yury here is my view on that:

  1. I don't care too much about 4 bytes of overhead
  2. I don't think that callback overhead would be a performance issue. Same like (1) it's technically an overhead, but is not significant one. You still have the same state machine for subsequent packets.
  3. We still have the protocol that coexists with the current one, with both options
  4. If changing framing in JS protocol is a problem it smells like a bad design for either the JS implementation or the protocol itself. So yes, probably, refactoring JS implementation is a good idea.
  5. Yes it's easy to implement framing, but it's also easy to get it wrong, and this is a kind of bug that are hard to debug and are super bad (including security issues).

Squashing Data Messages

So the only really concerning performance issue here is: squashing multiple postgres data packets. But if I read correctly this code it:

  1. Reads everything to the input buffer
  2. Checks every message header
  3. Copies all the messages into the output buffer anyway (i.e. we neither read from input socket directly to the output socket, nor write from input buffer directly to output socket)

So if copies are done anyway, it's easy to inject a header in-between. Yes it's technically an overhead of single memcpy vs multiple memcpys, but:

  1. For lots of small messages, parsing and checking messages (CPU branches) will dominate performance anyways
  2. For large messages, there exaclty same amount of copying is actually done (because it's unlikely that multiple larger messages will be received in a single network read).

And yes, both variants of the protocol can be further enhanced. For example, it's not that hard to leverage writev system call to interpose header and messages without any copying (I would do that not earlier than we have a rust implementation, though). And this kind of sophisticated implementation is only needed to be done in server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the only really concerning performance issue here is: squashing multiple postgres data packets.

Oh, by the way, we can't put header at the start of buffer before reading multiple postgres messages, because we don't know the length of the header in advance. We can do some ugly hacks, but they are uglier than what I propose.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if the only thing we have is a questionable perceived non-measured performance changes,

No, the main argument against reframing is removing the ability of layering protocols in a clean way (i.e. separating layers as black boxes), and placing an unnecessary maintenance burden on binding implementations.

The WebSocket protocol provides Ping/Pong messages which can only go between message frames, so if we start buffering and postgres lags too long

Once the data messages start flowing there isn't usually a large lag between them. The biggest latency is in query planning and time before the first row.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WebSocket protocol provides reliable message boundaries, so it makes a lot of sense to use them

Sure, and even if we have a 1:1 mapping between EdgeDB messages and WS messages, this is not the reason to change the format of the message (i.e. eliding the message length word).

Copy link
Member

@1st1 1st1 Feb 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that callback overhead would be a performance issue. Same like (1) it's technically an overhead, but is not significant one. You still have the same state machine for subsequent packets.

@tailhook Don't hand-wave the performance issue (even though it's not the main concern). JS forEach is 20x slower than a regular for loop -- that's the overhead of calling a function. In simple cases JITs can inline the function, but again, when a function is called in a loop. In this case it wouldn't be a simple function call in a loop, it will go through nodejs/browser event scheduling, so the overhead will likely sky rocket. But even if it's just a 20x slowdown it's not acceptable, period.

Yes it's easy to implement framing, but it's also easy to get it wrong, and this is a kind of bug that are hard to debug and are super bad (including security issues).

There are many things in protocol implementation that are hard to debug. Switching to websockets framing maybe would make it easier for some people to implement the websocket/edgedb protocol specifically, but then they also would need to implement our regular binary protocol. So essentially we'd be telling people that instead of one protocol they need to implement two. If you tell them it's OK to only implement one, we'd be essentially fragmenting our own driver ecosystem.

If changing framing in JS protocol is a problem it smells like a bad design for either the JS implementation or the protocol itself. So yes, probably, refactoring JS implementation is a good idea.

Frankly speaking this line of argument is becoming mildly annoying. The JS implementation is just fine, just like the Python implementation. Besides, this is not the point we should be discussing. It's an undeniable fact that supporting two different frame formats is extra work, let's leave it at that.


We have enough road blocks for edgedb; a custom binary protocol is one of them. Making the protocol even more complex (or having two of them slightly different) is adding yet another road block and muddying the waters.

I've provided a whole bunch of arguments why fusing websockets framing with our protocol isn't a good idea; most of them are hand-waved. I suggest to close this argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the main argument against reframing is removing the ability of layering protocols in a clean way (i.e. separating layers as black boxes), and placing an unnecessary maintenance burden on binding implementations.

Sure, and even if we have a 1:1 mapping between EdgeDB messages and WS messages, this is not the reason to change the format of the message (i.e. eliding the message length word).

I see it exactly other way around: the burden is checking that frames match and doing something if they don't.

And the other thing is also true for me: even if we can bundle multiple postgres packets into a single websocket frame, I don't think there is a good reason to: the faster first packet is delivered the faster it will be parsed. There can be slower network between app and edgedb than between edgedb and postgres, and buffering on edgedb side is such case only increases memory usage and delays client-side parsing with no profit.

Once the data messages start flowing there isn't usually a large lag between them. The biggest latency is in query planning and time before the first row.

Why? There is at least the network between. And also postgres can read data from disk directly for simple queries as far as I know, and disk can give arbitrary lag.

so the overhead will likely sky rocket. But even if it's just a 20x slowdown it's not acceptable, period.

It's a clearly overstatement. It can be 20x only if it's for(..) { x+=1 } if it's for (...) { func(x) } it's already 2x. So slowdown will be less than 20x not higher. There are bunch of virtual calls for parsing every row (each codec, i.e. each individual cell). So it's likely very imperceptible difference.

But both mine and your opinion are speculations, anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it exactly other way around: the burden is checking that frames match and doing something if they don't.

HTTP does not care about the TCP headers and vice-versa, this is no different. The entire point here is to separate the protocol layers cleanly.

the faster first packet is delivered the faster it will be parsed

There is always some amount of buffering involved to reduce the network operation overhead. Postgres itself uses 16Kib buffers. If your query returns a set of integers, you really don't want to wrap and flush each integer as a WebSocket message.

Why? There is at least the network between. And also postgres can read data from disk directly for simple queries as far as I know, and disk can give arbitrary lag.

I don't fully understand your point here. If your network and/or hardware is slow, then everything would be slow and you will have to set up your expectations and timeouts accordingly.

Changes from the Binary Protocol
--------------------------------

1. The URL to connect to is ``ws://<host>:<port>/ws/<database-name>``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @elprans

/ws/<database-name> is actually an interesting idea. We should probably require it for all our Ports.

This way you can configure an HTTP+EDGEQL port 8888 for database 'A' (use :8888/edgeql/A), an HTTP+GRAPHQL port for database 'B' (use :8888/graphql/B) and a WS port for 'A' (use :8888/ws/A/).

Currently our ports listen on / which means that you need to use haproxy or nginx to enable several edgedb application ports on one network port.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So are we making the database property on a Port a MULTI property?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we can have multiple application port objects (dropping the exclusive constraint on the network port), e.g.

INSERT Port { port := 8888, database := 'A', protocol := 'http+edgeql', ...};
INSERT Port { port := 8888, database := 'B', protocol := 'http+graphql', ...};
INSERT Port { port := 8888, database := 'A', protocol := 'ws+edgeql', ...};

Essentially a new exclusive constraint would be on (port, database, protocol).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then calling them "ports" is not making a whole lot of sense anymore. In my opinion, we need to take a step back and maybe rethink the design of the whole concept.

Remember, the original motivation for these is to provide a simple interface for pre-allocated connection pools. In that sense, the only important thing in the specification is the database name and the size of the pool. The protocol and other things are secondary and can actually reuse the same connection pool.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, we need to take a step back and maybe rethink the design of the whole concept.

Yes. The current design precludes one from easily exposing both EdgeQL and GraphQL on one HTTP port which can be troubling in some network setups. Ideally there should be a way to expose whatever you want related to HTTP on one network port.

Perhaps we can still call them "Ports" just redesign the schema:

type Port {
  property port -> int;
  link interfaces -> Interface;
}

type Interface {
   database
   concurrency
   protocol
}

(I don't like the Interface name, so it's just a placeholder).

I'm also wondering if we should change our binary protocol to send an \x00 byte when a socket connection is established; that would allow us to multiplex edgedb-binary protocol along with WS/HTTP on a single network port (as Paul once suggested).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also wondering if we should change our binary protocol to send an \x00 byte when a socket connection is established; that would allow us to multiplex edgedb-binary protocol along with WS/HTTP on a single network port (as Paul once suggested).

We can already distinguish, because our protocol always starts with V\x00. We only need to constrain handshake to < 16Mb max to make that always true, which is sensible I think. Such connection negotiation is not okay to expose to the internet without TLS, though. But it's okay to use inside the private network.

Then calling them "ports" is not making a whole lot of sense anymore. In my opinion, we need to take a step back and maybe rethink the design of the whole concept.

Yes, we have already discussed that connection pools and ports should be different features, allowing arbitrary mapping between ports and connection pools. But I think this is out of scope of this RFC.


Other than that I didn't work on configuring ports yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point re V\x00.


This means using string like ``edgedb-binary-0.8`` as a subprotocol name
for the handshake instead of negotiating the version in a separate
handshake message.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use subprotocol name + version negotiation but for the WS tunnel protocol. Essentially "edgedb.1" means that we're using WS tunnel protocol "1". The binary protocol version is negotiated inside the tunnel once it's established.

It's unlikely we need to bump the tunnel protocol version though, but having the ability to do that is crucial for long-term maintainability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since subprotocol is just a string, we can always add version later. So if it's unlikely that we ever bump version, I see no reason of adding it.

We should name it edgedb-tunnel, though. If we will really go with tuneling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since subprotocol is just a string, we can always add version later. So if it's unlikely that we ever bump version, I see no reason of adding it.

We maybe need to implement extra WS-specific auth as part of the tunneling proto. @elprans what do you think (IIRC it was your idea)?

We should name it edgedb-tunnel, though. If we will really go with tuneling.

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants