Use Lexicon and DAG-CBOR for subscribeRepos Frames #1137
Replies: 11 comments
-
If I'm not mistaken everything should be encoded as dag-cbor here :) So there's a quick win! Regarding there being two dag-cbor objects instead of one per message— one reason it works like this is so that the header precedes the body and can easily be parsed prior to the body. That allows implementers to use information contained in the header to interpret the body. An example of this is learning the lexicon type of object contained in the body prior to encountering it, so you can allocate memory for it ahead of time and do a single-pass unmarshalling. In dynamic languages like JS there's not a major win here, but we find it's fruitful in other languages like Go. I appreciate what you're saying about the upsides of using lexicon here. One concern I'd have is that it can be fairly verbose about things like unions, and every frame header is a union so it would always contain a A bit off topic, and maybe you already came across this, but there's a section in the cbor spec worth a peek about streaming cbor. It's peace of mind for me that this wire protocol would "just work" over TCP without the framing that websockets gives us and that the cbor spec sort of blesses the idea of streaming cbor. |
Beta Was this translation helpful? Give feedback.
-
The annoyance that originally motivated my thoughts in this direction, is that DAG-CBOR forbids concatenation of objects
I don't think the protocol is strictly violating this, because nobody is claiming that a Frame is an IPLD block, but it makes it tricky to parse because APIs are designed with this constraint in mind. The current TS implementation is able to use cborDecodeMulti to decode the header and body in one go However, a DAG-CBOR library won't necessarily have an API like this, making the separation of header and body inelegant. (concrete example: https://dag-cbor.readthedocs.io/en/latest/decoding.html) If using lexicons brings too much overhead, a more direct solution to this problem could be to prefix the header with a varint that specifies the length of the header (similar to in the CAR format) |
Beta Was this translation helpful? Give feedback.
-
Related thought: the overhead of lexicons could be mitigated by adding stream-level compression (making seemingly redundant/repeated expressions ~free) |
Beta Was this translation helpful? Give feedback.
-
I'm working on an atproto implementation in Rust and suffered from the same problem. Rust's decoding library for dag-cbor expects the data to be a single dag-cbor object, so in order to correctly parse the frame, it must explicitly detect that the decoded result is an "error with trailing data". I would like to avoid such a tricky process. Also, this may be a special situation when using serde in Rust, but it would be very helpful if the |
Beta Was this translation helpful? Give feedback.
-
I like the idea of using a 2-element list, probably preferable to my varint idea. (now I'm wondering why the CAR format isn't more like that, heh (edit: probably to make it easier to append to)) |
Beta Was this translation helpful? Give feedback.
-
As a bit of meta-context: the streaming wire protocol isn't totally set in stone, but we did iterate several times getting it to the state it is in now. It would obviously be best to make any final changes before opening federation, but it would already be a fairly disruptive thing to change today. So, some wiggle room on this, but not a ton. To clarify the original motivation, we had concerns about deserialization efficiency and wanted to ensure that we could handle multiple GByte/sec processing. One issue raised was that optimized DAG-CBOR parsers against a known schema could run faster, with a single pass and fewer allocations, but only if they knew the schema ahead of time. Another was to leave room for parsing just the header in a first pass, and then routing the rest of the frame as raw bytes (unparsed) to other cores/channels/etc. I've worked with Personally, i'd be open to open to trying to find a way to make the entire WebSocket frame a single DAG-CBOR object, while also retaining the efficiency of the existing golang setup. Maybe if the wrapper/header fields and ordering were constrained correctly, it could be both a valid DAG-CBOR object, and feasible to do manual parsing in golang? The other related itch to me is that the sequence number is in the payload and not as an optional field in the header. |
Beta Was this translation helpful? Give feedback.
-
I see, I had not considered such a case. In that case, it would certainly be better to have them concatenated as independent DAG-CBOR data... Even so, all JSON data that corresponds to a definition defined as a union on Lexicon contains |
Beta Was this translation helpful? Give feedback.
-
Mostly off-topic: On the subject of high-throughput processing, I was thinking it would be nice to be able to subscribe to a deterministic fraction of the firehose. For example, if I wanted to split the ingest workload over two nodes in my system, I'd like to be able to divert "all events with sequence number LSB=0" to one of them, and "all events with sequence number LSB=1" to the other - or some other deterministic partitioning scheme. |
Beta Was this translation helpful? Give feedback.
-
This is a good conversation, going to turn this issue into a "discussion". |
Beta Was this translation helpful? Give feedback.
-
Just looking at the recent docs update, the protocol is defined to use DAG-CBOR, which is good to see and addresses the first half of my concerns (iiuc, this is not a change, it's just documented now) However, my quibble with concatenation of DAG-CBOR objects still stands. My own DAG-CBOR parsing library can currently handle this, because I designed it with the use-case in mind - but I would like to update my API in a way that makes parsing of concatenated objects impossible. |
Beta Was this translation helpful? Give feedback.
-
Although, the current atproto implementation is still using a plain CBOR parser - which means it will accept messages which are valid CBOR but not valid DAG-CBOR, which seems like a bug. |
Beta Was this translation helpful? Give feedback.
-
It would appear that "firehose" data frames are formed by concatenating header and body CBOR objects, like so
atproto/packages/xrpc-server/src/stream/frames.ts
Lines 19 to 21 in 9d33cbb
This is entirely a matter of my personal taste, but I would much prefer it if frames were valid DAG-CBOR objects. Yes, the strictness properties of DAG-CBOR are not needed here, but it's also a much simpler format to work with than CBOR, with fewer parsing edge-cases.
For my atproto implementation, I'd like to be able to get away with only implementing DAG-CBOR, and not the rest of CBOR.
What if each Frame was a single DAG-CBOR encoded object, as specified by a lexicon document? (e.g. with a header and body field)
This seems like it would be a good way to reduce implementation complexity, enabling code reuse (I already have code for parsing, validating, and serializing lexicon objects - as do you!), and allowing the Frame format to be upgraded and extended in future just like any other Lexicon object.
Beta Was this translation helpful? Give feedback.
All reactions