Use Lexicon and DAG-CBOR for subscribeRepos Frames #1137

DavidBuchanan314 · 2023-05-31T02:02:29Z

DavidBuchanan314
May 31, 2023

It would appear that "firehose" data frames are formed by concatenating header and body CBOR objects, like so

atproto/packages/xrpc-server/src/stream/frames.ts

Lines 19 to 21 in 9d33cbb

    
           toBytes(): Uint8Array { 
        
             return uint8arrays.concat([cborEncode(this.header), cborEncode(this.body)]) 
        
           }

This is entirely a matter of my personal taste, but I would much prefer it if frames were valid DAG-CBOR objects. Yes, the strictness properties of DAG-CBOR are not needed here, but it's also a much simpler format to work with than CBOR, with fewer parsing edge-cases.

For my atproto implementation, I'd like to be able to get away with only implementing DAG-CBOR, and not the rest of CBOR.

What if each Frame was a single DAG-CBOR encoded object, as specified by a lexicon document? (e.g. with a header and body field)

This seems like it would be a good way to reduce implementation complexity, enabling code reuse (I already have code for parsing, validating, and serializing lexicon objects - as do you!), and allowing the Frame format to be upgraded and extended in future just like any other Lexicon object.

devinivy · 2023-05-31T14:21:56Z

devinivy
May 31, 2023
Maintainer

If I'm not mistaken everything should be encoded as dag-cbor here :) So there's a quick win!

Regarding there being two dag-cbor objects instead of one per message— one reason it works like this is so that the header precedes the body and can easily be parsed prior to the body. That allows implementers to use information contained in the header to interpret the body. An example of this is learning the lexicon type of object contained in the body prior to encountering it, so you can allocate memory for it ahead of time and do a single-pass unmarshalling. In dynamic languages like JS there's not a major win here, but we find it's fruitful in other languages like Go.

I appreciate what you're saying about the upsides of using lexicon here. One concern I'd have is that it can be fairly verbose about things like unions, and every frame header is a union so it would always contain a "$type": ... field. It was useful to drop the constraints of lexicon when developing a light wire protocol, which needs to be upgradeable (thanks dag-cbor!) but not anywhere near as extensible as lexicons need to be. Plus we specify these streaming endpoints in lexicon (example), so we'd kind of have lexicon sitting on top of lexicon!

A bit off topic, and maybe you already came across this, but there's a section in the cbor spec worth a peek about streaming cbor. It's peace of mind for me that this wire protocol would "just work" over TCP without the framing that websockets gives us and that the cbor spec sort of blesses the idea of streaming cbor.

0 replies

DavidBuchanan314 · 2023-05-31T20:28:13Z

DavidBuchanan314
May 31, 2023
Author

The annoyance that originally motivated my thoughts in this direction, is that DAG-CBOR forbids concatenation of objects

Encode and decode must operate on a single top-level CBOR object. Back-to-back concatenated objects are not allowed or supported, as suggested by section 5.1 of RFC 8949 for streaming applications. All bytes of an encoded DAG-CBOR object must decode to a single object. Extraneous bytes included in an IPLD block, whether valid or invalid CBOR, must not be accepted as valid DAG-CBOR.

I don't think the protocol is strictly violating this, because nobody is claiming that a Frame is an IPLD block, but it makes it tricky to parse because APIs are designed with this constraint in mind.

The current TS implementation is able to use cborDecodeMulti to decode the header and body in one go

atproto/packages/xrpc-server/src/stream/frames.ts

Line 29 in 9d33cbb

const decoded = cborDecodeMulti(bytes)

However, a DAG-CBOR library won't necessarily have an API like this, making the separation of header and body inelegant. (concrete example: https://dag-cbor.readthedocs.io/en/latest/decoding.html)

If using lexicons brings too much overhead, a more direct solution to this problem could be to prefix the header with a varint that specifies the length of the header (similar to in the CAR format)

0 replies

DavidBuchanan314 · 2023-05-31T23:52:11Z

DavidBuchanan314
May 31, 2023
Author

Related thought: the overhead of lexicons could be mitigated by adding stream-level compression (making seemingly redundant/repeated expressions ~free)

0 replies

sugyan · 2023-06-01T03:10:31Z

sugyan
Jun 1, 2023

I'm working on an atproto implementation in Rust and suffered from the same problem.

Rust's decoding library for dag-cbor expects the data to be a single dag-cbor object, so in order to correctly parse the frame, it must explicitly detect that the decoded result is an "error with trailing data". I would like to avoid such a tricky process.
https://github.com/sugyan/atrium/blob/461e7716c852ac49c745339c2495d4c2755ae0f7/atrium-xrpc-server/src/stream/frames.rs#L87-L89
(Would it be better if it were a single cbor object as ipld, which is a 2-element list of ipld in the header and ipld in the body...?)

Also, this may be a special situation when using serde in Rust, but it would be very helpful if the $type information is directly included in the body, since enum deserialization automatically determines the decoding destination type according to the contents of the specified tag, This eliminates the need to specify the type using switch statements, etc.

0 replies

DavidBuchanan314 · 2023-06-01T03:21:51Z

DavidBuchanan314
Jun 1, 2023
Author

I like the idea of using a 2-element list, probably preferable to my varint idea. (now I'm wondering why the CAR format isn't more like that, heh (edit: probably to make it easier to append to))

0 replies

bnewbold · 2023-06-01T22:21:40Z

bnewbold
Jun 1, 2023
Maintainer

As a bit of meta-context: the streaming wire protocol isn't totally set in stone, but we did iterate several times getting it to the state it is in now. It would obviously be best to make any final changes before opening federation, but it would already be a fairly disruptive thing to change today. So, some wiggle room on this, but not a ton.

To clarify the original motivation, we had concerns about deserialization efficiency and wanted to ensure that we could handle multiple GByte/sec processing. One issue raised was that optimized DAG-CBOR parsers against a known schema could run faster, with a single pass and fewer allocations, but only if they knew the schema ahead of time. Another was to leave room for parsing just the header in a first pass, and then routing the rest of the frame as raw bytes (unparsed) to other cores/channels/etc.

I've worked with serde a lot myself, and the enum deserialization feature is really nice. I don't think that pattern/idiom is particularly common in other languages, so what is a nice idiomatic solution for Rust/serde might not be idiomatic in general. I could be wrong about that, and the feature could become more popular over time. I'd also point out that, IIRC, it only works with a closed union (of known types), so it makes parsing a bit brittle. That does mostly match the current Lexicon closed-union semantics and "social contract", but it sure is nice having some wiggle room.

Personally, i'd be open to open to trying to find a way to make the entire WebSocket frame a single DAG-CBOR object, while also retaining the efficiency of the existing golang setup. Maybe if the wrapper/header fields and ordering were constrained correctly, it could be both a valid DAG-CBOR object, and feasible to do manual parsing in golang?

The other related itch to me is that the sequence number is in the payload and not as an optional field in the header.

0 replies

sugyan · 2023-06-02T00:50:01Z

sugyan
Jun 2, 2023

Another was to leave room for parsing just the header in a first pass, and then routing the rest of the frame as raw bytes (unparsed) to other cores/channels/etc.

I see, I had not considered such a case. In that case, it would certainly be better to have them concatenated as independent DAG-CBOR data...

Even so, all JSON data that corresponds to a definition defined as a union on Lexicon contains $type information, right? Even if it would be redundant, I would like to see the $type information included not only in the frame header but also in the frame body.

0 replies

DavidBuchanan314 · 2023-06-02T01:24:03Z

DavidBuchanan314
Jun 2, 2023
Author

Mostly off-topic: On the subject of high-throughput processing, I was thinking it would be nice to be able to subscribe to a deterministic fraction of the firehose. For example, if I wanted to split the ingest workload over two nodes in my system, I'd like to be able to divert "all events with sequence number LSB=0" to one of them, and "all events with sequence number LSB=1" to the other - or some other deterministic partitioning scheme.

0 replies

bnewbold · 2023-06-02T19:44:55Z

bnewbold
Jun 2, 2023
Maintainer

This is a good conversation, going to turn this issue into a "discussion".

0 replies

DavidBuchanan314 · 2023-06-15T12:01:19Z

DavidBuchanan314
Jun 15, 2023
Author

Just looking at the recent docs update, the protocol is defined to use DAG-CBOR, which is good to see and addresses the first half of my concerns (iiuc, this is not a change, it's just documented now)

bluesky-social/atproto-website@b4a75eb#diff-3a90140693364e9ae371ce8765af1d385c54479cd5cf2bb2bfd06f47c3e0a954R21

However, my quibble with concatenation of DAG-CBOR objects still stands. My own DAG-CBOR parsing library can currently handle this, because I designed it with the use-case in mind - but I would like to update my API in a way that makes parsing of concatenated objects impossible.

0 replies

DavidBuchanan314 · 2023-06-15T12:05:26Z

DavidBuchanan314
Jun 15, 2023
Author

Although, the current atproto implementation is still using a plain CBOR parser - which means it will accept messages which are valid CBOR but not valid DAG-CBOR, which seems like a bug.

https://github.com/bluesky-social/atproto/blob/318736816c65e86a554e09cb9dc2e095dff636c4/packages/common/src/ipld-multi.ts#LL23C3-L23C23

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Lexicon and DAG-CBOR for subscribeRepos Frames #1137

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 11 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Use Lexicon and DAG-CBOR for subscribeRepos Frames #1137

DavidBuchanan314 May 31, 2023

Replies: 11 comments

devinivy May 31, 2023 Maintainer

DavidBuchanan314 May 31, 2023 Author

DavidBuchanan314 May 31, 2023 Author

sugyan Jun 1, 2023

DavidBuchanan314 Jun 1, 2023 Author

bnewbold Jun 1, 2023 Maintainer

sugyan Jun 2, 2023

DavidBuchanan314 Jun 2, 2023 Author

bnewbold Jun 2, 2023 Maintainer

DavidBuchanan314 Jun 15, 2023 Author

DavidBuchanan314 Jun 15, 2023 Author

DavidBuchanan314
May 31, 2023

devinivy
May 31, 2023
Maintainer

DavidBuchanan314
May 31, 2023
Author

DavidBuchanan314
May 31, 2023
Author

sugyan
Jun 1, 2023

DavidBuchanan314
Jun 1, 2023
Author

bnewbold
Jun 1, 2023
Maintainer

sugyan
Jun 2, 2023

DavidBuchanan314
Jun 2, 2023
Author

bnewbold
Jun 2, 2023
Maintainer

DavidBuchanan314
Jun 15, 2023
Author

DavidBuchanan314
Jun 15, 2023
Author