Multicast for the scaled network #134
Replies: 19 comments 9 replies
-
|
Fun things to think about, at 1 Billion TPS, every 1 byte in the network header is 1 Gigabyte per second of data. I have tried to reduce space as much as possible. There may be more optimization required. CRC32c is essentially free with hardware acceleration and so the source ID was able to be reduced down to 4 bytes from 16 as the number of actual senders on the multicast network should be limited, and also segregated via the temporal sequence ID. |
Beta Was this translation helpful? Give feedback.
-
|
I'm still working on improvements to the wire frame format. In testing, I learned of shortcomings in the design of the monotonic sequence numbers. I am shifting to a hash chain to solve this and will have further updates to the pull request soon. |
Beta Was this translation helpful? Give feedback.
-
|
Background on the retransmission work: Multicast flows are UDP based, without reliable delivery guarantees. If a receiver host misses a packet, it needs an efficient way to determine such data is missing and request retransmission. The way this is achieved using BRC-124 is via a sender attributed, shard group bounded hash chain sequencing function applied to every packet. The tooling within the Retry Endpoint [1] repo implements a caching retry endpoint to facilitate both multicast and unicast transmission. It includes a beacon discovery mechanism for consumers, NACK ACK/MISS signaling mechanisms, as well as tier and preference based hierarchical escalation configuration capabilities. Shard Listener [2] demonstrates gap detection and requests. Shard Proxy [3] demonstrates sequence stamping as well as shard group and sender ID attribution. [1] Retry Endpoint |
Beta Was this translation helpful? Give feedback.
-
|
Added PR BRC-126 for NACK-based retransmission protocol. |
Beta Was this translation helpful? Give feedback.
-
|
The gap sequence retransmission is a bit tricky, as listeners must track chain sequences across group indexes/shards, and across senders (both original sources and/or proxies). Identifying a chain is important for retry endpoint rate limiting. I'm doing some work here now. |
Beta Was this translation helpful? Give feedback.
-
|
I'll probably be making changes to the sharding procedure. 24 bits is probably too long for the current group addressing. I can't see network operators willingly choosing to break up the full transaction stream into more than say, 1024 total groups. Even that would probably be an administrative nightmare with a hard requirement for full configuration automation/debug. The shard portion of the multicast group address should take up no more than 16 bits, maybe more if some sort of hash derivation is used. This leaves us 48 to use for possible subtree group addressing, if it should ever be needed, which would be stacked with the shard addressing, so the same address covers both. Think downstream networks connected by listeners that bridge the domain to the core sharded flows, but filtered by subtrees and re-transmitted multicast. The multicast group can then be determined by subtree group, and then shard group. In this way, the multicast group limits of switching gear are transcended through specialty subscription. I'm working on a subtree group ID announcement protocol that links subtree IDs to subtree groups. The goal is to be able to encompass all subtrees associated with an arbitrary specialization or categorization of transaction flows over time. New subtrees are announced continuously, linking them to a particular group which downstream interested shard-listeners pick up. They can then filter packets for just the flows they are interested in. The flows may still be optionally sharded for load balancing. Expect a few things to change. Work in progress. |
Beta Was this translation helpful? Give feedback.
-
|
BRC-127: Subtree group announcement protocol: #140 A method for transaction specialization filtering for network segments. Not every network needs the full transaction stream. Filter by groups of subtrees. |
Beta Was this translation helpful? Give feedback.
-
|
BRC-128: Multicast Extended Transaction Frame Format: #141 Tacks on EF payload to BRC-124 frames. |
Beta Was this translation helpful? Give feedback.
-
|
BRC-129: IPv6 Multicast Address Assignments: #143 Describes how to carve up the FF0X::B allocation assigned to BSV Association for Bitcoin SV Node Groups. See https://www.iana.org/assignments/ipv6-multicast-addresses/ipv6-multicast-addresses.xhtml for more detail. The last 16 bits are available for use in the assignment. This scheme tries to take a pragmatic approach by limiting the transaction shard groups to no more than 4096 shards. It leaves approximately 56K addresses in the middle for future expansion, and leaves the top 2048 suffixes (ending in :FFFF) for important network control groups. Regarding the shard group counts, this would mean a maximum 4096 group subscriptions, over 4096 or less network links PER miner, to take on a full transaction feed! Imagine the administration on that! At this point, I can not imagine that the administrative burden would be worth that level of load balancing, especially with 12 Terabit optical interfaces being in development by the network industry at this time (2026). |
Beta Was this translation helpful? Give feedback.
-
|
I'll be changing BRC-124 yet again, back to a hybrid of the original submission (that was merged without proper review) and the new format. The key fields are a hash key composed of a 16-bit XXH64(Source IP || Shard bits || Subtree ID || Sequence number), and then the 16 bit sequence number repeated as an integer after. This will allow for numerical gaps to be detected more simply than following a chain, and opens up a path for multi-frame retransmission sequences. This will work better at very high throughput levels than trying to walk an arbitrary length missing hash chain (even from both ends). The frame format is 92 bytes with this arrangement, and I really don't want to make it any longer as that is already 92 GB/s at 1B TPS. I need to make sure XXH64 is a sound choice even at high throughput with a lot of retransmission keys being added continuously (1 per multicast frame). More changes coming. Development is active. |
Beta Was this translation helpful? Give feedback.
-
|
A comment on the existing multicast work in the BRC repo, Ty and Project Babbage did great work here and I've read through BRC 80, 82, and 83 a few different times. These were written before the architecture was elucidated in the blog post I reference in the design document. This blog post described most of the details I needed to get started with an actual implementation that met the reference. There are a lot of good ideas in BRC 80 and 83, as well as 82. The more I contemplated meeting the goals of the reference material, I realized that some of the concerns about MLDv2 were probably unfounded because source specific multicast is problematic in an environment where there can be many injection points or senders. The rudimentary group announcement frame types I've implemented follow with 80 and could probably be improved a lot. The existing BRCs were written in the period before Teranode source code and concrete implementation details around subtrees were publicly known. I would like to reconcile the work I'm doing here with the architectural features expressed previously in 80, 82, and 83 because I feel that brain work is valuable. The architecture expressed in the reference material I consider a good starting point, because it's close enough to get an actual implementation going. I look forward to collaboration to bring a real, actual multicast network to deployment for the good of all network participants. |
Beta Was this translation helpful? Give feedback.
-
|
More complication. I started to plan what an actual deployment would look like starting at the very bottom with ip6gre tunnels making up the fabric link. The internet today is built on 1500 MTU size. To build without direct links, we need to handle fragmentation of packets. IPV4 does a lot of this for you, but we're building on V6, which does not. We need to handle fragmentation of UDP packets in the application. This means an encoding, serialization, and fragmentation scheme, along with error correction and retransmission guarantees for all fragments. BRC-124 and 128 are fine for payload of 1324 bytes for a basic test network using GRE6 tunnels. There is 84 byte overhead for the packets + tunnel, plus the 92 byte header. It's either cap the size of accepted transactions at the ingress point, or engineer around it. My instinct says to engineer for this because even with 9200 byte via an end to end fabric supporting jumbo frame packets, we still would fragment to serve a 10MB transaction, which is the current upper limit in BSV as I understand it. |
Beta Was this translation helpful? Give feedback.
-
|
All test scenarios are passing now too. Did some work on a few issues and also expanded test coverage. Ran: 28 PASSED All scenarios passed. |
Beta Was this translation helpful? Give feedback.
-
|
I've started to lab up BGP Equal Cost Multi Path (ECMP) with BGP AnyCast advertisement over both IPV4 and IPV6 to demonstrate the horizontal scalability and distribution-potential of the ingress proxy. I'm using a combination of FRR and BIRD2 (separate routers) to setup the adjacency configurations. BIRD2 on the proxies themselves. FRR for their upstream router and the external ASN router. Also, the BRCs were merged:
|
Beta Was this translation helpful? Give feedback.
-
|
BGP AnyCast ingress works great. Load balancing is easy with the stateless design. Could also be done via hardware/software L4-L7 load balancer such as HAproxy or F5 also. The two methods can be combined as well for incredible scalability at the ingestion layer. I'm now adding more end-to-end testing to try and cover as many scenarios as possible. To this end, I've added a Go test framework with full Docker containerization support including multicast bridging. It won't work in GitHub Actions using hosted runners, so I also implemented Dagger to pull the CI process out into the codebase. This will allow for automated e2e testing development while waiting for a self-hosted runner solution to be available. When one is, GitHub executes the same CI flow naturally. I also shipped Helm charts for all the Go repos and am progressing towards full Kubernetes deployment capability also. Reviews and feedback are desirable. |
Beta Was this translation helpful? Give feedback.
-
|
Ran into some problems with GitHub while trying to migrate repositories to a new organization (same name as old one). Hopefully I don't have to re-name all the repositories because of it. I have a support ticket open. 9 of them are in limbo currently. :-( |
Beta Was this translation helpful? Give feedback.
-
|
I'm going to evaluate the solution for moving to Source Specific Multicast, or at least baking in support for it at the component level. I don't think it's manageable in an environment where miners can come and go and sender/receiver IP addresses can change. It requires each subscriber to subscribe to each shard group PLUS sender source. Multiply N sources by X shards and it only gets more complicated the bigger the network gets. The group manifest advertisement service I'm building could be used to share manifest lists and coordinate this, but given the potentially disconnected nature of senders and listeners (who aren't necessarily also senders), this can get complicated quickly. Coordination must be handled at the application level. Unfortunately, RFC 8815 essentially deprecates Any Source Multicast (ASM) over inter-domain links, so I don't think there will be any hope of ISPs carrying the multicast group advertisements even if they were able to get over reluctance to run MP-BGP + PIM6 in the first place. The way forward for the foreseeable future is privately peered specialty ASNs, focused on multicast delivery, using inter-domain MP-BGP with ASM group advertisements as standard if the network needs to grow beyond one private operator. |
Beta Was this translation helpful? Give feedback.
-
|
Current ingress proxy performance testing on a single system results in full saturation of 10 Gbit NIC interface. I am using an old 6-core Intel PC from about 2017 and I can get 400,000+ packets per second. The throughput and PPS varies with payload size. Still, the results are promising and I'm planning an upgrade to 25 Gbit NIC interface for further testing. I test using both software loopback (dummy interface) as well as hard loopback cable connected back-to-back on the dual-port NIC. There is a cross-over point where packet size determines the throughput ceiling on the two different configurations. Smaller packet sizes favor the dummy interface at this time, and jumbo MTU size benefits the NIC scenario, of course. |
Beta Was this translation helpful? Give feedback.
-
|
Concrete implementation deployment testing in progress: https://1bsv.net Beta participants welcome. Get in touch. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I've begun building the multicast-in-multicast reference network topology.
More info: https://singulargrit.substack.com/p/multicast-within-multicast-anycast
To start, I've proposed a new raw wire frame format for transactions that dovetails onto BRC-12 (and probably the other extended formats also). BRC-124 was assigned after I had already started on BRC-122, hence the conflicting branch name.
NOTE: An early version of BRC-124 was committed to the BRC repo without peer review during some organizational cleanup. I've since updated the PR to incorporate hash-chain sequencing instead of numerical counters.
PR: #133
Here are some design details: https://github.com/lightwebinc/bitcoin-multicast/blob/main/DESIGN.md
Service implementations are available also. The proxy has previously been tested to 400K PPS before my old development environment starts dropping packets. I have tested listener functionality with proxy. I am currently working on testing retransmission functionality. Retransmissions require proper sequencing and source attribution in the transaction network header, and thus this frame format was submitted for new BRC.
Beta Was this translation helpful? Give feedback.
All reactions