-
Notifications
You must be signed in to change notification settings - Fork 50.9k
Description
React version: 19.3.0-canary-c0d218f0-20260324 and uhh, a custom fork
Steps To Reproduce
- first set of benchmarks are in this repo: https://github.com/switz/rsc-benchmarks
- second set are based on my fork implementing a fused rsc pipeline renderer (one pass, fizz takes on more responsibility - preview here)
Some Background
I am working on a new RSC-based framework. There's been a lot of recent discussion around react-based frameworks' performance in what are generally not representative of real world performance (aren't all web benchmarks?), but they did uncover some serious gaps. RSC performance is honestly sufficient for most use-cases, but it could be much more efficient and gets very throttled very quickly.
I spent a lot of time digging into RSC rendering throughput today+yesterday with claude. Both in next and outside of next (w/ my new framework, some pure react, etc.). I found two small perf wins in both Fizz and Flight (sent a PR for Fizz), but they are minimal in comparison to the two below. I spent most of the day debugging real world scenarios: code running on the same k8s cluster across the same apps in different contexts (next, my framework, etc.) all running through real-world networks. I then dropped down to baseline benchmarks to try and isolate the problems which reflected my real-world testing.
This is all based on single core, single threaded rendering. If I got anything wrong here, if I shoved my foot in my mouth, if I over-dramatized the situation, please tell me. I'm not an expert in web throughput engineering, cpu architectures, or javascript/react internals. I'm just a long-time software engineer who's having way too much fun building my own framework on what I consider to be the most complete web architecture.
The current behavior
These benchmarks are run in as simple cases as I could define, on my own M1 Max, on a single core at a time. To isolate the performance, I ran each test in a container to ensure somewhat consistent results where we could control the CPU limits (is this bad? you tell me – I'm sure there's some issue with it). On the average web server, perf will be worse than here which only exacerbates the issue further. None of this is meant to be a perfect or clean room benchmark – but, I think you'll find fairly consistent results as I did.
It's important to note that cpu-bound tasks just suffer in javascript environments. This isn't an I/O issue. You'll see below how the current RSC infrastructure compounds the CPU problem.
Node Streams vs Web Streams
The first issue is fairly well documented and offers the first ceiling of performance. Next runs on web streams (renderToReadableStream) which are written in javascript and much slower than node streams.
If you write a barebones RSC rendering test, you'll see that this is the first limit you hit.
| Metric | Node Streams | Web Streams | Difference |
|---|---|---|---|
| req/s | 1,004 | 743 | Node 35% faster |
| Median latency | 43ms | 58ms | Node 26% faster |
| P99 latency | 134ms | 139ms | ~same |
You get a 35% win in this particular case - without this you'll eventually throttle, so for anyone deploying to node environments, this should be the first priority.
But this only really is the first win, and it is negated in most real world scenarios because it brings us to the second major issue.
| Test | req/s | Median | Size | What it measures |
|---|---|---|---|---|
| [0] renderToString | 376 | 2.4ms | 116KB | Sync SSR baseline — no streams |
| [0a] Direct SSR (Node pipe) | 273 | 3.6ms | 116KB | Streaming SSR, no RSC |
| [0b] Direct SSR (Web stream) | 197 | 4.5ms | 116KB | Streaming SSR, Web streams |
| [1a] Flight serialize | 110 | 7.6ms | 235KB | RSC → Flight wire format |
| [1b] SSR from Flight (Node) | 100 | 6.8ms | 116KB | Pre-rendered Flight → HTML |
| [1c] SSR from Flight (Web) | 92 | 7.7ms | 116KB | Same, Web streams |
| [2a] Full RSC → Node | 44 | 22.7ms | 398KB | Flight + SSR + inject |
| [2b] Full RSC → Web | 36 | 25.6ms | 398KB | Same, Web streams |
| [3a] Full RSC → Node + gzip | 40 | 25.0ms | 22KB | Full pipeline + gzip |
| [3b] Full RSC → Web + gzip | 34 | 27.3ms | 22KB | Same, Web + gzip |
These results align with what we should expect:
- The full RSC pipeline is close to 10x worse than renderToString
- Node Streams are 20-30% faster than web streams, but in RSC that only buys us a small amount of improved throughput
- the tee coupling (flight + ssr) fights each other on the event loop and cuts half the performance right there
- raw SSR (not RSC) gets decent performance at 273 req/s
Now measuring req/s with 1 concurrent request isn't the best way to test real world performance. But what you'll find if you dig in is that with more concurrency, CPU usage throttles even harder, and memory usage balloons. So I think it's enough to show the drop-off.
So that brings us to..
Flight Serialization + SSR + Compression
When preparing an initial HTML response, React kicks off a flight serialization of the server component tree. This is because Fizz doesn't have knowledge of the server-client boundary.
After this serialization, the output gets tee'd into two streams for frameworks to consume:
- the flight serialization is converted to a fizz stream, which turns it back into a react tree, then the react tree is ssr'd/serialized to html
- the framework generates flight-based hydration <script> tags for injecting
On a single thread, these parallel streams back-pressure and compete for cpu. By serializing to flight, then back to fizz, then to html we end up throttling the single thread with a ton of unnecessary work.
So what's the solve? I mean, that's up to you guys. But on a single-thread, there's only one real pathway to improved performance: do less work.
So rather than the three-step intermediate serialization and deserialization, it would be better if there was a "fused" pipeline to handle rsc -> ssr in one pass. This would require some architectural changes to Fizz to identify client components and serialize the prop boundaries. My guess is that Fizz and Flight were given separate responsibilities because in theory you may want to run them in different places. But in practice, those of us shipping RSC servers run them together anyway.
I built a proof of concept with Claude (it's not wholly complete, perhaps the benchmarks are misleading, the props serialization is clearly incomplete, but it's a worthy exploration) and saw some real gains in performance, memory, and more consistent throughput under concurrent load. Especially for the pure server-component path (sans client components).
Per-Request Breakdown (226-product PLP)
| Phase | Full RSC Pipeline | Fused Renderer |
|---|---|---|
| Flight tree walk + encoding | 2.36ms | — |
| Props serialization (JSON) | 0.88ms | 0.88ms |
| Flight deserialize | 0.32ms | — |
| Fizz render | 3.53ms | 1.47ms ¹ |
| Hydration data to client | 0.65ms | 0.20ms |
| Fused boundary overhead | — | 1.73ms ² |
| TOTAL | 7.74ms | 4.28ms |
¹ Fused Fizz render matches plain renderToPipeableStream (1.42ms) — no Flight element overhead
² Markers, module resolution, props serialization, chunk output for hydration data
Our fused pipeline starts performing closer to raw SSR when there are no client components. But we do see a large drop-off once we serialize props into client components.
Throughput Comparison
| Mode | ms/req | req/s | Output | Description |
|---|---|---|---|---|
| Plain Fizz (no RSC) | 1.42ms | 702 | 102 KB | Theoretical ceiling |
| Fused (server-only) | 1.47ms | 680 | 102 KB | Matches plain Fizz |
| Fused (w/ client boundaries) | 4.28ms | 234 | 433 KB | 1.8x faster than full pipeline |
| Full RSC pipeline + hydration | 7.74ms | 129 | 411 KB | Current |
Where the 1.8x Comes From
| Eliminated | Saved |
|---|---|
| Flight tree walk + wire format encoding | 2.36ms |
| Flight wire format parsing | 0.32ms |
| Flight element reconstruction overhead in Fizz | 2.11ms |
Flight payload inlining (JSON.stringify) |
0.45ms |
| Total eliminated | 5.24ms |
| Added | Cost |
|---|---|
| Props serialization at boundaries | 0.88ms |
| Hydration script emission | 0.20ms |
| Boundary markers + module resolution + queue management | 0.65ms |
| Total added | 1.73ms |
This results in roughly 1.8x fewer CPU time than the current path. It's possible this is not a great benchmark, but it's just intended to be a proof of concept.
Key Properties
| Property | Full Pipeline | Fused |
|---|---|---|
| Tree walks | 3 (Flight + Flight Client + Fizz) | 1 (Fizz only) |
| Serialization passes | 2 (Flight wire + inline payload) | 1 (props at boundaries) |
| Intermediate buffers | ~291 KB Flight wire format | None |
| Output to client | ~411 KB (HTML + hydration data) | ~433 KB (HTML + hydration data) |
| Peak heap (c=50) | 297 MB | 60 MB |
| Flight server modified | — | No (zero changes) |
| Reconciler modified | — | No (zero changes) |
Client components still suffer from expensive props serialization into hydration tags. But at the very least, memory usage goes way down and throughput becomes more consistent. Because the ergonomics of serializing data from server to client is so clean, it becomes very easy to highten this issue without the end-user understanding why.
I know there's been some past discussion of handling the duplication of hydration content. This duplication of data was kind of undersold as not a big deal because of compression, but it turns out the throughput of running that compression inline on the bigger set of data leads to more cpu contention. You can offload compression to an external host (e.g. Cloudflare) or server, but then you're paying the transfer cost and relying on external processing power.
Why haven't these issues surfaced earlier?
Well, I don't know. At the end of the day, a full SSR rendering pipeline will almost always be more delayed by I/O (database, api requests, etc.) than by ~10-20ms of cpu time. The concurrency issue is fairly easily papered over with a few extra pods or cores. People get fewer req/s than they think.
On top of which, the most observed places that RSCs have been deployed has been on serverless platforms, where each request often gets its own thread – so you wouldn't notice the concurrency isuses unless you're looking at a traditional node server or really digging in. The problem here isn't really the wall clock of the cpu time (imo), it's the bottlenecking and throughput - JS just plain suffers here. So this is all easy to miss, or perhaps makes it worth dismissing as unnecessary optimization entirely.
But I think we'd all agree that higher throughput, better concurrency, lower memory, fewer bytes would be a net win if possible. And might bring RSC performance back to more traditional alternatives, while maintaining its architectural advantages.
After doing some research, I found an issue opened by @WIVSW from October that essentially identified much of this. They also saw a 10x drop in req/s when switching to the RSC pipeline.
The expected behavior
Ultimately, the desire is that RSCs render faster and more concurrently across a variety of scenarios: pure server rendering, many client boundaries/props serialization, and so on. With a reduction in memory usage and thrashing.
Hope this is useful, I spent some time trying to understand the internals of React so if I got anything wrong, please reorient me in the right direction – thanks for reading.