Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
TextureCache: Deferred/batched EFB copies #7539
This PR started as an extension of the locking concept, which doesn't perform very well at the moment due to a few reasons. For the explanation below, I am talking about a configuration where EFB2RAM is used, or "EFB Copies to Texture Only" is disabled.
There will be no changes to performance in EFBToTex mode, so please don't spam up the thread complaining about things still being slow on your three-year-old phone with EFB2RAM force disabled. Initial reports would suggest that performance is better, even on Android, with EFB Copies to Texture Only disabled.
Currently, when a game issues an EFB copy, we encode the EFB to a temporary texture, "idle" the GPU, copy the encoded texture data from the GPU (may be in RAM or VRAM, depending on the driver) to the emulated console's RAM, and continue on our merry way. The only problem, is the "idle" step takes ages. GPUs like being given large batches of work, and crunching through all of it without the CPU standing there, all intimidating-like, waiting for them to finish.
So, you might think hey, that isn't a big deal, there's only a couple of them every frame. Well, for one, it can cause GPUs to think they're not getting enough work, and stay at a lower clock rate/lower power state. It also all adds up. Imagine if you had to do this 20 or 30 times per frame, going back and forward with the host GPU. All that time, emulation is frozen, and can't progress. So FPS drops to the floor, and that's why we ship with EFB Copies to Texture Only as the default. The console GPU has no problem crunching through all these copies, but it's a pretty big deal for us.
This feature abuses the fact that the CPU and GPU in the console run asynchronously of one another, and there are several well-defined methods of synchronizing the two. It's also similar to the reason that the dual core hack works so well. If the game (specifically the CPU) wants to read some of the texels in an EFB copy, it should wait for the GPU to finish writing to the memory where they're stored, right?
The DrawDone command stalls the CPU until the GPU executes the command, PE tokens can interrupt the CPU when these tokens are encountered by the GPU, and command processor breakpoints can be used to interrupt the CPU when the GPU begins to process a certain distance into the FIFO. Note that the GX is pipelined, and the command processor is at the beginning of the pipeline, so a breakpoint does not guarantee that the pixel engine has finished writing to memory yet, so I'd expect this form of synchronization to be used less frequently for CPU access to copies.
So, if you're following everything so far, you'd probably think "wait, why do we have to write the copies out immediately, if the game is going to tell us when it wants to read the copy anyway?". Yep, that's what we're doing here. Instead of immediately "flushing" the EFB copy to emulated RAM, we just queue them all up, just how the GPU likes it, until there's a DrawDone or token, then write them out to memory in the order they came in.
But wait, there's a catch. Overlapping EFB copies. Currently, we invalidate the first EFB copy when a second one comes in at the same address, or overlaps the first. Here's a real-world example: Xenoblade Chronicles's sunset title screen. Copy from EFB->Texture, draw texture to EFB, copy EFB->Texture, draw, repeat. The second copy will invalidate the first, forcing a flush. It does about 6 copies to the same address, so instead of batching all 35 copies for the frame together, we're flushing every copy after the first! Not good!
What can we do about this? Well, we know it's going to the same address, and we flush EFB copies to RAM in the same order they come in, so the end result in RAM will be the same. But we can't use the old copy, since it's now outdated. So instead, let's remove it from the texture cache, so the high-resolution VRAM copy isn't used, but skip the flush. Boom, correct rendering, and batching! The 35 copies in the frame are all batched together, and we only have to idle the GPU once. Perfect.
You might've thought of another optimization here. If these copies are going to the same address, what about if they're the same size. Copy B completely overlaps Copy A. In fact, this happens in Xenoblade. Well, why don't we just throw away Copy A entirely? Good idea. It's not needed anyway. Note that we can't skip the copy on the GPU, since we can't predict what the next copy is going to look like, as we're processing the command stream as it comes in. But we can skip the copy on the CPU.
This whole idea works surprisingly well for most of the games we've tried. It looks like they're not too naughty and synchronize with the GPU when they want to read EFB copies. Which makes sense, because there's so many factors involved in much time it takes the GPU to process commands, you can't really do cycle counting here.
I tested it on Android and the numbers aren't as exciting as they are on desktop. With The Legend of Zelda: The Wind Waker (EFB2RAM enabled) I went from 17.52 fps to 18.69 fps on OpenGL.
On Vulkan it goes from 14.70 fps to 18.90 fps. Both of these results were taken on Outset Island on a SD835 (One Plus 5) phone
3 times, most recently
Nov 5, 2018
Alright, here are some awesome results
Xenoblade Chronicles during sunset is a sort of worst case scenario. It's a 30 FPS game. Here's how it checks out.
EFB2RAM - 62 FPS - This isn't bad, right? Double FPS. But I'm on a beefy computer, so, that's still worrying.
But we don't force on EFB2RAM for Xenoblade, meaning this doesn't matter. Let's look at Wind Waker just in a regular ocean scene. Again, 30 FPS game, but remember EFB2RAM is forced ON.
EFB2RAM - 105 FPS - Over 3x speed on a Core i7-6700K is pretty good, but, weaker computers won't like the slowdown just to get the pictobox working.
Some other numbers
Silent Hill: Shattered Memories (60 FPS game) - EFB2RAM sampled for working snow. Taken from a demanding spot with lots of reflections near the game start.
Nov 7, 2018
8 checks passed
Interesting to notice than now, in less demanding scenes, Copy to Ram seems faster than Copy to Texture (for example on my system, using D3d11, "The Last Story" at the beginning of gameplay is at 103 vs 100 fps while then moving to the hidden cave drops to 58 vs 61)...