New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TextureCache: Deferred/batched EFB copies #7539

Merged
merged 4 commits into from Nov 7, 2018

Conversation

7 participants
@stenzek
Contributor

stenzek commented Nov 2, 2018

This PR started as an extension of the locking concept, which doesn't perform very well at the moment due to a few reasons. For the explanation below, I am talking about a configuration where EFB2RAM is used, or "EFB Copies to Texture Only" is disabled.

There will be no changes to performance in EFBToTex mode, so please don't spam up the thread complaining about things still being slow on your three-year-old phone with EFB2RAM force disabled. Initial reports would suggest that performance is better, even on Android, with EFB Copies to Texture Only disabled.

Currently, when a game issues an EFB copy, we encode the EFB to a temporary texture, "idle" the GPU, copy the encoded texture data from the GPU (may be in RAM or VRAM, depending on the driver) to the emulated console's RAM, and continue on our merry way. The only problem, is the "idle" step takes ages. GPUs like being given large batches of work, and crunching through all of it without the CPU standing there, all intimidating-like, waiting for them to finish.

So, you might think hey, that isn't a big deal, there's only a couple of them every frame. Well, for one, it can cause GPUs to think they're not getting enough work, and stay at a lower clock rate/lower power state. It also all adds up. Imagine if you had to do this 20 or 30 times per frame, going back and forward with the host GPU. All that time, emulation is frozen, and can't progress. So FPS drops to the floor, and that's why we ship with EFB Copies to Texture Only as the default. The console GPU has no problem crunching through all these copies, but it's a pretty big deal for us.

This feature abuses the fact that the CPU and GPU in the console run asynchronously of one another, and there are several well-defined methods of synchronizing the two. It's also similar to the reason that the dual core hack works so well. If the game (specifically the CPU) wants to read some of the texels in an EFB copy, it should wait for the GPU to finish writing to the memory where they're stored, right?

The DrawDone command stalls the CPU until the GPU executes the command, PE tokens can interrupt the CPU when these tokens are encountered by the GPU, and command processor breakpoints can be used to interrupt the CPU when the GPU begins to process a certain distance into the FIFO. Note that the GX is pipelined, and the command processor is at the beginning of the pipeline, so a breakpoint does not guarantee that the pixel engine has finished writing to memory yet, so I'd expect this form of synchronization to be used less frequently for CPU access to copies.

So, if you're following everything so far, you'd probably think "wait, why do we have to write the copies out immediately, if the game is going to tell us when it wants to read the copy anyway?". Yep, that's what we're doing here. Instead of immediately "flushing" the EFB copy to emulated RAM, we just queue them all up, just how the GPU likes it, until there's a DrawDone or token, then write them out to memory in the order they came in.

But wait, there's a catch. Overlapping EFB copies. Currently, we invalidate the first EFB copy when a second one comes in at the same address, or overlaps the first. Here's a real-world example: Xenoblade Chronicles's sunset title screen. Copy from EFB->Texture, draw texture to EFB, copy EFB->Texture, draw, repeat. The second copy will invalidate the first, forcing a flush. It does about 6 copies to the same address, so instead of batching all 35 copies for the frame together, we're flushing every copy after the first! Not good!

What can we do about this? Well, we know it's going to the same address, and we flush EFB copies to RAM in the same order they come in, so the end result in RAM will be the same. But we can't use the old copy, since it's now outdated. So instead, let's remove it from the texture cache, so the high-resolution VRAM copy isn't used, but skip the flush. Boom, correct rendering, and batching! The 35 copies in the frame are all batched together, and we only have to idle the GPU once. Perfect.

You might've thought of another optimization here. If these copies are going to the same address, what about if they're the same size. Copy B completely overlaps Copy A. In fact, this happens in Xenoblade. Well, why don't we just throw away Copy A entirely? Good idea. It's not needed anyway. Note that we can't skip the copy on the GPU, since we can't predict what the next copy is going to look like, as we're processing the command stream as it comes in. But we can skip the copy on the CPU.

This whole idea works surprisingly well for most of the games we've tried. It looks like they're not too naughty and synchronize with the GPU when they want to read EFB copies. Which makes sense, because there's so many factors involved in much time it takes the GPU to process commands, you can't really do cycle counting here.

@stenzek stenzek added the WIP label Nov 2, 2018

@JMC47

This comment has been minimized.

Contributor

JMC47 commented Nov 2, 2018

I tested it on Android and the numbers aren't as exciting as they are on desktop. With The Legend of Zelda: The Wind Waker (EFB2RAM enabled) I went from 17.52 fps to 18.69 fps on OpenGL.

On Vulkan it goes from 14.70 fps to 18.90 fps. Both of these results were taken on Outset Island on a SD835 (One Plus 5) phone

@degasus

This comment has been minimized.

Member

degasus commented Nov 2, 2018

I fear you also need a FlushEFBCopies() call either on tmem cache invalidation or on texture uploading. Else you might miss an efb copy not handled with our efb2tex path.

@stenzek stenzek force-pushed the stenzek:batched-efb-copies branch 3 times, most recently from 57cea12 to a6ccecf Nov 5, 2018

@stenzek stenzek removed the WIP label Nov 5, 2018

@stenzek

This comment has been minimized.

Contributor

stenzek commented Nov 5, 2018

OP edited with details on how it works. The new option is called "Defer EFB Copies to RAM" in the Hacks tab, and it's enabled by default, as we haven't found anything which doesn't work with it.

@stenzek stenzek force-pushed the stenzek:batched-efb-copies branch 2 times, most recently from 2f87682 to cc88830 Nov 5, 2018

@JMC47

This comment has been minimized.

Contributor

JMC47 commented Nov 6, 2018

Alright, here are some awesome results

Xenoblade Chronicles during sunset is a sort of worst case scenario. It's a 30 FPS game. Here's how it checks out.

EFB2RAM - 62 FPS - This isn't bad, right? Double FPS. But I'm on a beefy computer, so, that's still worrying.
EFB2Tex - 180 FPS - This is awesome, 6x speed means that I'm well over what is required.
EFB2RAM Deferred - 156 FPS - Working EFB Copies at more than twice as fast as the current option. Beautiful.

But we don't force on EFB2RAM for Xenoblade, meaning this doesn't matter. Let's look at Wind Waker just in a regular ocean scene. Again, 30 FPS game, but remember EFB2RAM is forced ON.

EFB2RAM - 105 FPS - Over 3x speed on a Core i7-6700K is pretty good, but, weaker computers won't like the slowdown just to get the pictobox working.
EFB2TEX - 150 FPS - Not bad, but, this means broken pictobox and other features.
EFB2RAM Deferred - 130 FPS - Again, a pretty nice game for a game that isn't typically considered EFB copy limited.

Some other numbers

Silent Hill: Shattered Memories (60 FPS game) - EFB2RAM sampled for working snow. Taken from a demanding spot with lots of reflections near the game start.
EFB2RAM - 49 FPS
EFB2RAM Deferred - 55 FPS
EFB2Tex - 63 FPS

@delroth

This comment has been minimized.

Member

delroth commented Nov 6, 2018

Can you add the new option to analytics?

Other than that, LGTM. I've only had a cursory look but couldn't see anything wrong, and I think JMC has done good amounts of testing already :) I'd rather merge early and have a full 3 weeks for testing before next beta.

@stenzek stenzek force-pushed the stenzek:batched-efb-copies branch 2 times, most recently from 262f533 to f9c28d9 Nov 7, 2018

@stenzek stenzek force-pushed the stenzek:batched-efb-copies branch from f9c28d9 to a45f977 Nov 7, 2018

@delroth delroth merged commit dac58a8 into dolphin-emu:master Nov 7, 2018

8 checks passed

default Very basic checks passed, handed off to Buildbot.
Details
lint Build succeeded on builder lint
Details
pr-android Build succeeded on builder pr-android
Details
pr-deb-dbg-x64 Build succeeded on builder pr-deb-dbg-x64
Details
pr-deb-x64 Build succeeded on builder pr-deb-x64
Details
pr-freebsd-x64 Build succeeded on builder pr-freebsd-x64
Details
pr-osx-x64 Build succeeded on builder pr-osx-x64
Details
pr-ubu-x64 Build succeeded on builder pr-ubu-x64
Details
@psennermann

This comment has been minimized.

psennermann commented Nov 8, 2018

Interesting to notice than now, in less demanding scenes, Copy to Ram seems faster than Copy to Texture (for example on my system, using D3d11, "The Last Story" at the beginning of gameplay is at 103 vs 100 fps while then moving to the hidden cave drops to 58 vs 61)...

@JMC47

This comment has been minimized.

Contributor

JMC47 commented Nov 8, 2018

Clocking up the GPU probably?

@DaRkL3AD3R

This comment has been minimized.

DaRkL3AD3R commented Nov 11, 2018

Anyone else seeing broken Pictograph photos in Wind Waker?

Hack Off:
https://i.imgur.com/CcE9XLh.png

Hack On:
https://i.imgur.com/27HD6dk.png

@delroth

This comment has been minimized.

Member

delroth commented Nov 11, 2018

@DaRkL3AD3R Stenzek mentioned that there might be a corruption issue with Vulkan right now, do you see the same problem with OGL? Thanks.

@binturongx10

This comment has been minimized.

binturongx10 commented Nov 12, 2018

This seems to have broken rouge squadron II's intro cutscenes with XFB to texture only disabled.

screenshot 3

@JMC47

This comment has been minimized.

Contributor

JMC47 commented Nov 12, 2018

If you turn it off does it work fine? Or is it permanently broken?

@binturongx10

This comment has been minimized.

binturongx10 commented Nov 12, 2018

the defer efb copies setting does not affect it.

@binturongx10

This comment has been minimized.

binturongx10 commented Nov 12, 2018

But it started when this was merged

@JMC47

This comment has been minimized.

Contributor

JMC47 commented Nov 12, 2018

Thanks for the report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment