Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jit64: dcbx loop detection for improved performance when invalidating large memory regions. #10007

Merged
merged 4 commits into from Aug 17, 2021

Conversation

AdmiralCurtiss
Copy link
Contributor

@AdmiralCurtiss AdmiralCurtiss commented Aug 7, 2021

So unfortunately, I have to report that our recent dcbx fixes have once again made Mario Sunshine's boot sequence and A Boy and his Blob in general slow again.

I profiled this a bit and it turns out the big performance issue comes from a game calling dcbx, failing the BAT check on memory that is not mapped in the BAT, and then JitCache_TranslateAddress() having to go through the rather slow TranslatePageAddress(). This wouldn't be so bad once, but:

  • Mario Sunshine calls this on the entire address space, one cache line at a time.
  • A Boy and his Blob does this all the time in the background, on multiple ~266KB regions.

This, as it turns out, is slow. Profiler in Mario Sunshine's boot sequence shows >70% CPU usage in just the page translation code. Which makes sense when you think about it, we're effectively invoking InvalidateICache() about 134 million times.

This PR tries to detect such loops and transform them into larger invalidation calls, limited by the remaining downcount, instead of one per cache line.

Probably want an ARM implementation of this too.

@JMC47
Copy link
Contributor

JMC47 commented Aug 8, 2021

We probably need to do this before the end of the month for the progress report, since it'd be nice to have performance and compatibility.

@Pokechu22
Copy link
Contributor

For reference, libogc's cache_asm.S is (I think) a disassembly of the relevant SDK functions. You can see that almost all of the Range ones follow that pattern. The only exceptions are __LCEnable and LCAllocTags, which have two cache-related instructions per loop instead of one.

@AdmiralCurtiss AdmiralCurtiss force-pushed the x64-dcbx-in-loop branch 2 times, most recently from 344b6ca to 18ac407 Compare August 11, 2021 01:32
@AdmiralCurtiss AdmiralCurtiss marked this pull request as ready for review August 12, 2021 17:32
@AdmiralCurtiss AdmiralCurtiss changed the title [WIP] dcbx loop detection Jit64: dcbx loop detection for improved performance when invalidating large memory regions. Aug 12, 2021
@JMC47
Copy link
Contributor

JMC47 commented Aug 12, 2021

I can confirm there's no more stuttering in Arc Rise Fantasia, Pokemon Colosseum's startup, and Super Mario Sunshine. Mario and Sonic at the London Olympic Games and Happy Feet are not crashing with this optimization.

Copy link
Member

@JosJuice JosJuice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I will submit a JitArm64 version as a follow-up PR once this is merged.

@JMC47 JMC47 merged commit d162015 into dolphin-emu:master Aug 17, 2021
11 checks passed
@AdmiralCurtiss AdmiralCurtiss deleted the x64-dcbx-in-loop branch August 17, 2021 01:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
4 participants