New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jit64: dcbx loop detection for improved performance when invalidating large memory regions. #10007
Conversation
|
We probably need to do this before the end of the month for the progress report, since it'd be nice to have performance and compatibility. |
f39eb89
to
959f1db
Compare
|
For reference, libogc's cache_asm.S is (I think) a disassembly of the relevant SDK functions. You can see that almost all of the |
344b6ca
to
18ac407
Compare
…s 32-byte aligned.
18ac407
to
e79eed4
Compare
|
I can confirm there's no more stuttering in Arc Rise Fantasia, Pokemon Colosseum's startup, and Super Mario Sunshine. Mario and Sonic at the London Olympic Games and Happy Feet are not crashing with this optimization. |
e79eed4
to
7427f49
Compare
…multiple BAT or Page Table pages.
7427f49
to
55bcd97
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I will submit a JitArm64 version as a follow-up PR once this is merged.
55bcd97
to
8b2f5d5
Compare
So unfortunately, I have to report that our recent dcbx fixes have once again made Mario Sunshine's boot sequence and A Boy and his Blob in general slow again.
I profiled this a bit and it turns out the big performance issue comes from a game calling dcbx, failing the BAT check on memory that is not mapped in the BAT, and then
JitCache_TranslateAddress()having to go through the rather slowTranslatePageAddress(). This wouldn't be so bad once, but:This, as it turns out, is slow. Profiler in Mario Sunshine's boot sequence shows >70% CPU usage in just the page translation code. Which makes sense when you think about it, we're effectively invoking InvalidateICache() about 134 million times.
This PR tries to detect such loops and transform them into larger invalidation calls, limited by the remaining downcount, instead of one per cache line.
Probably want an ARM implementation of this too.