New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jit: Improve block lookup performance through a shm memory segment. #11737
Conversation
Interesting idea. Though I am curious where the 2% speedup comes from... I guess it's either the shorter code or the larger table. Have you tried checking what happens if you use the existing code but increase the size of the table? ( |
Hi, My best guess is it's due to the larger table since the following line
entirely disappears when running with "perf record". I will check if increasing the map in the current code itself also helps. |
It's both. First: current code
and
Second: bigger table (factor of 0x10 i.e. 16)
and
Third:
|
Okay. My only real concern with the overall idea then is how much RAM we would use up if the game ends up using every possible PC... But unless the game is specifically written to exploit Dolphin's behavior, we shouldn't come anywhere near that happening. So I guess it's fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might be able to get rid of the effectiveAddress
check in JitAsm.cpp, since each index in the block map now corresponds to exactly one PC value. Just make sure you still check msrBits
.
Yes i thought about that as well. But i think if the game goes rogue and indexes all possible PC's, eventually JitBaseBlockCache::Clear() would be called (from JitX::ClearCache) and the virtual file truncated. |
Good point! |
fyi: i have only tested Linux and Android so far and i don't have an easy way to test macos/windows right now. |
Would it be difficult to add a configuration option for this? (I can't allocate this much virtual memory on iOS. This only affects my personal fork though, so I'm not blocking this PR if it isn't possible or too hard.) |
Isn't there already a similarly sized shm segment in Source/Core/Core/HW/Memmap.cpp for the fastmem? |
Yes. I had to disable fastmem on my fork. |
I made it so that it falls back to the original method when the shm segment can't be allocated. |
That part of the change LGTM, thanks so much! I'll pass this back to Jos for further reviewing... |
I fixed the formatting issues (sorry for some reason "./Tools/lint.sh --force" didn't report them locally). |
Can you rebase this? We ended up with a merge conflict here. |
I rebased it already a while ago, but didn't push yet because i thought there was an issue with mGBA. It didn't compile. Turns out my distros mgba was too old and after switching to use the one in External it compiled again. To me that sounds like some version check is missing somewhere? Anyways i pushed an update here. |
return static_cast<u8*>(base); | ||
} | ||
|
||
void MemArena::ReleaseMemoryRegion() | ||
{ | ||
munmap(m_reserved_region, m_reserved_region_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think m_reserved_region_size
is set correctly on Android. You probably have to set it in ReserveMemoryRegion() like in the Unix variant. In fact, you can probably match both of those functions to the Unix ones (except for the flags):
dolphin/Source/Core/Common/MemArenaUnix.cpp
Lines 66 to 87 in 8c67083
u8* MemArena::ReserveMemoryRegion(size_t memory_size) | |
{ | |
const int flags = MAP_ANON | MAP_PRIVATE; | |
void* base = mmap(nullptr, memory_size, PROT_NONE, flags, -1, 0); | |
if (base == MAP_FAILED) | |
{ | |
PanicAlertFmt("Failed to map enough memory space: {}", LastStrerrorString()); | |
return nullptr; | |
} | |
m_reserved_region = base; | |
m_reserved_region_size = memory_size; | |
return static_cast<u8*>(base); | |
} | |
void MemArena::ReleaseMemoryRegion() | |
{ | |
if (m_reserved_region) | |
{ | |
munmap(m_reserved_region, m_reserved_region_size); | |
m_reserved_region = nullptr; | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Hm looking at this, does Android still have the old memory mapping bug where other malloc/new calls can allocate in fastmem space? Maybe that should be fixed too, but probably not in this PR.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nobody ever ported the fix for that over to Android as far as I can recall. But yeah, let's leave it out of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i did these changes because otherwise the fastmem arena and the lookup table would overlap and destroy each other.
but i can remove these changes from here once #11823 is merged (which i assume fixes it too)
m_block_map_arena.GrabSHMSegment(FAST_BLOCK_MAP_SIZE); | ||
} | ||
|
||
fast_block_map = (JitBlock**)m_block_map_arena.CreateView(0, FAST_BLOCK_MAP_SIZE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reinterpret_cast
, please.
JitBlock** fast_block_map = 0; | ||
Common::MemArena m_block_map_arena; | ||
|
||
// An alternative for the above fast_block_map but without a shm segment | ||
// in case the shm memory region couldn't be allocated. | ||
std::array<JitBlock*, FAST_BLOCK_MAP_FALLBACK_ELEMENTS> | ||
fast_block_map_fallback{}; // start_addr & mask -> number | ||
|
||
JitBlock** fast_block_map_ptr = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're already touching these, can you attach a m_
to the front of the names?
@@ -131,8 +132,9 @@ class JitBaseBlockCache | |||
// is valid (MSR.IR and MSR.DR, the address translation bits). | |||
static constexpr u32 JIT_CACHE_MSR_MASK = 0x30; | |||
|
|||
static constexpr u32 FAST_BLOCK_MAP_ELEMENTS = 0x10000; | |||
static constexpr u32 FAST_BLOCK_MAP_MASK = FAST_BLOCK_MAP_ELEMENTS - 1; | |||
static constexpr u64 FAST_BLOCK_MAP_SIZE = 0x2'0000'0000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should document how one arrives at this number. You have a guest memory space of 4 GB and PPC instructions are always aligned to 4 bytes, so that gives you 4 GB / 4 bytes = 0x4000'0000 possible entries in the map (I guess it's a lookup table now technically but whatever). Each entry points to a JitBlock*
, which is 8 bytes on 64 bit machines. That's 8 GB, or 0x2'0000'0000 bytes.
CMP(64, R(RSCRATCH2), | ||
MDisp(RSCRATCH, static_cast<s32>(offsetof(JitBlockData, effectiveAddress)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not your fault but jeez this is evil and only works because effectiveAddress
and msrBits
are right next to eachother in the JitBlockData
. This probably needs some comments and asserts...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More specifically, I would recommend something like:
static_assert(offsetof(JitBlockData, effectiveAddress) + 4 == offsetof(JitBlockData, msrBits));
This way, any violations will be caught at compile time, and the assert is clear enough that I don't think writing a comment is necessary.
MOV(32, R(RSCRATCH), PPCSTATE(pc)); | ||
|
||
MOV(64, R(RSCRATCH2), Imm64(icache)); | ||
MOV(64, R(RSCRATCH), MComplex(RSCRATCH2, RSCRATCH, SCALE_2, 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could also use a comment for how this actually arrives at the correct address, it took me a minute to figure out why SCALE_2
is correct.
(It's because each 4-byte offset of the PC register corresponds to a 8-byte offset in the lookup table, due to 8 byte host pointers.)
You can rebase this now. Please update the two calls to |
By using a shm memory segment for the fast_block_map that is sparsely allocated (i.e. on write by the OS) instead of a statically allocated array we can make the block lookup faster by: * Having a bigger space available for lookup that doesn't take up too much memory, because the OS will only allocate the needed pages when written to. * Decrease the time spent to lookup a block in the assembly dispatcher due to less comparisions and shorter code (for example the pc check has been entirely dropped since only the msrBits need to be validated). When the JIT block cache is full the shm segment will also be released and reallocated to avoid allocating too much memory. It will also be reset when the instruction cache is flushed by the PPC code to avoid having stale entries. Also fallback to the original method in case the memory segment couldn't be allocated.
Also remove munmap from ReserveMemoryRegion in MemArenaAndroid and put it into ReleaseMemoryRegion, otherwise the lookup table would break.
Only works on Linux and Android so far.
It gives around 2% performance boost on my x86_64 desktop when running with the speed limit disabled, on the arm64 device i've tested it the gains are less significant. But ofc might be that my testing method is wrong, and i wouldn't be surprised if the performance improvement is less or even imagined.
I'm filing a PR in this WIP state cause i'd like to know if i'm doing something wrong or whether the performance gains are real and whether it makes sense to continue with this work or not. I will fix the formatting to match dolphins code style after that.
In this branch i have extended the map even further and index via (addr | ((msrBits & mask) >> 4)).
https://github.com/krnlyng/dolphin/tree/block_map_msr
But this doesn't seem to change anything.