Jit: Improve block lookup performance through a shm memory segment. #11737

krnlyng · 2023-04-07T10:30:51Z

Also remove munmap from ReserveMemoryRegion in MemArenaAndroid and put it into ReleaseMemoryRegion, otherwise the lookup table would break.

Only works on Linux and Android so far.

It gives around 2% performance boost on my x86_64 desktop when running with the speed limit disabled, on the arm64 device i've tested it the gains are less significant. But ofc might be that my testing method is wrong, and i wouldn't be surprised if the performance improvement is less or even imagined.

I'm filing a PR in this WIP state cause i'd like to know if i'm doing something wrong or whether the performance gains are real and whether it makes sense to continue with this work or not. I will fix the formatting to match dolphins code style after that.

In this branch i have extended the map even further and index via (addr | ((msrBits & mask) >> 4)).
https://github.com/krnlyng/dolphin/tree/block_map_msr
But this doesn't seem to change anything.

JosJuice · 2023-04-07T11:47:31Z

Interesting idea. Though I am curious where the 2% speedup comes from... I guess it's either the shorter code or the larger table. Have you tried checking what happens if you use the existing code but increase the size of the table? ( FAST_BLOCK_MAP_ELEMENTS)

krnlyng · 2023-04-07T11:55:40Z

Hi,

My best guess is it's due to the larger table since the following line

     2.07%  CPU thread       dolphin-emu-nogui                 [.] JitBaseBlockCache::GetBlockFromStartAddress

entirely disappears when running with "perf record".

I will check if increasing the map in the current code itself also helps.

krnlyng · 2023-04-07T12:08:42Z

It's both.

First: current code
Game runs at ~190-195 VPS.

    13.18%  CPU thread       [JIT] tid 1881645                  [.] JIT_Loop

and

     1.93%  CPU thread       dolphin-emu-nogui                  [.] JitBaseBlockCache::GetBlockFromStartAddress

Second: bigger table (factor of 0x10 i.e. 16)
Game runs at ~195-200 VPS

    14.74%  CPU thread       [JIT] tid 1882030                      [.] JIT_Loop

and

     0.14%  CPU thread       dolphin-emu-nogui                      [.] JitBaseBlockCache::GetBlockFromStartAddress

Third:
mmap'ed version
Game runs at ~200-205 VPS

    13.11%  CPU thread       [JIT] tid 1882979                  [.] JIT_Loop

JosJuice · 2023-04-07T12:17:06Z

Okay. My only real concern with the overall idea then is how much RAM we would use up if the game ends up using every possible PC... But unless the game is specifically written to exploit Dolphin's behavior, we shouldn't come anywhere near that happening. So I guess it's fine.

JosJuice

You might be able to get rid of the effectiveAddress check in JitAsm.cpp, since each index in the block map now corresponds to exactly one PC value. Just make sure you still check msrBits.

Source/Core/Core/PowerPC/JitCommon/JitCache.cpp

krnlyng · 2023-04-07T12:24:26Z

Okay. My only real concern with the overall idea then is how much RAM we would use up if the game ends up using every possible PC... But unless the game is specifically written to exploit Dolphin's behavior, we shouldn't come anywhere near that happening. So I guess it's fine.

Yes i thought about that as well. But i think if the game goes rogue and indexes all possible PC's, eventually JitBaseBlockCache::Clear() would be called (from JitX::ClearCache) and the virtual file truncated.

JosJuice · 2023-04-07T12:29:32Z

Good point!

krnlyng · 2023-04-07T13:05:44Z

fyi: i have only tested Linux and Android so far and i don't have an easy way to test macos/windows right now.

Source/Core/Core/PowerPC/Jit64/JitAsm.cpp

Source/Core/Core/PowerPC/JitArm64/JitAsm.cpp

OatmealDome · 2023-04-07T19:28:16Z

Would it be difficult to add a configuration option for this? (I can't allocate this much virtual memory on iOS. This only affects my personal fork though, so I'm not blocking this PR if it isn't possible or too hard.)

krnlyng · 2023-04-07T19:46:25Z

Would it be difficult to add a configuration option for this? (I can't allocate this much virtual memory on iOS. This only affects my personal fork though, so I'm not blocking this PR if it isn't possible or too hard.)

Isn't there already a similarly sized shm segment in Source/Core/Core/HW/Memmap.cpp for the fastmem?

OatmealDome · 2023-04-07T19:51:06Z

Yes. I had to disable fastmem on my fork.

krnlyng · 2023-04-08T12:34:05Z

Yes. I had to disable fastmem on my fork.

I made it so that it falls back to the original method when the shm segment can't be allocated.
I hope that works for you!

OatmealDome · 2023-04-09T22:54:18Z

That part of the change LGTM, thanks so much! I'll pass this back to Jos for further reviewing...

krnlyng · 2023-04-20T09:24:01Z

I fixed the formatting issues (sorry for some reason "./Tools/lint.sh --force" didn't report them locally).

AdmiralCurtiss · 2023-05-09T17:11:57Z

Can you rebase this? We ended up with a merge conflict here.

krnlyng · 2023-05-09T17:43:13Z

I rebased it already a while ago, but didn't push yet because i thought there was an issue with mGBA. It didn't compile. Turns out my distros mgba was too old and after switching to use the one in External it compiled again.

To me that sounds like some version check is missing somewhere?

Anyways i pushed an update here.

AdmiralCurtiss · 2023-05-10T00:12:05Z

Source/Core/Common/MemArenaAndroid.cpp

  return static_cast<u8*>(base);
 }

 void MemArena::ReleaseMemoryRegion()
 {
+  munmap(m_reserved_region, m_reserved_region_size);


I don't think m_reserved_region_size is set correctly on Android. You probably have to set it in ReserveMemoryRegion() like in the Unix variant. In fact, you can probably match both of those functions to the Unix ones (except for the flags):

dolphin/Source/Core/Common/MemArenaUnix.cpp

Lines 66 to 87 in 8c67083

u8* MemArena::ReserveMemoryRegion(size_t memory_size)

{

const int flags = MAP_ANON | MAP_PRIVATE;

void* base = mmap(nullptr, memory_size, PROT_NONE, flags, -1, 0);

if (base == MAP_FAILED)

{

PanicAlertFmt("Failed to map enough memory space: {}", LastStrerrorString());

return nullptr;

}

m_reserved_region = base;

m_reserved_region_size = memory_size;

return static_cast<u8*>(base);

}

void MemArena::ReleaseMemoryRegion()

{

if (m_reserved_region)

{

munmap(m_reserved_region, m_reserved_region_size);

m_reserved_region = nullptr;

}

}

(Hm looking at this, does Android still have the old memory mapping bug where other malloc/new calls can allocate in fastmem space? Maybe that should be fixed too, but probably not in this PR.)

Nobody ever ported the fix for that over to Android as far as I can recall. But yeah, let's leave it out of this PR.

i did these changes because otherwise the fastmem arena and the lookup table would overlap and destroy each other.
but i can remove these changes from here once #11823 is merged (which i assume fixes it too)

AdmiralCurtiss · 2023-05-10T00:23:07Z

Source/Core/Core/PowerPC/JitCommon/JitCache.cpp

+    m_block_map_arena.GrabSHMSegment(FAST_BLOCK_MAP_SIZE);
+  }
+
+  fast_block_map = (JitBlock**)m_block_map_arena.CreateView(0, FAST_BLOCK_MAP_SIZE);


reinterpret_cast, please.

AdmiralCurtiss · 2023-05-10T00:29:09Z

Source/Core/Core/PowerPC/JitCommon/JitCache.h

+  JitBlock** fast_block_map = 0;
+  Common::MemArena m_block_map_arena;
+
+  // An alternative for the above fast_block_map but without a shm segment
+  // in case the shm memory region couldn't be allocated.
+  std::array<JitBlock*, FAST_BLOCK_MAP_FALLBACK_ELEMENTS>
+      fast_block_map_fallback{};  // start_addr & mask -> number
+
+  JitBlock** fast_block_map_ptr = 0;


If you're already touching these, can you attach a m_ to the front of the names?

AdmiralCurtiss · 2023-05-10T00:40:29Z

Source/Core/Core/PowerPC/JitCommon/JitCache.h

@@ -131,8 +132,9 @@ class JitBaseBlockCache
  // is valid (MSR.IR and MSR.DR, the address translation bits).
  static constexpr u32 JIT_CACHE_MSR_MASK = 0x30;

-  static constexpr u32 FAST_BLOCK_MAP_ELEMENTS = 0x10000;
-  static constexpr u32 FAST_BLOCK_MAP_MASK = FAST_BLOCK_MAP_ELEMENTS - 1;
+  static constexpr u64 FAST_BLOCK_MAP_SIZE = 0x2'0000'0000;


You should document how one arrives at this number. You have a guest memory space of 4 GB and PPC instructions are always aligned to 4 bytes, so that gives you 4 GB / 4 bytes = 0x4000'0000 possible entries in the map (I guess it's a lookup table now technically but whatever). Each entry points to a JitBlock*, which is 8 bytes on 64 bit machines. That's 8 GB, or 0x2'0000'0000 bytes.

AdmiralCurtiss · 2023-05-10T00:45:57Z

Source/Core/Core/PowerPC/Jit64/JitAsm.cpp

+      CMP(64, R(RSCRATCH2),
+          MDisp(RSCRATCH, static_cast<s32>(offsetof(JitBlockData, effectiveAddress))));


Not your fault but jeez this is evil and only works because effectiveAddress and msrBits are right next to eachother in the JitBlockData. This probably needs some comments and asserts...

More specifically, I would recommend something like:

static_assert(offsetof(JitBlockData, effectiveAddress) + 4 == offsetof(JitBlockData, msrBits));

This way, any violations will be caught at compile time, and the assert is clear enough that I don't think writing a comment is necessary.

AdmiralCurtiss · 2023-05-10T00:52:25Z

Source/Core/Core/PowerPC/Jit64/JitAsm.cpp

+      MOV(32, R(RSCRATCH), PPCSTATE(pc));
+
+      MOV(64, R(RSCRATCH2), Imm64(icache));
+      MOV(64, R(RSCRATCH), MComplex(RSCRATCH2, RSCRATCH, SCALE_2, 0));


This could also use a comment for how this actually arrives at the correct address, it took me a minute to figure out why SCALE_2 is correct.

(It's because each 4-byte offset of the PC register corresponds to a 8-byte offset in the lookup table, due to 8 byte host pointers.)

krnlyng · 2023-05-11T20:16:38Z

I dropped the changes that would conflict with #11823. But this also means that it crashes on android unless #11823 is merged too.

AdmiralCurtiss · 2023-05-19T17:24:41Z

You can rebase this now. Please update the two calls to GrabSHMSegment() and add a second parameter with a unique string for the memory segment, eg. "dolphin-emu-jitblock".

By using a shm memory segment for the fast_block_map that is sparsely allocated (i.e. on write by the OS) instead of a statically allocated array we can make the block lookup faster by: * Having a bigger space available for lookup that doesn't take up too much memory, because the OS will only allocate the needed pages when written to. * Decrease the time spent to lookup a block in the assembly dispatcher due to less comparisions and shorter code (for example the pc check has been entirely dropped since only the msrBits need to be validated). When the JIT block cache is full the shm segment will also be released and reallocated to avoid allocating too much memory. It will also be reset when the instruction cache is flushed by the PPC code to avoid having stale entries. Also fallback to the original method in case the memory segment couldn't be allocated.

JosJuice reviewed Apr 7, 2023

View reviewed changes

Source/Core/Core/PowerPC/JitCommon/JitCache.cpp Outdated Show resolved Hide resolved

JosJuice reviewed Apr 7, 2023

View reviewed changes

Source/Core/Core/PowerPC/Jit64/JitAsm.cpp Outdated Show resolved Hide resolved

Source/Core/Core/PowerPC/JitArm64/JitAsm.cpp Outdated Show resolved Hide resolved

krnlyng changed the title ~~WIP: Jit: Replace fast_block_map with an mmap'ed lookup table.~~ Jit: Improve block lookup performance through a shm memory segment. Apr 7, 2023

AdmiralCurtiss reviewed May 10, 2023

View reviewed changes

AdmiralCurtiss mentioned this pull request May 10, 2023

Common/MemArenaWin: Improve file mapping logic. #11822

Merged

AdmiralCurtiss merged commit 8fd61d0 into dolphin-emu:master May 21, 2023
14 checks passed

Medard22 mentioned this pull request Mar 11, 2024

Jit: Improve block lookup performance through a shm memory segment. Medard22/Dolphin-MMJR2-VBI#298

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jit: Improve block lookup performance through a shm memory segment. #11737

Jit: Improve block lookup performance through a shm memory segment. #11737

krnlyng commented Apr 7, 2023 •

edited

JosJuice commented Apr 7, 2023

krnlyng commented Apr 7, 2023

krnlyng commented Apr 7, 2023 •

edited

JosJuice commented Apr 7, 2023

JosJuice left a comment

krnlyng commented Apr 7, 2023

JosJuice commented Apr 7, 2023

krnlyng commented Apr 7, 2023 •

edited

OatmealDome commented Apr 7, 2023

krnlyng commented Apr 7, 2023

OatmealDome commented Apr 7, 2023 •

edited

krnlyng commented Apr 8, 2023

OatmealDome commented Apr 9, 2023

krnlyng commented Apr 20, 2023

AdmiralCurtiss commented May 9, 2023

krnlyng commented May 9, 2023

AdmiralCurtiss May 10, 2023

AdmiralCurtiss May 10, 2023

JosJuice May 10, 2023

krnlyng May 11, 2023

AdmiralCurtiss May 10, 2023

AdmiralCurtiss May 10, 2023

AdmiralCurtiss May 10, 2023

AdmiralCurtiss May 10, 2023

JosJuice May 10, 2023

AdmiralCurtiss May 10, 2023 •

edited

krnlyng commented May 11, 2023

AdmiralCurtiss commented May 19, 2023

	u8* MemArena::ReserveMemoryRegion(size_t memory_size)
	{
	const int flags = MAP_ANON \| MAP_PRIVATE;
	void* base = mmap(nullptr, memory_size, PROT_NONE, flags, -1, 0);
	if (base == MAP_FAILED)
	{
	PanicAlertFmt("Failed to map enough memory space: {}", LastStrerrorString());
	return nullptr;
	}
	m_reserved_region = base;
	m_reserved_region_size = memory_size;
	return static_cast<u8*>(base);
	}

	void MemArena::ReleaseMemoryRegion()
	{
	if (m_reserved_region)
	{
	munmap(m_reserved_region, m_reserved_region_size);
	m_reserved_region = nullptr;
	}
	}

		CMP(64, R(RSCRATCH2),
		MDisp(RSCRATCH, static_cast<s32>(offsetof(JitBlockData, effectiveAddress))));

Jit: Improve block lookup performance through a shm memory segment. #11737

Jit: Improve block lookup performance through a shm memory segment. #11737

Conversation

krnlyng commented Apr 7, 2023 • edited

JosJuice commented Apr 7, 2023

krnlyng commented Apr 7, 2023

krnlyng commented Apr 7, 2023 • edited

JosJuice commented Apr 7, 2023

JosJuice left a comment

Choose a reason for hiding this comment

krnlyng commented Apr 7, 2023

JosJuice commented Apr 7, 2023

krnlyng commented Apr 7, 2023 • edited

OatmealDome commented Apr 7, 2023

krnlyng commented Apr 7, 2023

OatmealDome commented Apr 7, 2023 • edited

krnlyng commented Apr 8, 2023

OatmealDome commented Apr 9, 2023

krnlyng commented Apr 20, 2023

AdmiralCurtiss commented May 9, 2023

krnlyng commented May 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AdmiralCurtiss May 10, 2023 • edited

Choose a reason for hiding this comment

krnlyng commented May 11, 2023

AdmiralCurtiss commented May 19, 2023

krnlyng commented Apr 7, 2023 •

edited

krnlyng commented Apr 7, 2023 •

edited

krnlyng commented Apr 7, 2023 •

edited

OatmealDome commented Apr 7, 2023 •

edited

AdmiralCurtiss May 10, 2023 •

edited