Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jit: Improve block lookup performance through a shm memory segment. #11737

Merged
merged 1 commit into from May 21, 2023
Merged

Conversation

krnlyng
Copy link
Contributor

@krnlyng krnlyng commented Apr 7, 2023

Also remove munmap from ReserveMemoryRegion in MemArenaAndroid and put it into ReleaseMemoryRegion, otherwise the lookup table would break.

Only works on Linux and Android so far.

It gives around 2% performance boost on my x86_64 desktop when running with the speed limit disabled, on the arm64 device i've tested it the gains are less significant. But ofc might be that my testing method is wrong, and i wouldn't be surprised if the performance improvement is less or even imagined.

I'm filing a PR in this WIP state cause i'd like to know if i'm doing something wrong or whether the performance gains are real and whether it makes sense to continue with this work or not. I will fix the formatting to match dolphins code style after that.

In this branch i have extended the map even further and index via (addr | ((msrBits & mask) >> 4)).
https://github.com/krnlyng/dolphin/tree/block_map_msr
But this doesn't seem to change anything.

@JosJuice
Copy link
Member

JosJuice commented Apr 7, 2023

Interesting idea. Though I am curious where the 2% speedup comes from... I guess it's either the shorter code or the larger table. Have you tried checking what happens if you use the existing code but increase the size of the table? ( FAST_BLOCK_MAP_ELEMENTS)

@krnlyng
Copy link
Contributor Author

krnlyng commented Apr 7, 2023

Hi,

My best guess is it's due to the larger table since the following line

     2.07%  CPU thread       dolphin-emu-nogui                 [.] JitBaseBlockCache::GetBlockFromStartAddress

entirely disappears when running with "perf record".

I will check if increasing the map in the current code itself also helps.

@krnlyng
Copy link
Contributor Author

krnlyng commented Apr 7, 2023

It's both.

First: current code
Game runs at ~190-195 VPS.

    13.18%  CPU thread       [JIT] tid 1881645                  [.] JIT_Loop

and

     1.93%  CPU thread       dolphin-emu-nogui                  [.] JitBaseBlockCache::GetBlockFromStartAddress

Second: bigger table (factor of 0x10 i.e. 16)
Game runs at ~195-200 VPS

    14.74%  CPU thread       [JIT] tid 1882030                      [.] JIT_Loop

and

     0.14%  CPU thread       dolphin-emu-nogui                      [.] JitBaseBlockCache::GetBlockFromStartAddress

Third:
mmap'ed version
Game runs at ~200-205 VPS

    13.11%  CPU thread       [JIT] tid 1882979                  [.] JIT_Loop

@JosJuice
Copy link
Member

JosJuice commented Apr 7, 2023

Okay. My only real concern with the overall idea then is how much RAM we would use up if the game ends up using every possible PC... But unless the game is specifically written to exploit Dolphin's behavior, we shouldn't come anywhere near that happening. So I guess it's fine.

Copy link
Member

@JosJuice JosJuice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be able to get rid of the effectiveAddress check in JitAsm.cpp, since each index in the block map now corresponds to exactly one PC value. Just make sure you still check msrBits.

Source/Core/Core/PowerPC/JitCommon/JitCache.cpp Outdated Show resolved Hide resolved
@krnlyng
Copy link
Contributor Author

krnlyng commented Apr 7, 2023

Okay. My only real concern with the overall idea then is how much RAM we would use up if the game ends up using every possible PC... But unless the game is specifically written to exploit Dolphin's behavior, we shouldn't come anywhere near that happening. So I guess it's fine.

Yes i thought about that as well. But i think if the game goes rogue and indexes all possible PC's, eventually JitBaseBlockCache::Clear() would be called (from JitX::ClearCache) and the virtual file truncated.

@JosJuice
Copy link
Member

JosJuice commented Apr 7, 2023

Good point!

@krnlyng
Copy link
Contributor Author

krnlyng commented Apr 7, 2023

fyi: i have only tested Linux and Android so far and i don't have an easy way to test macos/windows right now.

@krnlyng krnlyng changed the title WIP: Jit: Replace fast_block_map with an mmap'ed lookup table. Jit: Improve block lookup performance through a shm memory segment. Apr 7, 2023
@OatmealDome
Copy link
Member

Would it be difficult to add a configuration option for this? (I can't allocate this much virtual memory on iOS. This only affects my personal fork though, so I'm not blocking this PR if it isn't possible or too hard.)

@krnlyng
Copy link
Contributor Author

krnlyng commented Apr 7, 2023

Would it be difficult to add a configuration option for this? (I can't allocate this much virtual memory on iOS. This only affects my personal fork though, so I'm not blocking this PR if it isn't possible or too hard.)

Isn't there already a similarly sized shm segment in Source/Core/Core/HW/Memmap.cpp for the fastmem?

@OatmealDome
Copy link
Member

OatmealDome commented Apr 7, 2023

Yes. I had to disable fastmem on my fork.

@krnlyng
Copy link
Contributor Author

krnlyng commented Apr 8, 2023

Yes. I had to disable fastmem on my fork.

I made it so that it falls back to the original method when the shm segment can't be allocated.
I hope that works for you!

@OatmealDome
Copy link
Member

That part of the change LGTM, thanks so much! I'll pass this back to Jos for further reviewing...

@krnlyng
Copy link
Contributor Author

krnlyng commented Apr 20, 2023

I fixed the formatting issues (sorry for some reason "./Tools/lint.sh --force" didn't report them locally).

@AdmiralCurtiss
Copy link
Contributor

Can you rebase this? We ended up with a merge conflict here.

@krnlyng
Copy link
Contributor Author

krnlyng commented May 9, 2023

I rebased it already a while ago, but didn't push yet because i thought there was an issue with mGBA. It didn't compile. Turns out my distros mgba was too old and after switching to use the one in External it compiled again.

To me that sounds like some version check is missing somewhere?

Anyways i pushed an update here.

return static_cast<u8*>(base);
}

void MemArena::ReleaseMemoryRegion()
{
munmap(m_reserved_region, m_reserved_region_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think m_reserved_region_size is set correctly on Android. You probably have to set it in ReserveMemoryRegion() like in the Unix variant. In fact, you can probably match both of those functions to the Unix ones (except for the flags):

u8* MemArena::ReserveMemoryRegion(size_t memory_size)
{
const int flags = MAP_ANON | MAP_PRIVATE;
void* base = mmap(nullptr, memory_size, PROT_NONE, flags, -1, 0);
if (base == MAP_FAILED)
{
PanicAlertFmt("Failed to map enough memory space: {}", LastStrerrorString());
return nullptr;
}
m_reserved_region = base;
m_reserved_region_size = memory_size;
return static_cast<u8*>(base);
}
void MemArena::ReleaseMemoryRegion()
{
if (m_reserved_region)
{
munmap(m_reserved_region, m_reserved_region_size);
m_reserved_region = nullptr;
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Hm looking at this, does Android still have the old memory mapping bug where other malloc/new calls can allocate in fastmem space? Maybe that should be fixed too, but probably not in this PR.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nobody ever ported the fix for that over to Android as far as I can recall. But yeah, let's leave it out of this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i did these changes because otherwise the fastmem arena and the lookup table would overlap and destroy each other.
but i can remove these changes from here once #11823 is merged (which i assume fixes it too)

m_block_map_arena.GrabSHMSegment(FAST_BLOCK_MAP_SIZE);
}

fast_block_map = (JitBlock**)m_block_map_arena.CreateView(0, FAST_BLOCK_MAP_SIZE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reinterpret_cast, please.

Comment on lines 212 to 220
JitBlock** fast_block_map = 0;
Common::MemArena m_block_map_arena;

// An alternative for the above fast_block_map but without a shm segment
// in case the shm memory region couldn't be allocated.
std::array<JitBlock*, FAST_BLOCK_MAP_FALLBACK_ELEMENTS>
fast_block_map_fallback{}; // start_addr & mask -> number

JitBlock** fast_block_map_ptr = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're already touching these, can you attach a m_ to the front of the names?

@@ -131,8 +132,9 @@ class JitBaseBlockCache
// is valid (MSR.IR and MSR.DR, the address translation bits).
static constexpr u32 JIT_CACHE_MSR_MASK = 0x30;

static constexpr u32 FAST_BLOCK_MAP_ELEMENTS = 0x10000;
static constexpr u32 FAST_BLOCK_MAP_MASK = FAST_BLOCK_MAP_ELEMENTS - 1;
static constexpr u64 FAST_BLOCK_MAP_SIZE = 0x2'0000'0000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should document how one arrives at this number. You have a guest memory space of 4 GB and PPC instructions are always aligned to 4 bytes, so that gives you 4 GB / 4 bytes = 0x4000'0000 possible entries in the map (I guess it's a lookup table now technically but whatever). Each entry points to a JitBlock*, which is 8 bytes on 64 bit machines. That's 8 GB, or 0x2'0000'0000 bytes.

Comment on lines +152 to +159
CMP(64, R(RSCRATCH2),
MDisp(RSCRATCH, static_cast<s32>(offsetof(JitBlockData, effectiveAddress))));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not your fault but jeez this is evil and only works because effectiveAddress and msrBits are right next to eachother in the JitBlockData. This probably needs some comments and asserts...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More specifically, I would recommend something like:

static_assert(offsetof(JitBlockData, effectiveAddress) + 4 == offsetof(JitBlockData, msrBits));

This way, any violations will be caught at compile time, and the assert is clear enough that I don't think writing a comment is necessary.

MOV(32, R(RSCRATCH), PPCSTATE(pc));

MOV(64, R(RSCRATCH2), Imm64(icache));
MOV(64, R(RSCRATCH), MComplex(RSCRATCH2, RSCRATCH, SCALE_2, 0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also use a comment for how this actually arrives at the correct address, it took me a minute to figure out why SCALE_2 is correct.

(It's because each 4-byte offset of the PC register corresponds to a 8-byte offset in the lookup table, due to 8 byte host pointers.)

@krnlyng
Copy link
Contributor Author

krnlyng commented May 11, 2023

I dropped the changes that would conflict with #11823. But this also means that it crashes on android unless #11823 is merged too.

@AdmiralCurtiss
Copy link
Contributor

You can rebase this now. Please update the two calls to GrabSHMSegment() and add a second parameter with a unique string for the memory segment, eg. "dolphin-emu-jitblock".

By using a shm memory segment for the fast_block_map that is sparsely
allocated (i.e. on write by the OS) instead of a statically allocated
array we can make the block lookup faster by:
* Having a bigger space available for lookup that doesn't take up
  too much memory, because the OS will only allocate the needed
  pages when written to.
* Decrease the time spent to lookup a block in the assembly dispatcher
  due to less comparisions and shorter code (for example the pc check
  has been entirely dropped since only the msrBits need to be validated).

When the JIT block cache is full the shm segment will also be released
and reallocated to avoid allocating too much memory. It will also be
reset when the instruction cache is flushed by the PPC code to avoid
having stale entries.

Also fallback to the original method in case the memory segment couldn't
be allocated.
@AdmiralCurtiss AdmiralCurtiss merged commit 8fd61d0 into dolphin-emu:master May 21, 2023
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
4 participants