Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not use magic numbers for RAM sizes/masks in PPCCache #11764

Merged
merged 1 commit into from Apr 22, 2023

Conversation

Minty-Meeo
Copy link
Member

Memory Override + Write-Back Cache emulation = game crash

@@ -168,11 +171,11 @@ void Cache::Invalidate(u32 addr)
if (valid[set] & (1U << way))
{
if (addrs[set][way] & CACHE_VMEM_BIT)
lookup_table_vmem[(addrs[set][way] >> 5) & 0xfffff] = 0xff;
lookup_table_vmem[(addrs[set][way] >> 5) & (memory.GetFakeVMemMask() >> 5)] = 0xff;
Copy link
Member

@lioncash lioncash Apr 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps make local helpers like:

static u32 GoodName(const Memory::MemoryManager& memory)
{
  return memory.GetFakeVMemMask() >> 5;
}
static u32 GoodName2(const Memory::MemoryManager& memory)
{
  return memory.GetExRamMask() >> 5;
}
static u32 GoodName3(const Memory::MemoryManager& memory)
{
  return memory.GetRamMask() >> 5;
}

so all the repeated shifting in 8 places isn't necessary

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, the symmetry of the statement is pleasing. If helpers were necessary, though, I believe a single helper which performs the right shift by 5 to align to an eight-word boundary would be all we need. But then, any decent name describing this action will undoubtedly make the statement even wider than it already is.

@Pokechu22
Copy link
Contributor

Can't this cause out of bounds accesses to these arrays?

// Note: This is only for performance purposes; this same data could be computed at runtime
// from the tags and valid fields (and that's how it's done on the actual cache)
std::array<u8, 1 << 20> lookup_table{};
std::array<u8, 1 << 21> lookup_table_ex{};
std::array<u8, 1 << 20> lookup_table_vmem{};

IMO it'd be better to get rid of them entirely like I tried to do in #10826 - and now that the data cache is implemented I can get decent performance data without needing to run it on pure interpreter for icache.

@Minty-Meeo
Copy link
Member Author

Minty-Meeo commented Apr 17, 2023

Whoops, you are right. Swapping operator[](size_t) for at(size_t) immediately revealed an exception. I wonder how this didn't catastrophically crash while testing w/ the Colossal Caverns romhack.

@Minty-Meeo Minty-Meeo force-pushed the ppc-cache-extendo-ram branch 3 times, most recently from 2cbef2f to d688a73 Compare April 17, 2023 08:21
@Minty-Meeo Minty-Meeo changed the title Do not use magic numbers for RAM masks in PPCCache Do not use magic numbers for RAM sizes/masks in PPCCache Apr 17, 2023
@Minty-Meeo
Copy link
Member Author

I think the lookup table sizes are meant to match the emulator-allocated size of memory, not the real size of memory. The original magic numbers would suggest such. Is this correct?

@shuffle2
Copy link
Contributor

isn't this code in a hot path?

@Minty-Meeo
Copy link
Member Author

A very hot path, though I didn't notice a severe decrease in emulation speed. Certainly nothing more significant than what Write-Back Cache emulation already causes. The only optimization I could offer would be more functions which return RAM sizes/masks that are already right-shifted by five, since the RAM sizes never changes after MemoryManager::Init(). Are you requesting that?

@Minty-Meeo
Copy link
Member Author

This change probably saves a single instruction per PPCCache method execution, so the difference is completely imperceptible to me. Still, an instruction saved is an instruction earned.

@AdmiralCurtiss
Copy link
Contributor

Why not just store the masks in the Cache struct itself?

@Minty-Meeo
Copy link
Member Author

Storing them in the Cache / InstructionCache classes meant having two copies of the pre-calculated values, which I felt could mislead someone to believe there is a difference between the data and instruction cache eight-word ram masks. I really wanted to store them in a relevant class which encompassed both the dCache and iCache, but none I knew of were easily accessible from the functions that need the values.
Alternatively, I could just make them static members of the classes, but I think that goes against the recent de-globalization work around the codebase.
If your suggestion helps get this merged, however, I'll gladly do it.

@Minty-Meeo
Copy link
Member Author

Is this final solution satisfactory?

@AdmiralCurtiss
Copy link
Contributor

Yeah I like that better personally, @Pokechu22 and @lioncash, opinions?

@AdmiralCurtiss
Copy link
Contributor

(though I don't really understand why they're called eight_word_mask)

@Minty-Meeo
Copy link
Member Author

An L1 cache block contains eight words (32 bytes). All of this right-shifting by five is to convert the size/mask in bytes to the size/mask in cache blocks.

@shuffle2
Copy link
Contributor

To be clear: I don't think that change makes a difference in terms of perf, the compiler can just perform the AND before the right shift as they're both unsigned and shift amount is the same - making the logical code identical in either case. The overhead of this PR was/is the indirect mem access to fetch the mask since it's no longer an immediate. The change moves the indirect access to this instead of the memory singleton, but the memory singleton is already accessed elsewhere in the function, so it's probably a wash either way.

Memory Override + Write-Back Cache emulation = game crash
@Minty-Meeo
Copy link
Member Author

Minty-Meeo commented Apr 21, 2023

Thank you, shuffle2, for having the first good suggestion so far. I was totally blind to that potential logic optimization.
Cache::Init(), Cache::Invalidate(), Cache::DoState(), and InstructionCache::Invalidate() are the only ones which did not already reach for the MemoryManager singleton before. Those are a lot less hot than, say, Cache::Read() and Cache::Write(), so I wouldn't lose sleep over it. Plus, I'm sure the situation will improve with further de-globalization, as the references can eventually be passed directly to the methods which need them rather than walking through System first.

Copy link
Contributor

@Pokechu22 Pokechu22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some measurements with this patch:

Patch
diff --git a/Source/Core/Core/HW/DVD/DVDThread.cpp b/Source/Core/Core/HW/DVD/DVDThread.cpp
index d861bba2eb..036b60a0ec 100644
--- a/Source/Core/Core/HW/DVD/DVDThread.cpp
+++ b/Source/Core/Core/HW/DVD/DVDThread.cpp
@@ -332,6 +332,13 @@ void DVDThread::DVDThreadMain()
 {
   Common::SetCurrentThreadName("DVD thread");
 
+  m_file_logger.m_time_to_open = 0;
+  m_file_logger.m_time_to_boot = 0;
+  m_file_logger.m_time_to_intro = 0;
+  m_file_logger.m_time_to_menu = 0;
+  m_file_logger.m_time_to_attract = 0;
+  m_file_logger.m_timer.Start();
+
   while (true)
   {
     m_request_queue_expanded.Wait();
diff --git a/Source/Core/Core/HW/DVD/FileMonitor.cpp b/Source/Core/Core/HW/DVD/FileMonitor.cpp
index 6e08006a6a..1efd711662 100644
--- a/Source/Core/Core/HW/DVD/FileMonitor.cpp
+++ b/Source/Core/Core/HW/DVD/FileMonitor.cpp
@@ -88,6 +88,29 @@ void FileLogger::Log(const DiscIO::Volume& volume, const DiscIO::Partition& part
   // Update the last accessed file
   m_previous_partition = partition;
   m_previous_file_offset = file_offset;
+
+  if (path == "opening.bnr")
+  {
+    m_time_to_open = m_timer.ElapsedMs();
+  }
+  else if (path == "data/nintendo.szs")
+  {
+    m_time_to_boot = m_timer.ElapsedMs();
+  }
+  else if (path == "data/Entrance.thp")
+  {
+    m_time_to_intro = m_timer.ElapsedMs();
+  }
+  else if (path == "data/particle.szs")
+  {
+    m_time_to_menu = m_timer.ElapsedMs();
+  }
+  else if (path == "data/autodemoA.thp")
+  {
+    m_time_to_attract = m_timer.ElapsedMs();
+    SuccessAlertFmt("Open: {}\nBoot: {}\nIntro: {}\nMenu: {}\nAttract: {}", m_time_to_open,
+                    m_time_to_boot, m_time_to_intro, m_time_to_menu, m_time_to_attract);
+  }
 }
 
 }  // namespace FileMonitor
diff --git a/Source/Core/Core/HW/DVD/FileMonitor.h b/Source/Core/Core/HW/DVD/FileMonitor.h
index 38de351460..84016af971 100644
--- a/Source/Core/Core/HW/DVD/FileMonitor.h
+++ b/Source/Core/Core/HW/DVD/FileMonitor.h
@@ -5,6 +5,7 @@
 
 #include "Common/CommonTypes.h"
 #include "DiscIO/Volume.h"
+#include "Common/Timer.h"
 
 namespace FileMonitor
 {
@@ -16,6 +17,13 @@ public:
 
   void Log(const DiscIO::Volume& volume, const DiscIO::Partition& partition, u64 offset);
 
+  u64 m_time_to_open;
+  u64 m_time_to_boot;
+  u64 m_time_to_intro;
+  u64 m_time_to_menu;
+  u64 m_time_to_attract;
+  Common::Timer m_timer;
+
 private:
   DiscIO::Partition m_previous_partition;
   u64 m_previous_file_offset = 0;
diff --git a/Source/Core/Core/PowerPC/MMU.cpp b/Source/Core/Core/PowerPC/MMU.cpp
index a1597b96a5..09c4eccb66 100644
--- a/Source/Core/Core/PowerPC/MMU.cpp
+++ b/Source/Core/Core/PowerPC/MMU.cpp
@@ -1099,7 +1099,9 @@ void MMU::StoreDCacheLine(u32 address)
 
   if (m_ppc_state.msr.DR)
   {
-    auto translated_address = TranslateAddress<XCheckTLBFlag::Write>(address);
+    // Note: per ppc_750cl.pdf, dcbst is treated as a load, not a store
+    // (presumably because it doesn't change the value in memory that the CPU sees)
+    auto translated_address = TranslateAddress<XCheckTLBFlag::Read>(address);
     if (translated_address.result == TranslateAddressResultEnum::DIRECT_STORE_SEGMENT)
     {
       return;
@@ -1107,7 +1109,7 @@ void MMU::StoreDCacheLine(u32 address)
     if (translated_address.result == TranslateAddressResultEnum::PAGE_FAULT)
     {
       // If translation fails, generate a DSI.
-      GenerateDSIException(address, true);
+      GenerateDSIException(address, false);
       return;
     }
     address = translated_address.address;
@@ -1145,15 +1147,22 @@ void MMU::FlushDCacheLine(u32 address)
 
   if (m_ppc_state.msr.DR)
   {
-    auto translated_address = TranslateAddress<XCheckTLBFlag::Write>(address);
+    // Note: per ppc_750cl.pdf, dcbf is treated as a load, not a store
+    // (presumably because it doesn't change the value in memory that the CPU sees)
+    auto translated_address = TranslateAddress<XCheckTLBFlag::Read>(address);
     if (translated_address.result == TranslateAddressResultEnum::DIRECT_STORE_SEGMENT)
     {
       return;
     }
     if (translated_address.result == TranslateAddressResultEnum::PAGE_FAULT)
     {
-      // If translation fails, generate a DSI.
-      GenerateDSIException(address, true);
+      // XXX Based on ppc_750cl.pdf, a DSI exception should occur if translation fails.
+      // However, on startup, Super Mario Sunshine calls OSProtectRange(0,0,0x80000000,0) and
+      // OSProtectRange(1,0x83000000,0x7d000000,0) on startup, which in turn call
+      // DCFlushRange(0,0x80000000) and DCFlushRange(0x83000000,0x7d000000), which then runs
+      // dcbf for every block from 0 to 0x7fffffe0 and 0x83000000 to 0xffffffe0.
+      // Since this does not crash, presumably no exception should be generated.
+      // GenerateDSIException(address, false);
       return;
     }
     address = translated_address.address;
@@ -1169,15 +1178,18 @@ void MMU::TouchDCacheLine(u32 address, bool store)
 
   if (m_ppc_state.msr.DR)
   {
-    auto translated_address = TranslateAddress<XCheckTLBFlag::Write>(address);
+    // Note: per ppc_750cl.pdf, dcbt and dcbtst are treated as loads, not stores
+    // (presumably because it doesn't change the value in memory that the CPU sees)
+    auto translated_address = TranslateAddress<XCheckTLBFlag::Read>(address);
     if (translated_address.result == TranslateAddressResultEnum::DIRECT_STORE_SEGMENT)
     {
       return;
     }
     if (translated_address.result == TranslateAddressResultEnum::PAGE_FAULT)
     {
-      // If translation fails, generate a DSI.
-      GenerateDSIException(address, true);
+      // Per ppc_750cl.pdf: "Additionally, no exception occurs in the case of a translation fault or
+      // protection violation."
+      // GenerateDSIException(address, false);
       return;
     }
     address = translated_address.address;

This patch adds some rough time measuring using the file monitor log system (for which the corresponding log type must be enabled) to measure how long it takes for various parts of Super Mario Sunshine to load. However, Super Mario Sunshine fails to boot with the cache enabled on master, and this patch also includes a workaround for that (which I'll PR in a bit once I'm more confident of it).

Specifically, I measured how long it took for the game to boot (via the GC IPL), then play through the airplane cutscene (since that video is decompressed on the CPU), then idle on the title screen, then start the attract mode cutscene. I used the null video backend and had the speed limit at unlimited.

Originally, it took between 230.6-233.5 seconds for this to happen on my machine. Now it takes between 228.9-231.3 seconds. (I'm not sure if this difference is statistically significant with my sample size of 5.) Note that with the cache disabled, at 100% speed, it takes about 160.5 seconds. Still, this seems to be faster, or at least not slower, than it was before, and for a bugfix that's useful.

Raw data:
Master:
---------------------------
Information
---------------------------
Open: 1860
Boot: 15099
Intro: 23967
Menu: 181464
Attract: 231831
---------------------------
OK   
---------------------------

---------------------------
Information
---------------------------
Open: 1870
Boot: 15039
Intro: 23778
Menu: 180367
Attract: 230788
---------------------------
OK   
---------------------------

---------------------------
Information
---------------------------
Open: 1805
Boot: 14955
Intro: 23746
Menu: 179866
Attract: 230558
---------------------------
OK   
---------------------------

---------------------------
Information
---------------------------
Open: 1894
Boot: 15278
Intro: 24269
Menu: 182882
Attract: 233477
---------------------------
OK   
---------------------------

---------------------------
Information
---------------------------
Open: 1867
Boot: 15121
Intro: 23906
Menu: 180851
Attract: 230978
---------------------------
OK   
---------------------------

PR:

---------------------------
Information
---------------------------
Open: 1858
Boot: 15248
Intro: 24218
Menu: 181120
Attract: 231344
---------------------------
OK   
---------------------------

---------------------------
Information
---------------------------
Open: 1809
Boot: 15040
Intro: 23859
Menu: 179272
Attract: 229530
---------------------------
OK   
---------------------------

---------------------------
Information
---------------------------
Open: 1830
Boot: 14969
Intro: 23858
Menu: 178834
Attract: 228870
---------------------------
OK   
---------------------------

---------------------------
Information
---------------------------
Open: 1806
Boot: 14923
Intro: 23736
Menu: 180230
Attract: 230363
---------------------------
OK   
---------------------------

---------------------------
Information
---------------------------
Open: 1773
Boot: 15164
Intro: 23895
Menu: 179846
Attract: 229977
---------------------------
OK   
---------------------------

I also tried removing the lookup tables entirely or switching to std::unordered_map, but both of those seemed to have worse performance. The test wasn't super scientific though. (I was doing other stuff on my PC at the time, specifically watching YouTube videos, but I did try to make sure the load was about the same in all cases).

@Pokechu22 Pokechu22 merged commit 8fbfee0 into dolphin-emu:master Apr 22, 2023
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
5 participants