Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk SDKHooks perform significantly worse on Linux than Windows #1935

Closed
6 tasks done
Vauff opened this issue Feb 18, 2023 · 23 comments
Closed
6 tasks done

Bulk SDKHooks perform significantly worse on Linux than Windows #1935

Vauff opened this issue Feb 18, 2023 · 23 comments

Comments

@Vauff
Copy link
Contributor

Vauff commented Feb 18, 2023

Help us help you

  • I have checked that my issue doesn't exist yet.
  • I have tried my absolute best to reduce the problem-space and have provided the absolute smallest test-case possible.
  • I can always reproduce the issue with the provided description below.

Environment

  • Operating System version:
    Affected: Debian GNU/Linux 11 (bullseye)
    Test comparison: Microsoft Windows Server 2019 Standard
  • Game/AppID (with version if applicable): CS:GO 730/740
  • Current SourceMod version: 1.11.0.6931
  • Current SourceMod snapshot: 1.12.0.6982
  • Current Metamod: Source snapshot: 1.12.0-dev+1167

Description

Bulk creation of SDKHooks on Linux servers seem to perform disproportionately worse than on an equivalent Windows server. I have verified this by creating identical Windows & Linux test servers on the same hardware.

With the test case provided below, peak frame time from round restart jumps from 290.30 ms to 3951.61 ms when the plugin is loaded on Linux. And from 175.21 ms to 251.07 ms when the plugin is loaded on Windows. I don't have exact values for when this happened on a live server, but it was enough to trigger the watchdog timer every time a round restarted on the affected map.

Problematic Code (or Steps to Reproduce)

The issue is noticeable when lots of entities are hooked in bulk, common plugin designs such as seen below can result in this happening at round start.

#include <sourcemod>
#include <sdkhooks>

public void OnEntityCreated(int entity, const char[] classname)
{
  SDKHook(entity, SDKHook_SpawnPost, Hook_SpawnPost);
}

void Hook_SpawnPost(int entity)
{
  // any code here
}

I have prepared a CS:GO test map that has 8193 entities (mostly lights) for easily testing this issue. If this test case sounds unrealistic, I assure you this exists in real maps, and is how I even discovered the issue initially.

Logs

  • Please attach in separate files: game output, library logs, kernel logs, and any other supporting information.
  • In case of a crash, please attach minidump or dump analyze output.

VProf output in all cases is from enabling sv_cheats and running endround once.
Linux with plugin
Linux without plugin
Windows with plugin
Windows without plugin

@asherkin
Copy link
Member

Could you grab a perf report with the plugin on Linux? As we're likely talking outside of the game here VProf is unlikely to drill down usefully. Make sure you've got JIT debug metadata enabled in core.cfg, the default should work fine, just check you can see smx frames in the report.

@Vauff
Copy link
Contributor Author

Vauff commented Feb 18, 2023

Could you grab a perf report with the plugin on Linux? As we're likely talking outside of the game here VProf is unlikely to drill down usefully. Make sure you've got JIT debug metadata enabled in core.cfg, the default should work fine, just check you can see smx frames in the report.

perf.data

@tsuza
Copy link
Contributor

tsuza commented Feb 18, 2023

This reminds me of this: nosoop/SM-TFCustAttr#9 (comment). Same SDKHook tanking performance ( on Linux on top of that ).

@Kenzzer
Copy link
Member

Kenzzer commented Feb 20, 2023

This is a massive shot in the dark, but i believe this is at least partially if not completely responsible for the time loss :
https://github.com/alliedmodders/metamod-source/blob/80d3f9c14d5a78c3e93b9822e983eace41981d4f/core/sourcehook/sh_memory.h#L225 which is called whener a hook is setup for the first time, but there are other scenarios where it's also called, like if the vtable hooked is just different to begin with, which will most certainly happen if entities are hooked independently of their classnames, and thus vtable ptr.

Previous performance issues surrounding sourcehook SetMemAccess were partially addressed in #1523

@KyleSanderson
Copy link
Member

This is a massive shot in the dark, but i believe this is at least partially if not completely responsible for the time loss : https://github.com/alliedmodders/metamod-source/blob/80d3f9c14d5a78c3e93b9822e983eace41981d4f/core/sourcehook/sh_memory.h#L225 which is called whener a hook is setup for the first time, but there are other scenarios where it's also called, like if the vtable hooked is just different to begin with, which will most certainly happen if entities are hooked independently of their classnames, and thus vtable ptr.

Previous performance issues surrounding sourcehook SetMemAccess were partially addressed in #1523

There's been a lot of changes to mprotect in the last year around performance. which kernel are you running?

@chrb22
Copy link

chrb22 commented Feb 24, 2023

I've also seen very bad performance from SDKHooks, though that was something that developed over time and I'm not sure it's the same problem as this issue. It would happen when SetTransmit hooks got created/removed from the first creation/last destruction of a classname currently in the world. Shooting a single rocket as Soldier in TF2 would end up making the net_graph var spike to well over 15 ms after some hours (look for tf2_hide.smx in vprof or perf).

However the lag wasn't just because SetTransmit creations/destructions, since the lag only happened when my plugin was running, which isn't the same plugin that used the laggy hooks (my plugin doesn't use SDKHooks). After testing different stuff out, it ended up being StoreToAddress calls with updateMemAccess cluelessly set to its default value that slowed the hooks down. I'm not writing to a vtable or the like though, instead it is a chunk of memory allocated with g_pMemAlloc->Alloc in PackedEntity::AllocAndCopyPadded (called from CGameServer::SendClientMessages->SV_ComputeClientPacks->PackEntities_Normal->SV_PackEntity). From my understanding, both StoreToAddress with updateMemAccess = true and SDKHooks update page permissions, so I got the idea to use updateMemAccess = false instead on my calls. That ended up fixing the lag, but I don't understand how the StoreToAddress would affect SDKHooks in any way given the memory I write to is allocated by the game and already writable. The fact there I'm using a DHooks detour on CGameServer::SendClientMessages might also be relevant because I don't recall the lag happening when that detour wasn't active. I might be remembering wrong though as the test results for narrowing down the lag weren't written down.

@KyleSanderson
Copy link
Member

hmm. While a kernel version hasn't been provided yet (Debian 11 is 5.10 retail?), Maple Trees came in for LTS 6.1 which, admittedly, is relatively recent, but looks directly at this from where the issue is (but perhaps not the cause). There's other kernels with various changes to this (so as long as we're not talking a 2.x/3.x/4.x kernel we're at least in a range).

The current thought from the perf (thank you) is this is unrelated to SDKHooks, but in SourceHook. When SetMemAccess is called, we run mprotect changing one of the page bits to W from R_XP, which per docs from (over) 2 decades ago creates a new VMA mapping. There have been various merge schemes and similar when the page bits are the same, which we presently don't restore with SourceHook.

As TF inherently is heavier than most with entity creation (presumably a quick match server with CS:GO could be in the same tier over time - or with 8k entities being 4x as bad as historically possible), this could accelerate a problem with VMA allocations causing fragmentation in the page tables. Windows may not be suffering from the same issue because the pages may already be RWX (or their equivalent with virtualprotect).

The question now is: is this on the right track... This is how SourceHook checks the page bits https://github.com/alliedmodders/metamod-source/blob/d5030d06123b56fb96ab447fab8a508fc7ccab49/core/sourcehook/sh_memory.h#L56

If you're able to reproduce the issue, you could be able to cat /proc/$pid/maps and see the allocations over time. So if the initial map causes it, what happens with 6 level changes over time with the level. Are there hundreds or thousands added each level change perpetually? does it do something different without MM:S present? Does nothing change?

@Vauff
Copy link
Contributor Author

Vauff commented Feb 25, 2023

Yes this was tested on kernel 5.10, I'll follow up in a bit with additional testing.

@Vauff
Copy link
Contributor Author

Vauff commented Feb 27, 2023

hmm. While a kernel version hasn't been provided yet (Debian 11 is 5.10 retail?), Maple Trees came in for LTS 6.1 which, admittedly, is relatively recent, but looks directly at this from where the issue is (but perhaps not the cause). There's other kernels with various changes to this (so as long as we're not talking a 2.x/3.x/4.x kernel we're at least in a range).

On some weaker test hardware, I'm only seeing vprof peak frame time drop from 12310.81 ms to 10710.14 ms when upgrading the Debian 11 kernel from 5.10 to 6.1.

If you're able to reproduce the issue, you could be able to cat /proc/$pid/maps and see the allocations over time. So if the initial map causes it, what happens with 6 level changes over time with the level. Are there hundreds or thousands added each level change perpetually? does it do something different without MM:S present? Does nothing change?

The amount of allocations seems to stay pretty consistent throughout multiple map and round changes. Running the server without MM:S does reduce the amount of them, but that seems expected given no extra bins are loaded anymore. Unloading MM:S does also of course stop the lag (same as removing the plugin).

@Kenzzer
Copy link
Member

Kenzzer commented Feb 27, 2023

hmm. While a kernel version hasn't been provided yet (Debian 11 is 5.10 retail?), Maple Trees came in for LTS 6.1 which, admittedly, is relatively recent, but looks directly at this from where the issue is (but perhaps not the cause). There's other kernels with various changes to this (so as long as we're not talking a 2.x/3.x/4.x kernel we're at least in a range).

Debian 9.13 on 4.19.0-0.bpo.9-amd64 and Debian 10 on 4.19.0-22-cloud-amd64.
Note: I'm well aware Debian 9 is end of life, but I'm not the owner of that machine, only the latter which is awaiting an upgrade.

Windows may not be suffering from the same issue because the pages may already be RWX (or their equivalent with virtualprotect).

The question now is: is this on the right track... This is how SourceHook checks the page bits https://github.com/alliedmodders/metamod-source/blob/d5030d06123b56fb96ab447fab8a508fc7ccab49/core/sourcehook/sh_memory.h#L56

Although not what you describe, the discord user @nosoop attempted to skip GetPageBits altogether and always set RWX, and didn't notice a significant performance increase. Although that could be due to what you say here

we run mprotect changing one of the page bits to W from R_XP, which per docs from (over) 2 decades ago creates a new VMA mapping.

And it recreates everything even though no bits changed on the subsequent calls ?

Here's their message

nosoop — 20/02/2023 10:15
my baseless suspicion is that GetPageBits is not good

I wonder what the consequences are if you skip that operation and just set rwx on the memory page
[later...]
doesn't look like there's a significant change

@PeakKS
Copy link
Contributor

PeakKS commented Apr 17, 2023

Think I mostly fixed it (hardware dependent). Using pkeys I brought this from around 6 seconds to around 1.

  1. Patch SetMemAccess in sh_memory to the following:
static int pkey = pkey_alloc(0, 0);
return pkey_mprotect(SH_LALIGN(addr), len + SH_LALDIF(addr), access, pkey);
  1. Recompile both metamod and sourcemod with this
  2. Add this plugin for a basic test:
#include <sourcemod>
#include <sdkhooks>

float starttime = 0.0;
float endtime = 0.0;

public void OnMapStart() {
    float elapsed = endtime - starttime;
    PrintToServer("-------------------------------------\n"
               ..."Entity creation occured in %f seconds\n"
               ..."-------------------------------------", elapsed);
}

public void OnEntityCreated(int entity, const char[] classname)
{
  if (starttime == 0.0)
    starttime = GetEngineTime();
  endtime = GetEngineTime();
  SDKHook(entity, SDKHook_SpawnPost, Hook_SpawnPost);
}

void Hook_SpawnPost(int entity)
{
}
  1. Start server with the test map from the original post.

Not sure the best way for us to implement this seeing as pkeys are a relatively new addition to x86 and I can't even find a comprehensive list of supported cpus.

@danielstm
Copy link

danielstm commented Apr 17, 2023

Sounds great, I need to test this.
But I'm not exactly sure if I'm doing it correct as pkey_mprotect is returning an int.
We should probably replace this line:
return mprotect(SH_LALIGN(addr), len + SH_LALDIF(addr), access)==0 ? true : false;
with
static int pkey = pkey_alloc(0, 0); return pkey_mprotect(SH_LALIGN(addr), len + SH_LALDIF(addr), access, pkey);
or with
static int pkey = pkey_alloc(0, 0); return pkey_mprotect(SH_LALIGN(addr), len + SH_LALDIF(addr), access, pkey)==0 ? true : false;

@PeakKS
Copy link
Contributor

PeakKS commented Apr 17, 2023

Sounds great, I need to test this. But I'm not exactly sure if I'm doing it correct as pkey_mprotect is returning an int. We should probably replace this line: return mprotect(SH_LALIGN(addr), len + SH_LALDIF(addr), access)==0 ? true : false; with static int pkey = pkey_alloc(0, 0); return pkey_mprotect(SH_LALIGN(addr), len + SH_LALDIF(addr), access, pkey); or with static int pkey = pkey_alloc(0, 0); return pkey_mprotect(SH_LALIGN(addr), len + SH_LALDIF(addr), access, pkey)==0 ? true : false;

Whoops you are correct. I'm surprised that didn't cause any errors, I guess there is no error checking from the caller. Anyways I retested and adding this back in doesn't affect the speedup.

Sidenote, it seems to be possible to just use the default pkey (0), so there is no need for the pkey_alloc.

@dvander
Copy link
Member

dvander commented Apr 17, 2023

If we're supposing that mprotect is killing performance here, I'd assume it primarily affects two cases of hooks:

  1. Changing the hook count on a vtable from 0 to 1, or 1 to 0, very frequently.
  2. Hooking many instances that have many different vtables (such as entities).

Is that what people are observing?

pkey_mprotect looks great, but as you pointed out, new hardware only, and even then maybe Intel only.

I'm going to ask a dumb question. Why do we need to bother setting back to r-x in the first place?

@PeakKS
Copy link
Contributor

PeakKS commented Apr 17, 2023

Sounds great, I need to test this. But I'm not exactly sure if I'm doing it correct as pkey_mprotect is returning an int. We should probably replace this line: return mprotect(SH_LALIGN(addr), len + SH_LALDIF(addr), access)==0 ? true : false; with static int pkey = pkey_alloc(0, 0); return pkey_mprotect(SH_LALIGN(addr), len + SH_LALDIF(addr), access, pkey); or with static int pkey = pkey_alloc(0, 0); return pkey_mprotect(SH_LALIGN(addr), len + SH_LALDIF(addr), access, pkey)==0 ? true : false;

Whoops you are correct. I'm surprised that didn't cause any errors, I guess there is no error checking from the caller. Anyways I retested and adding this back in doesn't affect the speedup.

Sidenote, it seems to be possible to just use the default pkey (0), so there is no need for the pkey_alloc.

Okay I realized I messed up AGAIN, and realized I tested
return pkey_mprotect(SH_LALIGN(addr), len + SH_LALDIF(addr), access, pkey)? true : false;
instead of
return pkey_mprotect(SH_LALIGN(addr), len + SH_LALDIF(addr), access, pkey)==0 ? true : false;

Now pkey version seems to have the same performance. Interestingly only 11 of the hooks are actually fired (with the correct implementation). With the bad implementation one of the hooks is still fired though.

@KyleSanderson
Copy link
Member

So, summarizing the findings from discord...

Atleast with the lights example, the single light entity is notified through SDKHooks as created, spawned, and then immediately destroyed, with the same entity index cycled through 8000 times, which stresses this code back to the SH_ADD/REMOVE_HOOK equivalent performance. This is confirmed on Linux CS:GO. Windows is another matter, but I'm suspecting it doesn't go through the same cycle so the comparison is invalid (even if it does, this is the setup / teardown cost to hook a vtable on each platform).

One possible option is to not release the vtable hook when the last entity disappears. This removes the setup/teardown cost at the expense of memory and lookup performance (albeit because it's all vectorized, it should be like 1us or even less). It might get weird quick with things like CEntity should it be removed, but if we reset the table at levelend it would paper over that problem.

Open to thoughts, but this is actually behaving as it did before the vtable hooking changes (albeit worse now, because the entity limit has been 4x'd since I was having issues all those years ago).

@danielstm
Copy link

Sounds great, I need to test this. But I'm not exactly sure if I'm doing it correct as pkey_mprotect is returning an int. We should probably replace this line: return mprotect(SH_LALIGN(addr), len + SH_LALDIF(addr), access)==0 ? true : false; with static int pkey = pkey_alloc(0, 0); return pkey_mprotect(SH_LALIGN(addr), len + SH_LALDIF(addr), access, pkey); or with static int pkey = pkey_alloc(0, 0); return pkey_mprotect(SH_LALIGN(addr), len + SH_LALDIF(addr), access, pkey)==0 ? true : false;

Whoops you are correct. I'm surprised that didn't cause any errors, I guess there is no error checking from the caller. Anyways I retested and adding this back in doesn't affect the speedup.
Sidenote, it seems to be possible to just use the default pkey (0), so there is no need for the pkey_alloc.

Okay I realized I messed up AGAIN, and realized I tested return pkey_mprotect(SH_LALIGN(addr), len + SH_LALDIF(addr), access, pkey)? true : false; instead of return pkey_mprotect(SH_LALIGN(addr), len + SH_LALDIF(addr), access, pkey)==0 ? true : false;

Now pkey version seems to have the same performance. Interestingly only 11 of the hooks are actually fired (with the correct implementation). With the bad implementation one of the hooks is still fired though.

Leading to the question if this is needed or not? Did the test plugin work correctly?

@PeakKS
Copy link
Contributor

PeakKS commented Apr 18, 2023

Leading to the question if this is needed or not? Did the test plugin work correctly?

Test works but the thing that was being tested wasn't. pkey_mprotect does not appear to make much if any advantage over normal mprotect.

If anyone wants to test out not deleting the vtable hooks just edit extension.cpp from the sdkhooks extension, deleting all occurences of:

if (pawnhooks.size() == 0)
{
  delete vtablehooklist[listentry];
  vtablehooklist.erase(vtablehooklist.begin() + listentry);
  listentry--;
}

from the all the Unhook functions should do it. Doesn't break anything afaict.

@Kenzzer
Copy link
Member

Kenzzer commented Apr 18, 2023

I'm going to ask a dumb question. Why do we need to bother setting back to r-x in the first place?

That question was raised in the discord when the issue was opened, but nobody has any idea. I assume you're implying to leave the memory writeable ? I personally don't see harm in this, any mm plugins or sourcemod plugins can write in memory protected areas if they want to anyways. Ofc keeping access violation crash would be desirable but is it worth the performance trade off ?

@PeakKS
Copy link
Contributor

PeakKS commented Apr 18, 2023

New dumb question, where is any code setting protections back to r-x? I don't see anything in mm or sm. And checking /proc/pid/maps for srcds_linux I see plenty of pages left as rwx.

@KyleSanderson
Copy link
Member

You would patch them back when the hook is released. However, as there's nothing wrong with mprotect inherently here, the gains will be very minimal.

@KaelaSavia
Copy link
Contributor

KaelaSavia commented Dec 5, 2023

Issue still needs some kind of bandaid applied especially in TF2 where entities such as syringe gun needles get created and destroyed at fast rate and sdkhooking them can eat up quickly cpu resources.

Additionally similar thing occurs with tf_weapon weapons entities which are hooked (albeit not by sdkhooks) by default through criticals.cpp

sdkhooks

@Vauff
Copy link
Contributor Author

Vauff commented Jun 10, 2024

With #2094 merged fixing this in SDKHooks, I suppose this can be closed now. The root issue still exists in SourceHook (despite a ~50% speedup in dev branch here), but that would be more worthy of a Metamod issue, if there's even anything that can still be done about it.

@Vauff Vauff closed this as completed Jun 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants