-
-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bulk SDKHooks perform significantly worse on Linux than Windows #1935
Comments
Could you grab a |
|
This reminds me of this: nosoop/SM-TFCustAttr#9 (comment). Same SDKHook tanking performance ( on Linux on top of that ). |
This is a massive shot in the dark, but i believe this is at least partially if not completely responsible for the time loss : Previous performance issues surrounding sourcehook |
There's been a lot of changes to mprotect in the last year around performance. which kernel are you running? |
I've also seen very bad performance from SDKHooks, though that was something that developed over time and I'm not sure it's the same problem as this issue. It would happen when However the lag wasn't just because |
hmm. While a kernel version hasn't been provided yet (Debian 11 is 5.10 retail?), Maple Trees came in for LTS 6.1 which, admittedly, is relatively recent, but looks directly at this from where the issue is (but perhaps not the cause). There's other kernels with various changes to this (so as long as we're not talking a 2.x/3.x/4.x kernel we're at least in a range). The current thought from the perf (thank you) is this is unrelated to SDKHooks, but in SourceHook. When SetMemAccess is called, we run mprotect changing one of the page bits to W from R_XP, which per docs from (over) 2 decades ago creates a new VMA mapping. There have been various merge schemes and similar when the page bits are the same, which we presently don't restore with SourceHook. As TF inherently is heavier than most with entity creation (presumably a quick match server with CS:GO could be in the same tier over time - or with 8k entities being 4x as bad as historically possible), this could accelerate a problem with VMA allocations causing fragmentation in the page tables. Windows may not be suffering from the same issue because the pages may already be RWX (or their equivalent with virtualprotect). The question now is: is this on the right track... This is how SourceHook checks the page bits https://github.com/alliedmodders/metamod-source/blob/d5030d06123b56fb96ab447fab8a508fc7ccab49/core/sourcehook/sh_memory.h#L56 If you're able to reproduce the issue, you could be able to |
Yes this was tested on kernel 5.10, I'll follow up in a bit with additional testing. |
On some weaker test hardware, I'm only seeing vprof peak frame time drop from
The amount of allocations seems to stay pretty consistent throughout multiple map and round changes. Running the server without MM:S does reduce the amount of them, but that seems expected given no extra bins are loaded anymore. Unloading MM:S does also of course stop the lag (same as removing the plugin). |
Debian
Although not what you describe, the discord user @nosoop attempted to skip GetPageBits altogether and always set RWX, and didn't notice a significant performance increase. Although that could be due to what you say here
And it recreates everything even though no bits changed on the subsequent calls ? Here's their message
|
Think I mostly fixed it (hardware dependent). Using pkeys I brought this from around 6 seconds to around 1.
static int pkey = pkey_alloc(0, 0);
return pkey_mprotect(SH_LALIGN(addr), len + SH_LALDIF(addr), access, pkey);
#include <sourcemod>
#include <sdkhooks>
float starttime = 0.0;
float endtime = 0.0;
public void OnMapStart() {
float elapsed = endtime - starttime;
PrintToServer("-------------------------------------\n"
..."Entity creation occured in %f seconds\n"
..."-------------------------------------", elapsed);
}
public void OnEntityCreated(int entity, const char[] classname)
{
if (starttime == 0.0)
starttime = GetEngineTime();
endtime = GetEngineTime();
SDKHook(entity, SDKHook_SpawnPost, Hook_SpawnPost);
}
void Hook_SpawnPost(int entity)
{
}
Not sure the best way for us to implement this seeing as pkeys are a relatively new addition to x86 and I can't even find a comprehensive list of supported cpus. |
Sounds great, I need to test this. |
Whoops you are correct. I'm surprised that didn't cause any errors, I guess there is no error checking from the caller. Anyways I retested and adding this back in doesn't affect the speedup. Sidenote, it seems to be possible to just use the default pkey (0), so there is no need for the pkey_alloc. |
If we're supposing that mprotect is killing performance here, I'd assume it primarily affects two cases of hooks:
Is that what people are observing? pkey_mprotect looks great, but as you pointed out, new hardware only, and even then maybe Intel only. I'm going to ask a dumb question. Why do we need to bother setting back to r-x in the first place? |
Okay I realized I messed up AGAIN, and realized I tested Now pkey version seems to have the same performance. Interestingly only 11 of the hooks are actually fired (with the correct implementation). With the bad implementation one of the hooks is still fired though. |
So, summarizing the findings from discord... Atleast with the lights example, the single light entity is notified through SDKHooks as created, spawned, and then immediately destroyed, with the same entity index cycled through 8000 times, which stresses this code back to the SH_ADD/REMOVE_HOOK equivalent performance. This is confirmed on Linux CS:GO. Windows is another matter, but I'm suspecting it doesn't go through the same cycle so the comparison is invalid (even if it does, this is the setup / teardown cost to hook a vtable on each platform). One possible option is to not release the vtable hook when the last entity disappears. This removes the setup/teardown cost at the expense of memory and lookup performance (albeit because it's all vectorized, it should be like 1us or even less). It might get weird quick with things like CEntity should it be removed, but if we reset the table at levelend it would paper over that problem. Open to thoughts, but this is actually behaving as it did before the vtable hooking changes (albeit worse now, because the entity limit has been 4x'd since I was having issues all those years ago). |
Leading to the question if this is needed or not? Did the test plugin work correctly? |
Test works but the thing that was being tested wasn't. pkey_mprotect does not appear to make much if any advantage over normal mprotect. If anyone wants to test out not deleting the vtable hooks just edit extension.cpp from the sdkhooks extension, deleting all occurences of: if (pawnhooks.size() == 0)
{
delete vtablehooklist[listentry];
vtablehooklist.erase(vtablehooklist.begin() + listentry);
listentry--;
} from the all the |
That question was raised in the discord when the issue was opened, but nobody has any idea. I assume you're implying to leave the memory writeable ? I personally don't see harm in this, any mm plugins or sourcemod plugins can write in memory protected areas if they want to anyways. Ofc keeping access violation crash would be desirable but is it worth the performance trade off ? |
New dumb question, where is any code setting protections back to r-x? I don't see anything in mm or sm. And checking /proc/pid/maps for srcds_linux I see plenty of pages left as rwx. |
You would patch them back when the hook is released. However, as there's nothing wrong with mprotect inherently here, the gains will be very minimal. |
Issue still needs some kind of bandaid applied especially in TF2 where entities such as syringe gun needles get created and destroyed at fast rate and sdkhooking them can eat up quickly cpu resources. Additionally similar thing occurs with tf_weapon weapons entities which are hooked (albeit not by sdkhooks) by default through criticals.cpp |
Help us help you
Environment
Affected:
Debian GNU/Linux 11 (bullseye)
Test comparison:
Microsoft Windows Server 2019 Standard
Description
Bulk creation of SDKHooks on Linux servers seem to perform disproportionately worse than on an equivalent Windows server. I have verified this by creating identical Windows & Linux test servers on the same hardware.
With the test case provided below, peak frame time from round restart jumps from
290.30 ms
to3951.61 ms
when the plugin is loaded on Linux. And from175.21 ms
to251.07 ms
when the plugin is loaded on Windows. I don't have exact values for when this happened on a live server, but it was enough to trigger the watchdog timer every time a round restarted on the affected map.Problematic Code (or Steps to Reproduce)
The issue is noticeable when lots of entities are hooked in bulk, common plugin designs such as seen below can result in this happening at round start.
I have prepared a CS:GO test map that has 8193 entities (mostly lights) for easily testing this issue. If this test case sounds unrealistic, I assure you this exists in real maps, and is how I even discovered the issue initially.
Logs
VProf output in all cases is from enabling sv_cheats and running
endround
once.Linux with plugin
Linux without plugin
Windows with plugin
Windows without plugin
The text was updated successfully, but these errors were encountered: