Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transparent Huge Page support #5816

Open
2 tasks done
ryao opened this issue May 4, 2022 · 18 comments
Open
2 tasks done

Transparent Huge Page support #5816

ryao opened this issue May 4, 2022 · 18 comments
Labels
Feature Request New feature or request

Comments

@ryao
Copy link

ryao commented May 4, 2022

Feature Request

I confirm:

  • that I haven't found another request for this feature.
  • that I have checked whether there are updates for my system available that
    contain this feature already.

Description

Add an environment variable for turning on transparent huge page support on the heap via madvise(addr, size, MADV_HUGEPAGE).

Additional changes could be made to further improve transparent huge page support, such as:

  1. Modifying the heap to allocate huge page aligned regions to further enhance this feature.
  2. Implementing MEM_LARGE_PAGES in VirtualAlloc in Wine.

Justification [optional]

This is Linux specific, so I doubt Wine would accept the patch.

This should be done because fewer TLB misses from the use of transparent huge pages should slightly improve CPU bound performance in games running under Proton. Implementing an option to use to turn it on for evaluation purposes would probably be best. That would allow the community to gather data to determine whether it should be on by default.

Risks [optional]

Setting it system-wide is said to harm performance:

https://www.reddit.com/r/linux_gaming/comments/uhfjyt/comment/i75z26g/

It is possible that this could cause a performance regression.

References [optional]

There are claims on reddit that games can benefit from this:

https://www.reddit.com/r/linux_gaming/comments/uhfjyt/underrated_advice_for_improving_gaming/

The manual page states:

This feature is primarily aimed at applications that use large mappings of data and access large regions of that memory at a time

https://man7.org/linux/man-pages/man2/madvise.2.html

This describes a video game.

Microsoft documentation on Windows MEM_LARGE_PAGES:

https://docs.microsoft.com/en-us/windows/win32/memory/large-page-support

Kernel documentation on THP support:

https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html

@kisak-valve kisak-valve added the Feature Request New feature or request label May 4, 2022
@ryao
Copy link
Author

ryao commented May 4, 2022

I wrote a really small library to test this feature without modifying proton:

#include <sys/mman.h>

__attribute__((constructor)) static void init()
{
	madvise(0, 0x00000800000000000, MADV_HUGEPAGE);
}

It can be compiled with gcc -shared -o thp.so thp.c. Then you can do LD_PRELOAD=/path/to/thp.so %command% to launch a game with it. I assume the system is a 64-bit system, such that the system gcc will produce a 64-bit binary. The game therefore needs to be 64-bit for this to be executed.

I was not sure where the heap was, so I just told the kernel to use huge pages everywhere. A quick test is not showing any use of huge pages via grep -i huge /proc/meminfo. My guess is that either the pages need to already be allocated when this function is called, or the system needs to have huge pages pre-allocated (which was the case for using THP via QEMU in the past).

In any case, this naive approach did not work. I still think the proposal on reddit is a potentially good idea, but it needs further investigation. :/

@ryao
Copy link
Author

ryao commented May 5, 2022

It turns out that it is necessary to preallocate huge pages via hugeadm from libhugetlbfs. On my system, I ran hugeadm --pool-pages-min 2MB:5000, which will allocate ~10GB to huge pages (my system can handle it).

Someone on reddit pointed out that there is already libhugetlbfs for this. However, some testing found that it did nothing (even after mounting hugetlbfs). I tried using HUGETLB_FORCE_ELFMAP=yes HUGETLB_MORECORE=yes HUGETLB_NO_PREFAULT=yes after reading the man page, but again, nothing happened according to /proc/meminfo. Setting HUGETLB_DEBUG=3 revealed multiple problems:

libhugetlbfs [vserver:21100]: INFO: Segment 2's unaligned memsz is too small: 0x45c8 < 0x45000
libhugetlbfs [vserver:21100]: INFO: Segment 3's unaligned memsz is too small: 0x8f21 < 0x40000
libhugetlbfs [vserver:21100]: INFO: Segment 4's unaligned memsz is too small: 0x4c88 < 0x37000
libhugetlbfs [vserver:21100]: INFO: Segment 5's unaligned memsz is too small: 0xf90 < 0x30bb0
libhugetlbfs [vserver:21100]: INFO: No segments were appropriate for remapping
libhugetlbfs [vserver:21100]: INFO: Not setting up morecore because it's not available (see issue #52).

This seems to be a known issue in glibc 2.33 and newer:

libhugetlbfs/libhugetlbfs#52

As mentioned there, glibc 2.35 has added support for huge pages as a workaround. Provided that GLIBC_TUNABLES was not disabled at build time, GLIBC_TUNABLES=glibc.malloc.hugetlb=2 will enable it. With that, I am seeing huge page usage when I launch overwatch in lutris and look at huge page usage at the title screen:

# grep -i huge /proc/meminfo
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:    5000
HugePages_Free:     3040
HugePages_Rsvd:      290
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:        10240000 kB

However, after looking at /proc/$(pgrep -i overwatch)/numa_maps, I can see that not all allocations are backed by huge pages. In specific, the following counts after filtering out most mmap()'ed files:

# cat /proc/$(pgrep -i overwatch)/numa_maps | grep -viE 'file=/(u|l|v|h)' | grep kernelpagesize_kB=4 | wc
   6331   44355  428023
vserver ~ # cat /proc/$(pgrep -i overwatch)/numa_maps | grep -viE 'file=/(u|l|v|h)' | grep kernelpagesize_kB=2048 | wc
   1206    9648  123012

I imagine that the non huge page allocations are coming from wine's memory allocator, whose allocations matter far more for games than glibc's memory allocation. I guess wine's allocator needs to be patched to leverage huge pages if available. :/

@ryao
Copy link
Author

ryao commented May 5, 2022

Someone in discord tried using glxgears to test this. I am not sure if that is the best test, but it is easy to run. I repeated his test after making some minor changes (such as turning off compositing in kwin) and I found there to be a very slight improvement:

richard@vserver ~ $ /usr/bin/env __GL_SYNC_TO_VBLANK=0 glxgears
196431 frames in 5.0 seconds = 39286.027 FPS
197137 frames in 5.0 seconds = 39427.332 FPS
195896 frames in 5.0 seconds = 39179.164 FPS
^C
richard@vserver ~ $ /usr/bin/env __GL_SYNC_TO_VBLANK=0 GLIBC_TUNABLES=glibc.malloc.hugetlb=2 glxgears
196174 frames in 5.0 seconds = 39234.695 FPS
197737 frames in 5.0 seconds = 39547.320 FPS
197599 frames in 5.0 seconds = 39519.699 FPS
197471 frames in 5.0 seconds = 39494.102 FPS
^C

Wine's memory allocator needs to be patched before any tests are worth doing on windows software running on Linux, but native software such as glxgears can have huge pages turned on via glibc (provided it is 2.35 or greater), so people likely could do experiments on native software to see if there is an improvement. Note that anyone doing these tests would want to do the following:

  1. sudo hugeadm --pool-pages-min 2MB:1024 (or a higher number)
  2. Change the launch configuration for the game in steam to GLIBC_TUNABLES=glibc.malloc.hugetlb=2 %command%.

@Patola
Copy link

Patola commented May 5, 2022

These glxgears fps improvements, calculated from the 3 samples before and 4 samples after hugepages activated, amount to 0,38% of improvement (39297.5076... fps before, 39448.954 average fps afterwards). Considering the risk of regression for certain cases, is this worth the effort?

@aeikum
Copy link
Collaborator

aeikum commented May 5, 2022

I doubt we're interested in this unless someone can demonstrate a real performance benefit.

@ryao
Copy link
Author

ryao commented May 5, 2022

These glxgears fps improvements, calculated from the 3 samples before and 4 samples after hugepages activated, amount to 0,38% of improvement (39297.5076... fps before, 39448.954 average fps afterwards). Considering the risk of regression for certain cases, is this worth the effort?

@Patola I do not think glxgears is a good benchmark, since it is tiny. It unlikely has TLB misses (although we can measure that with Linux perf). You need to test something bigger like a real game. I only did it since I did not want to explain why the numbers from glxgears someone else posted were wrong after seeing someone else do that on discord. My posting the numbers was to preempt someone else posting numbers that were taken in a bad way, even if I knew that glxgears was not a great benchmark to evaluate this. To be honest, I was surprised to see anything that looked like an improvement in it at all. :/

That said, I would only expect a 1-3% improvement from lower TLB misses. Maybe systems with integrated graphics would benefit more.

I doubt we're interested in this unless someone can demonstrate a real performance benefit.

@aeikum Would would consitute a real performance benefit? If we can consistently observe a 1% improvement somewhere that is relevant to games on Linux, would that be enough? I do not want to set unrealistic expectations. The theoretical benefits of huge pages are:

  • With 2MB pages, we avoid a pointer indirection on TLB misses, reduce the number of TLB misses and reduce the amount of memory used by the page tables by a factor of 512.
  • With 1GB pages, we avoid two pointer indirections on TLB misses, reduce the number of TLB misses and reduce the amount of memory used by the page tables by a factor of 262144.

I would not expect this to give more than a 1-3% performance improvement. Finding it would require that we either find a native game that demonstrates a real improvement through the glibc huge page code or patch wine so that people could test games for a real improvement.

@ryao
Copy link
Author

ryao commented May 5, 2022

I did some benchmarks in both Shadow of the Tomb Raider and Civilization VI. I will not post the Shadow of the Tomb Raider data since it showed identical performance. However, half of the allocations in Shadow of The Tomb Raider still used 4K sizes, so perhaps glibc's method for using huge pages did not work for the allocations that matter. I tried Civilization 6's AI benchmark under the assumption that it would use glibc's allocator for its AI decisions. This showed a slight improvement:

Civilization 6 AI average turn time at 4K pages:

Run 1: 10.56
Run 2: 10.54
Run 3: 10.50

Civilization 6 AI average turn time at 2M pages:

Run 1: 10.51
Run 2: 10.41
Run 3: 10.41

That is an average improvement of 0.8544%.

My test hardware was:

  • AMD Ryzen 7 5800X
  • 64GB 3200MHz ECC CL22 RAM
  • Asus Strix GeForce RTX 3080

I guess huge pages have a smaller impact than I thought. Maybe someone with integrated graphics like the guy in the reddit thread would see a bigger improvement. :/

@ryao
Copy link
Author

ryao commented May 5, 2022

I rebooted my PC and did a last ditch effort to test huge page support on it. This time, I started Shadow of the Tomb Raider after a fresh boot and ran 3 benchmark runs back to back. I then set Shadow of the Tomb Raider to start with GLIBC_TUNABLES=glibc.malloc.hugetlb=2 %command%, rebooted, ran the following terminal commands from a virtual terminal as root before logging into my desktop and then ran 3 benchmark runs back to back.

hugeadm --pool-pages-min 2M:12G
hugeadm --thp-always

Here are the results:

Shadow of The Tomb Raider (4KB pages):

Run 1: 267 FPS
Run 2: 262 FPS
Run 3: 264 FPS

Shadow of the Tomb Raider (2MB huge pages):

Run 1: 274 FPS
Run 2: 268 FPS
Run 3: 271 FPS

This is a 2.52% improvement in FPS. This appears to be from Linux forcing transparent huge page support on allocations that do not use glibc malloc because prior tests without hugeadm --thp-always showed no difference with huge pages when relying solely on GLIBC_TUNABLES=glibc.malloc.hugetlb=2 to turn on huge page support. Examining /proc/$pid/numa_maps suggests more allocations are using huge pages thanks to hugeadm --thp-always.

Here are links to screenshots showing my graphical settings:

https://steamuserimages-a.akamaihd.net/ugc/1823405591147219865/19628FBFC8B64CD0D2B7DAD03F61260887D452D1/?imw=2048&imh=1152&ima=fit&impolicy=Letterbox&imcolor=%23000000&letterbox=true

https://steamuserimages-a.akamaihd.net/ugc/1823405591147225282/93A0FB5B73C86D21C81787BC3314119482560799/?imw=5000&imh=5000&ima=fit&impolicy=Letterbox&imcolor=%23000000&letterbox=false

I have a 4K display, but I had to change the game resolution to 1920x1080 while reducing the resolution modifier to its lowest to get the game to be 100% CPU bound.

Additionally, while I have Nvidia graphics, it seems that the Intel and AMD graphics drivers have huge page support:

https://lists.freedesktop.org/archives/dri-devel/2017-August/150732.html
https://lists.freedesktop.org/archives/intel-gfx/2017-September/139810.html

There would probably be a greater benefit on systems with integrated graphics. I suspect this could benefit the steam deck. We should not need to force the kernel to always use transparent huge pages to gain this performance increase. Instead, proper use of madvise() in the software stack ought to be sufficient. So far, I have failed at inserting it well enough to be able to measure a difference, but forcing everything to use transparent huge pages (as suggested on reddit) has revealed that there is a performance benefit in Shadow of the Tomb Raider (when the game is CPU bound).

@ryao
Copy link
Author

ryao commented May 5, 2022

@Kron4ek
Copy link

Kron4ek commented May 6, 2022

For what it's worth, on my system i definitely see an improvement with Huge Pages enabled.

I tested CPU bound performance in The Witcher 3 (800x600 windowed and all settings to low) on my low-end Intel Pentium G4620 and AMD RX 560 and got around 7% higher performance with Transparent Huge Pages enabled: 102 FPS without THP vs 109 FPS with THP. My GPU is underloaded in both cases, but with THP it is slightly less underloaded.

I tested several times to be sure, with Huge Pages i always get a bit higher performance. Well, at least in The Witcher 3, haven't tested other games yet.

@GenocideStomper
Copy link

For what it's worth, on my system i definitely see an improvement with Huge Pages enabled.

Nice! What did you do to enable it?

@Kron4ek
Copy link

Kron4ek commented May 6, 2022

@GenocideStomper I enabled it with:

# echo always > /sys/kernel/mm/transparent_hugepage/enabled

@machinedgod
Copy link

machinedgod commented May 11, 2022

From my own research: transparent hugepages are a different mechanism from hugepages. The latter have to be manually requsted using one of the several different allocators (mimalloc and others; madvise is an optional call, which can provide you with kernel's advice about what's a smart thing to do) - but transparent hugepages are really that - transparently handled by the kernel.

From the documentation (https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html), the guaranteed way to get a transparent hugepage is to set the sys option to always, and then request an aligned allocation 2M in size (I assume its not specifically 2M const, but whatever /proc/meminfo shows as hugepage size).

Quote:

Transparent Hugepage Support maximizes the usefulness of free memory if compared to the reservation approach of hugetlbfs by allowing all unused memory to be used as cache or other movable (or even unmovable entities). It doesn’t require reservation to prevent hugepage allocation failures to be noticeable from userland. It allows paging and all other advanced VM features to be available on the hugepages. It requires no modifications for applications to take advantage of it.

You can monitor usage (again, all of this is in the kernel documentation) by grepping for AnonHugePages, and I see consistent usage on my system as I keep opening and closing applications.

So - my conclusion is, kernel knows what its doing without manual meddling.

@ryao
Copy link
Author

ryao commented May 11, 2022

Replying to #5816 (comment)

I came to a similar conclusion last week, but forgot to post it. There is no need to statically allocate huge pages to use huge pages.

@ipr
Copy link

ipr commented May 21, 2022

The thing with hugepages is that their usability/effectiveness can be changed by memory fragmentation when system has been in use for a while. So the measurements should cover various cases where people are playing games (and long sessions).

Originally hugepages were implemented for use with virtual machines (IIRC) and some software have observed worse results with them for some reason (though I don't why Hadoop would suffer from it.. maybe it has been fixed?)

So it might not be a universal win for every case and it would be beneficial to test more in various cases.
If it does give enough performance win consistently then it would be good to have, maybe as an option in the beginning?

@Retidurc
Copy link

Replying to #5816 (comment)

I came to a similar conclusion last week, but forgot to post it. There is no need to statically allocate huge pages to use huge pages.

I can confirm.

Hugepages and transparent hugepages share a similar goal but are obtained differently.

THP doesn't need pre-allocated hugepage pools and doesn't need complex configuration, however THP are limited to anonymous memory regions such as heap and stack space, and can lead to performance degradation while memory pressure is high, as the CPU will spend more time to find a memory region or need to defrag memory.

With hugepages you gain a more fine-grained controls, and probably other types of memory region, but require more configuration.

@polarathene
Copy link

Probably not too convincing of a sample, but may be relevant to share in this issue: https://www.youtube.com/watch?v=DSGKq5KSkPw&t=520s

Steam Deck user talks about enabling THP and shows some graphs to compare before / after frame rates. They demonstrate a 10% improvement on the min frame rate (or rather 0.1% lower bound). It's not much in their example, translates into roughly 2 frames.

While that is perhaps not too impressive, it might be a better way to look at the benefits beyond average / max FPS improvements.


Opt-in via madvise would be preferable than always on systems that have workloads where THP negatively affects software.

  • Databases don't seem to be hit too heavily and is more relevant to server-only deployments.

  • On workstations and personal systems, software like VMWare is known to frequently encounter freezes/stalls for several seconds or longer due to kcompactd activity.

    I have a system running with approx a third of the available 32GB memory allocated (and another third as disk cache), no disk swap and still ran into this issue (VM has 4GB memory assigned, mostly idle CPU and low memory usage but often becomes unresponsive from simple interactions in the VM like on any app - unless dropping caches or rebooting). Host CPU ramps up due to kcompactd activity.

Beyond that some distros default to madvise, but for most I think they can just enable THP always if it's worthwhile to them. I assume the minor improvements wouldn't justify the effort required to support opt-in via madvise within Steam?

@pchome
Copy link
Contributor

pchome commented Aug 21, 2023

MADV_HUGEPAGE

It's GLIBC_TUNABLES=glibc.malloc.hugetlb=1 for madvise systems then, not 2.
No additional configuration needed.

Tunable: glibc.malloc.hugetlb

This tunable controls the usage of Huge Pages on malloc calls. The default value is 0, which disables any additional support on malloc.

Setting its value to 1 enables the use of madvise with MADV_HUGEPAGE after memory allocation with mmap. It is enabled only if the system supports Transparent Huge Page (currently only on Linux).

Setting its value to 2 enables the use of Huge Page directly with mmap with the use of MAP_HUGETLB flag. The huge page size to use will be the default one provided by the system. A value larger than 2 specifies huge page size, which will be matched against the system supported ones. If provided value is invalid, MAP_HUGETLB will not be used.

Use watch grep -i huge /proc/meminfo before app launch and see AnonHugePages will change.
Also watch --differences grep thp /proc/vmstat should show some stats.

Value to 2 does nothing on systems w/o configured HUGETLB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Request New feature or request
Projects
None yet
Development

No branches or pull requests