Add support for memory mappings with large pages #7977

lexprfuncall · 2023-12-21T23:01:50Z

This change extends the memory managers in Erlang/OTP to
optionally map virtual memory on Linux using large pages. Large
pages make more efficient use of the TLB by reducing the dynamic
frequency of TLB misses thereby increasing application
performance. The effect or reducing TLB misses can be dramatic.
On our server workloads we have observed >10% performance
improvements simply from enabling mappings with large pages.

The opportunity for such an improvement comes from the growing
disparity between the size of the TLB and main memory. The
typical 64-bit x86 has a 64-entry L1 data TLB and a 128-entry L1
instruction TLB, addressing 256KiB and 512KiB of data and text
respectively, at 4KiB per entry. This is 4-5 orders of magnitude
smaller than the amount of memory installed in the typical server
of today. When the TLB is missed the processor must perform an
address translate by traversing a multi-level table in main
memory, a costly operation. This overhead is largely dispensable
by using a larger page size. For mappings using a 2MiB or 1GiB
page in lieu of a 4KiB page the L1 TLB can address 32MiB or 4GiB,
respectively, a more meaningful fraction of typical working sets.

To take advantage of large pages in Erlang, we identified three
classes of virtual memory allocations that would benefit from
large pages: the .text segment, heap allocations, JIT
allocations. While there are many strategies for using large
pages on Linux, we chose to use Transparent Huge Pages which are
the most flexible. Briefly put, in order to use THP we arranged
for all of these classes of virtual memory allocations to be
aligned and sized to a multiple of a large page and performed a
system call that advised the kernel to use large pages for the
mapping.

While this change is Linux specific, for the most part, other
operating systems have similar, and sometimes better, mechanisms
for achieving the same effect. We believe it is be possible to
use large pages for heap and JIT allocations on FreeBSD, Linux,
Solaris, and Windows. The changes to AsmJit in particular
already contain work to this effect.

CLAassistant · 2023-12-21T23:01:56Z

All committers have signed the CLA.

github-actions · 2023-12-21T23:03:03Z

CT Test Results

4 files 146 suites 47m 3s ⏱️
1 637 tests 1 584 ✅ 53 💤 0 ❌
2 150 runs 2 078 ✅ 72 💤 0 ❌

Results for commit aebc8f7.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

jhogberg

Thanks for the PR! Comments below :)

erts/emulator/beam/erl_init.c

erts/emulator/sys/common/erl_mmap.c

erts/emulator/beam/jit/beam_jit_main.cpp

erts/emulator/beam/erl_init.c

erts/emulator/asmjit/core/jitallocator.cpp

mikpe · 2023-12-29T15:06:59Z

I've only skimmed this PR, but FWIW I tried using Linux hugepages back around R15/R16 or so. Transparent hugepages were unusable (caused kernel lockups in the RHEL kernels we had at the time). What I ended up doing was to use explicit hugepages for the super carrier, but nothing else. That worked but wasn't a win (and required non-default kernel options) so I didn't pursue it.

lexprfuncall · 2023-12-29T23:29:06Z

@jhogberg you left some other comments which all seem very reasonable. I'll follow-up on them after the weekend. Also, sorry for all the one-off comments. Responding from an e-mail does not give you the "start a review" option which lead me to having my replies dribble out incrementally. I'll be more tidy with the remainder of my comments.

jhogberg · 2023-12-29T23:36:17Z

@jhogberg you left some other comments which all seem very reasonable. I'll follow-up on them after the weekend.

Thanks!

Also, sorry for all the one-off comments. Responding from an e-mail does not give you the "start a review" option which lead me to having my replies dribble out incrementally. I'll be more tidy with the remainder of my comments.

I don't mind. :)

lexprfuncall · 2023-12-29T23:39:12Z

@mikpe we have had pretty good luck even back in the R15/R16 era with large pages. Back then, it was shown to be profitable to align allocations on a 2MiB and use FreeBSD's superpage feature on 64-bit x86.

As for HugeTLB, I have an option to use them for asmjit's dual-mapping mode on Linux where THP is not an option. In general, my experience is that the administrative overhead for having large amounts of huge pages is tricky when you are running in a shared environment and rebooting the machine between jobs is not an option. For the JIT where we use, say, 128MiB of memory, it's not a big deal to reserve a few 2MiB pages everywhere. With today's memory sizes, that will fly under the radar. For the super carrier which can be 10s or 100s of GiB, it's less practical.

lexprfuncall

Thank you everyone for all of your feedback. I think I've made all of the requested changes. One significant highlight is that I have pulled out the JIT-related changes. Because of its entanglement with third-party code, I will create a separate pull request for it.

erts/emulator/sys/common/erl_mmap.c

jhogberg · 2024-01-16T12:37:22Z

Thanks, I've added it to our daily builds. :)

lexprfuncall

Took a few tries but the CI is now green. The combination of an older version of clang and the -Werror and -Wunused-command-line-argument options ultimately meant some extra code was needed to find the subset of the flags clang-10 supports for segment alignment among those GCC actually uses.

jhogberg · 2024-01-18T08:24:46Z

erts/emulator/sys/unix/sys.c

+            continue;
+        if (from > &etext)
+            break;
+        if ((UWord)from % sys_large_page_size != 0)


This crashes if is_linux_thp_enabled() returns non-zero and /sys/kernel/mm/transparent_hugepage/hpage_pmd_size cannot be opened, resulting in sys_large_page_size being zero. We've got two test rigs where this is the case, one with kernel version 4.4.74 and the other with 3.12.60.

Perhaps is_linux_thp_enabled should also check get_large_page_size() != 0?

Sorry, I think I let that bug creep in after I rearranged the code.

Along the lines of what you have suggested, I have added an early exit in this function if sys_large_page_size == 0. That should prevent the later % sys_large_page_size from exploding. This variation appealed to my aesthetic sense of having the thp enabled check just be about reading and parsing the relevant sysfs(5) file. However, if you think it's better to put this check in is_linux_thp_enabled I am certainly okay with that too

That's fine. :)

jhogberg · 2024-01-25T11:17:04Z

This is pretty much ready to merge now. I think the only thing left is to move the flag affecting mseg_alloc over to https://www.erlang.org/doc/man/erts_alloc.html#system-flags-effecting-erts_alloc and documenting it.

Since we may want to introduce explicit huge pages later (optionally with size when arch supports that?), I'm thinking of calling the option +MMhp transparent | off but anyone has a better suggestion I'm all ears.

jhogberg · 2024-01-29T20:55:08Z

We've settled on +MMhp transparent | off. Once the option's renamed and documented we're ready to merge it. :)

lexprfuncall · 2024-01-30T01:48:28Z

This is pretty much ready to merge now. I think the only thing left is to move the flag affecting mseg_alloc over to https://www.erlang.org/doc/man/erts_alloc.html#system-flags-effecting-erts_alloc and documenting it.

Sure. Would you like me to take care of that?

Since we may want to introduce explicit huge pages later (optionally with size when arch supports that?), I'm thinking of calling the option +MMhp transparent | off but anyone has a better suggestion I'm all ears.

I think my choice of a flag name encoding a Linux-specific mechanism like THP or HugeTLB was unfortunate as the patch is trivially generalized to other operating systems which have semantics slightly different from Linux THP. If I had to do it again, I'd call it +mseg_use_large_pages or an unpronounceable variant thereof.

Also, prescribing the mechanism on Linux in some all-or-nothing way prevents using both mechanisms where appropriate. For example, 1GiB pages are not supported by the iTLB so we like to use HugeTLB to get a few 1GiB pages for data and THP for the rest.

Anyway, what I have in mind is having the "on" value enable some sensible default support for large pages. THP is a good default for Linux and we can support FreeBSD, macOS, Solaris and Win32 in a similar way with a very small amount of new platform specific code (likely less than was required for Linux). More exotic things like Linux HugeTLB or the similar feature on AIX probably require a few extra flags to specify policies for things like selecting the page size, hoarding the reserved pages at startup, or what to do when the reserved pages are exhausted. That will be hard to configure in a single flag so I’d imagine it would be kindest to users to configure that separately.

What do you think? As an aside, I’m happy to make the follow-up pull request to support FreeBSD and macOS, at least.

vans163 · 2024-01-30T04:21:01Z

This is pretty much ready to merge now. I think the only thing left is to move the flag affecting mseg_alloc over to https://www.erlang.org/doc/man/erts_alloc.html#system-flags-effecting-erts_alloc and documenting it.

Sure. Would you like me to take care of that?

Since we may want to introduce explicit huge pages later (optionally with size when arch supports that?), I'm thinking of calling the option +MMhp transparent | off but anyone has a better suggestion I'm all ears.

I think my choice of a flag name encoding a Linux-specific mechanism like THP or HugeTLB was unfortunate as the patch is trivially generalized to other operating systems which have semantics slightly different from Linux THP. If I had to do it again, I'd call it +mseg_use_large_pages or an unpronounceable variant thereof.

Also, prescribing the mechanism on Linux in some all-or-nothing way prevents using both mechanisms where appropriate. For example, 1GiB pages are not supported by the iTLB so we like to use HugeTLB to get a few 1GiB pages for data and THP for the rest.

Anyway, what I have in mind is having the "on" value enable some sensible default support for large pages. THP is a good default for Linux and we can support FreeBSD, macOS, Solaris and Win32 in a similar way with a very small amount of new platform specific code (likely less than was required for Linux). More exotic things like Linux HugeTLB or the similar feature on AIX probably require a few extra flags to specify policies for things like selecting the page size, hoarding the reserved pages at startup, or what to do when the reserved pages are exhausted. That will be hard to configure in a single flag so I’d imagine it would be kindest to users to configure that separately.

What do you think? As an aside, I’m happy to make the follow-up pull request to support FreeBSD and macOS, at least.

This would be really nice, we have long wanted hugepages to work natively on Linux (seems they are only supported on FreeBSD via superpages).

jhogberg · 2024-01-30T07:49:20Z

Sure. Would you like me to take care of that?

Yes please. :)

I think my choice of a flag name encoding a Linux-specific mechanism like THP or HugeTLB was unfortunate as the patch is trivially generalized to other operating systems which have semantics slightly different from Linux THP. If I had to do it again, I'd call it +mseg_use_large_pages or an unpronounceable variant thereof.

Also, prescribing the mechanism on Linux in some all-or-nothing way prevents using both mechanisms where appropriate. For example, 1GiB pages are not supported by the iTLB so we like to use HugeTLB to get a few 1GiB pages for data and THP for the rest.

Anyway, what I have in mind is having the "on" value enable some sensible default support for large pages. THP is a good default for Linux and we can support FreeBSD, macOS, Solaris and Win32 in a similar way with a very small amount of new platform specific code (likely less than was required for Linux). More exotic things like Linux HugeTLB or the similar feature on AIX probably require a few extra flags to specify policies for things like selecting the page size, hoarding the reserved pages at startup, or what to do when the reserved pages are exhausted. That will be hard to configure in a single flag so I’d imagine it would be kindest to users to configure that separately.

What do you think? As an aside, I’m happy to make the follow-up pull request to support FreeBSD and macOS, at least.

We can always have +MMhp<> more specific, for example +MMhpi for instructions (.text segment, JIT once supported) and +MMhpd for data. The 1GiB pages for data case could be handled through the super-carrier mechanism (introduce +MMschp?), which already handles much of what you mentioned with regards to reservation, running out of pages, and so on.

jhogberg · 2024-01-30T12:02:26Z

Oh, just as a heads-up, the documentation will move over to ExDoc tomorrow so maybe we should hold off on adding that part until then (if you've already made the changes, I'll port them over).

jhogberg · 2024-01-31T13:05:13Z

#8026 has been merged now, the erts_alloc documentation has moved to https://github.com/erlang/otp/blob/master/erts/doc/references/erts_alloc.md. :)

A brief description of the new format can be found here.

lexprfuncall · 2024-02-01T02:14:42Z

Yes please. :)

Sure, I'll take care of it in the next few days.

We can always have +MMhp<> more specific, for example +MMhpi for instructions (.text segment, JIT once supported) and +MMhpd for data.

What I was proposing might be a little different, which is to have the flag toggle the use of large pages and leave it up to the runtime to pick a mechanism, rather than specifying the details on the commandline of what mechanism (transparent, explicit, etc.) and where (static text, data, etc.)

Consider the following cases...

For FreeBSD, HP-UX, Linux with THP set to always, recent Solaris and, supposedly, recent Windows, you can take advantage of large pages by, at most, arranging to allocate virtual memory at the right alignment and size.
For Linux with THP set to madvise, MacOS, older Solaris, and most or all Windows, there are advisory and mandatory mechanisms available for requesting large pages. If the mechanism is mandatory, you need to retry it and fall back to something reasonable.
For AIX and Linux with HugeTLB, the mechanism allocates from a finite pool of reserved pages. This also creates lots of corner cases (consider the case of calling fork(2)) so it can create correctness issues if not used with care.

At least for cases 1 and 2, it is enough to have a single flag that turns large pages on or off through whatever combination of mechanisms makes sense on the host operating system. This is easiest for users and this flag would be a good candidate for eventually being on by default since there is rarely a downside to having it enabled.

For case 3, a hairy parameter, like the +MMhp<> you proposed, that can encode all of the information needed to configure the huge pages (page sizes, count of pages, NUMA domain, etc.) and what to do if the pages are not available is probably unavoidable. For users that have already gone to the trouble to provision huge pages, the added burden is probably acceptable.

The 1GiB pages for data case could be handled through the super-carrier mechanism (introduce +MMschp?), which already handles much of what you mentioned with regards to reservation, running out of pages, and so on.

I also implemented support for 1GiB pages using HugeTLB. It was a more invasive change since you need to decide which parts of the super carrier get the 1GiB huge pages. (I also had to replace a call to fork(2).) If the super carrier runs out of 1GiB pages it is still reasonable to use other mechanisms. IIRC, the fallback in the super carrier is either mseg or sysalloc which is less efficient, at least for us.

jhogberg · 2024-02-01T08:50:37Z

What I was proposing might be a little different, which is to have the flag toggle the use of large pages and leave it up to the runtime to pick a mechanism, rather than specifying the details on the commandline of what mechanism (transparent, explicit, etc.) and where (static text, data, etc.)

Then we're on the same page, you shouldn't need to touch the tricky flags unless you need to, but they should be there for "case 3" once we cross that bridge.

I also implemented support for 1GiB pages using HugeTLB. It was a more invasive change since you need to decide which parts of the super carrier get the 1GiB huge pages. (I also had to replace a call to fork(2).) If the super carrier runs out of 1GiB pages it is still reasonable to use other mechanisms. IIRC, the fallback in the super carrier is either mseg or sysalloc which is less efficient, at least for us.

Then we'll fall back differently under +MMschp on or whatever we decide to call it, it doesn't have to work the way it does today :-)

lexprfuncall · 2024-02-01T23:16:19Z

Then we're on the same page, you shouldn't need to touch the tricky flags unless you need to, but they should be there for "case 3" once we cross that bridge.

To confirm, the direction is to rename the flag as you had suggested above but have an "on" and "off" setting for now, with off as the default, right?

Then we'll fall back differently under +MMschp on or whatever we decide to call it, it doesn't have to work the way it does today :-)

Indeed, this will need a different design.

jhogberg · 2024-02-02T08:33:56Z

To confirm, the direction is to rename the flag as you had suggested above but have an "on" and "off" setting for now, with off as the default, right?

Yeah, on/off works :-)

lexprfuncall · 2024-02-02T21:18:15Z

Yeah, on/off works :-)

Thanks for clarifying. I will go ahead update the flag and add the documentation sometime next week.

Linux allows an application to remap its .text segment at runtime using transparent huge pages. To do this, an application needs to determine the start and length of the .text segment and pass this range as an argument to the madvise(2) system call. There are many techniques for doing this, the approach chosen in this change is to parse the /proc filesystem. An alternative would be to get this information from the ELF header. For this to work reliably, the start address of the text segment should be aligned to a multiple of the size of a transparent huge page, and the length text segment should be greater than the size of a transparent huge page and ideally a multiple of that size. Finally, page sizes and support for multiple page sizes varies by architecture. This change only supports the 64-bit x86 but, in theory, it can be generalized to other architectures that support THP.

In order for mseg pages to be reliably mapped with pages lager than the default page size, the mapping must start and end at a multiple of the larger page size. To do this, this change adds an abstraction for performing a memory-mapping with a specified alignment. On an operating systems like SunOS 5.9 and later, this is done by passing some extra flags to mmap(2). On operating systems without such a capability, we must do this manually by over-allocating and freeing the excess. The logic in this change only affects super carrier allocations but it can be generalized to other mseg allocations.

lexprfuncall · 2024-02-09T05:33:09Z

@jhogberg I have updated the documentation and changed the flag name to be a +M option that can be set to either on or off with off as the default. Since it turns out "huge page" means something specific on Windows I just used the large page terminology throughout. Please let me know if there are any other needed changes.

jhogberg · 2024-02-13T09:13:17Z

Thanks, it looks great, I'll merge it once 27-rc1 is released (in a day or so?). :-)

jhogberg · 2024-02-15T08:57:41Z

Merged, thanks again for the PR! :)

lexprfuncall force-pushed the large-pages branch 2 times, most recently from 3f7a1f0 to 24bd2fc Compare December 22, 2023 02:47

jhogberg requested changes Dec 28, 2023

View reviewed changes

jhogberg self-assigned this Dec 28, 2023

jhogberg added the team:VM Assigned to OTP team VM label Dec 28, 2023

lexprfuncall force-pushed the large-pages branch 3 times, most recently from ef64081 to 38126f1 Compare January 14, 2024 19:18

lexprfuncall commented Jan 14, 2024

View reviewed changes

lexprfuncall requested a review from jhogberg January 14, 2024 19:20

lexprfuncall force-pushed the large-pages branch 2 times, most recently from f1429c9 to 1a24a03 Compare January 15, 2024 20:35

jhogberg added the testing currently being tested, tag is used by OTP internal CI label Jan 16, 2024

lexprfuncall force-pushed the large-pages branch 4 times, most recently from d051b04 to 5815264 Compare January 17, 2024 02:20

lexprfuncall commented Jan 17, 2024

View reviewed changes

jhogberg reviewed Jan 18, 2024

View reviewed changes

lexprfuncall force-pushed the large-pages branch from 5815264 to 2c694e3 Compare January 19, 2024 00:45

jhogberg removed the testing currently being tested, tag is used by OTP internal CI label Jan 29, 2024

lexprfuncall added 2 commits February 8, 2024 15:11

lexprfuncall force-pushed the large-pages branch from 2c694e3 to aebc8f7 Compare February 9, 2024 05:05

jhogberg added the testing currently being tested, tag is used by OTP internal CI label Feb 14, 2024

jhogberg merged commit 06533ae into erlang:master Feb 15, 2024
18 checks passed

garazdawi mentioned this pull request Jun 17, 2024

ELF-binaries in the erts-15.0/bin of Hexpm Docker images are too big (and full or zeros) #8574

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for memory mappings with large pages #7977

Add support for memory mappings with large pages #7977

lexprfuncall commented Dec 21, 2023

CLAassistant commented Dec 21, 2023 •

edited

Loading

github-actions bot commented Dec 21, 2023 •

edited

Loading

jhogberg left a comment

mikpe commented Dec 29, 2023

lexprfuncall commented Dec 29, 2023

jhogberg commented Dec 29, 2023

lexprfuncall commented Dec 29, 2023

lexprfuncall left a comment

jhogberg commented Jan 16, 2024

lexprfuncall left a comment •

edited

Loading

jhogberg Jan 18, 2024

lexprfuncall Jan 19, 2024

jhogberg Jan 22, 2024

jhogberg commented Jan 25, 2024

jhogberg commented Jan 29, 2024

lexprfuncall commented Jan 30, 2024

vans163 commented Jan 30, 2024

jhogberg commented Jan 30, 2024

jhogberg commented Jan 30, 2024 •

edited

Loading

jhogberg commented Jan 31, 2024 •

edited

Loading

lexprfuncall commented Feb 1, 2024

jhogberg commented Feb 1, 2024

lexprfuncall commented Feb 1, 2024

jhogberg commented Feb 2, 2024

lexprfuncall commented Feb 2, 2024

lexprfuncall commented Feb 9, 2024

jhogberg commented Feb 13, 2024

jhogberg commented Feb 15, 2024

Add support for memory mappings with large pages #7977

Add support for memory mappings with large pages #7977

Conversation

lexprfuncall commented Dec 21, 2023

CLAassistant commented Dec 21, 2023 • edited Loading

github-actions bot commented Dec 21, 2023 • edited Loading

CT Test Results

Artifacts

jhogberg left a comment

Choose a reason for hiding this comment

mikpe commented Dec 29, 2023

lexprfuncall commented Dec 29, 2023

jhogberg commented Dec 29, 2023

lexprfuncall commented Dec 29, 2023

lexprfuncall left a comment

Choose a reason for hiding this comment

jhogberg commented Jan 16, 2024

lexprfuncall left a comment • edited Loading

Choose a reason for hiding this comment

jhogberg Jan 18, 2024

Choose a reason for hiding this comment

lexprfuncall Jan 19, 2024

Choose a reason for hiding this comment

jhogberg Jan 22, 2024

Choose a reason for hiding this comment

jhogberg commented Jan 25, 2024

jhogberg commented Jan 29, 2024

lexprfuncall commented Jan 30, 2024

vans163 commented Jan 30, 2024

jhogberg commented Jan 30, 2024

jhogberg commented Jan 30, 2024 • edited Loading

jhogberg commented Jan 31, 2024 • edited Loading

lexprfuncall commented Feb 1, 2024

jhogberg commented Feb 1, 2024

lexprfuncall commented Feb 1, 2024

jhogberg commented Feb 2, 2024

lexprfuncall commented Feb 2, 2024

lexprfuncall commented Feb 9, 2024

jhogberg commented Feb 13, 2024

jhogberg commented Feb 15, 2024

CLAassistant commented Dec 21, 2023 •

edited

Loading

github-actions bot commented Dec 21, 2023 •

edited

Loading

lexprfuncall left a comment •

edited

Loading

jhogberg commented Jan 30, 2024 •

edited

Loading

jhogberg commented Jan 31, 2024 •

edited

Loading