Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for memory mappings with large pages #7977

Merged
merged 2 commits into from
Feb 15, 2024

Conversation

lexprfuncall
Copy link
Contributor

This change extends the memory managers in Erlang/OTP to
optionally map virtual memory on Linux using large pages.  Large
pages make more efficient use of the TLB by reducing the dynamic
frequency of TLB misses thereby increasing application
performance.  The effect or reducing TLB misses can be dramatic.
On our server workloads we have observed >10% performance
improvements simply from enabling mappings with large pages.

The opportunity for such an improvement comes from the growing
disparity between the size of the TLB and main memory.  The
typical 64-bit x86 has a 64-entry L1 data TLB and a 128-entry L1
instruction TLB, addressing 256KiB and 512KiB of data and text
respectively, at 4KiB per entry.  This is 4-5 orders of magnitude
smaller than the amount of memory installed in the typical server
of today.  When the TLB is missed the processor must perform an
address translate by traversing a multi-level table in main
memory, a costly operation.  This overhead is largely dispensable
by using a larger page size.  For mappings using a 2MiB or 1GiB
page in lieu of a 4KiB page the L1 TLB can address 32MiB or 4GiB,
respectively, a more meaningful fraction of typical working sets.

To take advantage of large pages in Erlang, we identified three
classes of virtual memory allocations that would benefit from
large pages: the .text segment, heap allocations, JIT
allocations. While there are many strategies for using large
pages on Linux, we chose to use Transparent Huge Pages which are
the most flexible.  Briefly put, in order to use THP we arranged
for all of these classes of virtual memory allocations to be
aligned and sized to a multiple of a large page and performed a
system call that advised the kernel to use large pages for the
mapping.

While this change is Linux specific, for the most part, other
operating systems have similar, and sometimes better, mechanisms
for achieving the same effect.  We believe it is be possible to
use large pages for heap and JIT allocations on FreeBSD, Linux,
Solaris, and Windows.  The changes to AsmJit in particular
already contain work to this effect.

@CLAassistant
Copy link

CLAassistant commented Dec 21, 2023

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

github-actions bot commented Dec 21, 2023

CT Test Results

    4 files    146 suites   47m 3s ⏱️
1 637 tests 1 584 ✅ 53 💤 0 ❌
2 150 runs  2 078 ✅ 72 💤 0 ❌

Results for commit aebc8f7.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

@lexprfuncall lexprfuncall force-pushed the large-pages branch 2 times, most recently from 3f7a1f0 to 24bd2fc Compare December 22, 2023 02:47
Copy link
Contributor

@jhogberg jhogberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Comments below :)

erts/emulator/beam/erl_init.c Outdated Show resolved Hide resolved
erts/emulator/beam/erl_init.c Outdated Show resolved Hide resolved
erts/emulator/beam/erl_init.c Outdated Show resolved Hide resolved
erts/emulator/beam/erl_init.c Outdated Show resolved Hide resolved
erts/emulator/beam/erl_init.c Outdated Show resolved Hide resolved
erts/emulator/sys/common/erl_mmap.c Outdated Show resolved Hide resolved
erts/emulator/sys/common/erl_mmap.c Show resolved Hide resolved
erts/emulator/beam/jit/beam_jit_main.cpp Outdated Show resolved Hide resolved
erts/emulator/beam/erl_init.c Outdated Show resolved Hide resolved
erts/emulator/asmjit/core/jitallocator.cpp Outdated Show resolved Hide resolved
@jhogberg jhogberg self-assigned this Dec 28, 2023
@jhogberg jhogberg added the team:VM Assigned to OTP team VM label Dec 28, 2023
@mikpe
Copy link
Contributor

mikpe commented Dec 29, 2023

I've only skimmed this PR, but FWIW I tried using Linux hugepages back around R15/R16 or so. Transparent hugepages were unusable (caused kernel lockups in the RHEL kernels we had at the time). What I ended up doing was to use explicit hugepages for the super carrier, but nothing else. That worked but wasn't a win (and required non-default kernel options) so I didn't pursue it.

@lexprfuncall
Copy link
Contributor Author

@jhogberg you left some other comments which all seem very reasonable. I'll follow-up on them after the weekend. Also, sorry for all the one-off comments. Responding from an e-mail does not give you the "start a review" option which lead me to having my replies dribble out incrementally. I'll be more tidy with the remainder of my comments.

@jhogberg
Copy link
Contributor

@jhogberg you left some other comments which all seem very reasonable. I'll follow-up on them after the weekend.

Thanks!

Also, sorry for all the one-off comments. Responding from an e-mail does not give you the "start a review" option which lead me to having my replies dribble out incrementally. I'll be more tidy with the remainder of my comments.

I don't mind. :)

@lexprfuncall
Copy link
Contributor Author

@mikpe we have had pretty good luck even back in the R15/R16 era with large pages. Back then, it was shown to be profitable to align allocations on a 2MiB and use FreeBSD's superpage feature on 64-bit x86.

As for HugeTLB, I have an option to use them for asmjit's dual-mapping mode on Linux where THP is not an option. In general, my experience is that the administrative overhead for having large amounts of huge pages is tricky when you are running in a shared environment and rebooting the machine between jobs is not an option. For the JIT where we use, say, 128MiB of memory, it's not a big deal to reserve a few 2MiB pages everywhere. With today's memory sizes, that will fly under the radar. For the super carrier which can be 10s or 100s of GiB, it's less practical.

@lexprfuncall lexprfuncall force-pushed the large-pages branch 3 times, most recently from ef64081 to 38126f1 Compare January 14, 2024 19:18
Copy link
Contributor Author

@lexprfuncall lexprfuncall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you everyone for all of your feedback. I think I've made all of the requested changes. One significant highlight is that I have pulled out the JIT-related changes. Because of its entanglement with third-party code, I will create a separate pull request for it.

erts/emulator/sys/common/erl_mmap.c Outdated Show resolved Hide resolved
erts/emulator/sys/common/erl_mmap.c Show resolved Hide resolved
erts/emulator/sys/common/erl_mmap.c Outdated Show resolved Hide resolved
erts/emulator/sys/common/erl_mmap.c Show resolved Hide resolved
erts/emulator/sys/common/erl_mmap.c Outdated Show resolved Hide resolved
erts/emulator/sys/common/erl_mmap.c Show resolved Hide resolved
erts/emulator/sys/common/erl_mmap.c Outdated Show resolved Hide resolved
erts/emulator/sys/common/erl_mmap.c Outdated Show resolved Hide resolved
erts/emulator/sys/common/erl_mmap.c Show resolved Hide resolved
erts/emulator/sys/common/erl_mmap.c Outdated Show resolved Hide resolved
@lexprfuncall lexprfuncall force-pushed the large-pages branch 2 times, most recently from f1429c9 to 1a24a03 Compare January 15, 2024 20:35
@jhogberg jhogberg added the testing currently being tested, tag is used by OTP internal CI label Jan 16, 2024
@jhogberg
Copy link
Contributor

Thanks, I've added it to our daily builds. :)

@lexprfuncall lexprfuncall force-pushed the large-pages branch 4 times, most recently from d051b04 to 5815264 Compare January 17, 2024 02:20
Copy link
Contributor Author

@lexprfuncall lexprfuncall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a few tries but the CI is now green. The combination of an older version of clang and the -Werror and -Wunused-command-line-argument options ultimately meant some extra code was needed to find the subset of the flags clang-10 supports for segment alignment among those GCC actually uses.

continue;
if (from > &etext)
break;
if ((UWord)from % sys_large_page_size != 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This crashes if is_linux_thp_enabled() returns non-zero and /sys/kernel/mm/transparent_hugepage/hpage_pmd_size cannot be opened, resulting in sys_large_page_size being zero. We've got two test rigs where this is the case, one with kernel version 4.4.74 and the other with 3.12.60.

Perhaps is_linux_thp_enabled should also check get_large_page_size() != 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think I let that bug creep in after I rearranged the code.

Along the lines of what you have suggested, I have added an early exit in this function if sys_large_page_size == 0. That should prevent the later % sys_large_page_size from exploding. This variation appealed to my aesthetic sense of having the thp enabled check just be about reading and parsing the relevant sysfs(5) file. However, if you think it's better to put this check in is_linux_thp_enabled I am certainly okay with that too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine. :)

@jhogberg
Copy link
Contributor

This is pretty much ready to merge now. I think the only thing left is to move the flag affecting mseg_alloc over to https://www.erlang.org/doc/man/erts_alloc.html#system-flags-effecting-erts_alloc and documenting it.

Since we may want to introduce explicit huge pages later (optionally with size when arch supports that?), I'm thinking of calling the option +MMhp transparent | off but anyone has a better suggestion I'm all ears.

@jhogberg jhogberg removed the testing currently being tested, tag is used by OTP internal CI label Jan 29, 2024
@jhogberg
Copy link
Contributor

We've settled on +MMhp transparent | off. Once the option's renamed and documented we're ready to merge it. :)

@lexprfuncall
Copy link
Contributor Author

This is pretty much ready to merge now. I think the only thing left is to move the flag affecting mseg_alloc over to https://www.erlang.org/doc/man/erts_alloc.html#system-flags-effecting-erts_alloc and documenting it.

Sure. Would you like me to take care of that?

Since we may want to introduce explicit huge pages later (optionally with size when arch supports that?), I'm thinking of calling the option +MMhp transparent | off but anyone has a better suggestion I'm all ears.

I think my choice of a flag name encoding a Linux-specific mechanism like THP or HugeTLB was unfortunate as the patch is trivially generalized to other operating systems which have semantics slightly different from Linux THP. If I had to do it again, I'd call it +mseg_use_large_pages or an unpronounceable variant thereof.

Also, prescribing the mechanism on Linux in some all-or-nothing way prevents using both mechanisms where appropriate. For example, 1GiB pages are not supported by the iTLB so we like to use HugeTLB to get a few 1GiB pages for data and THP for the rest.

Anyway, what I have in mind is having the "on" value enable some sensible default support for large pages. THP is a good default for Linux and we can support FreeBSD, macOS, Solaris and Win32 in a similar way with a very small amount of new platform specific code (likely less than was required for Linux). More exotic things like Linux HugeTLB or the similar feature on AIX probably require a few extra flags to specify policies for things like selecting the page size, hoarding the reserved pages at startup, or what to do when the reserved pages are exhausted. That will be hard to configure in a single flag so I’d imagine it would be kindest to users to configure that separately.

What do you think? As an aside, I’m happy to make the follow-up pull request to support FreeBSD and macOS, at least.

@vans163
Copy link
Contributor

vans163 commented Jan 30, 2024

This is pretty much ready to merge now. I think the only thing left is to move the flag affecting mseg_alloc over to https://www.erlang.org/doc/man/erts_alloc.html#system-flags-effecting-erts_alloc and documenting it.

Sure. Would you like me to take care of that?

Since we may want to introduce explicit huge pages later (optionally with size when arch supports that?), I'm thinking of calling the option +MMhp transparent | off but anyone has a better suggestion I'm all ears.

I think my choice of a flag name encoding a Linux-specific mechanism like THP or HugeTLB was unfortunate as the patch is trivially generalized to other operating systems which have semantics slightly different from Linux THP. If I had to do it again, I'd call it +mseg_use_large_pages or an unpronounceable variant thereof.

Also, prescribing the mechanism on Linux in some all-or-nothing way prevents using both mechanisms where appropriate. For example, 1GiB pages are not supported by the iTLB so we like to use HugeTLB to get a few 1GiB pages for data and THP for the rest.

Anyway, what I have in mind is having the "on" value enable some sensible default support for large pages. THP is a good default for Linux and we can support FreeBSD, macOS, Solaris and Win32 in a similar way with a very small amount of new platform specific code (likely less than was required for Linux). More exotic things like Linux HugeTLB or the similar feature on AIX probably require a few extra flags to specify policies for things like selecting the page size, hoarding the reserved pages at startup, or what to do when the reserved pages are exhausted. That will be hard to configure in a single flag so I’d imagine it would be kindest to users to configure that separately.

What do you think? As an aside, I’m happy to make the follow-up pull request to support FreeBSD and macOS, at least.

This would be really nice, we have long wanted hugepages to work natively on Linux (seems they are only supported on FreeBSD via superpages).

@jhogberg
Copy link
Contributor

Sure. Would you like me to take care of that?

Yes please. :)

I think my choice of a flag name encoding a Linux-specific mechanism like THP or HugeTLB was unfortunate as the patch is trivially generalized to other operating systems which have semantics slightly different from Linux THP. If I had to do it again, I'd call it +mseg_use_large_pages or an unpronounceable variant thereof.

Also, prescribing the mechanism on Linux in some all-or-nothing way prevents using both mechanisms where appropriate. For example, 1GiB pages are not supported by the iTLB so we like to use HugeTLB to get a few 1GiB pages for data and THP for the rest.

Anyway, what I have in mind is having the "on" value enable some sensible default support for large pages. THP is a good default for Linux and we can support FreeBSD, macOS, Solaris and Win32 in a similar way with a very small amount of new platform specific code (likely less than was required for Linux). More exotic things like Linux HugeTLB or the similar feature on AIX probably require a few extra flags to specify policies for things like selecting the page size, hoarding the reserved pages at startup, or what to do when the reserved pages are exhausted. That will be hard to configure in a single flag so I’d imagine it would be kindest to users to configure that separately.

What do you think? As an aside, I’m happy to make the follow-up pull request to support FreeBSD and macOS, at least.

We can always have +MMhp<> more specific, for example +MMhpi for instructions (.text segment, JIT once supported) and +MMhpd for data. The 1GiB pages for data case could be handled through the super-carrier mechanism (introduce +MMschp?), which already handles much of what you mentioned with regards to reservation, running out of pages, and so on.

@jhogberg
Copy link
Contributor

jhogberg commented Jan 30, 2024

Oh, just as a heads-up, the documentation will move over to ExDoc tomorrow so maybe we should hold off on adding that part until then (if you've already made the changes, I'll port them over).

@jhogberg
Copy link
Contributor

jhogberg commented Jan 31, 2024

#8026 has been merged now, the erts_alloc documentation has moved to https://github.com/erlang/otp/blob/master/erts/doc/references/erts_alloc.md. :)

A brief description of the new format can be found here.

@lexprfuncall
Copy link
Contributor Author

Yes please. :)

Sure, I'll take care of it in the next few days.

We can always have +MMhp<> more specific, for example +MMhpi for instructions (.text segment, JIT once supported) and +MMhpd for data.

What I was proposing might be a little different, which is to have the flag toggle the use of large pages and leave it up to the runtime to pick a mechanism, rather than specifying the details on the commandline of what mechanism (transparent, explicit, etc.) and where (static text, data, etc.)

Consider the following cases...

  1. For FreeBSD, HP-UX, Linux with THP set to always, recent Solaris and, supposedly, recent Windows, you can take advantage of large pages by, at most, arranging to allocate virtual memory at the right alignment and size.

  2. For Linux with THP set to madvise, MacOS, older Solaris, and most or all Windows, there are advisory and mandatory mechanisms available for requesting large pages. If the mechanism is mandatory, you need to retry it and fall back to something reasonable.

  3. For AIX and Linux with HugeTLB, the mechanism allocates from a finite pool of reserved pages. This also creates lots of corner cases (consider the case of calling fork(2)) so it can create correctness issues if not used with care.

At least for cases 1 and 2, it is enough to have a single flag that turns large pages on or off through whatever combination of mechanisms makes sense on the host operating system. This is easiest for users and this flag would be a good candidate for eventually being on by default since there is rarely a downside to having it enabled.

For case 3, a hairy parameter, like the +MMhp<> you proposed, that can encode all of the information needed to configure the huge pages (page sizes, count of pages, NUMA domain, etc.) and what to do if the pages are not available is probably unavoidable. For users that have already gone to the trouble to provision huge pages, the added burden is probably acceptable.

The 1GiB pages for data case could be handled through the super-carrier mechanism (introduce +MMschp?), which already handles much of what you mentioned with regards to reservation, running out of pages, and so on.

I also implemented support for 1GiB pages using HugeTLB. It was a more invasive change since you need to decide which parts of the super carrier get the 1GiB huge pages. (I also had to replace a call to fork(2).) If the super carrier runs out of 1GiB pages it is still reasonable to use other mechanisms. IIRC, the fallback in the super carrier is either mseg or sysalloc which is less efficient, at least for us.

@jhogberg
Copy link
Contributor

jhogberg commented Feb 1, 2024

What I was proposing might be a little different, which is to have the flag toggle the use of large pages and leave it up to the runtime to pick a mechanism, rather than specifying the details on the commandline of what mechanism (transparent, explicit, etc.) and where (static text, data, etc.)

Then we're on the same page, you shouldn't need to touch the tricky flags unless you need to, but they should be there for "case 3" once we cross that bridge.

I also implemented support for 1GiB pages using HugeTLB. It was a more invasive change since you need to decide which parts of the super carrier get the 1GiB huge pages. (I also had to replace a call to fork(2).) If the super carrier runs out of 1GiB pages it is still reasonable to use other mechanisms. IIRC, the fallback in the super carrier is either mseg or sysalloc which is less efficient, at least for us.

Then we'll fall back differently under +MMschp on or whatever we decide to call it, it doesn't have to work the way it does today :-)

@lexprfuncall
Copy link
Contributor Author

Then we're on the same page, you shouldn't need to touch the tricky flags unless you need to, but they should be there for "case 3" once we cross that bridge.

To confirm, the direction is to rename the flag as you had suggested above but have an "on" and "off" setting for now, with off as the default, right?

Then we'll fall back differently under +MMschp on or whatever we decide to call it, it doesn't have to work the way it does today :-)

Indeed, this will need a different design.

@jhogberg
Copy link
Contributor

jhogberg commented Feb 2, 2024

To confirm, the direction is to rename the flag as you had suggested above but have an "on" and "off" setting for now, with off as the default, right?

Yeah, on/off works :-)

@lexprfuncall
Copy link
Contributor Author

Yeah, on/off works :-)

Thanks for clarifying. I will go ahead update the flag and add the documentation sometime next week.

Linux allows an application to remap its .text segment at runtime
using transparent huge pages.  To do this, an application needs to
determine the start and length of the .text segment and pass this
range as an argument to the madvise(2) system call.  There are many
techniques for doing this, the approach chosen in this change is to
parse the /proc filesystem.  An alternative would be to get this
information from the ELF header.

For this to work reliably, the start address of the text segment
should be aligned to a multiple of the size of a transparent huge
page, and the length text segment should be greater than the size of a
transparent huge page and ideally a multiple of that size.

Finally, page sizes and support for multiple page sizes varies by
architecture.  This change only supports the 64-bit x86 but, in
theory, it can be generalized to other architectures that support THP.
In order for mseg pages to be reliably mapped with pages lager than
the default page size, the mapping must start and end at a multiple of
the larger page size.

To do this, this change adds an abstraction for performing a
memory-mapping with a specified alignment.  On an operating systems
like SunOS 5.9 and later, this is done by passing some extra flags to
mmap(2).  On operating systems without such a capability, we must do
this manually by over-allocating and freeing the excess.

The logic in this change only affects super carrier allocations but it
can be generalized to other mseg allocations.
@lexprfuncall
Copy link
Contributor Author

@jhogberg I have updated the documentation and changed the flag name to be a +M option that can be set to either on or off with off as the default. Since it turns out "huge page" means something specific on Windows I just used the large page terminology throughout. Please let me know if there are any other needed changes.

@jhogberg
Copy link
Contributor

Thanks, it looks great, I'll merge it once 27-rc1 is released (in a day or so?). :-)

@jhogberg jhogberg added the testing currently being tested, tag is used by OTP internal CI label Feb 14, 2024
@jhogberg jhogberg merged commit 06533ae into erlang:master Feb 15, 2024
18 checks passed
@jhogberg
Copy link
Contributor

Merged, thanks again for the PR! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team:VM Assigned to OTP team VM testing currently being tested, tag is used by OTP internal CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants