Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazily allocate thread stacks #143

Closed
nyh opened this issue Dec 26, 2013 · 9 comments
Closed

Lazily allocate thread stacks #143

nyh opened this issue Dec 26, 2013 · 9 comments

Comments

@nyh
Copy link
Contributor

nyh commented Dec 26, 2013

We currently allocate for each pthread, by default, a 1 MB stack. This allocation is done with mmap with mmu::mmap_populate, i.e., we force the entire allocation to be done up-front. This is a terrible waste: thread stacks are an excellent candidate for lazy allocation - they cannot use huge-pages anyway (they are less than 2MB, and have a guard page), and they often use much less memory than the full 1MB allotment.

Unfortunately, OSv crashes with lazy-population of stack threads, e.g., in tst-pipe.so, as explained in commit 41efdc1 which made the allocation immediate (with mmu::mmap_populate). The problem is that with lazy allocation of stacks, any code can cause us to need to allocate one more page for the stack, and cause a page fault, which may sleep (waiting on a mutex for memory allocation, for example). The crash happens when this happens in some OSv "kernel" code which has preemption disabled or irq disabled but runs on the user's stack (e.g., RCU lock, wait_until, the scheduler, or numerous other places), and a sleep is not allowed.

On the OSv mailing list, I proposed the following workaround: In preempt_disable() and irq_disable(), before we disable preemption, read a byte 4096 bytes ahead of the current stack top. Usually, this will add a tiny performance penalty (a single read, not even a if()), but if the next page of the stack is not yet allocated, it will cause the page fault to be taken here and now, before we disable preemption.
We also need to read a byte from the stack in thread::init_stack(), because new threads start with preemption disabled so we need them to start with a minimal part of the stack allocated.

If we want more than 4096 bytes of available stack to be guaranteed, we can read several bytes at 4096-byte stride, but this is slower than reading a single byte. Instead, we can consider doing this: read just one byte a few pages from the current stack position, and modify the page-fault handler to allocate more than one adjacent page on each page fault.

On the mailing list, Avi Kivity suggested other options to solve our lazily-allocated stack problem:

  • insert "thunk code" between user and kernel code that switches the stacks to known resident stacks. We could abuse the elf linker code to do that for us, at run time.
  • use -fsplit-stack to allow a dynamically allocated, discontiguous stack on physical memory
@nyh
Copy link
Contributor Author

nyh commented Feb 7, 2014

Just as an example of the kind of savings I am hoping to get from this:

Right now, in even a small application ("make image=rogue", for example) I see 180 threads.
Threads have variable stack sizes (some are 4K, pthreads are 1MB by default), but the vast majority of these 180 threads are ZFS threads which have a 65 KB stack.
If half of this stack is unused (just a ballpark guestimate...), we have 32 KB wasted for each of the 180 threads. That's almost 6MB, which is 13% of the memory used by this tiny application (I can run it on a 46 MB guest).

The memory saving will be much larger in applications which use many pthreads, e.g., heavily threaded Java applications.

penberg pushed a commit that referenced this issue Oct 8, 2014
Linux's mmap() API has a flag, MAP_STACK, which is a no-op on Linux, but
was added to the API in case "some architectures require special treatment
for stack allocations" (see mmap(2)). Unfortunately, OSv indeed has a
special requirement for stack memory: As explained in issue #143, we
currently do not support demand-paged stacks (i.e., accessing the stack
cannot cause a page fault), so the stack must be allocated pre-faulted
and pinned.

So if an application did us the courtesy of actually using MAP_STACK
when mmap'ing a stack, let's return the favor by mapping pre-faulted
memory which will actually work if used as a stack.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
nyh added a commit that referenced this issue Oct 5, 2016
In issue #790, Timmons C. Player discovered something dangerous with
on-stack sched::thread objects: The application's stack may be mmap()ed,
so the scheduler now needs to access an application's mmap()ed memory
area. Those accesses can potentially have all sort of problems (e.g.,
page faults in the scheduler, although in practice issue #143 precludes
it). But more concretely, The "lazy TLB flush" code added in commit
7e38453 means that the scheduler may see old pages for an application's
mmap()ed pages, so it cannot work with a sched::thread in an mmap()ed
area.

This patch prevents on-stack sched::thread construction by hiding the
constructor, and only allowing a thread to be created through the function
sched::thread::make(...), which creates the object on the heap.

This patch is long but it more-or-less contains just the following
types of changes:

1. The change to sched.hh to hide the sched::thread constructor, and
   instead offer a sched::thread::make(...) function.

2. Every class which used a "sched::thread" member was replaced by
   a std::unique_ptr<sched::thread> member.

   One "casualty" of this change is that a class that used to hold a
   sched::thread member will now hold a sched::thread pointer, and use an
   extra allocation and indirection - even if we know for sure that the
   containing object will always be allocated on the heap (such as the
   case in pthread.cc's pthread::_thread, for example). This can be worked
   around by making the exceptional class a friend of sched::thread to be
   able to use the hidden constructor - but for now I decided the tiny
   performance benefit isn't important enough to make this exception.

3. Every place which used "new sched::thread(...)" to create the thread,
   now uses "sched::thread::make(...)" instead.

   Sadly we no longer allow "new sched::thread(...)", although there is
   nothing wrong with it, because disabling the constructor also disables
   the new expression.

4. One test in misc-wake.cc (which wasn't run by default anyway, because)
   it is named "misc-*") was commented out because it relied on creating
   thread objects in mmap()ed areas.

One of the places modified is rcu_flush(), whose on-stack sched::thread
objects are what caused issue #790.

Fixes #790.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
myechuri pushed a commit to myechuri/osv that referenced this issue Jun 22, 2017
In issue cloudius-systems#790, Timmons C. Player discovered something dangerous with
on-stack sched::thread objects: The application's stack may be mmap()ed,
so the scheduler now needs to access an application's mmap()ed memory
area. Those accesses can potentially have all sort of problems (e.g.,
page faults in the scheduler, although in practice issue cloudius-systems#143 precludes
it). But more concretely, The "lazy TLB flush" code added in commit
7e38453 means that the scheduler may see old pages for an application's
mmap()ed pages, so it cannot work with a sched::thread in an mmap()ed
area.

This patch prevents on-stack sched::thread construction by hiding the
constructor, and only allowing a thread to be created through the function
sched::thread::make(...), which creates the object on the heap.

This patch is long but it more-or-less contains just the following
types of changes:

1. The change to sched.hh to hide the sched::thread constructor, and
   instead offer a sched::thread::make(...) function.

2. Every class which used a "sched::thread" member was replaced by
   a std::unique_ptr<sched::thread> member.

   One "casualty" of this change is that a class that used to hold a
   sched::thread member will now hold a sched::thread pointer, and use an
   extra allocation and indirection - even if we know for sure that the
   containing object will always be allocated on the heap (such as the
   case in pthread.cc's pthread::_thread, for example). This can be worked
   around by making the exceptional class a friend of sched::thread to be
   able to use the hidden constructor - but for now I decided the tiny
   performance benefit isn't important enough to make this exception.

3. Every place which used "new sched::thread(...)" to create the thread,
   now uses "sched::thread::make(...)" instead.

   Sadly we no longer allow "new sched::thread(...)", although there is
   nothing wrong with it, because disabling the constructor also disables
   the new expression.

4. One test in misc-wake.cc (which wasn't run by default anyway, because)
   it is named "misc-*") was commented out because it relied on creating
   thread objects in mmap()ed areas.

One of the places modified is rcu_flush(), whose on-stack sched::thread
objects are what caused issue cloudius-systems#790.

Fixes cloudius-systems#790.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
@nyh
Copy link
Contributor Author

nyh commented May 29, 2019

In this mailing-list thread, https://groups.google.com/d/msg/osv-dev/LqUAWjzVpSc/Yf4bTdi0AwAJ, Roberto and Waldek noticed that a Java program with 400 user threads uses over half a gigabyte just for these threads' stack, and on Linux this doesn't happen. It happens because of this issue (Java's default stack size is 1MB).

@wkozaczuk
Copy link
Collaborator

This seems like a quite important issue to fix from memory utilization standpoint. I wonder what the pros and cons of the 3 options are:

  • single-byte-read page fault when stack
    • one obvious downside is that if an application wants to use more than single page (or other buffer we want to keep in advance) this will not work
  • "thunk code"
    • how would we implement it?
  • -fsplit-stack

@nyh
Copy link
Contributor Author

nyh commented May 30, 2019

Hi @wkozaczuk my andswers and recommendations:

In the single-byte-read page fault the issue is not the application wanting to use more than a single page, but the kernel. It's not even all the kernel code - it's only preemption-disabled parts of the kernel which will be limited to 4096 bytes of stack. I'm hoping that most the kernel code does not have preemption disabled, so maybe 4096 bytes will be enough, but I really don't know.

I also suggested how we can read more than one byte at 4096-byte interviews to pre-fault more than one page (at some performance cost), but I think the best option is perhaps the one I already mentioned - of bringing in several adjacent pages on one single page fault. In other words, the code reads just one byte 4096 bytes back (stack grows downward...), but if this one read gets a page fault, the page fault handler will allocate not only this one page, but also the previous few pages, if they are still part of the same mapped range. Or, since this feature is very stack-specific we can do this only for stacks: currently, libc_flags_to_mmap() converts MAP_STACK to "mmu::mmap_populate". It should convert it to a new "mmu::mmap_stack", and the page fault handler will only do the "populate a few pages below the faulting one" trick when it sees a mapping with the mmu::mmap_stack parameter. Other OSv stack-allocation code which today uses mmu::mmap_populate (see commit 41efdc1) will also need to use the new mmu::mmap_stack instead.

About think-code - I think it's complicated we should only look into it if we're out of other options.
About "-fsplit-stack" - I have no idea how it works, and if it does work on OSv. But even if it did, how would Java use it? I thought Java itself determined that we allocate 1MB stacks for its threads, and doesn't use "split stack" features.

@wkozaczuk
Copy link
Collaborator

@nyh So it looks your proposal is probably the best way to go. As I understand it would not only apply to application threads (where app and OSv share same stack anyway) but also to pure kernel threads (like ZFS ones), correct? In theory same solution could be applied to syscall stack which we malloc on first syscall execution.

@nyh
Copy link
Contributor Author

nyh commented May 30, 2019

@wkozaczuk this issue is relevant to stacks allocated by mmap(), not by malloc(). All our Posix threads used by applications use mmap, because our pthread::allocate_stack()usesmmu::map_anon(nullptr, size, mmu::mmap_populate, ...)` which is also an mmap (and, as I said above, uses explicitly mmap_populate for immediate population and it shouldn't, perhaps it should use a new mmu::mmap_stack).

Most other internal threads in OSv has smaller stackes which are allocated by malloc(), not by mmap(). See init_stack() in arch/x64/arch-switch.hh. These always allocate immediately and aren't lazily populated, and the thinking was that this is fine for the small stacks (usually 64 KB) we use there. So nothing needs to change there, I think.

@wkozaczuk
Copy link
Collaborator

wkozaczuk commented Dec 18, 2019

Some takeaways from discussing the ongoing implementation effort - https://groups.google.com/d/msg/osv-dev/oaB5CHThc7M/2JCIvzgpAQAJ:

  1. We need to handle the nesting scenario when preempt_disable might be called after irq_disable was already called and vice versa. This means we cannot read a byte from the page ahead blindly as the stack may have already advanced (gotten deeper) and -4096(%rsp) could lead to another page fault which we cannot handle. This can be prevented by introducing separate stack_page_read_counter that we would increment and decrement in right places around disabling/enabling preemption/interrupts and trigger page read if the counter is 1 (look for a WIP patch here).

  2. All the increments and decrements of the stack_page_read_counter should be symmetrical. Decrementing it in wait_for_interrupts() does not seem to but if I remember correctly when I was trying to get it all to work I had to put it there, maybe because we start with 1. But I am not longer convinced it is necessary. We need to better understand it.

  3. Ideally, we should not even be trying to read the next page on kernel threads or even better on threads with stack mapped with mmap_stack. That can be accomplished by initializing the stack_page_read_counter to some higher value than 1 (10 or 11) so it never reaches 0 to trigger page read. This possibly can be set as a 1 thing in the thread_main_c method (see arch/x64/arch-switch.hh). It received thread pointer so we should somehow be able to determine how its stack got constructed. If not we can add a boolean flag to the thread class.

  4. In theory, we could get by without stack_page_read_counter with just checking sched::preemptable() AND somehow read the flags register (PUSHF?) to directly check if the interrupt flag is set or not in the read_stack_page method. But I have a feeling it would be more expensive.

  5. Using the counter may not be that expensive given we already use a similar mechanism to implement preemptable().

@wkozaczuk
Copy link
Collaborator

@nyh Do you have any comments about these most recent takeaways and a solution proposed on the mailing list?

wkozaczuk pushed a commit that referenced this issue Aug 28, 2022
This is the original patch from Jan 11, 2021 applied manually
to resolve conflicts caused by subsequent changes to the relevant
files since.
Waldemar Kozaczuk
------------------------------------------------------------------
Since commit 695375f, new threads start
with preemption disabled, and then immediately (in thread_main_c()) enable
preemption. However, this code runs on the thread's stack, which can be
supplied by the user. If this stack is lazily allocated (not pre-populated,
with pages to be populated only on first use), we will get a page fault
when thread_main_c() begins to run with preemption disabled.

This can happen if the application creates a thread stack with mmap()
without passing the MAP_STACK option.

Note that it is still strongly recommended to pre-populate thread
stacks by using MAP_STACK (or MAP_POPULATE), because we still haven't
committed a fix for issue #143. However, let's at least not crash
immediately and consistently, as we did before this patch.

This patch simply reads from the top byte of the user-supplied stack
before starting the thread - causing it to be faulted in (if needed)
at that time instead of when the thread starts. The first stack page
should be enough to start thread_main_c() and get to the point where
it re-enables preemption.

This patch also includes a test reproducing this issue and demonstrating
that it is fixed.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
wkozaczuk added a commit that referenced this issue Oct 16, 2022
This is the 1st of the total of eight patches that implement optional
support of so called "lazy stack" feature. The lazy stack is well explained
by the issue #143 and allows to save substantial amount of memory if
application spawns many pthreads with large stacks by letting stack grow
dynamically as needed instead of getting pre-populated ahead of time.

The crux of this solution and the previous versions is based on the observation
that OSv memory fault handler requires that both interrupts and preemption must
be enabled when fault is triggered. Therefore if stack is dynamically mapped
we need to make sure that stack page faults NEVER happen in the relatively
few places of kernel code that executes with either interrupts or preemption
disabled. And we satisfy this requirement by "pre-faulting" the stack by reading
a byte page (4096) down per stack pointer just before preemption or interrupts
are disabled. Now, the problem is complicated by the fact that the kernel code A
that disables preemption or interrupts may nest by calling another kernel
function B that also disables preemption or interrupts in which case the function
B should NOT try to pre-fault the stack otherwise the fault handler will abort
due to violated constraint stated before. In short we cannot "blindly"
or unconditionally pre-fault the stack in all places before interrupts or
preemption are disabled.

Some of the previous solutions modified both arch::irq_disable() and
sched::preempt_disable() to check if both preemption and interrupts
are enabled and only then try to read a byte at -4096 offset down. Unfortunately,
this makes it more costly than envisioned by Nadav Har'El - instead of single
instruction to read from memory, compiler needs 4-5 to read data if preemption
and interrupts are enabled and perform relevant jump. To make it worse, the
implementation of arch::irq_enabled() is pretty expensive at least in x64
and uses stack with pushfq. To avoid it the previous solutions would add new
thread local counter and pack irq disabling counter together with preemption one.
But even with this optimization I found this approach to deteriorate the
performance quite substantially. For example the memory allocation logic
disables preemption in quite many places (see core/mempool.cc) and corresponding
test - misc-free-perf.cc - would show performance - number of malloc()/free()
executed per second - degrade on average by 15-20%.

So this latest version implemented by this and next 7 patches takes different
approach. Instead of putting the conditional pre-faulting of stack in both
irq_disable() and preempt_disable(), we analyze OSv code to find all places
where irq_disable() and/or preempt_disable() is called directly (or indirectly
sometimes) and pre-fault the stack there if necessary or not. This makes
it obviously way more laborious and prone to human error (we can miss some places),
but would make it way more performant (no noticable performance degradation noticed)
comparing to earlier versions described in the paragraph above.

As we analyze all call sites, we need to make some observations to help us
decide what exactly to do in each case:
- do nothing
- blindly pre-fault the stack (single instruction)
- conditionally pre-fault the stack (hopefully in very few places)

Rule 1: Do nothing if call site in question executes ALWAYS in kernel thread.

Rule 2: Do nothing if call site executes on populated stack - includes the above
        but also code executing on interrupt, exception or syscall stack.

Rule 3: Do nothing if call site executes when we know that either interrupts
        or preemption are disabled. Good example is an interrupt handler
        or code within WITH_LOCK(irq_lock) or WITH_LOCK(preemption_lock) blocks.

Rule 4: Pre-fault unconditionally if we know that BOTH preemption and interrupts
        are enabled. These in most cases can only be deduced by analysing where
        the particular function is called. In general any such function called by
        user code like libc would satisfy the condition. But sometimes it is tricky
        because kernel might be calling libc functions, such as malloc().

Rule 5: Otherwise pre-fault stack by determining dynamically:
        only if sched::preemptable() and irq::enabled()

One general rule is that all potential stack page faults would happen
on application thread stack when some kernel code gets executed down
the call stack.

In general we identify the call sites in following categories:
- direct calls to arch::irq_disable() and arch::irq_disable_notrace() (tracepoints)
- direct calls to sched::preempt_disable()
- code using WITH_LOCK() with instance of irq_lock_type or irq_save_lock_type
- code using WITH_LOCK(preempt_lock)
- code using WITH_LOCK(osv::rcu_read_lock)

The above locations can be found with simple grep but also with an IDE like
CLion from JetBrains that can help more efficiently find all direct but also
more importantly indirect usages of the call sites identified above.

So this patch lays a ground work by defining the inline assembly to pre-fault
the stack where necessary and introduces two build parameters -
CONF_lazy_stack and CONF_lazy_stack_invariant - that are disabled by default.
The first one is used in all places to enable the lazy stack logic and the second
one is used to add code with some related invariants that will help us to reason
about the code and whether we should do nothing, pre-fault stack "blindly" or
conditionally.

The remaining 7 patches mostly add the pre-fault code in relevant places
but also annotate code with some invariants using assert().

Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
@wkozaczuk
Copy link
Collaborator

This has been implemented by this series of commits - https://github.com/search?q=repo%3Acloudius-systems%2Fosv+%22lazy+stack%22&type=commits. Even though the lazy stack support is not enabled by default (see the lazy stack flag in conf/base.mk) because we deem it as experimental, I suggest we close this issue as implemented.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants