Skip to content

Commit 06201e0

Browse files
davidhildenbrandAlexander Gordeev
authored andcommitted
s390/mm: Re-enable the shared zeropage for !PV and !skeys KVM guests
commit fa41ba0 ("s390/mm: avoid empty zero pages for KVM guests to avoid postcopy hangs") introduced an undesired side effect when combined with memory ballooning and VM migration: memory part of the inflated memory balloon will consume memory. Assuming we have a 100GiB VM and inflated the balloon to 40GiB. Our VM will consume ~60GiB of memory. If we now trigger a VM migration, hypervisors like QEMU will read all VM memory. As s390x does not support the shared zeropage, we'll end up allocating for all previously-inflated memory part of the memory balloon: 50 GiB. So we might easily (unexpectedly) crash the VM on the migration source. Even worse, hypervisors like QEMU optimize for zeropage migration to not consume memory on the migration destination: when migrating a "page full of zeroes", on the migration destination they check whether the target memory is already zero (by reading the destination memory) and avoid writing to the memory to not allocate memory: however, s390x will also allocate memory here, implying that also on the migration destination, we will end up allocating all previously-inflated memory part of the memory balloon. This is especially bad if actual memory overcommit was not desired, when memory ballooning is used for dynamic VM memory resizing, setting aside some memory during boot that can be added later on demand. Alternatives like virtio-mem that would avoid this issue are not yet available on s390x. There could be ways to optimize some cases in user space: before reading memory in an anonymous private mapping on the migration source, check via /proc/self/pagemap if anything is already populated. Similarly check on the migration destination before reading. While that would avoid populating tables full of shared zeropages on all architectures, it's harder to get right and performant, and requires user space changes. Further, with posctopy live migration we must place a page, so there, "avoid touching memory to avoid allocating memory" is not really possible. (Note that a previously we would have falsely inserted shared zeropages into processes using UFFDIO_ZEROPAGE where mm_forbids_zeropage() would have actually forbidden it) PV is currently incompatible with memory ballooning, and in the common case, KVM guests don't make use of storage keys. Instead of zapping zeropages when enabling storage keys / PV, that turned out to be problematic in the past, let's do exactly the same we do with KSM pages: trigger unsharing faults to replace the shared zeropages by proper anonymous folios. What about added latency when enabling storage kes? Having a lot of zeropages in applicable environments (PV, legacy guests, unittests) is unexpected. Further, KSM could today already unshare the zeropages and unmerging KSM pages when enabling storage kets would unshare the KSM-placed zeropages in the same way, resulting in the same latency. [ agordeev: Fixed sparse and checkpatch complaints and error handling ] Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com> Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com> Fixes: fa41ba0 ("s390/mm: avoid empty zero pages for KVM guests to avoid postcopy hangs") Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20240411161441.910170-3-david@redhat.com Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
1 parent 90a7592 commit 06201e0

File tree

6 files changed

+146
-47
lines changed

6 files changed

+146
-47
lines changed

arch/s390/include/asm/gmap.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -146,7 +146,7 @@ int gmap_mprotect_notify(struct gmap *, unsigned long start,
146146

147147
void gmap_sync_dirty_log_pmd(struct gmap *gmap, unsigned long dirty_bitmap[4],
148148
unsigned long gaddr, unsigned long vmaddr);
149-
int gmap_mark_unmergeable(void);
149+
int s390_disable_cow_sharing(void);
150150
void s390_unlist_old_asce(struct gmap *gmap);
151151
int s390_replace_asce(struct gmap *gmap);
152152
void s390_uv_destroy_pfns(unsigned long count, unsigned long *pfns);

arch/s390/include/asm/mmu.h

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,11 @@ typedef struct {
3232
unsigned int uses_skeys:1;
3333
/* The mmu context uses CMM. */
3434
unsigned int uses_cmm:1;
35+
/*
36+
* The mmu context allows COW-sharing of memory pages (KSM, zeropage).
37+
* Note that COW-sharing during fork() is currently always allowed.
38+
*/
39+
unsigned int allow_cow_sharing:1;
3540
/* The gmaps associated with this context are allowed to use huge pages. */
3641
unsigned int allow_gmap_hpage_1m:1;
3742
} mm_context_t;

arch/s390/include/asm/mmu_context.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ static inline int init_new_context(struct task_struct *tsk,
3535
mm->context.has_pgste = 0;
3636
mm->context.uses_skeys = 0;
3737
mm->context.uses_cmm = 0;
38+
mm->context.allow_cow_sharing = 1;
3839
mm->context.allow_gmap_hpage_1m = 0;
3940
#endif
4041
switch (mm->context.asce_limit) {

arch/s390/include/asm/pgtable.h

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -566,10 +566,20 @@ static inline pud_t set_pud_bit(pud_t pud, pgprot_t prot)
566566
}
567567

568568
/*
569-
* In the case that a guest uses storage keys
570-
* faults should no longer be backed by zero pages
569+
* As soon as the guest uses storage keys or enables PV, we deduplicate all
570+
* mapped shared zeropages and prevent new shared zeropages from getting
571+
* mapped.
571572
*/
572-
#define mm_forbids_zeropage mm_has_pgste
573+
#define mm_forbids_zeropage mm_forbids_zeropage
574+
static inline int mm_forbids_zeropage(struct mm_struct *mm)
575+
{
576+
#ifdef CONFIG_PGSTE
577+
if (!mm->context.allow_cow_sharing)
578+
return 1;
579+
#endif
580+
return 0;
581+
}
582+
573583
static inline int mm_uses_skeys(struct mm_struct *mm)
574584
{
575585
#ifdef CONFIG_PGSTE

arch/s390/kvm/kvm-s390.c

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2631,9 +2631,7 @@ static int kvm_s390_handle_pv(struct kvm *kvm, struct kvm_pv_cmd *cmd)
26312631
if (r)
26322632
break;
26332633

2634-
mmap_write_lock(current->mm);
2635-
r = gmap_mark_unmergeable();
2636-
mmap_write_unlock(current->mm);
2634+
r = s390_disable_cow_sharing();
26372635
if (r)
26382636
break;
26392637

arch/s390/mm/gmap.c

Lines changed: 125 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -2549,41 +2549,6 @@ static inline void thp_split_mm(struct mm_struct *mm)
25492549
}
25502550
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
25512551

2552-
/*
2553-
* Remove all empty zero pages from the mapping for lazy refaulting
2554-
* - This must be called after mm->context.has_pgste is set, to avoid
2555-
* future creation of zero pages
2556-
* - This must be called after THP was disabled.
2557-
*
2558-
* mm contracts with s390, that even if mm were to remove a page table,
2559-
* racing with the loop below and so causing pte_offset_map_lock() to fail,
2560-
* it will never insert a page table containing empty zero pages once
2561-
* mm_forbids_zeropage(mm) i.e. mm->context.has_pgste is set.
2562-
*/
2563-
static int __zap_zero_pages(pmd_t *pmd, unsigned long start,
2564-
unsigned long end, struct mm_walk *walk)
2565-
{
2566-
unsigned long addr;
2567-
2568-
for (addr = start; addr != end; addr += PAGE_SIZE) {
2569-
pte_t *ptep;
2570-
spinlock_t *ptl;
2571-
2572-
ptep = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
2573-
if (!ptep)
2574-
break;
2575-
if (is_zero_pfn(pte_pfn(*ptep)))
2576-
ptep_xchg_direct(walk->mm, addr, ptep, __pte(_PAGE_INVALID));
2577-
pte_unmap_unlock(ptep, ptl);
2578-
}
2579-
return 0;
2580-
}
2581-
2582-
static const struct mm_walk_ops zap_zero_walk_ops = {
2583-
.pmd_entry = __zap_zero_pages,
2584-
.walk_lock = PGWALK_WRLOCK,
2585-
};
2586-
25872552
/*
25882553
* switch on pgstes for its userspace process (for kvm)
25892554
*/
@@ -2601,22 +2566,142 @@ int s390_enable_sie(void)
26012566
mm->context.has_pgste = 1;
26022567
/* split thp mappings and disable thp for future mappings */
26032568
thp_split_mm(mm);
2604-
walk_page_range(mm, 0, TASK_SIZE, &zap_zero_walk_ops, NULL);
26052569
mmap_write_unlock(mm);
26062570
return 0;
26072571
}
26082572
EXPORT_SYMBOL_GPL(s390_enable_sie);
26092573

2610-
int gmap_mark_unmergeable(void)
2574+
static int find_zeropage_pte_entry(pte_t *pte, unsigned long addr,
2575+
unsigned long end, struct mm_walk *walk)
2576+
{
2577+
unsigned long *found_addr = walk->private;
2578+
2579+
/* Return 1 of the page is a zeropage. */
2580+
if (is_zero_pfn(pte_pfn(*pte))) {
2581+
/*
2582+
* Shared zeropage in e.g., a FS DAX mapping? We cannot do the
2583+
* right thing and likely don't care: FAULT_FLAG_UNSHARE
2584+
* currently only works in COW mappings, which is also where
2585+
* mm_forbids_zeropage() is checked.
2586+
*/
2587+
if (!is_cow_mapping(walk->vma->vm_flags))
2588+
return -EFAULT;
2589+
2590+
*found_addr = addr;
2591+
return 1;
2592+
}
2593+
return 0;
2594+
}
2595+
2596+
static const struct mm_walk_ops find_zeropage_ops = {
2597+
.pte_entry = find_zeropage_pte_entry,
2598+
.walk_lock = PGWALK_WRLOCK,
2599+
};
2600+
2601+
/*
2602+
* Unshare all shared zeropages, replacing them by anonymous pages. Note that
2603+
* we cannot simply zap all shared zeropages, because this could later
2604+
* trigger unexpected userfaultfd missing events.
2605+
*
2606+
* This must be called after mm->context.allow_cow_sharing was
2607+
* set to 0, to avoid future mappings of shared zeropages.
2608+
*
2609+
* mm contracts with s390, that even if mm were to remove a page table,
2610+
* and racing with walk_page_range_vma() calling pte_offset_map_lock()
2611+
* would fail, it will never insert a page table containing empty zero
2612+
* pages once mm_forbids_zeropage(mm) i.e.
2613+
* mm->context.allow_cow_sharing is set to 0.
2614+
*/
2615+
static int __s390_unshare_zeropages(struct mm_struct *mm)
2616+
{
2617+
struct vm_area_struct *vma;
2618+
VMA_ITERATOR(vmi, mm, 0);
2619+
unsigned long addr;
2620+
vm_fault_t fault;
2621+
int rc;
2622+
2623+
for_each_vma(vmi, vma) {
2624+
/*
2625+
* We could only look at COW mappings, but it's more future
2626+
* proof to catch unexpected zeropages in other mappings and
2627+
* fail.
2628+
*/
2629+
if ((vma->vm_flags & VM_PFNMAP) || is_vm_hugetlb_page(vma))
2630+
continue;
2631+
addr = vma->vm_start;
2632+
2633+
retry:
2634+
rc = walk_page_range_vma(vma, addr, vma->vm_end,
2635+
&find_zeropage_ops, &addr);
2636+
if (rc < 0)
2637+
return rc;
2638+
else if (!rc)
2639+
continue;
2640+
2641+
/* addr was updated by find_zeropage_pte_entry() */
2642+
fault = handle_mm_fault(vma, addr,
2643+
FAULT_FLAG_UNSHARE | FAULT_FLAG_REMOTE,
2644+
NULL);
2645+
if (fault & VM_FAULT_OOM)
2646+
return -ENOMEM;
2647+
/*
2648+
* See break_ksm(): even after handle_mm_fault() returned 0, we
2649+
* must start the lookup from the current address, because
2650+
* handle_mm_fault() may back out if there's any difficulty.
2651+
*
2652+
* VM_FAULT_SIGBUS and VM_FAULT_SIGSEGV are unexpected but
2653+
* maybe they could trigger in the future on concurrent
2654+
* truncation. In that case, the shared zeropage would be gone
2655+
* and we can simply retry and make progress.
2656+
*/
2657+
cond_resched();
2658+
goto retry;
2659+
}
2660+
2661+
return 0;
2662+
}
2663+
2664+
static int __s390_disable_cow_sharing(struct mm_struct *mm)
26112665
{
2666+
int rc;
2667+
2668+
if (!mm->context.allow_cow_sharing)
2669+
return 0;
2670+
2671+
mm->context.allow_cow_sharing = 0;
2672+
2673+
/* Replace all shared zeropages by anonymous pages. */
2674+
rc = __s390_unshare_zeropages(mm);
26122675
/*
26132676
* Make sure to disable KSM (if enabled for the whole process or
26142677
* individual VMAs). Note that nothing currently hinders user space
26152678
* from re-enabling it.
26162679
*/
2617-
return ksm_disable(current->mm);
2680+
if (!rc)
2681+
rc = ksm_disable(mm);
2682+
if (rc)
2683+
mm->context.allow_cow_sharing = 1;
2684+
return rc;
2685+
}
2686+
2687+
/*
2688+
* Disable most COW-sharing of memory pages for the whole process:
2689+
* (1) Disable KSM and unmerge/unshare any KSM pages.
2690+
* (2) Disallow shared zeropages and unshare any zerpages that are mapped.
2691+
*
2692+
* Not that we currently don't bother with COW-shared pages that are shared
2693+
* with parent/child processes due to fork().
2694+
*/
2695+
int s390_disable_cow_sharing(void)
2696+
{
2697+
int rc;
2698+
2699+
mmap_write_lock(current->mm);
2700+
rc = __s390_disable_cow_sharing(current->mm);
2701+
mmap_write_unlock(current->mm);
2702+
return rc;
26182703
}
2619-
EXPORT_SYMBOL_GPL(gmap_mark_unmergeable);
2704+
EXPORT_SYMBOL_GPL(s390_disable_cow_sharing);
26202705

26212706
/*
26222707
* Enable storage key handling from now on and initialize the storage
@@ -2685,7 +2770,7 @@ int s390_enable_skey(void)
26852770
goto out_up;
26862771

26872772
mm->context.uses_skeys = 1;
2688-
rc = gmap_mark_unmergeable();
2773+
rc = __s390_disable_cow_sharing(mm);
26892774
if (rc) {
26902775
mm->context.uses_skeys = 0;
26912776
goto out_up;

0 commit comments

Comments
 (0)