Skip to content

Commit 22a49f6

Browse files
author
Alexander Gordeev
committed
Merge branch 'shared-zeropage' into features
David Hildenbrand says: =================== This series fixes one issue with uffd + shared zeropages on s390x and fixes that "ordinary" KVM guests can make use of shared zeropages again. userfaultfd could currently end up mapping shared zeropages into processes that forbid shared zeropages. This only apples to s390x, relevant for handling PV guests and guests that use storage kets correctly. Fix it by placing a zeroed folio instead of the shared zeropage during UFFDIO_ZEROPAGE instead. I stumbled over this issue while looking into a customer scenario that is using: (1) Memory ballooning for dynamic resizing. Start a VM with, say, 100 GiB and inflate the balloon during boot to 60 GiB. The VM has ~40 GiB available and additional memory can be "fake hotplugged" to the VM later on demand by deflating the balloon. Actual memory overcommit is not desired, so physical memory would only be moved between VMs. (2) Live migration of VMs between sites to evacuate servers in case of emergency. Without the shared zeropage, during (2), the VM would suddenly consume 100 GiB on the migration source and destination. On the migration source, where we don't excpect memory overcommit, we could easilt end up crashing the VM during migration. Independent of that, memory handed back to the hypervisor using "free page reporting" would end up consuming actual memory after the migration on the destination, not getting freed up until reused+freed again. While there might be ways to optimize parts of this in QEMU, we really should just support the shared zeropage again for ordinary VMs. We only expect legcy guests to make use of storage keys, so let's handle zeropages again when enabling storage keys or when enabling PV. To not break userfaultfd like we did in the past, don't zap the shared zeropages, but instead trigger unsharing faults, just like we do for unsharing KSM pages in break_ksm(). Unsharing faults will simply replace the shared zeropage by a zeroed anonymous folio. We can already trigger the same fault path using GUP, when trying to long-term pin a shared zeropage, but also when unmerging a KSM-placed zeropages, so this is nothing new. Patch #1 tested on 86-64 by forcing mm_forbids_zeropage() to be 1, and running the uffd selftests. Patch #2 tested on s390x: the live migration scenario now works as expected, and kvm-unit-tests that trigger usage of skeys work well, whereby I can see detection and unsharing of shared zeropages. Further (as broken in v2), I tested that the shared zeropage is no longer populated after skeys are used -- that mm_forbids_zeropage() works as expected: ./s390x-run s390x/skey.elf \ -no-shutdown \ -chardev socket,id=monitor,path=/var/tmp/mon,server,nowait \ -mon chardev=monitor,mode=readline Then, in another shell: # cat /proc/`pgrep qemu`/smaps_rollup | grep Rss Rss: 31484 kB # echo "dump-guest-memory tmp" | sudo nc -U /var/tmp/mon ... # cat /proc/`pgrep qemu`/smaps_rollup | grep Rss Rss: 160452 kB -> Reading guest memory does not populate the shared zeropage Doing the same with selftest.elf (no skeys) # cat /proc/`pgrep qemu`/smaps_rollup | grep Rss Rss: 30900 kB # echo "dump-guest-memory tmp" | sudo nc -U /var/tmp/mon ... # cat /proc/`pgrep qemu`/smaps_rollup | grep Rsstmp/mon Rss: 30924 kB -> Reading guest memory does populate the shared zeropage =================== Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2 parents cc4edb9 + 06201e0 commit 22a49f6

File tree

7 files changed

+181
-47
lines changed

7 files changed

+181
-47
lines changed

arch/s390/include/asm/gmap.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -146,7 +146,7 @@ int gmap_mprotect_notify(struct gmap *, unsigned long start,
146146

147147
void gmap_sync_dirty_log_pmd(struct gmap *gmap, unsigned long dirty_bitmap[4],
148148
unsigned long gaddr, unsigned long vmaddr);
149-
int gmap_mark_unmergeable(void);
149+
int s390_disable_cow_sharing(void);
150150
void s390_unlist_old_asce(struct gmap *gmap);
151151
int s390_replace_asce(struct gmap *gmap);
152152
void s390_uv_destroy_pfns(unsigned long count, unsigned long *pfns);

arch/s390/include/asm/mmu.h

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,11 @@ typedef struct {
3232
unsigned int uses_skeys:1;
3333
/* The mmu context uses CMM. */
3434
unsigned int uses_cmm:1;
35+
/*
36+
* The mmu context allows COW-sharing of memory pages (KSM, zeropage).
37+
* Note that COW-sharing during fork() is currently always allowed.
38+
*/
39+
unsigned int allow_cow_sharing:1;
3540
/* The gmaps associated with this context are allowed to use huge pages. */
3641
unsigned int allow_gmap_hpage_1m:1;
3742
} mm_context_t;

arch/s390/include/asm/mmu_context.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ static inline int init_new_context(struct task_struct *tsk,
3535
mm->context.has_pgste = 0;
3636
mm->context.uses_skeys = 0;
3737
mm->context.uses_cmm = 0;
38+
mm->context.allow_cow_sharing = 1;
3839
mm->context.allow_gmap_hpage_1m = 0;
3940
#endif
4041
switch (mm->context.asce_limit) {

arch/s390/include/asm/pgtable.h

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -572,10 +572,20 @@ static inline pud_t set_pud_bit(pud_t pud, pgprot_t prot)
572572
}
573573

574574
/*
575-
* In the case that a guest uses storage keys
576-
* faults should no longer be backed by zero pages
575+
* As soon as the guest uses storage keys or enables PV, we deduplicate all
576+
* mapped shared zeropages and prevent new shared zeropages from getting
577+
* mapped.
577578
*/
578-
#define mm_forbids_zeropage mm_has_pgste
579+
#define mm_forbids_zeropage mm_forbids_zeropage
580+
static inline int mm_forbids_zeropage(struct mm_struct *mm)
581+
{
582+
#ifdef CONFIG_PGSTE
583+
if (!mm->context.allow_cow_sharing)
584+
return 1;
585+
#endif
586+
return 0;
587+
}
588+
579589
static inline int mm_uses_skeys(struct mm_struct *mm)
580590
{
581591
#ifdef CONFIG_PGSTE

arch/s390/kvm/kvm-s390.c

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2631,9 +2631,7 @@ static int kvm_s390_handle_pv(struct kvm *kvm, struct kvm_pv_cmd *cmd)
26312631
if (r)
26322632
break;
26332633

2634-
mmap_write_lock(current->mm);
2635-
r = gmap_mark_unmergeable();
2636-
mmap_write_unlock(current->mm);
2634+
r = s390_disable_cow_sharing();
26372635
if (r)
26382636
break;
26392637

arch/s390/mm/gmap.c

Lines changed: 125 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -2549,41 +2549,6 @@ static inline void thp_split_mm(struct mm_struct *mm)
25492549
}
25502550
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
25512551

2552-
/*
2553-
* Remove all empty zero pages from the mapping for lazy refaulting
2554-
* - This must be called after mm->context.has_pgste is set, to avoid
2555-
* future creation of zero pages
2556-
* - This must be called after THP was disabled.
2557-
*
2558-
* mm contracts with s390, that even if mm were to remove a page table,
2559-
* racing with the loop below and so causing pte_offset_map_lock() to fail,
2560-
* it will never insert a page table containing empty zero pages once
2561-
* mm_forbids_zeropage(mm) i.e. mm->context.has_pgste is set.
2562-
*/
2563-
static int __zap_zero_pages(pmd_t *pmd, unsigned long start,
2564-
unsigned long end, struct mm_walk *walk)
2565-
{
2566-
unsigned long addr;
2567-
2568-
for (addr = start; addr != end; addr += PAGE_SIZE) {
2569-
pte_t *ptep;
2570-
spinlock_t *ptl;
2571-
2572-
ptep = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
2573-
if (!ptep)
2574-
break;
2575-
if (is_zero_pfn(pte_pfn(*ptep)))
2576-
ptep_xchg_direct(walk->mm, addr, ptep, __pte(_PAGE_INVALID));
2577-
pte_unmap_unlock(ptep, ptl);
2578-
}
2579-
return 0;
2580-
}
2581-
2582-
static const struct mm_walk_ops zap_zero_walk_ops = {
2583-
.pmd_entry = __zap_zero_pages,
2584-
.walk_lock = PGWALK_WRLOCK,
2585-
};
2586-
25872552
/*
25882553
* switch on pgstes for its userspace process (for kvm)
25892554
*/
@@ -2601,22 +2566,142 @@ int s390_enable_sie(void)
26012566
mm->context.has_pgste = 1;
26022567
/* split thp mappings and disable thp for future mappings */
26032568
thp_split_mm(mm);
2604-
walk_page_range(mm, 0, TASK_SIZE, &zap_zero_walk_ops, NULL);
26052569
mmap_write_unlock(mm);
26062570
return 0;
26072571
}
26082572
EXPORT_SYMBOL_GPL(s390_enable_sie);
26092573

2610-
int gmap_mark_unmergeable(void)
2574+
static int find_zeropage_pte_entry(pte_t *pte, unsigned long addr,
2575+
unsigned long end, struct mm_walk *walk)
2576+
{
2577+
unsigned long *found_addr = walk->private;
2578+
2579+
/* Return 1 of the page is a zeropage. */
2580+
if (is_zero_pfn(pte_pfn(*pte))) {
2581+
/*
2582+
* Shared zeropage in e.g., a FS DAX mapping? We cannot do the
2583+
* right thing and likely don't care: FAULT_FLAG_UNSHARE
2584+
* currently only works in COW mappings, which is also where
2585+
* mm_forbids_zeropage() is checked.
2586+
*/
2587+
if (!is_cow_mapping(walk->vma->vm_flags))
2588+
return -EFAULT;
2589+
2590+
*found_addr = addr;
2591+
return 1;
2592+
}
2593+
return 0;
2594+
}
2595+
2596+
static const struct mm_walk_ops find_zeropage_ops = {
2597+
.pte_entry = find_zeropage_pte_entry,
2598+
.walk_lock = PGWALK_WRLOCK,
2599+
};
2600+
2601+
/*
2602+
* Unshare all shared zeropages, replacing them by anonymous pages. Note that
2603+
* we cannot simply zap all shared zeropages, because this could later
2604+
* trigger unexpected userfaultfd missing events.
2605+
*
2606+
* This must be called after mm->context.allow_cow_sharing was
2607+
* set to 0, to avoid future mappings of shared zeropages.
2608+
*
2609+
* mm contracts with s390, that even if mm were to remove a page table,
2610+
* and racing with walk_page_range_vma() calling pte_offset_map_lock()
2611+
* would fail, it will never insert a page table containing empty zero
2612+
* pages once mm_forbids_zeropage(mm) i.e.
2613+
* mm->context.allow_cow_sharing is set to 0.
2614+
*/
2615+
static int __s390_unshare_zeropages(struct mm_struct *mm)
2616+
{
2617+
struct vm_area_struct *vma;
2618+
VMA_ITERATOR(vmi, mm, 0);
2619+
unsigned long addr;
2620+
vm_fault_t fault;
2621+
int rc;
2622+
2623+
for_each_vma(vmi, vma) {
2624+
/*
2625+
* We could only look at COW mappings, but it's more future
2626+
* proof to catch unexpected zeropages in other mappings and
2627+
* fail.
2628+
*/
2629+
if ((vma->vm_flags & VM_PFNMAP) || is_vm_hugetlb_page(vma))
2630+
continue;
2631+
addr = vma->vm_start;
2632+
2633+
retry:
2634+
rc = walk_page_range_vma(vma, addr, vma->vm_end,
2635+
&find_zeropage_ops, &addr);
2636+
if (rc < 0)
2637+
return rc;
2638+
else if (!rc)
2639+
continue;
2640+
2641+
/* addr was updated by find_zeropage_pte_entry() */
2642+
fault = handle_mm_fault(vma, addr,
2643+
FAULT_FLAG_UNSHARE | FAULT_FLAG_REMOTE,
2644+
NULL);
2645+
if (fault & VM_FAULT_OOM)
2646+
return -ENOMEM;
2647+
/*
2648+
* See break_ksm(): even after handle_mm_fault() returned 0, we
2649+
* must start the lookup from the current address, because
2650+
* handle_mm_fault() may back out if there's any difficulty.
2651+
*
2652+
* VM_FAULT_SIGBUS and VM_FAULT_SIGSEGV are unexpected but
2653+
* maybe they could trigger in the future on concurrent
2654+
* truncation. In that case, the shared zeropage would be gone
2655+
* and we can simply retry and make progress.
2656+
*/
2657+
cond_resched();
2658+
goto retry;
2659+
}
2660+
2661+
return 0;
2662+
}
2663+
2664+
static int __s390_disable_cow_sharing(struct mm_struct *mm)
26112665
{
2666+
int rc;
2667+
2668+
if (!mm->context.allow_cow_sharing)
2669+
return 0;
2670+
2671+
mm->context.allow_cow_sharing = 0;
2672+
2673+
/* Replace all shared zeropages by anonymous pages. */
2674+
rc = __s390_unshare_zeropages(mm);
26122675
/*
26132676
* Make sure to disable KSM (if enabled for the whole process or
26142677
* individual VMAs). Note that nothing currently hinders user space
26152678
* from re-enabling it.
26162679
*/
2617-
return ksm_disable(current->mm);
2680+
if (!rc)
2681+
rc = ksm_disable(mm);
2682+
if (rc)
2683+
mm->context.allow_cow_sharing = 1;
2684+
return rc;
2685+
}
2686+
2687+
/*
2688+
* Disable most COW-sharing of memory pages for the whole process:
2689+
* (1) Disable KSM and unmerge/unshare any KSM pages.
2690+
* (2) Disallow shared zeropages and unshare any zerpages that are mapped.
2691+
*
2692+
* Not that we currently don't bother with COW-shared pages that are shared
2693+
* with parent/child processes due to fork().
2694+
*/
2695+
int s390_disable_cow_sharing(void)
2696+
{
2697+
int rc;
2698+
2699+
mmap_write_lock(current->mm);
2700+
rc = __s390_disable_cow_sharing(current->mm);
2701+
mmap_write_unlock(current->mm);
2702+
return rc;
26182703
}
2619-
EXPORT_SYMBOL_GPL(gmap_mark_unmergeable);
2704+
EXPORT_SYMBOL_GPL(s390_disable_cow_sharing);
26202705

26212706
/*
26222707
* Enable storage key handling from now on and initialize the storage
@@ -2685,7 +2770,7 @@ int s390_enable_skey(void)
26852770
goto out_up;
26862771

26872772
mm->context.uses_skeys = 1;
2688-
rc = gmap_mark_unmergeable();
2773+
rc = __s390_disable_cow_sharing(mm);
26892774
if (rc) {
26902775
mm->context.uses_skeys = 0;
26912776
goto out_up;

mm/userfaultfd.c

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -316,6 +316,38 @@ static int mfill_atomic_pte_copy(pmd_t *dst_pmd,
316316
goto out;
317317
}
318318

319+
static int mfill_atomic_pte_zeroed_folio(pmd_t *dst_pmd,
320+
struct vm_area_struct *dst_vma,
321+
unsigned long dst_addr)
322+
{
323+
struct folio *folio;
324+
int ret = -ENOMEM;
325+
326+
folio = vma_alloc_zeroed_movable_folio(dst_vma, dst_addr);
327+
if (!folio)
328+
return ret;
329+
330+
if (mem_cgroup_charge(folio, dst_vma->vm_mm, GFP_KERNEL))
331+
goto out_put;
332+
333+
/*
334+
* The memory barrier inside __folio_mark_uptodate makes sure that
335+
* zeroing out the folio become visible before mapping the page
336+
* using set_pte_at(). See do_anonymous_page().
337+
*/
338+
__folio_mark_uptodate(folio);
339+
340+
ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr,
341+
&folio->page, true, 0);
342+
if (ret)
343+
goto out_put;
344+
345+
return 0;
346+
out_put:
347+
folio_put(folio);
348+
return ret;
349+
}
350+
319351
static int mfill_atomic_pte_zeropage(pmd_t *dst_pmd,
320352
struct vm_area_struct *dst_vma,
321353
unsigned long dst_addr)
@@ -324,6 +356,9 @@ static int mfill_atomic_pte_zeropage(pmd_t *dst_pmd,
324356
spinlock_t *ptl;
325357
int ret;
326358

359+
if (mm_forbids_zeropage(dst_vma->vm_mm))
360+
return mfill_atomic_pte_zeroed_folio(dst_pmd, dst_vma, dst_addr);
361+
327362
_dst_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr),
328363
dst_vma->vm_page_prot));
329364
ret = -EAGAIN;

0 commit comments

Comments
 (0)