Skip to content

Commit 3ea2771

Browse files
Mel Gormantorvalds
authored andcommitted
mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries
Nadav Amit identified a theoritical race between page reclaim and mprotect due to TLB flushes being batched outside of the PTL being held. He described the race as follows: CPU0 CPU1 ---- ---- user accesses memory using RW PTE [PTE now cached in TLB] try_to_unmap_one() ==> ptep_get_and_clear() ==> set_tlb_ubc_flush_pending() mprotect(addr, PROT_READ) ==> change_pte_range() ==> [ PTE non-present - no flush ] user writes using cached RW PTE ... try_to_unmap_flush() The same type of race exists for reads when protecting for PROT_NONE and also exists for operations that can leave an old TLB entry behind such as munmap, mremap and madvise. For some operations like mprotect, it's not necessarily a data integrity issue but it is a correctness issue as there is a window where an mprotect that limits access still allows access. For munmap, it's potentially a data integrity issue although the race is massive as an munmap, mmap and return to userspace must all complete between the window when reclaim drops the PTL and flushes the TLB. However, it's theoritically possible so handle this issue by flushing the mm if reclaim is potentially currently batching TLB flushes. Other instances where a flush is required for a present pte should be ok as either the page lock is held preventing parallel reclaim or a page reference count is elevated preventing a parallel free leading to corruption. In the case of page_mkclean there isn't an obvious path that userspace could take advantage of without using the operations that are guarded by this patch. Other users such as gup as a race with reclaim looks just at PTEs. huge page variants should be ok as they don't race with reclaim. mincore only looks at PTEs. userfault also should be ok as if a parallel reclaim takes place, it will either fault the page back in or read some of the data before the flush occurs triggering a fault. Note that a variant of this patch was acked by Andy Lutomirski but this was for the x86 parts on top of his PCID work which didn't make the 4.13 merge window as expected. His ack is dropped from this version and there will be a follow-on patch on top of PCID that will include his ack. [akpm@linux-foundation.org: tweak comments] [akpm@linux-foundation.org: fix spello] Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.de Reported-by: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: <stable@vger.kernel.org> [v4.4+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1 parent 27e37d8 commit 3ea2771

File tree

7 files changed

+48
-1
lines changed

7 files changed

+48
-1
lines changed

include/linux/mm_types.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -494,6 +494,10 @@ struct mm_struct {
494494
* PROT_NONE or PROT_NUMA mapped page.
495495
*/
496496
bool tlb_flush_pending;
497+
#endif
498+
#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
499+
/* See flush_tlb_batched_pending() */
500+
bool tlb_flush_batched;
497501
#endif
498502
struct uprobes_state uprobes_state;
499503
#ifdef CONFIG_HUGETLB_PAGE

mm/internal.h

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -498,14 +498,17 @@ extern struct workqueue_struct *mm_percpu_wq;
498498
#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
499499
void try_to_unmap_flush(void);
500500
void try_to_unmap_flush_dirty(void);
501+
void flush_tlb_batched_pending(struct mm_struct *mm);
501502
#else
502503
static inline void try_to_unmap_flush(void)
503504
{
504505
}
505506
static inline void try_to_unmap_flush_dirty(void)
506507
{
507508
}
508-
509+
static inline void flush_tlb_batched_pending(struct mm_struct *mm)
510+
{
511+
}
509512
#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
510513

511514
extern const struct trace_print_flags pageflag_names[];

mm/madvise.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -320,6 +320,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
320320

321321
tlb_remove_check_page_size_change(tlb, PAGE_SIZE);
322322
orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
323+
flush_tlb_batched_pending(mm);
323324
arch_enter_lazy_mmu_mode();
324325
for (; addr != end; pte++, addr += PAGE_SIZE) {
325326
ptent = *pte;

mm/memory.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1197,6 +1197,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
11971197
init_rss_vec(rss);
11981198
start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
11991199
pte = start_pte;
1200+
flush_tlb_batched_pending(mm);
12001201
arch_enter_lazy_mmu_mode();
12011202
do {
12021203
pte_t ptent = *pte;

mm/mprotect.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
6464
atomic_read(&vma->vm_mm->mm_users) == 1)
6565
target_node = numa_node_id();
6666

67+
flush_tlb_batched_pending(vma->vm_mm);
6768
arch_enter_lazy_mmu_mode();
6869
do {
6970
oldpte = *pte;

mm/mremap.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -152,6 +152,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
152152
new_ptl = pte_lockptr(mm, new_pmd);
153153
if (new_ptl != old_ptl)
154154
spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
155+
flush_tlb_batched_pending(vma->vm_mm);
155156
arch_enter_lazy_mmu_mode();
156157

157158
for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,

mm/rmap.c

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -604,6 +604,13 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
604604
arch_tlbbatch_add_mm(&tlb_ubc->arch, mm);
605605
tlb_ubc->flush_required = true;
606606

607+
/*
608+
* Ensure compiler does not re-order the setting of tlb_flush_batched
609+
* before the PTE is cleared.
610+
*/
611+
barrier();
612+
mm->tlb_flush_batched = true;
613+
607614
/*
608615
* If the PTE was dirty then it's best to assume it's writable. The
609616
* caller must use try_to_unmap_flush_dirty() or try_to_unmap_flush()
@@ -631,6 +638,35 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
631638

632639
return should_defer;
633640
}
641+
642+
/*
643+
* Reclaim unmaps pages under the PTL but do not flush the TLB prior to
644+
* releasing the PTL if TLB flushes are batched. It's possible for a parallel
645+
* operation such as mprotect or munmap to race between reclaim unmapping
646+
* the page and flushing the page. If this race occurs, it potentially allows
647+
* access to data via a stale TLB entry. Tracking all mm's that have TLB
648+
* batching in flight would be expensive during reclaim so instead track
649+
* whether TLB batching occurred in the past and if so then do a flush here
650+
* if required. This will cost one additional flush per reclaim cycle paid
651+
* by the first operation at risk such as mprotect and mumap.
652+
*
653+
* This must be called under the PTL so that an access to tlb_flush_batched
654+
* that is potentially a "reclaim vs mprotect/munmap/etc" race will synchronise
655+
* via the PTL.
656+
*/
657+
void flush_tlb_batched_pending(struct mm_struct *mm)
658+
{
659+
if (mm->tlb_flush_batched) {
660+
flush_tlb_mm(mm);
661+
662+
/*
663+
* Do not allow the compiler to re-order the clearing of
664+
* tlb_flush_batched before the tlb is flushed.
665+
*/
666+
barrier();
667+
mm->tlb_flush_batched = false;
668+
}
669+
}
634670
#else
635671
static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
636672
{

0 commit comments

Comments
 (0)