Skip to content

Commit c0d0381

Browse files
mjkravetztorvalds
authored andcommitted
hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2. While discussing the issue with huge_pte_offset [1], I remembered that there were more outstanding hugetlb races. These issues are: 1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become invalid via a call to huge_pmd_unshare by another thread. 2) hugetlbfs page faults can race with truncation causing invalid global reserve counts and state. A previous attempt was made to use i_mmap_rwsem in this manner as described at [2]. However, those patches were reverted starting with [3] due to locking issues. To effectively use i_mmap_rwsem to address the above issues it needs to be held (in read mode) during page fault processing. However, during fault processing we need to lock the page we will be adding. Lock ordering requires we take page lock before i_mmap_rwsem. Waiting until after taking the page lock is too late in the fault process for the synchronization we want to do. To address this lock ordering issue, the following patches change the lock ordering for hugetlb pages. This is not too invasive as hugetlbfs processing is done separate from core mm in many places. However, I don't really like this idea. Much ugliness is contained in the new routine hugetlb_page_mapping_lock_write() of patch 1. The only other way I can think of to address these issues is by catching all the races. After catching a race, cleanup, backout, retry ... etc, as needed. This can get really ugly, especially for huge page reservations. At one time, I started writing some of the reservation backout code for page faults and it got so ugly and complicated I went down the path of adding synchronization to avoid the races. Any other suggestions would be welcome. [1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/ [2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/ [3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com [4] https://lore.kernel.org/linux-mm/1584028670.7365.182.camel@lca.pw/ [5] https://lore.kernel.org/lkml/20200312183142.108df9ac@canb.auug.org.au/ This patch (of 2): While looking at BUGs associated with invalid huge page map counts, it was discovered and observed that a huge pte pointer could become 'invalid' and point to another task's page table. Consider the following: A task takes a page fault on a shared hugetlbfs file and calls huge_pte_alloc to get a ptep. Suppose the returned ptep points to a shared pmd. Now, another task truncates the hugetlbfs file. As part of truncation, it unmaps everyone who has the file mapped. If the range being truncated is covered by a shared pmd, huge_pmd_unshare will be called. For all but the last user of the shared pmd, huge_pmd_unshare will clear the pud pointing to the pmd. If the task in the middle of the page fault is not the last user, the ptep returned by huge_pte_alloc now points to another task's page table or worse. This leads to bad things such as incorrect page map/reference counts or invalid memory references. To fix, expand the use of i_mmap_rwsem as follows: - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called. huge_pmd_share is only called via huge_pte_alloc, so callers of huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers of huge_pte_alloc continue to hold the semaphore until finished with the ptep. - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called. One problem with this scheme is that it requires taking i_mmap_rwsem before taking the page lock during page faults. This is not the order specified in the rest of mm code. Handling of hugetlbfs pages is mostly isolated today. Therefore, we use this alternative locking order for PageHuge() pages. mapping->i_mmap_rwsem hugetlb_fault_mutex (hugetlbfs specific page fault mutex) page->flags PG_locked (lock_page) To help with lock ordering issues, hugetlb_page_mapping_lock_write() is introduced to write lock the i_mmap_rwsem associated with a page. In most cases it is easy to get address_space via vma->vm_file->f_mapping. However, in the case of migration or memory errors for anon pages we do not have an associated vma. A new routine _get_hugetlb_page_mapping() will use anon_vma to get address_space in these cases. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Link: http://lkml.kernel.org/r/20200316205756.146666-2-mike.kravetz@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1 parent 49aef71 commit c0d0381

File tree

8 files changed

+234
-19
lines changed

8 files changed

+234
-19
lines changed

fs/hugetlbfs/inode.c

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -450,7 +450,9 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
450450
if (unlikely(page_mapped(page))) {
451451
BUG_ON(truncate_op);
452452

453+
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
453454
i_mmap_lock_write(mapping);
455+
mutex_lock(&hugetlb_fault_mutex_table[hash]);
454456
hugetlb_vmdelete_list(&mapping->i_mmap,
455457
index * pages_per_huge_page(h),
456458
(index + 1) * pages_per_huge_page(h));

include/linux/fs.h

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -526,6 +526,11 @@ static inline void i_mmap_lock_write(struct address_space *mapping)
526526
down_write(&mapping->i_mmap_rwsem);
527527
}
528528

529+
static inline int i_mmap_trylock_write(struct address_space *mapping)
530+
{
531+
return down_write_trylock(&mapping->i_mmap_rwsem);
532+
}
533+
529534
static inline void i_mmap_unlock_write(struct address_space *mapping)
530535
{
531536
up_write(&mapping->i_mmap_rwsem);

include/linux/hugetlb.h

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -109,6 +109,8 @@ u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx);
109109

110110
pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud);
111111

112+
struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage);
113+
112114
extern int sysctl_hugetlb_shm_group;
113115
extern struct list_head huge_boot_pages;
114116

@@ -151,6 +153,12 @@ static inline unsigned long hugetlb_total_pages(void)
151153
return 0;
152154
}
153155

156+
static inline struct address_space *hugetlb_page_mapping_lock_write(
157+
struct page *hpage)
158+
{
159+
return NULL;
160+
}
161+
154162
static inline int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr,
155163
pte_t *ptep)
156164
{

mm/hugetlb.c

Lines changed: 145 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1322,6 +1322,106 @@ int PageHeadHuge(struct page *page_head)
13221322
return get_compound_page_dtor(page_head) == free_huge_page;
13231323
}
13241324

1325+
/*
1326+
* Find address_space associated with hugetlbfs page.
1327+
* Upon entry page is locked and page 'was' mapped although mapped state
1328+
* could change. If necessary, use anon_vma to find vma and associated
1329+
* address space. The returned mapping may be stale, but it can not be
1330+
* invalid as page lock (which is held) is required to destroy mapping.
1331+
*/
1332+
static struct address_space *_get_hugetlb_page_mapping(struct page *hpage)
1333+
{
1334+
struct anon_vma *anon_vma;
1335+
pgoff_t pgoff_start, pgoff_end;
1336+
struct anon_vma_chain *avc;
1337+
struct address_space *mapping = page_mapping(hpage);
1338+
1339+
/* Simple file based mapping */
1340+
if (mapping)
1341+
return mapping;
1342+
1343+
/*
1344+
* Even anonymous hugetlbfs mappings are associated with an
1345+
* underlying hugetlbfs file (see hugetlb_file_setup in mmap
1346+
* code). Find a vma associated with the anonymous vma, and
1347+
* use the file pointer to get address_space.
1348+
*/
1349+
anon_vma = page_lock_anon_vma_read(hpage);
1350+
if (!anon_vma)
1351+
return mapping; /* NULL */
1352+
1353+
/* Use first found vma */
1354+
pgoff_start = page_to_pgoff(hpage);
1355+
pgoff_end = pgoff_start + hpage_nr_pages(hpage) - 1;
1356+
anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root,
1357+
pgoff_start, pgoff_end) {
1358+
struct vm_area_struct *vma = avc->vma;
1359+
1360+
mapping = vma->vm_file->f_mapping;
1361+
break;
1362+
}
1363+
1364+
anon_vma_unlock_read(anon_vma);
1365+
return mapping;
1366+
}
1367+
1368+
/*
1369+
* Find and lock address space (mapping) in write mode.
1370+
*
1371+
* Upon entry, the page is locked which allows us to find the mapping
1372+
* even in the case of an anon page. However, locking order dictates
1373+
* the i_mmap_rwsem be acquired BEFORE the page lock. This is hugetlbfs
1374+
* specific. So, we first try to lock the sema while still holding the
1375+
* page lock. If this works, great! If not, then we need to drop the
1376+
* page lock and then acquire i_mmap_rwsem and reacquire page lock. Of
1377+
* course, need to revalidate state along the way.
1378+
*/
1379+
struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage)
1380+
{
1381+
struct address_space *mapping, *mapping2;
1382+
1383+
mapping = _get_hugetlb_page_mapping(hpage);
1384+
retry:
1385+
if (!mapping)
1386+
return mapping;
1387+
1388+
/*
1389+
* If no contention, take lock and return
1390+
*/
1391+
if (i_mmap_trylock_write(mapping))
1392+
return mapping;
1393+
1394+
/*
1395+
* Must drop page lock and wait on mapping sema.
1396+
* Note: Once page lock is dropped, mapping could become invalid.
1397+
* As a hack, increase map count until we lock page again.
1398+
*/
1399+
atomic_inc(&hpage->_mapcount);
1400+
unlock_page(hpage);
1401+
i_mmap_lock_write(mapping);
1402+
lock_page(hpage);
1403+
atomic_add_negative(-1, &hpage->_mapcount);
1404+
1405+
/* verify page is still mapped */
1406+
if (!page_mapped(hpage)) {
1407+
i_mmap_unlock_write(mapping);
1408+
return NULL;
1409+
}
1410+
1411+
/*
1412+
* Get address space again and verify it is the same one
1413+
* we locked. If not, drop lock and retry.
1414+
*/
1415+
mapping2 = _get_hugetlb_page_mapping(hpage);
1416+
if (mapping2 != mapping) {
1417+
i_mmap_unlock_write(mapping);
1418+
mapping = mapping2;
1419+
goto retry;
1420+
}
1421+
1422+
return mapping;
1423+
}
1424+
13251425
pgoff_t __basepage_index(struct page *page)
13261426
{
13271427
struct page *page_head = compound_head(page);
@@ -3312,6 +3412,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
33123412
int cow;
33133413
struct hstate *h = hstate_vma(vma);
33143414
unsigned long sz = huge_page_size(h);
3415+
struct address_space *mapping = vma->vm_file->f_mapping;
33153416
struct mmu_notifier_range range;
33163417
int ret = 0;
33173418

@@ -3322,6 +3423,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
33223423
vma->vm_start,
33233424
vma->vm_end);
33243425
mmu_notifier_invalidate_range_start(&range);
3426+
} else {
3427+
/*
3428+
* For shared mappings i_mmap_rwsem must be held to call
3429+
* huge_pte_alloc, otherwise the returned ptep could go
3430+
* away if part of a shared pmd and another thread calls
3431+
* huge_pmd_unshare.
3432+
*/
3433+
i_mmap_lock_read(mapping);
33253434
}
33263435

33273436
for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
@@ -3399,6 +3508,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
33993508

34003509
if (cow)
34013510
mmu_notifier_invalidate_range_end(&range);
3511+
else
3512+
i_mmap_unlock_read(mapping);
34023513

34033514
return ret;
34043515
}
@@ -3847,13 +3958,15 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
38473958
};
38483959

38493960
/*
3850-
* hugetlb_fault_mutex must be dropped before
3851-
* handling userfault. Reacquire after handling
3852-
* fault to make calling code simpler.
3961+
* hugetlb_fault_mutex and i_mmap_rwsem must be
3962+
* dropped before handling userfault. Reacquire
3963+
* after handling fault to make calling code simpler.
38533964
*/
38543965
hash = hugetlb_fault_mutex_hash(mapping, idx);
38553966
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
3967+
i_mmap_unlock_read(mapping);
38563968
ret = handle_userfault(&vmf, VM_UFFD_MISSING);
3969+
i_mmap_lock_read(mapping);
38573970
mutex_lock(&hugetlb_fault_mutex_table[hash]);
38583971
goto out;
38593972
}
@@ -4018,6 +4131,11 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
40184131

40194132
ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
40204133
if (ptep) {
4134+
/*
4135+
* Since we hold no locks, ptep could be stale. That is
4136+
* OK as we are only making decisions based on content and
4137+
* not actually modifying content here.
4138+
*/
40214139
entry = huge_ptep_get(ptep);
40224140
if (unlikely(is_hugetlb_entry_migration(entry))) {
40234141
migration_entry_wait_huge(vma, mm, ptep);
@@ -4031,14 +4149,29 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
40314149
return VM_FAULT_OOM;
40324150
}
40334151

4152+
/*
4153+
* Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
4154+
* until finished with ptep. This prevents huge_pmd_unshare from
4155+
* being called elsewhere and making the ptep no longer valid.
4156+
*
4157+
* ptep could have already be assigned via huge_pte_offset. That
4158+
* is OK, as huge_pte_alloc will return the same value unless
4159+
* something has changed.
4160+
*/
40344161
mapping = vma->vm_file->f_mapping;
4035-
idx = vma_hugecache_offset(h, vma, haddr);
4162+
i_mmap_lock_read(mapping);
4163+
ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
4164+
if (!ptep) {
4165+
i_mmap_unlock_read(mapping);
4166+
return VM_FAULT_OOM;
4167+
}
40364168

40374169
/*
40384170
* Serialize hugepage allocation and instantiation, so that we don't
40394171
* get spurious allocation failures if two CPUs race to instantiate
40404172
* the same page in the page cache.
40414173
*/
4174+
idx = vma_hugecache_offset(h, vma, haddr);
40424175
hash = hugetlb_fault_mutex_hash(mapping, idx);
40434176
mutex_lock(&hugetlb_fault_mutex_table[hash]);
40444177

@@ -4126,6 +4259,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
41264259
}
41274260
out_mutex:
41284261
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
4262+
i_mmap_unlock_read(mapping);
41294263
/*
41304264
* Generally it's safe to hold refcount during waiting page lock. But
41314265
* here we just wait to defer the next page fault to avoid busy loop and
@@ -4776,10 +4910,12 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
47764910
* Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
47774911
* and returns the corresponding pte. While this is not necessary for the
47784912
* !shared pmd case because we can allocate the pmd later as well, it makes the
4779-
* code much cleaner. pmd allocation is essential for the shared case because
4780-
* pud has to be populated inside the same i_mmap_rwsem section - otherwise
4781-
* racing tasks could either miss the sharing (see huge_pte_offset) or select a
4782-
* bad pmd for sharing.
4913+
* code much cleaner.
4914+
*
4915+
* This routine must be called with i_mmap_rwsem held in at least read mode.
4916+
* For hugetlbfs, this prevents removal of any page table entries associated
4917+
* with the address space. This is important as we are setting up sharing
4918+
* based on existing page table entries (mappings).
47834919
*/
47844920
pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
47854921
{
@@ -4796,7 +4932,6 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
47964932
if (!vma_shareable(vma, addr))
47974933
return (pte_t *)pmd_alloc(mm, pud, addr);
47984934

4799-
i_mmap_lock_read(mapping);
48004935
vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
48014936
if (svma == vma)
48024937
continue;
@@ -4826,7 +4961,6 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
48264961
spin_unlock(ptl);
48274962
out:
48284963
pte = (pte_t *)pmd_alloc(mm, pud, addr);
4829-
i_mmap_unlock_read(mapping);
48304964
return pte;
48314965
}
48324966

@@ -4837,7 +4971,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
48374971
* indicated by page_count > 1, unmap is achieved by clearing pud and
48384972
* decrementing the ref count. If count == 1, the pte page is not shared.
48394973
*
4840-
* called with page table lock held.
4974+
* Called with page table lock held and i_mmap_rwsem held in write mode.
48414975
*
48424976
* returns: 1 successfully unmapped a shared pte page
48434977
* 0 the underlying pte page is not shared, or it is the last user

mm/memory-failure.c

Lines changed: 27 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -954,7 +954,7 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
954954
enum ttu_flags ttu = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
955955
struct address_space *mapping;
956956
LIST_HEAD(tokill);
957-
bool unmap_success;
957+
bool unmap_success = true;
958958
int kill = 1, forcekill;
959959
struct page *hpage = *hpagep;
960960
bool mlocked = PageMlocked(hpage);
@@ -1016,7 +1016,32 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
10161016
if (kill)
10171017
collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED);
10181018

1019-
unmap_success = try_to_unmap(hpage, ttu);
1019+
if (!PageHuge(hpage)) {
1020+
unmap_success = try_to_unmap(hpage, ttu);
1021+
} else {
1022+
/*
1023+
* For hugetlb pages, try_to_unmap could potentially call
1024+
* huge_pmd_unshare. Because of this, take semaphore in
1025+
* write mode here and set TTU_RMAP_LOCKED to indicate we
1026+
* have taken the lock at this higer level.
1027+
*
1028+
* Note that the call to hugetlb_page_mapping_lock_write
1029+
* is necessary even if mapping is already set. It handles
1030+
* ugliness of potentially having to drop page lock to obtain
1031+
* i_mmap_rwsem.
1032+
*/
1033+
mapping = hugetlb_page_mapping_lock_write(hpage);
1034+
1035+
if (mapping) {
1036+
unmap_success = try_to_unmap(hpage,
1037+
ttu|TTU_RMAP_LOCKED);
1038+
i_mmap_unlock_write(mapping);
1039+
} else {
1040+
pr_info("Memory failure: %#lx: could not find mapping for mapped huge page\n",
1041+
pfn);
1042+
unmap_success = false;
1043+
}
1044+
}
10201045
if (!unmap_success)
10211046
pr_err("Memory failure: %#lx: failed to unmap page (mapcount=%d)\n",
10221047
pfn, page_mapcount(hpage));

mm/migrate.c

Lines changed: 22 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1282,6 +1282,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
12821282
int page_was_mapped = 0;
12831283
struct page *new_hpage;
12841284
struct anon_vma *anon_vma = NULL;
1285+
struct address_space *mapping = NULL;
12851286

12861287
/*
12871288
* Migratability of hugepages depends on architectures and their size.
@@ -1329,18 +1330,36 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
13291330
goto put_anon;
13301331

13311332
if (page_mapped(hpage)) {
1333+
/*
1334+
* try_to_unmap could potentially call huge_pmd_unshare.
1335+
* Because of this, take semaphore in write mode here and
1336+
* set TTU_RMAP_LOCKED to let lower levels know we have
1337+
* taken the lock.
1338+
*/
1339+
mapping = hugetlb_page_mapping_lock_write(hpage);
1340+
if (unlikely(!mapping))
1341+
goto unlock_put_anon;
1342+
13321343
try_to_unmap(hpage,
1333-
TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
1344+
TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS|
1345+
TTU_RMAP_LOCKED);
13341346
page_was_mapped = 1;
1347+
/*
1348+
* Leave mapping locked until after subsequent call to
1349+
* remove_migration_ptes()
1350+
*/
13351351
}
13361352

13371353
if (!page_mapped(hpage))
13381354
rc = move_to_new_page(new_hpage, hpage, mode);
13391355

1340-
if (page_was_mapped)
1356+
if (page_was_mapped) {
13411357
remove_migration_ptes(hpage,
1342-
rc == MIGRATEPAGE_SUCCESS ? new_hpage : hpage, false);
1358+
rc == MIGRATEPAGE_SUCCESS ? new_hpage : hpage, true);
1359+
i_mmap_unlock_write(mapping);
1360+
}
13431361

1362+
unlock_put_anon:
13441363
unlock_page(new_hpage);
13451364

13461365
put_anon:

0 commit comments

Comments
 (0)