Skip to content

Commit 0a31bc9

Browse files
hnaztorvalds
authored andcommitted
mm: memcontrol: rewrite uncharge API
The memcg uncharging code that is involved towards the end of a page's lifetime - truncation, reclaim, swapout, migration - is impressively complicated and fragile. Because anonymous and file pages were always charged before they had their page->mapping established, uncharges had to happen when the page type could still be known from the context; as in unmap for anonymous, page cache removal for file and shmem pages, and swap cache truncation for swap pages. However, these operations happen well before the page is actually freed, and so a lot of synchronization is necessary: - Charging, uncharging, page migration, and charge migration all need to take a per-page bit spinlock as they could race with uncharging. - Swap cache truncation happens during both swap-in and swap-out, and possibly repeatedly before the page is actually freed. This means that the memcg swapout code is called from many contexts that make no sense and it has to figure out the direction from page state to make sure memory and memory+swap are always correctly charged. - On page migration, the old page might be unmapped but then reused, so memcg code has to prevent untimely uncharging in that case. Because this code - which should be a simple charge transfer - is so special-cased, it is not reusable for replace_page_cache(). But now that charged pages always have a page->mapping, introduce mem_cgroup_uncharge(), which is called after the final put_page(), when we know for sure that nobody is looking at the page anymore. For page migration, introduce mem_cgroup_migrate(), which is called after the migration is successful and the new page is fully rmapped. Because the old page is no longer uncharged after migration, prevent double charges by decoupling the page's memcg association (PCG_USED and pc->mem_cgroup) from the page holding an actual charge. The new bits PCG_MEM and PCG_MEMSW represent the respective charges and are transferred to the new page during migration. mem_cgroup_migrate() is suitable for replace_page_cache() as well, which gets rid of mem_cgroup_replace_page_cache(). However, care needs to be taken because both the source and the target page can already be charged and on the LRU when fuse is splicing: grab the page lock on the charge moving side to prevent changing pc->mem_cgroup of a page under migration. Also, the lruvecs of both pages change as we uncharge the old and charge the new during migration, and putback may race with us, so grab the lru lock and isolate the pages iff on LRU to prevent races and ensure the pages are on the right lruvec afterward. Swap accounting is massively simplified: because the page is no longer uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry before the final put_page() in page reclaim. Finally, page_cgroup changes are now protected by whatever protection the page itself offers: anonymous pages are charged under the page table lock, whereas page cache insertions, swapin, and migration hold the page lock. Uncharging happens under full exclusion with no outstanding references. Charging and uncharging also ensure that the page is off-LRU, which serializes against charge migration. Remove the very costly page_cgroup lock and set pc->flags non-atomically. [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable] [vdavydov@parallels.com: fix flags definition] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Tested-by: Jet Chen <jet.chen@intel.com> Acked-by: Michal Hocko <mhocko@suse.cz> Tested-by: Felipe Balbi <balbi@ti.com> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1 parent 00501b5 commit 0a31bc9

File tree

16 files changed

+389
-768
lines changed

16 files changed

+389
-768
lines changed

Documentation/cgroups/memcg_test.txt

Lines changed: 6 additions & 122 deletions
Original file line numberDiff line numberDiff line change
@@ -29,28 +29,13 @@ Please note that implementation details can be changed.
2929
2. Uncharge
3030
a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
3131

32-
mem_cgroup_uncharge_page()
33-
Called when an anonymous page is fully unmapped. I.e., mapcount goes
34-
to 0. If the page is SwapCache, uncharge is delayed until
35-
mem_cgroup_uncharge_swapcache().
36-
37-
mem_cgroup_uncharge_cache_page()
38-
Called when a page-cache is deleted from radix-tree. If the page is
39-
SwapCache, uncharge is delayed until mem_cgroup_uncharge_swapcache().
40-
41-
mem_cgroup_uncharge_swapcache()
42-
Called when SwapCache is removed from radix-tree. The charge itself
43-
is moved to swap_cgroup. (If mem+swap controller is disabled, no
44-
charge to swap occurs.)
32+
mem_cgroup_uncharge()
33+
Called when a page's refcount goes down to 0.
4534

4635
mem_cgroup_uncharge_swap()
4736
Called when swp_entry's refcnt goes down to 0. A charge against swap
4837
disappears.
4938

50-
mem_cgroup_end_migration(old, new)
51-
At success of migration old is uncharged (if necessary), a charge
52-
to new page is committed. At failure, charge to old page is committed.
53-
5439
3. charge-commit-cancel
5540
Memcg pages are charged in two steps:
5641
mem_cgroup_try_charge()
@@ -69,18 +54,6 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
6954
Anonymous page is newly allocated at
7055
- page fault into MAP_ANONYMOUS mapping.
7156
- Copy-On-Write.
72-
It is charged right after it's allocated before doing any page table
73-
related operations. Of course, it's uncharged when another page is used
74-
for the fault address.
75-
76-
At freeing anonymous page (by exit() or munmap()), zap_pte() is called
77-
and pages for ptes are freed one by one.(see mm/memory.c). Uncharges
78-
are done at page_remove_rmap() when page_mapcount() goes down to 0.
79-
80-
Another page freeing is by page-reclaim (vmscan.c) and anonymous
81-
pages are swapped out. In this case, the page is marked as
82-
PageSwapCache(). uncharge() routine doesn't uncharge the page marked
83-
as SwapCache(). It's delayed until __delete_from_swap_cache().
8457

8558
4.1 Swap-in.
8659
At swap-in, the page is taken from swap-cache. There are 2 cases.
@@ -89,41 +62,6 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
8962
(b) If the SwapCache has been mapped by processes, it has been
9063
charged already.
9164

92-
This swap-in is one of the most complicated work. In do_swap_page(),
93-
following events occur when pte is unchanged.
94-
95-
(1) the page (SwapCache) is looked up.
96-
(2) lock_page()
97-
(3) try_charge_swapin()
98-
(4) reuse_swap_page() (may call delete_swap_cache())
99-
(5) commit_charge_swapin()
100-
(6) swap_free().
101-
102-
Considering following situation for example.
103-
104-
(A) The page has not been charged before (2) and reuse_swap_page()
105-
doesn't call delete_from_swap_cache().
106-
(B) The page has not been charged before (2) and reuse_swap_page()
107-
calls delete_from_swap_cache().
108-
(C) The page has been charged before (2) and reuse_swap_page() doesn't
109-
call delete_from_swap_cache().
110-
(D) The page has been charged before (2) and reuse_swap_page() calls
111-
delete_from_swap_cache().
112-
113-
memory.usage/memsw.usage changes to this page/swp_entry will be
114-
Case (A) (B) (C) (D)
115-
Event
116-
Before (2) 0/ 1 0/ 1 1/ 1 1/ 1
117-
===========================================
118-
(3) +1/+1 +1/+1 +1/+1 +1/+1
119-
(4) - 0/ 0 - -1/ 0
120-
(5) 0/-1 0/ 0 -1/-1 0/ 0
121-
(6) - 0/-1 - 0/-1
122-
===========================================
123-
Result 1/ 1 1/ 1 1/ 1 1/ 1
124-
125-
In any cases, charges to this page should be 1/ 1.
126-
12765
4.2 Swap-out.
12866
At swap-out, typical state transition is below.
12967

@@ -136,28 +74,20 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
13674
swp_entry's refcnt -= 1.
13775

13876

139-
At (b), the page is marked as SwapCache and not uncharged.
140-
At (d), the page is removed from SwapCache and a charge in page_cgroup
141-
is moved to swap_cgroup.
142-
14377
Finally, at task exit,
14478
(e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
145-
Here, a charge in swap_cgroup disappears.
14679

14780
5. Page Cache
14881
Page Cache is charged at
14982
- add_to_page_cache_locked().
15083

151-
uncharged at
152-
- __remove_from_page_cache().
153-
15484
The logic is very clear. (About migration, see below)
15585
Note: __remove_from_page_cache() is called by remove_from_page_cache()
15686
and __remove_mapping().
15787

15888
6. Shmem(tmpfs) Page Cache
159-
Memcg's charge/uncharge have special handlers of shmem. The best way
160-
to understand shmem's page state transition is to read mm/shmem.c.
89+
The best way to understand shmem's page state transition is to read
90+
mm/shmem.c.
16191
But brief explanation of the behavior of memcg around shmem will be
16292
helpful to understand the logic.
16393

@@ -170,56 +100,10 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
170100
It's charged when...
171101
- A new page is added to shmem's radix-tree.
172102
- A swp page is read. (move a charge from swap_cgroup to page_cgroup)
173-
It's uncharged when
174-
- A page is removed from radix-tree and not SwapCache.
175-
- When SwapCache is removed, a charge is moved to swap_cgroup.
176-
- When swp_entry's refcnt goes down to 0, a charge in swap_cgroup
177-
disappears.
178103

179104
7. Page Migration
180-
One of the most complicated functions is page-migration-handler.
181-
Memcg has 2 routines. Assume that we are migrating a page's contents
182-
from OLDPAGE to NEWPAGE.
183-
184-
Usual migration logic is..
185-
(a) remove the page from LRU.
186-
(b) allocate NEWPAGE (migration target)
187-
(c) lock by lock_page().
188-
(d) unmap all mappings.
189-
(e-1) If necessary, replace entry in radix-tree.
190-
(e-2) move contents of a page.
191-
(f) map all mappings again.
192-
(g) pushback the page to LRU.
193-
(-) OLDPAGE will be freed.
194-
195-
Before (g), memcg should complete all necessary charge/uncharge to
196-
NEWPAGE/OLDPAGE.
197-
198-
The point is....
199-
- If OLDPAGE is anonymous, all charges will be dropped at (d) because
200-
try_to_unmap() drops all mapcount and the page will not be
201-
SwapCache.
202-
203-
- If OLDPAGE is SwapCache, charges will be kept at (g) because
204-
__delete_from_swap_cache() isn't called at (e-1)
205-
206-
- If OLDPAGE is page-cache, charges will be kept at (g) because
207-
remove_from_swap_cache() isn't called at (e-1)
208-
209-
memcg provides following hooks.
210-
211-
- mem_cgroup_prepare_migration(OLDPAGE)
212-
Called after (b) to account a charge (usage += PAGE_SIZE) against
213-
memcg which OLDPAGE belongs to.
214-
215-
- mem_cgroup_end_migration(OLDPAGE, NEWPAGE)
216-
Called after (f) before (g).
217-
If OLDPAGE is used, commit OLDPAGE again. If OLDPAGE is already
218-
charged, a charge by prepare_migration() is automatically canceled.
219-
If NEWPAGE is used, commit NEWPAGE and uncharge OLDPAGE.
220-
221-
But zap_pte() (by exit or munmap) can be called while migration,
222-
we have to check if OLDPAGE/NEWPAGE is a valid page after commit().
105+
106+
mem_cgroup_migrate()
223107

224108
8. LRU
225109
Each memcg has its own private LRU. Now, its handling is under global

include/linux/memcontrol.h

Lines changed: 15 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -60,15 +60,17 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
6060
bool lrucare);
6161
void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg);
6262

63-
struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
64-
struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
63+
void mem_cgroup_uncharge(struct page *page);
64+
65+
/* Batched uncharging */
66+
void mem_cgroup_uncharge_start(void);
67+
void mem_cgroup_uncharge_end(void);
6568

66-
/* For coalescing uncharge for reducing memcg' overhead*/
67-
extern void mem_cgroup_uncharge_start(void);
68-
extern void mem_cgroup_uncharge_end(void);
69+
void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
70+
bool lrucare);
6971

70-
extern void mem_cgroup_uncharge_page(struct page *page);
71-
extern void mem_cgroup_uncharge_cache_page(struct page *page);
72+
struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
73+
struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
7274

7375
bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
7476
struct mem_cgroup *memcg);
@@ -96,12 +98,6 @@ bool mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *memcg)
9698

9799
extern struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg);
98100

99-
extern void
100-
mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
101-
struct mem_cgroup **memcgp);
102-
extern void mem_cgroup_end_migration(struct mem_cgroup *memcg,
103-
struct page *oldpage, struct page *newpage, bool migration_ok);
104-
105101
struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
106102
struct mem_cgroup *,
107103
struct mem_cgroup_reclaim_cookie *);
@@ -116,8 +112,6 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list);
116112
void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
117113
extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
118114
struct task_struct *p);
119-
extern void mem_cgroup_replace_page_cache(struct page *oldpage,
120-
struct page *newpage);
121115

122116
static inline void mem_cgroup_oom_enable(void)
123117
{
@@ -235,19 +229,21 @@ static inline void mem_cgroup_cancel_charge(struct page *page,
235229
{
236230
}
237231

238-
static inline void mem_cgroup_uncharge_start(void)
232+
static inline void mem_cgroup_uncharge(struct page *page)
239233
{
240234
}
241235

242-
static inline void mem_cgroup_uncharge_end(void)
236+
static inline void mem_cgroup_uncharge_start(void)
243237
{
244238
}
245239

246-
static inline void mem_cgroup_uncharge_page(struct page *page)
240+
static inline void mem_cgroup_uncharge_end(void)
247241
{
248242
}
249243

250-
static inline void mem_cgroup_uncharge_cache_page(struct page *page)
244+
static inline void mem_cgroup_migrate(struct page *oldpage,
245+
struct page *newpage,
246+
bool lrucare)
251247
{
252248
}
253249

@@ -286,17 +282,6 @@ static inline struct cgroup_subsys_state
286282
return NULL;
287283
}
288284

289-
static inline void
290-
mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
291-
struct mem_cgroup **memcgp)
292-
{
293-
}
294-
295-
static inline void mem_cgroup_end_migration(struct mem_cgroup *memcg,
296-
struct page *oldpage, struct page *newpage, bool migration_ok)
297-
{
298-
}
299-
300285
static inline struct mem_cgroup *
301286
mem_cgroup_iter(struct mem_cgroup *root,
302287
struct mem_cgroup *prev,
@@ -392,10 +377,6 @@ static inline
392377
void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
393378
{
394379
}
395-
static inline void mem_cgroup_replace_page_cache(struct page *oldpage,
396-
struct page *newpage)
397-
{
398-
}
399380
#endif /* CONFIG_MEMCG */
400381

401382
#if !defined(CONFIG_MEMCG) || !defined(CONFIG_DEBUG_VM)

include/linux/page_cgroup.h

Lines changed: 5 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,9 @@
33

44
enum {
55
/* flags for mem_cgroup */
6-
PCG_LOCK, /* Lock for pc->mem_cgroup and following bits. */
7-
PCG_USED, /* this object is in use. */
8-
PCG_MIGRATION, /* under page migration */
6+
PCG_USED = 0x01, /* This page is charged to a memcg */
7+
PCG_MEM = 0x02, /* This page holds a memory charge */
8+
PCG_MEMSW = 0x04, /* This page holds a memory+swap charge */
99
__NR_PCG_FLAGS,
1010
};
1111

@@ -44,42 +44,9 @@ static inline void __init page_cgroup_init(void)
4444
struct page_cgroup *lookup_page_cgroup(struct page *page);
4545
struct page *lookup_cgroup_page(struct page_cgroup *pc);
4646

47-
#define TESTPCGFLAG(uname, lname) \
48-
static inline int PageCgroup##uname(struct page_cgroup *pc) \
49-
{ return test_bit(PCG_##lname, &pc->flags); }
50-
51-
#define SETPCGFLAG(uname, lname) \
52-
static inline void SetPageCgroup##uname(struct page_cgroup *pc)\
53-
{ set_bit(PCG_##lname, &pc->flags); }
54-
55-
#define CLEARPCGFLAG(uname, lname) \
56-
static inline void ClearPageCgroup##uname(struct page_cgroup *pc) \
57-
{ clear_bit(PCG_##lname, &pc->flags); }
58-
59-
#define TESTCLEARPCGFLAG(uname, lname) \
60-
static inline int TestClearPageCgroup##uname(struct page_cgroup *pc) \
61-
{ return test_and_clear_bit(PCG_##lname, &pc->flags); }
62-
63-
TESTPCGFLAG(Used, USED)
64-
CLEARPCGFLAG(Used, USED)
65-
SETPCGFLAG(Used, USED)
66-
67-
SETPCGFLAG(Migration, MIGRATION)
68-
CLEARPCGFLAG(Migration, MIGRATION)
69-
TESTPCGFLAG(Migration, MIGRATION)
70-
71-
static inline void lock_page_cgroup(struct page_cgroup *pc)
72-
{
73-
/*
74-
* Don't take this lock in IRQ context.
75-
* This lock is for pc->mem_cgroup, USED, MIGRATION
76-
*/
77-
bit_spin_lock(PCG_LOCK, &pc->flags);
78-
}
79-
80-
static inline void unlock_page_cgroup(struct page_cgroup *pc)
47+
static inline int PageCgroupUsed(struct page_cgroup *pc)
8148
{
82-
bit_spin_unlock(PCG_LOCK, &pc->flags);
49+
return !!(pc->flags & PCG_USED);
8350
}
8451

8552
#else /* CONFIG_MEMCG */

include/linux/swap.h

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -381,9 +381,13 @@ static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
381381
}
382382
#endif
383383
#ifdef CONFIG_MEMCG_SWAP
384-
extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
384+
extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
385+
extern void mem_cgroup_uncharge_swap(swp_entry_t entry);
385386
#else
386-
static inline void mem_cgroup_uncharge_swap(swp_entry_t ent)
387+
static inline void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
388+
{
389+
}
390+
static inline void mem_cgroup_uncharge_swap(swp_entry_t entry)
387391
{
388392
}
389393
#endif
@@ -443,7 +447,7 @@ extern void swap_shmem_alloc(swp_entry_t);
443447
extern int swap_duplicate(swp_entry_t);
444448
extern int swapcache_prepare(swp_entry_t);
445449
extern void swap_free(swp_entry_t);
446-
extern void swapcache_free(swp_entry_t, struct page *page);
450+
extern void swapcache_free(swp_entry_t);
447451
extern int free_swap_and_cache(swp_entry_t);
448452
extern int swap_type_of(dev_t, sector_t, struct block_device **);
449453
extern unsigned int count_swap_pages(int, int);
@@ -507,7 +511,7 @@ static inline void swap_free(swp_entry_t swp)
507511
{
508512
}
509513

510-
static inline void swapcache_free(swp_entry_t swp, struct page *page)
514+
static inline void swapcache_free(swp_entry_t swp)
511515
{
512516
}
513517

mm/filemap.c

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -234,7 +234,6 @@ void delete_from_page_cache(struct page *page)
234234
spin_lock_irq(&mapping->tree_lock);
235235
__delete_from_page_cache(page, NULL);
236236
spin_unlock_irq(&mapping->tree_lock);
237-
mem_cgroup_uncharge_cache_page(page);
238237

239238
if (freepage)
240239
freepage(page);
@@ -490,8 +489,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
490489
if (PageSwapBacked(new))
491490
__inc_zone_page_state(new, NR_SHMEM);
492491
spin_unlock_irq(&mapping->tree_lock);
493-
/* mem_cgroup codes must not be called under tree_lock */
494-
mem_cgroup_replace_page_cache(old, new);
492+
mem_cgroup_migrate(old, new, true);
495493
radix_tree_preload_end();
496494
if (freepage)
497495
freepage(old);

0 commit comments

Comments
 (0)