Skip to content

Commit 79dfdac

Browse files
Michal Hockotorvalds
authored andcommitted
memcg: make oom_lock 0 and 1 based rather than counter
Commit 867578c ("memcg: fix oom kill behavior") introduced a oom_lock counter which is incremented by mem_cgroup_oom_lock when we are about to handle memcg OOM situation. mem_cgroup_handle_oom falls back to a sleep if oom_lock > 1 to prevent from multiple oom kills at the same time. The counter is then decremented by mem_cgroup_oom_unlock called from the same function. This works correctly but it can lead to serious starvations when we have many processes triggering OOM and many CPUs available for them (I have tested with 16 CPUs). Consider a process (call it A) which gets the oom_lock (the first one that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex) and other processes that are blocked on the mutex. While A releases the mutex and calls mem_cgroup_out_of_memory others will wake up (one after another) and increase the counter and fall into sleep (memcg_oom_waitq). Once A finishes mem_cgroup_out_of_memory it takes the mutex again and decreases oom_lock and wakes other tasks (if releasing memory by somebody else - e.g. killed process - hasn't done it yet). A testcase would look like: Assume malloc XXX is a program allocating XXX Megabytes of memory which touches all allocated pages in a tight loop # swapoff SWAP_DEVICE # cgcreate -g memory:A # cgset -r memory.oom_control=0 A # cgset -r memory.limit_in_bytes= 200M # for i in `seq 100` # do # cgexec -g memory:A malloc 10 & # done The main problem here is that all processes still race for the mutex and there is no guarantee that we will get counter back to 0 for those that got back to mem_cgroup_handle_oom. In the end the whole convoy in/decreases the counter but we do not get to 1 that would enable killing so nothing useful can be done. The time is basically unbounded because it highly depends on scheduling and ordering on mutex (I have seen this taking hours...). This patch replaces the counter by a simple {un}lock semantic. As mem_cgroup_oom_{un}lock works on the a subtree of a hierarchy we have to make sure that nobody else races with us which is guaranteed by the memcg_oom_mutex. We have to be careful while locking subtrees because we can encounter a subtree which is already locked: hierarchy: A / \ B \ /\ \ C D E B - C - D tree might be already locked. While we want to enable locking E subtree because OOM situations cannot influence each other we definitely do not want to allow locking A. Therefore we have to refuse lock if any subtree is already locked and clear up the lock for all nodes that have been set up to the failure point. On the other hand we have to make sure that the rest of the world will recognize that a group is under OOM even though it doesn't have a lock. Therefore we have to introduce under_oom variable which is incremented and decremented for the whole subtree when we enter resp. leave mem_cgroup_handle_oom. under_oom, unlike oom_lock, doesn't need be updated under memcg_oom_mutex because its users only check a single group and they use atomic operations for that. This can be checked easily by the following test case: # cgcreate -g memory:A # cgset -r memory.use_hierarchy=1 A # cgset -r memory.oom_control=1 A # cgset -r memory.limit_in_bytes= 100M # cgset -r memory.memsw.limit_in_bytes= 100M # cgcreate -g memory:A/B # cgset -r memory.oom_control=1 A/B # cgset -r memory.limit_in_bytes=20M # cgset -r memory.memsw.limit_in_bytes=20M # cgexec -g memory:A/B malloc 30 & #->this will be blocked by OOM of group B # cgexec -g memory:A malloc 80 & #->this will be blocked by OOM of group A While B gets oom_lock A will not get it. Both of them go into sleep and wait for an external action. We can make the limit higher for A to enforce waking it up # cgset -r memory.memsw.limit_in_bytes=300M A # cgset -r memory.limit_in_bytes=300M A malloc in A has to wake up even though it doesn't have oom_lock. Finally, the unlock path is very easy because we always unlock only the subtree we have locked previously while we always decrement under_oom. Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1 parent bb2a0de commit 79dfdac

File tree

1 file changed

+70
-16
lines changed

1 file changed

+70
-16
lines changed

mm/memcontrol.c

Lines changed: 70 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -246,7 +246,10 @@ struct mem_cgroup {
246246
* Should the accounting and control be hierarchical, per subtree?
247247
*/
248248
bool use_hierarchy;
249-
atomic_t oom_lock;
249+
250+
bool oom_lock;
251+
atomic_t under_oom;
252+
250253
atomic_t refcnt;
251254

252255
int swappiness;
@@ -1722,37 +1725,83 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
17221725
/*
17231726
* Check OOM-Killer is already running under our hierarchy.
17241727
* If someone is running, return false.
1728+
* Has to be called with memcg_oom_mutex
17251729
*/
17261730
static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
17271731
{
1728-
int x, lock_count = 0;
1729-
struct mem_cgroup *iter;
1732+
int lock_count = -1;
1733+
struct mem_cgroup *iter, *failed = NULL;
1734+
bool cond = true;
17301735

1731-
for_each_mem_cgroup_tree(iter, mem) {
1732-
x = atomic_inc_return(&iter->oom_lock);
1733-
lock_count = max(x, lock_count);
1736+
for_each_mem_cgroup_tree_cond(iter, mem, cond) {
1737+
bool locked = iter->oom_lock;
1738+
1739+
iter->oom_lock = true;
1740+
if (lock_count == -1)
1741+
lock_count = iter->oom_lock;
1742+
else if (lock_count != locked) {
1743+
/*
1744+
* this subtree of our hierarchy is already locked
1745+
* so we cannot give a lock.
1746+
*/
1747+
lock_count = 0;
1748+
failed = iter;
1749+
cond = false;
1750+
}
17341751
}
17351752

1736-
if (lock_count == 1)
1737-
return true;
1738-
return false;
1753+
if (!failed)
1754+
goto done;
1755+
1756+
/*
1757+
* OK, we failed to lock the whole subtree so we have to clean up
1758+
* what we set up to the failing subtree
1759+
*/
1760+
cond = true;
1761+
for_each_mem_cgroup_tree_cond(iter, mem, cond) {
1762+
if (iter == failed) {
1763+
cond = false;
1764+
continue;
1765+
}
1766+
iter->oom_lock = false;
1767+
}
1768+
done:
1769+
return lock_count;
17391770
}
17401771

1772+
/*
1773+
* Has to be called with memcg_oom_mutex
1774+
*/
17411775
static int mem_cgroup_oom_unlock(struct mem_cgroup *mem)
17421776
{
17431777
struct mem_cgroup *iter;
17441778

1779+
for_each_mem_cgroup_tree(iter, mem)
1780+
iter->oom_lock = false;
1781+
return 0;
1782+
}
1783+
1784+
static void mem_cgroup_mark_under_oom(struct mem_cgroup *mem)
1785+
{
1786+
struct mem_cgroup *iter;
1787+
1788+
for_each_mem_cgroup_tree(iter, mem)
1789+
atomic_inc(&iter->under_oom);
1790+
}
1791+
1792+
static void mem_cgroup_unmark_under_oom(struct mem_cgroup *mem)
1793+
{
1794+
struct mem_cgroup *iter;
1795+
17451796
/*
17461797
* When a new child is created while the hierarchy is under oom,
17471798
* mem_cgroup_oom_lock() may not be called. We have to use
17481799
* atomic_add_unless() here.
17491800
*/
17501801
for_each_mem_cgroup_tree(iter, mem)
1751-
atomic_add_unless(&iter->oom_lock, -1, 0);
1752-
return 0;
1802+
atomic_add_unless(&iter->under_oom, -1, 0);
17531803
}
17541804

1755-
17561805
static DEFINE_MUTEX(memcg_oom_mutex);
17571806
static DECLARE_WAIT_QUEUE_HEAD(memcg_oom_waitq);
17581807

@@ -1794,7 +1843,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *mem)
17941843

17951844
static void memcg_oom_recover(struct mem_cgroup *mem)
17961845
{
1797-
if (mem && atomic_read(&mem->oom_lock))
1846+
if (mem && atomic_read(&mem->under_oom))
17981847
memcg_wakeup_oom(mem);
17991848
}
18001849

@@ -1812,6 +1861,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
18121861
owait.wait.private = current;
18131862
INIT_LIST_HEAD(&owait.wait.task_list);
18141863
need_to_kill = true;
1864+
mem_cgroup_mark_under_oom(mem);
1865+
18151866
/* At first, try to OOM lock hierarchy under mem.*/
18161867
mutex_lock(&memcg_oom_mutex);
18171868
locked = mem_cgroup_oom_lock(mem);
@@ -1835,10 +1886,13 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
18351886
finish_wait(&memcg_oom_waitq, &owait.wait);
18361887
}
18371888
mutex_lock(&memcg_oom_mutex);
1838-
mem_cgroup_oom_unlock(mem);
1889+
if (locked)
1890+
mem_cgroup_oom_unlock(mem);
18391891
memcg_wakeup_oom(mem);
18401892
mutex_unlock(&memcg_oom_mutex);
18411893

1894+
mem_cgroup_unmark_under_oom(mem);
1895+
18421896
if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
18431897
return false;
18441898
/* Give chance to dying process */
@@ -4505,7 +4559,7 @@ static int mem_cgroup_oom_register_event(struct cgroup *cgrp,
45054559
list_add(&event->list, &memcg->oom_notify);
45064560

45074561
/* already in OOM ? */
4508-
if (atomic_read(&memcg->oom_lock))
4562+
if (atomic_read(&memcg->under_oom))
45094563
eventfd_signal(eventfd, 1);
45104564
mutex_unlock(&memcg_oom_mutex);
45114565

@@ -4540,7 +4594,7 @@ static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
45404594

45414595
cb->fill(cb, "oom_kill_disable", mem->oom_kill_disable);
45424596

4543-
if (atomic_read(&mem->oom_lock))
4597+
if (atomic_read(&mem->under_oom))
45444598
cb->fill(cb, "under_oom", 1);
45454599
else
45464600
cb->fill(cb, "under_oom", 0);

0 commit comments

Comments
 (0)