Skip to content

Commit 9a6ebbd

Browse files
jankarabrauner
authored andcommitted
writeback: Avoid excessively long inode switching times
With lazytime mount option enabled we can be switching many dirty inodes on cgroup exit to the parent cgroup. The numbers observed in practice when systemd slice of a large cron job exits can easily reach hundreds of thousands or millions. The logic in inode_do_switch_wbs() which sorts the inode into appropriate place in b_dirty list of the target wb however has linear complexity in the number of dirty inodes thus overall time complexity of switching all the inodes is quadratic leading to workers being pegged for hours consuming 100% of the CPU and switching inodes to the parent wb. Simple reproducer of the issue: FILES=10000 # Filesystem mounted with lazytime mount option MNT=/mnt/ echo "Creating files and switching timestamps" for (( j = 0; j < 50; j ++ )); do mkdir $MNT/dir$j for (( i = 0; i < $FILES; i++ )); do echo "foo" >$MNT/dir$j/file$i done touch -a -t 202501010000 $MNT/dir$j/file* done wait echo "Syncing and flushing" sync echo 3 >/proc/sys/vm/drop_caches echo "Reading all files from a cgroup" mkdir /sys/fs/cgroup/unified/mycg1 || exit echo $$ >/sys/fs/cgroup/unified/mycg1/cgroup.procs || exit for (( j = 0; j < 50; j ++ )); do cat /mnt/dir$j/file* >/dev/null & done wait echo "Switching wbs" # Now rmdir the cgroup after the script exits We need to maintain b_dirty list ordering to keep writeback happy so instead of sorting inode into appropriate place just append it at the end of the list and clobber dirtied_time_when. This may result in inode writeback starting later after cgroup switch however cgroup switches are rare so it shouldn't matter much. Since the cgroup had write access to the inode, there are no practical concerns of the possible DoS issues. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
1 parent 66c14dc commit 9a6ebbd

File tree

1 file changed

+11
-10
lines changed

1 file changed

+11
-10
lines changed

fs/fs-writeback.c

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -445,22 +445,23 @@ static bool inode_do_switch_wbs(struct inode *inode,
445445
* Transfer to @new_wb's IO list if necessary. If the @inode is dirty,
446446
* the specific list @inode was on is ignored and the @inode is put on
447447
* ->b_dirty which is always correct including from ->b_dirty_time.
448-
* The transfer preserves @inode->dirtied_when ordering. If the @inode
449-
* was clean, it means it was on the b_attached list, so move it onto
450-
* the b_attached list of @new_wb.
448+
* If the @inode was clean, it means it was on the b_attached list, so
449+
* move it onto the b_attached list of @new_wb.
451450
*/
452451
if (!list_empty(&inode->i_io_list)) {
453452
inode->i_wb = new_wb;
454453

455454
if (inode->i_state & I_DIRTY_ALL) {
456-
struct inode *pos;
457-
458-
list_for_each_entry(pos, &new_wb->b_dirty, i_io_list)
459-
if (time_after_eq(inode->dirtied_when,
460-
pos->dirtied_when))
461-
break;
455+
/*
456+
* We need to keep b_dirty list sorted by
457+
* dirtied_time_when. However properly sorting the
458+
* inode in the list gets too expensive when switching
459+
* many inodes. So just attach inode at the end of the
460+
* dirty list and clobber the dirtied_time_when.
461+
*/
462+
inode->dirtied_time_when = jiffies;
462463
inode_io_list_move_locked(inode, new_wb,
463-
pos->i_io_list.prev);
464+
&new_wb->b_dirty);
464465
} else {
465466
inode_cgwb_move_to_attached(inode, new_wb);
466467
}

0 commit comments

Comments
 (0)