Skip to content

Conversation

bmastbergen
Copy link
Collaborator

Background

The original intent here was to backport c2dbe32 to address CVE-2023-52707. That required prerequisite commit acf21dc to add the function wake_up_pollfree(). This commit was committed to mainline, and backported to many LT kernels, as part of a 5 commit series, which itself includes commit f2fe402 to address CVE-2021-47505. I've gone ahead and added all 5 of those commits even though technically they aren't all build time or run time prerequisites for the CVE fixes at hand. But based on the commit messages they are real fixes for real problems, so it seems like a good thing to have them.

Commits

    wait: add wake_up_pollfree()

    jira VULN-63551
    cve-pre CVE-2021-47505
    commit-author Eric Biggers <ebiggers@google.com>
    commit 42288cb44c4b5fff7653bc392b583a2b8bd6a8c0
    binder: use wake_up_pollfree()

    jira VULN-63551
    cve-pre CVE-2021-47505
    commit-author Eric Biggers <ebiggers@google.com>
    commit a880b28a71e39013e357fd3adccd1d8a31bc69a8
    signalfd: use wake_up_pollfree()

    jira VULN-63551
    cve-pre CVE-2021-47505
    commit-author Eric Biggers <ebiggers@google.com>
    commit 9537bae0da1f8d1e2361ab6d0479e8af7824e160
    aio: keep poll requests on waitqueue until completed

    jira VULN-63551
    cve-pre CVE-2021-47505
    commit-author Eric Biggers <ebiggers@google.com>
    commit 363bee27e25804d8981dd1c025b4ad49dc39c530
    aio: fix use-after-free due to missing POLLFREE handling

    jira VULN-63551
    cve CVE-2021-47505
    commit-author Eric Biggers <ebiggers@google.com>
    commit 50252e4b5e989ce64555c7aef7516bdefc2fea72
    sched/psi: Fix use-after-free in ep_remove_wait_queue()

    jira VULN-4566
    cve CVE-2023-52707
    commit-author Munehisa Kamata <kamatam@amazon.com>
    commit c2dbe32d5db5c4ead121cf86dabd5ab691fb47fe

Build Log

/home/brett/kernel-src-tree
Running make mrproper...
[TIMER]{MRPROPER}: 12s
x86_64 architecture detected, copying config
'configs/kernel-x86_64-rhel.config' -> '.config'
Setting Local Version for build
CONFIG_LOCALVERSION="-bmastbergen_ciqlts9_2_many-vulns-10-1-25-8dd024108c06"
Making olddefconfig
--
  HOSTCC  scripts/kconfig/util.o
  HOSTLD  scripts/kconfig/conf
#
# configuration written to .config
#
Starting Build
  SYSHDR  arch/x86/include/generated/uapi/asm/unistd_32.h
  SYSHDR  arch/x86/include/generated/uapi/asm/unistd_64.h
  SYSHDR  arch/x86/include/generated/uapi/asm/unistd_x32.h
  SYSTBL  arch/x86/include/generated/asm/syscalls_32.h
  SYSHDR  arch/x86/include/generated/asm/unistd_64_x32.h
--
  BTF [M] sound/virtio/virtio_snd.ko
  LD [M]  sound/xen/snd_xen_front.ko
  LD [M]  virt/lib/irqbypass.ko
  BTF [M] sound/xen/snd_xen_front.ko
  BTF [M] virt/lib/irqbypass.ko
[TIMER]{BUILD}: 932s
Making Modules
  INSTALL /lib/modules/5.14.0-bmastbergen_ciqlts9_2_many-vulns-10-1-25-8dd024108c06+/kernel/arch/x86/crypto/blowfish-x86_64.ko
  INSTALL /lib/modules/5.14.0-bmastbergen_ciqlts9_2_many-vulns-10-1-25-8dd024108c06+/kernel/arch/x86/crypto/blake2s-x86_64.ko
  INSTALL /lib/modules/5.14.0-bmastbergen_ciqlts9_2_many-vulns-10-1-25-8dd024108c06+/kernel/arch/x86/crypto/camellia-aesni-avx-x86_64.ko
  INSTALL /lib/modules/5.14.0-bmastbergen_ciqlts9_2_many-vulns-10-1-25-8dd024108c06+/kernel/arch/x86/crypto/camellia-aesni-avx2.ko
--
  STRIP   /lib/modules/5.14.0-bmastbergen_ciqlts9_2_many-vulns-10-1-25-8dd024108c06+/kernel/virt/lib/irqbypass.ko
  SIGN    /lib/modules/5.14.0-bmastbergen_ciqlts9_2_many-vulns-10-1-25-8dd024108c06+/kernel/sound/x86/snd-hdmi-lpe-audio.ko
  SIGN    /lib/modules/5.14.0-bmastbergen_ciqlts9_2_many-vulns-10-1-25-8dd024108c06+/kernel/virt/lib/irqbypass.ko
  SIGN    /lib/modules/5.14.0-bmastbergen_ciqlts9_2_many-vulns-10-1-25-8dd024108c06+/kernel/sound/xen/snd_xen_front.ko
  DEPMOD  /lib/modules/5.14.0-bmastbergen_ciqlts9_2_many-vulns-10-1-25-8dd024108c06+
[TIMER]{MODULES}: 8s
Making Install
sh ./arch/x86/boot/install.sh \
	5.14.0-bmastbergen_ciqlts9_2_many-vulns-10-1-25-8dd024108c06+ arch/x86/boot/bzImage \
	System.map "/boot"
[TIMER]{INSTALL}: 57s
Checking kABI
kABI check passed
Setting Default Kernel to /boot/vmlinuz-5.14.0-bmastbergen_ciqlts9_2_many-vulns-10-1-25-8dd024108c06+ and Index to 2
Hopefully Grub2.0 took everything ... rebooting after time metrices
[TIMER]{MRPROPER}: 12s
[TIMER]{BUILD}: 932s
[TIMER]{MODULES}: 8s
[TIMER]{INSTALL}: 57s
[TIMER]{TOTAL} 1028s
Rebooting in 10 seconds

Testing

selftest-5.14.0-284.30.1.el9_2.92ciq_lts.11.1.x86_64-1.log

selftest-5.14.0-bmastbergen_ciqlts9_2_many-vulns-10-1-25-8dd024108c06+-1.log

brett@lycia ~/ciq/many-92-vulns-10-1-25
 % grep ^ok selftest-5.14.0-284.30.1.el9_2.92ciq_lts.11.1.x86_64-1.log | wc -l
330
brett@lycia ~/ciq/many-92-vulns-10-1-25
 % grep ^ok selftest-5.14.0-bmastbergen_ciqlts9_2_many-vulns-10-1-25-8dd024108c06+-1.log | wc -l
330
brett@lycia ~/ciq/many-92-vulns-10-1-25
 % grep ok <(diff -adU0 <(grep ^ok selftest-5.14.0-284.30.1.el9_2.92ciq_lts.11.1.x86_64-1.log | sort -h) <(grep ^ok selftest-5.14.0-bmastbergen_ciqlts9_2_many-vulns-10-1-25-8dd024108c06+-1.log | sort -h))
-ok 1 selftests: livepatch: test-livepatch.sh # SKIP
+ok 1 selftests: livepatch: test-livepatch.sh
-ok 1 selftests: zram: zram.sh # SKIP
+ok 1 selftests: zram: zram.sh
-ok 2 selftests: cgroup: test_kmem
-ok 2 selftests: livepatch: test-callbacks.sh # SKIP
+ok 2 selftests: livepatch: test-callbacks.sh
+ok 32 selftests: net: l2tp.sh
-ok 3 selftests: cgroup: test_core
-ok 3 selftests: livepatch: test-shadow-vars.sh # SKIP
+ok 3 selftests: livepatch: test-shadow-vars.sh
-ok 46 selftests: net: devlink_port_split.py # SKIP
+ok 46 selftests: net: devlink_port_split.py
+ok 47 selftests: net: drop_monitor_tests.sh
-ok 4 selftests: livepatch: test-state.sh # SKIP
+ok 4 selftests: livepatch: test-state.sh
-ok 53 selftests: net: gro.sh
-ok 5 selftests: livepatch: test-ftrace.sh # SKIP
+ok 5 selftests: livepatch: test-ftrace.sh
-ok 6 selftests: cgroup: test_stress.sh
+ok 6 selftests: net: tls
+ok 9 selftests: net: test_bpf.sh
brett@lycia ~/ciq/many-92-vulns-10-1-25
 %

jira VULN-63551
cve-pre CVE-2021-47505
commit-author Eric Biggers <ebiggers@google.com>
commit 42288cb

Several ->poll() implementations are special in that they use a
waitqueue whose lifetime is the current task, rather than the struct
file as is normally the case.  This is okay for blocking polls, since a
blocking poll occurs within one task; however, non-blocking polls
require another solution.  This solution is for the queue to be cleared
before it is freed, using 'wake_up_poll(wq, EPOLLHUP | POLLFREE);'.

However, that has a bug: wake_up_poll() calls __wake_up() with
nr_exclusive=1.  Therefore, if there are multiple "exclusive" waiters,
and the wakeup function for the first one returns a positive value, only
that one will be called.  That's *not* what's needed for POLLFREE;
POLLFREE is special in that it really needs to wake up everyone.

Considering the three non-blocking poll systems:

- io_uring poll doesn't handle POLLFREE at all, so it is broken anyway.

- aio poll is unaffected, since it doesn't support exclusive waits.
  However, that's fragile, as someone could add this feature later.

- epoll doesn't appear to be broken by this, since its wakeup function
  returns 0 when it sees POLLFREE.  But this is fragile.

Although there is a workaround (see epoll), it's better to define a
function which always sends POLLFREE to all waiters.  Add such a
function.  Also make it verify that the queue really becomes empty after
all waiters have been woken up.

	Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
	Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20211209010455.42744-2-ebiggers@kernel.org
	Signed-off-by: Eric Biggers <ebiggers@google.com>
(cherry picked from commit 42288cb)
	Signed-off-by: Brett Mastbergen <bmastbergen@ciq.com>
jira VULN-63551
cve-pre CVE-2021-47505
commit-author Eric Biggers <ebiggers@google.com>
commit a880b28

wake_up_poll() uses nr_exclusive=1, so it's not guaranteed to wake up
all exclusive waiters.  Yet, POLLFREE *must* wake up all waiters.  epoll
and aio poll are fortunately not affected by this, but it's very
fragile.  Thus, the new function wake_up_pollfree() has been introduced.

Convert binder to use wake_up_pollfree().

	Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes: f5cb779 ("ANDROID: binder: remove waitqueue when thread exits.")
	Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20211209010455.42744-3-ebiggers@kernel.org
	Signed-off-by: Eric Biggers <ebiggers@google.com>
(cherry picked from commit a880b28)
	Signed-off-by: Brett Mastbergen <bmastbergen@ciq.com>
jira VULN-63551
cve-pre CVE-2021-47505
commit-author Eric Biggers <ebiggers@google.com>
commit 9537bae

wake_up_poll() uses nr_exclusive=1, so it's not guaranteed to wake up
all exclusive waiters.  Yet, POLLFREE *must* wake up all waiters.  epoll
and aio poll are fortunately not affected by this, but it's very
fragile.  Thus, the new function wake_up_pollfree() has been introduced.

Convert signalfd to use wake_up_pollfree().

	Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes: d80e731 ("epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree()")
	Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20211209010455.42744-4-ebiggers@kernel.org
	Signed-off-by: Eric Biggers <ebiggers@google.com>
(cherry picked from commit 9537bae)
	Signed-off-by: Brett Mastbergen <bmastbergen@ciq.com>
jira VULN-63551
cve-pre CVE-2021-47505
commit-author Eric Biggers <ebiggers@google.com>
commit 363bee2

Currently, aio_poll_wake() will always remove the poll request from the
waitqueue.  Then, if aio_poll_complete_work() sees that none of the
polled events are ready and the request isn't cancelled, it re-adds the
request to the waitqueue.  (This can easily happen when polling a file
that doesn't pass an event mask when waking up its waitqueue.)

This is fundamentally broken for two reasons:

  1. If a wakeup occurs between vfs_poll() and the request being
     re-added to the waitqueue, it will be missed because the request
     wasn't on the waitqueue at the time.  Therefore, IOCB_CMD_POLL
     might never complete even if the polled file is ready.

  2. When the request isn't on the waitqueue, there is no way to be
     notified that the waitqueue is being freed (which happens when its
     lifetime is shorter than the struct file's).  This is supposed to
     happen via the waitqueue entries being woken up with POLLFREE.

Therefore, leave the requests on the waitqueue until they are actually
completed (or cancelled).  To keep track of when aio_poll_complete_work
needs to be scheduled, use new fields in struct poll_iocb.  Remove the
'done' field which is now redundant.

Note that this is consistent with how sys_poll() and eventpoll work;
their wakeup functions do *not* remove the waitqueue entries.

Fixes: 2c14fa8 ("aio: implement IOCB_CMD_POLL")
	Cc: <stable@vger.kernel.org> # v4.18+
Link: https://lore.kernel.org/r/20211209010455.42744-5-ebiggers@kernel.org
	Signed-off-by: Eric Biggers <ebiggers@google.com>
(cherry picked from commit 363bee2)
	Signed-off-by: Brett Mastbergen <bmastbergen@ciq.com>
jira VULN-63551
cve CVE-2021-47505
commit-author Eric Biggers <ebiggers@google.com>
commit 50252e4

signalfd_poll() and binder_poll() are special in that they use a
waitqueue whose lifetime is the current task, rather than the struct
file as is normally the case.  This is okay for blocking polls, since a
blocking poll occurs within one task; however, non-blocking polls
require another solution.  This solution is for the queue to be cleared
before it is freed, by sending a POLLFREE notification to all waiters.

Unfortunately, only eventpoll handles POLLFREE.  A second type of
non-blocking poll, aio poll, was added in kernel v4.18, and it doesn't
handle POLLFREE.  This allows a use-after-free to occur if a signalfd or
binder fd is polled with aio poll, and the waitqueue gets freed.

Fix this by making aio poll handle POLLFREE.

A patch by Ramji Jiyani <ramjiyani@google.com>
(https://lore.kernel.org/r/20211027011834.2497484-1-ramjiyani@google.com)
tried to do this by making aio_poll_wake() always complete the request
inline if POLLFREE is seen.  However, that solution had two bugs.
First, it introduced a deadlock, as it unconditionally locked the aio
context while holding the waitqueue lock, which inverts the normal
locking order.  Second, it didn't consider that POLLFREE notifications
are missed while the request has been temporarily de-queued.

The second problem was solved by my previous patch.  This patch then
properly fixes the use-after-free by handling POLLFREE in a
deadlock-free way.  It does this by taking advantage of the fact that
freeing of the waitqueue is RCU-delayed, similar to what eventpoll does.

Fixes: 2c14fa8 ("aio: implement IOCB_CMD_POLL")
	Cc: <stable@vger.kernel.org> # v4.18+
Link: https://lore.kernel.org/r/20211209010455.42744-6-ebiggers@kernel.org
	Signed-off-by: Eric Biggers <ebiggers@google.com>
(cherry picked from commit 50252e4)
	Signed-off-by: Brett Mastbergen <bmastbergen@ciq.com>
jira VULN-4566
cve CVE-2023-52707
commit-author Munehisa Kamata <kamatam@amazon.com>
commit c2dbe32

If a non-root cgroup gets removed when there is a thread that registered
trigger and is polling on a pressure file within the cgroup, the polling
waitqueue gets freed in the following path:

 do_rmdir
   cgroup_rmdir
     kernfs_drain_open_files
       cgroup_file_release
         cgroup_pressure_release
           psi_trigger_destroy

However, the polling thread still has a reference to the pressure file and
will access the freed waitqueue when the file is closed or upon exit:

 fput
   ep_eventpoll_release
     ep_free
       ep_remove_wait_queue
         remove_wait_queue

This results in use-after-free as pasted below.

The fundamental problem here is that cgroup_file_release() (and
consequently waitqueue's lifetime) is not tied to the file's real lifetime.
Using wake_up_pollfree() here might be less than ideal, but it is in line
with the comment at commit 42288cb ("wait: add wake_up_pollfree()")
since the waitqueue's lifetime is not tied to file's one and can be
considered as another special case. While this would be fixable by somehow
making cgroup_file_release() be tied to the fput(), it would require
sizable refactoring at cgroups or higher layer which might be more
justifiable if we identify more cases like this.

  BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x60/0xc0
  Write of size 4 at addr ffff88810e625328 by task a.out/4404

	CPU: 19 PID: 4404 Comm: a.out Not tainted 6.2.0-rc6 #38
	Hardware name: Amazon EC2 c5a.8xlarge/, BIOS 1.0 10/16/2017
	Call Trace:
	<TASK>
	dump_stack_lvl+0x73/0xa0
	print_report+0x16c/0x4e0
	kasan_report+0xc3/0xf0
	kasan_check_range+0x2d2/0x310
	_raw_spin_lock_irqsave+0x60/0xc0
	remove_wait_queue+0x1a/0xa0
	ep_free+0x12c/0x170
	ep_eventpoll_release+0x26/0x30
	__fput+0x202/0x400
	task_work_run+0x11d/0x170
	do_exit+0x495/0x1130
	do_group_exit+0x100/0x100
	get_signal+0xd67/0xde0
	arch_do_signal_or_restart+0x2a/0x2b0
	exit_to_user_mode_prepare+0x94/0x100
	syscall_exit_to_user_mode+0x20/0x40
	do_syscall_64+0x52/0x90
	entry_SYSCALL_64_after_hwframe+0x63/0xcd
	</TASK>

 Allocated by task 4404:

	kasan_set_track+0x3d/0x60
	__kasan_kmalloc+0x85/0x90
	psi_trigger_create+0x113/0x3e0
	pressure_write+0x146/0x2e0
	cgroup_file_write+0x11c/0x250
	kernfs_fop_write_iter+0x186/0x220
	vfs_write+0x3d8/0x5c0
	ksys_write+0x90/0x110
	do_syscall_64+0x43/0x90
	entry_SYSCALL_64_after_hwframe+0x63/0xcd

 Freed by task 4407:

	kasan_set_track+0x3d/0x60
	kasan_save_free_info+0x27/0x40
	____kasan_slab_free+0x11d/0x170
	slab_free_freelist_hook+0x87/0x150
	__kmem_cache_free+0xcb/0x180
	psi_trigger_destroy+0x2e8/0x310
	cgroup_file_release+0x4f/0xb0
	kernfs_drain_open_files+0x165/0x1f0
	kernfs_drain+0x162/0x1a0
	__kernfs_remove+0x1fb/0x310
	kernfs_remove_by_name_ns+0x95/0xe0
	cgroup_addrm_files+0x67f/0x700
	cgroup_destroy_locked+0x283/0x3c0
	cgroup_rmdir+0x29/0x100
	kernfs_iop_rmdir+0xd1/0x140
	vfs_rmdir+0xfe/0x240
	do_rmdir+0x13d/0x280
	__x64_sys_rmdir+0x2c/0x30
	do_syscall_64+0x43/0x90
	entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fixes: 0e94682 ("psi: introduce psi monitor")
	Signed-off-by: Munehisa Kamata <kamatam@amazon.com>
	Signed-off-by: Mengchi Cheng <mengcc@amazon.com>
	Signed-off-by: Ingo Molnar <mingo@kernel.org>
	Acked-by: Suren Baghdasaryan <surenb@google.com>
	Acked-by: Peter Zijlstra <peterz@infradead.org>
	Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/lkml/20230106224859.4123476-1-kamatam@amazon.com/
Link: https://lore.kernel.org/r/20230214212705.4058045-1-kamatam@amazon.com
(cherry picked from commit c2dbe32)
	Signed-off-by: Brett Mastbergen <bmastbergen@ciq.com>
@bmastbergen bmastbergen requested a review from a team October 2, 2025 14:44
@PlaidCat PlaidCat requested review from a team and removed request for a team October 3, 2025 14:24
Copy link
Collaborator

@PlaidCat PlaidCat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@bmastbergen bmastbergen merged commit b4f997e into ciqlts9_2 Oct 3, 2025
4 checks passed
@bmastbergen bmastbergen deleted the bmastbergen_ciqlts9_2/many-vulns-10-1-25 branch October 3, 2025 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants