Skip to content

SIG-CLOUD-9: X86/KASLR fixes for HMM utilized by CUDA + temp fixes/workaround for HMM kselftest #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

PlaidCat
Copy link
Collaborator

Bug fix requested by SIG-CLOUD user to address users needs with GPUs, CUDA and HMM.
This is the integration and temporary fix to allow the HMM kselftest to not live in an infinite loop.

Internal Ticket: SECO-170

BUILD

[maple@r9-sigcloud-builder kernel-src-tree]$ ../kernel-tools/kernel_build.sh
/mnt/code/kernel-src-tree
  CLEAN   arch/x86/boot/compressed
  CLEAN   arch/x86/boot

  CLEAN   include/config include/generated arch/x86/include/generated .config .config.old .version Module.symvers certs/signing_key.pem certs/signing_key.x509 certs/x509.genkey
[TIMER]{MRPROPER}: 7s
x86_64 architecture detected, copying config
'configs/kernel-x86_64-rhel.config' -> '.config'
Setting Local Version for build
CONFIG_LOCALVERSION="-jmaple_cuda_bug_fix"
Making olddefconfig
  HOSTCC  scripts/basic/fixdep
  HOSTCC  scripts/kconfig/conf.o
  HOSTCC  scripts/kconfig/confdata.o
  HOSTCC  scripts/kconfig/expr.o
  LEX     scripts/kconfig/lexer.lex.c
  YACC    scripts/kconfig/parser.tab.[ch]
  HOSTCC  scripts/kconfig/lexer.lex.o
  HOSTCC  scripts/kconfig/menu.o
  HOSTCC  scripts/kconfig/parser.tab.o
  HOSTCC  scripts/kconfig/preprocess.o
  HOSTCC  scripts/kconfig/symbol.o
  HOSTCC  scripts/kconfig/util.o
  HOSTLD  scripts/kconfig/conf
#
# configuration written to .config
#
Starting Build

  BTF [M] sound/xen/snd_xen_front.ko
  BTF [M] virt/lib/irqbypass.ko
[TIMER]{BUILD}: 1494s
Making Modules
  INSTALL /lib/modules/5.14.0-jmaple_cuda_bug_fix+/kernel/arch/x86/crypto/blake2s-x86_64.ko
  STRIP   /lib/modules/5.14.0-jmaple_cuda_bug_fix+/kernel/arch/x86/crypto/blake2s-x86_64.ko
  SIGN    /lib/modules/5.14.0-jmaple_cuda_bug_fix+/kernel/arch/x86/crypto/blake2s-x86_64.ko

  SIGN    /lib/modules/5.14.0-jmaple_cuda_bug_fix+/kernel/virt/lib/irqbypass.ko
  DEPMOD  /lib/modules/5.14.0-jmaple_cuda_bug_fix+
[TIMER]{MODULES}: 31s
Making Install
sh ./arch/x86/boot/install.sh 5.14.0-jmaple_cuda_bug_fix+ \
	arch/x86/boot/bzImage System.map "/boot"
[TIMER]{INSTALL}: 20s
Checking kABI
kABI check passed
Setting Default Kernel to /boot/vmlinuz-5.14.0-jmaple_cuda_bug_fix+ and Index to 4
The default is /boot/loader/entries/9d52c46c412f49e287269f012ed1d6ff-5.14.0-jmaple_cuda_bug_fix+.conf with index 4 and kernel /boot/vmlinuz-5.14.0-jmaple_cuda_bug_fix+
The default is /boot/loader/entries/9d52c46c412f49e287269f012ed1d6ff-5.14.0-jmaple_cuda_bug_fix+.conf with index 4 and kernel /boot/vmlinuz-5.14.0-jmaple_cuda_bug_fix+
Generating grub configuration file ...
Adding boot menu entry for UEFI Firmware Settings ...
done
Hopefully Grub2.0 took everything ... rebooting after time metrices
[TIMER]{MRPROPER}: 7s
[TIMER]{BUILD}: 1494s
[TIMER]{MODULES}: 31s
[TIMER]{INSTALL}: 20s
[TIMER]{TOTAL} 1556s

kABI Check

See in the build log

Checking kABI
kABI check passed

Kernel Self Tests

There have been issues with kselftests for Rocky9 so in this case we're only testing the HMM situation. Which also has its own problems in TEARDOWN on Failure.

5 Runs per pre and post update

hmm_update_wrong.txt is intentionally wrong to show that there is data that is different

[jmaple@devbox code]$ for i in $(seq 1 $(( ${#files[@]} - 1)) ); do echo ${files[i]}; diff ${files[0]} ${files[i]}; done
hmm_default_test2.txt
hmm_default_test3.txt
hmm_default_test4.txt
hmm_default_test5.txt
hmm_update_test1.txt
hmm_update_test2.txt
hmm_update_test3.txt
hmm_update_test4.txt
hmm_update_test5.txt
hmm_update_wrong.txt
240a241
> # Maple changed below here
242,245c243,246
< #            OK  hmm2.hmm2_device_coherent.double_map
< ok 56 hmm2.hmm2_device_coherent.double_map
< # FAILED: 53 / 56 tests passed.
< # Totals: pass:51 fail:3 xfail:0 xpass:0 skip:2 error:0
---
> #            FAIL  hmm2.hmm2_device_coherent.double_map
> fail 56 hmm2.hmm2_device_coherent.double_map
> # FAILED: 52 / 56 tests passed.
> # Totals: pass:50 fail:4 xfail:0 xpass:0 skip:2 error:0

Example

[maple@r9-sigcloud-builder mm]$ cat /mnt/code/hmm_default_test1.txt
running: bash
--------------------------------
running bash ./test_hmm.sh smoke
--------------------------------
Running smoke test. Note, this test provides basic coverage.
TAP version 13
1..56
# Starting 56 tests from 5 test cases.
#  RUN           hmm.hmm_device_private.open_close ...
#            OK  hmm.hmm_device_private.open_close
ok 1 hmm.hmm_device_private.open_close
#  RUN           hmm.hmm_device_private.anon_read ...
#            OK  hmm.hmm_device_private.anon_read
ok 2 hmm.hmm_device_private.anon_read
#  RUN           hmm.hmm_device_private.anon_read_prot ...
#            OK  hmm.hmm_device_private.anon_read_prot
ok 3 hmm.hmm_device_private.anon_read_prot
#  RUN           hmm.hmm_device_private.anon_write ...
#            OK  hmm.hmm_device_private.anon_write
ok 4 hmm.hmm_device_private.anon_write
#  RUN           hmm.hmm_device_private.anon_write_prot ...
#            OK  hmm.hmm_device_private.anon_write_prot
ok 5 hmm.hmm_device_private.anon_write_prot
#  RUN           hmm.hmm_device_private.anon_write_child ...
#            OK  hmm.hmm_device_private.anon_write_child
ok 6 hmm.hmm_device_private.anon_write_child
#  RUN           hmm.hmm_device_private.anon_write_child_shared ...
#            OK  hmm.hmm_device_private.anon_write_child_shared
ok 7 hmm.hmm_device_private.anon_write_child_shared
#  RUN           hmm.hmm_device_private.anon_write_huge ...
#            OK  hmm.hmm_device_private.anon_write_huge
ok 8 hmm.hmm_device_private.anon_write_huge
#  RUN           hmm.hmm_device_private.anon_write_hugetlbfs ...
#            OK  hmm.hmm_device_private.anon_write_hugetlbfs
ok 9 # SKIP Huge page could not be allocated
#  RUN           hmm.hmm_device_private.file_read ...
#            OK  hmm.hmm_device_private.file_read
ok 10 hmm.hmm_device_private.file_read
#  RUN           hmm.hmm_device_private.file_write ...
#            OK  hmm.hmm_device_private.file_write
ok 11 hmm.hmm_device_private.file_write
#  RUN           hmm.hmm_device_private.migrate ...
#            OK  hmm.hmm_device_private.migrate
ok 12 hmm.hmm_device_private.migrate
#  RUN           hmm.hmm_device_private.migrate_fault ...
#            OK  hmm.hmm_device_private.migrate_fault
ok 13 hmm.hmm_device_private.migrate_fault
#  RUN           hmm.hmm_device_private.migrate_release ...
#            OK  hmm.hmm_device_private.migrate_release
ok 14 hmm.hmm_device_private.migrate_release
#  RUN           hmm.hmm_device_private.migrate_shared ...
#            OK  hmm.hmm_device_private.migrate_shared
ok 15 hmm.hmm_device_private.migrate_shared
#  RUN           hmm.hmm_device_private.migrate_multiple ...
#            OK  hmm.hmm_device_private.migrate_multiple
ok 16 hmm.hmm_device_private.migrate_multiple
#  RUN           hmm.hmm_device_private.anon_read_multiple ...
#            OK  hmm.hmm_device_private.anon_read_multiple
ok 17 hmm.hmm_device_private.anon_read_multiple
#  RUN           hmm.hmm_device_private.anon_teardown ...
#            OK  hmm.hmm_device_private.anon_teardown
ok 18 hmm.hmm_device_private.anon_teardown
#  RUN           hmm.hmm_device_private.mixedmap ...
#            OK  hmm.hmm_device_private.mixedmap
ok 19 hmm.hmm_device_private.mixedmap
#  RUN           hmm.hmm_device_private.compound ...
#            OK  hmm.hmm_device_private.compound
ok 20 hmm.hmm_device_private.compound
#  RUN           hmm.hmm_device_private.exclusive ...
#          FAIL  hmm.hmm_device_private.exclusive
not ok 21 hmm.hmm_device_private.exclusive
#  RUN           hmm.hmm_device_private.exclusive_mprotect ...
#          FAIL  hmm.hmm_device_private.exclusive_mprotect
not ok 22 hmm.hmm_device_private.exclusive_mprotect
#  RUN           hmm.hmm_device_private.exclusive_cow ...
#          FAIL  hmm.hmm_device_private.exclusive_cow
not ok 23 hmm.hmm_device_private.exclusive_cow
#  RUN           hmm.hmm_device_private.hmm_gup_test ...
#            OK  hmm.hmm_device_private.hmm_gup_test
ok 24 # SKIP Skipping test, could not find gup_test driver
#  RUN           hmm.hmm_device_private.hmm_cow_in_device ...
#            OK  hmm.hmm_device_private.hmm_cow_in_device
ok 25 hmm.hmm_device_private.hmm_cow_in_device
#  RUN           hmm.hmm_device_coherent.open_close ...
#            OK  hmm.hmm_device_coherent.open_close
ok 26 hmm.hmm_device_coherent.open_close
#  RUN           hmm.hmm_device_coherent.anon_read ...
#            OK  hmm.hmm_device_coherent.anon_read
ok 27 hmm.hmm_device_coherent.anon_read
#  RUN           hmm.hmm_device_coherent.anon_read_prot ...
#            OK  hmm.hmm_device_coherent.anon_read_prot
ok 28 hmm.hmm_device_coherent.anon_read_prot
#  RUN           hmm.hmm_device_coherent.anon_write ...
#            OK  hmm.hmm_device_coherent.anon_write
ok 29 hmm.hmm_device_coherent.anon_write
#  RUN           hmm.hmm_device_coherent.anon_write_prot ...
#            OK  hmm.hmm_device_coherent.anon_write_prot
ok 30 hmm.hmm_device_coherent.anon_write_prot
#  RUN           hmm.hmm_device_coherent.anon_write_child ...
#            OK  hmm.hmm_device_coherent.anon_write_child
ok 31 hmm.hmm_device_coherent.anon_write_child
#  RUN           hmm.hmm_device_coherent.anon_write_child_shared ...
#            OK  hmm.hmm_device_coherent.anon_write_child_shared
ok 32 hmm.hmm_device_coherent.anon_write_child_shared
#  RUN           hmm.hmm_device_coherent.anon_write_huge ...
#            OK  hmm.hmm_device_coherent.anon_write_huge
ok 33 hmm.hmm_device_coherent.anon_write_huge
#  RUN           hmm.hmm_device_coherent.anon_write_hugetlbfs ...
#            OK  hmm.hmm_device_coherent.anon_write_hugetlbfs
ok 34 hmm.hmm_device_coherent.anon_write_hugetlbfs
#  RUN           hmm.hmm_device_coherent.file_read ...
#            OK  hmm.hmm_device_coherent.file_read
ok 35 hmm.hmm_device_coherent.file_read
#  RUN           hmm.hmm_device_coherent.file_write ...
#            OK  hmm.hmm_device_coherent.file_write
ok 36 hmm.hmm_device_coherent.file_write
#  RUN           hmm.hmm_device_coherent.migrate ...
#            OK  hmm.hmm_device_coherent.migrate
ok 37 hmm.hmm_device_coherent.migrate
#  RUN           hmm.hmm_device_coherent.migrate_fault ...
#            OK  hmm.hmm_device_coherent.migrate_fault
ok 38 hmm.hmm_device_coherent.migrate_fault
#  RUN           hmm.hmm_device_coherent.migrate_release ...
#            OK  hmm.hmm_device_coherent.migrate_release
ok 39 hmm.hmm_device_coherent.migrate_release
#  RUN           hmm.hmm_device_coherent.migrate_shared ...
#            OK  hmm.hmm_device_coherent.migrate_shared
ok 40 hmm.hmm_device_coherent.migrate_shared
#  RUN           hmm.hmm_device_coherent.migrate_multiple ...
#            OK  hmm.hmm_device_coherent.migrate_multiple
ok 41 hmm.hmm_device_coherent.migrate_multiple
#  RUN           hmm.hmm_device_coherent.anon_read_multiple ...
#            OK  hmm.hmm_device_coherent.anon_read_multiple
ok 42 hmm.hmm_device_coherent.anon_read_multiple
#  RUN           hmm.hmm_device_coherent.anon_teardown ...
#            OK  hmm.hmm_device_coherent.anon_teardown
ok 43 hmm.hmm_device_coherent.anon_teardown
#  RUN           hmm.hmm_device_coherent.mixedmap ...
#            OK  hmm.hmm_device_coherent.mixedmap
ok 44 hmm.hmm_device_coherent.mixedmap
#  RUN           hmm.hmm_device_coherent.compound ...
#            OK  hmm.hmm_device_coherent.compound
ok 45 hmm.hmm_device_coherent.compound
#  RUN           hmm.hmm_device_coherent.exclusive ...
#            OK  hmm.hmm_device_coherent.exclusive
ok 46 hmm.hmm_device_coherent.exclusive
#  RUN           hmm.hmm_device_coherent.exclusive_mprotect ...
#            OK  hmm.hmm_device_coherent.exclusive_mprotect
ok 47 hmm.hmm_device_coherent.exclusive_mprotect
#  RUN           hmm.hmm_device_coherent.exclusive_cow ...
#            OK  hmm.hmm_device_coherent.exclusive_cow
ok 48 hmm.hmm_device_coherent.exclusive_cow
#  RUN           hmm.hmm_device_coherent.hmm_gup_test ...
#            OK  hmm.hmm_device_coherent.hmm_gup_test
ok 49 hmm.hmm_device_coherent.hmm_gup_test
#  RUN           hmm.hmm_device_coherent.hmm_cow_in_device ...
#            OK  hmm.hmm_device_coherent.hmm_cow_in_device
ok 50 hmm.hmm_device_coherent.hmm_cow_in_device
#  RUN           hmm2.hmm2_device_private.migrate_mixed ...
#            OK  hmm2.hmm2_device_private.migrate_mixed
ok 51 hmm2.hmm2_device_private.migrate_mixed
#  RUN           hmm2.hmm2_device_private.snapshot ...
#            OK  hmm2.hmm2_device_private.snapshot
ok 52 hmm2.hmm2_device_private.snapshot
#  RUN           hmm2.hmm2_device_private.double_map ...
#            OK  hmm2.hmm2_device_private.double_map
ok 53 hmm2.hmm2_device_private.double_map
#  RUN           hmm2.hmm2_device_coherent.migrate_mixed ...
#            OK  hmm2.hmm2_device_coherent.migrate_mixed
ok 54 hmm2.hmm2_device_coherent.migrate_mixed
#  RUN           hmm2.hmm2_device_coherent.snapshot ...
#            OK  hmm2.hmm2_device_coherent.snapshot
ok 55 hmm2.hmm2_device_coherent.snapshot
#  RUN           hmm2.hmm2_device_coherent.double_map ...
#            OK  hmm2.hmm2_device_coherent.double_map
ok 56 hmm2.hmm2_device_coherent.double_map
# FAILED: 53 / 56 tests passed.
# Totals: pass:51 fail:3 xfail:0 xpass:0 skip:2 error:0
[PASS]
0
SUMMARY: PASS=1 SKIP=0 FAIL=0

jira SECO-170
bugfix cuda hangs
commit-author Thomas Gleixner <tglx@linutronix.de>
commit ea72ce5
upstream-diff Missing CONFIG_KMSAN commits in
arch/x86/include/asm/pgtable_64_types.h

iounmap() on x86 occasionally fails to unmap because the provided valid
ioremap address is not below high_memory. It turned out that this
happens due to KASLR.

KASLR uses the full address space between PAGE_OFFSET and vaddr_end to
randomize the starting points of the direct map, vmalloc and vmemmap
regions.  It thereby limits the size of the direct map by using the
installed memory size plus an extra configurable margin for hot-plug
memory.  This limitation is done to gain more randomization space
because otherwise only the holes between the direct map, vmalloc,
vmemmap and vaddr_end would be usable for randomizing.

The limited direct map size is not exposed to the rest of the kernel, so
the memory hot-plug and resource management related code paths still
operate under the assumption that the available address space can be
determined with MAX_PHYSMEM_BITS.

request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1
downwards.  That means the first allocation happens past the end of the
direct map and if unlucky this address is in the vmalloc space, which
causes high_memory to become greater than VMALLOC_START and consequently
causes iounmap() to fail for valid ioremap addresses.

MAX_PHYSMEM_BITS cannot be changed for that because the randomization
does not align with address bit boundaries and there are other places
which actually require to know the maximum number of address bits.  All
remaining usage sites of MAX_PHYSMEM_BITS have been analyzed and found
to be correct.

Cure this by exposing the end of the direct map via PHYSMEM_END and use
that for the memory hot-plug and resource management related places
instead of relying on MAX_PHYSMEM_BITS. In the KASLR case PHYSMEM_END
maps to a variable which is initialized by the KASLR initialization and
otherwise it is based on MAX_PHYSMEM_BITS as before.

To prevent future hickups add a check into add_pages() to catch callers
trying to add memory above PHYSMEM_END.

Fixes: 0483e1f ("x86/mm: Implement ASLR for kernel memory regions")
	Reported-by: Max Ramanouski <max8rr8@gmail.com>
	Reported-by: Alistair Popple <apopple@nvidia.com>
	Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-By: Max Ramanouski <max8rr8@gmail.com>
	Tested-by: Alistair Popple <apopple@nvidia.com>
	Reviewed-by: Dan Williams <dan.j.williams@intel.com>
	Reviewed-by: Alistair Popple <apopple@nvidia.com>
	Reviewed-by: Kees Cook <kees@kernel.org>
	Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/87ed6soy3z.ffs@tglx
(cherry picked from commit ea72ce5)
	Signed-off-by: Jonathan Maple <jmaple@ciq.com>
jira SECO-170

In Rocky9 if you run ./run_vmtests.sh -t hmm it will fail and cause an
infinite loop on ASSERTs in FIXTURE_TEARDOWN()
This temporary fix is based on the discussion here
https://patchwork.kernel.org/project/linux-kselftest/patch/26017fe3-5ad7-6946-57db-e5ec48063ceb@suse.cz/#25046055

We will investigate further kselftest updates that will resolve the root
causes of this.
@PlaidCat PlaidCat requested a review from gvrose8192 October 23, 2024 20:51
Copy link

@gvrose8192 gvrose8192 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@PlaidCat PlaidCat merged commit 9b46c72 into sig-cloud-9/5.14.0-427.40.1.el9_4 Oct 23, 2024
@PlaidCat PlaidCat deleted the jmaple_r9-sigcloud-cuda_bug_fix branch October 23, 2024 21:05
@PlaidCat PlaidCat restored the jmaple_r9-sigcloud-cuda_bug_fix branch October 24, 2024 14:26
@PlaidCat PlaidCat deleted the jmaple_r9-sigcloud-cuda_bug_fix branch October 24, 2024 14:26
PlaidCat added a commit that referenced this pull request Nov 1, 2024
jira LE-2015
cve CVE-2024-40904
Rebuild_History Non-Buildable kernel-5.14.0-427.42.1.el9_4
commit-author Alan Stern <stern@rowland.harvard.edu>
commit 22f0081

The syzbot fuzzer found that the interrupt-URB completion callback in
the cdc-wdm driver was taking too long, and the driver's immediate
resubmission of interrupt URBs with -EPROTO status combined with the
dummy-hcd emulation to cause a CPU lockup:

cdc_wdm 1-1:1.0: nonzero urb status received: -71
cdc_wdm 1-1:1.0: wdm_int_callback - 0 bytes
watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [syz-executor782:6625]
CPU#0 Utilization every 4s during lockup:
	#1:  98% system,	  0% softirq,	  3% hardirq,	  0% idle
	#2:  98% system,	  0% softirq,	  3% hardirq,	  0% idle
	#3:  98% system,	  0% softirq,	  3% hardirq,	  0% idle
	#4:  98% system,	  0% softirq,	  3% hardirq,	  0% idle
	#5:  98% system,	  1% softirq,	  3% hardirq,	  0% idle
Modules linked in:
irq event stamp: 73096
hardirqs last  enabled at (73095): [<ffff80008037bc00>] console_emit_next_record kernel/printk/printk.c:2935 [inline]
hardirqs last  enabled at (73095): [<ffff80008037bc00>] console_flush_all+0x650/0xb74 kernel/printk/printk.c:2994
hardirqs last disabled at (73096): [<ffff80008af10b00>] __el1_irq arch/arm64/kernel/entry-common.c:533 [inline]
hardirqs last disabled at (73096): [<ffff80008af10b00>] el1_interrupt+0x24/0x68 arch/arm64/kernel/entry-common.c:551
softirqs last  enabled at (73048): [<ffff8000801ea530>] softirq_handle_end kernel/softirq.c:400 [inline]
softirqs last  enabled at (73048): [<ffff8000801ea530>] handle_softirqs+0xa60/0xc34 kernel/softirq.c:582
softirqs last disabled at (73043): [<ffff800080020de8>] __do_softirq+0x14/0x20 kernel/softirq.c:588
CPU: 0 PID: 6625 Comm: syz-executor782 Tainted: G        W          6.10.0-rc2-syzkaller-g8867bbd4a056 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/02/2024

Testing showed that the problem did not occur if the two error
messages -- the first two lines above -- were removed; apparently adding
material to the kernel log takes a surprisingly large amount of time.

In any case, the best approach for preventing these lockups and to
avoid spamming the log with thousands of error messages per second is
to ratelimit the two dev_err() calls.  Therefore we replace them with
dev_err_ratelimited().

	Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
	Suggested-by: Greg KH <gregkh@linuxfoundation.org>
Reported-and-tested-by: syzbot+5f996b83575ef4058638@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-usb/00000000000073d54b061a6a1c65@google.com/
Reported-and-tested-by: syzbot+1b2abad17596ad03dcff@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-usb/000000000000f45085061aa9b37e@google.com/
Fixes: 9908a32 ("USB: remove err() macro from usb class drivers")
Link: https://lore.kernel.org/linux-usb/40dfa45b-5f21-4eef-a8c1-51a2f320e267@rowland.harvard.edu/
	Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/29855215-52f5-4385-b058-91f42c2bee18@rowland.harvard.edu
	Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 22f0081)
	Signed-off-by: Jonathan Maple <jmaple@ciq.com>
PlaidCat added a commit that referenced this pull request Dec 17, 2024
jira LE-2169
cve CVE-2024-38540
Rebuild_History Non-Buildable kernel-4.18.0-553.27.1.el8_10
commit-author Michal Schmidt <mschmidt@redhat.com>
commit 78cfd17
Empty-Commit: Cherry-Pick Conflicts during history rebuild.
Will be included in final tarball splat. Ref for failed cherry-pick at:
ciq/ciq_backports/kernel-4.18.0-553.27.1.el8_10/78cfd171.failed

Undefined behavior is triggered when bnxt_qplib_alloc_init_hwq is called
with hwq_attr->aux_depth != 0 and hwq_attr->aux_stride == 0.
In that case, "roundup_pow_of_two(hwq_attr->aux_stride)" gets called.
roundup_pow_of_two is documented as undefined for 0.

Fix it in the one caller that had this combination.

The undefined behavior was detected by UBSAN:
  UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13
  shift exponent 64 is too large for 64-bit type 'long unsigned int'
  CPU: 24 PID: 1075 Comm: (udev-worker) Not tainted 6.9.0-rc6+ #4
  Hardware name: Abacus electric, s.r.o. - servis@abacus.cz Super Server/H12SSW-iN, BIOS 2.7 10/25/2023
  Call Trace:
   <TASK>
   dump_stack_lvl+0x5d/0x80
   ubsan_epilogue+0x5/0x30
   __ubsan_handle_shift_out_of_bounds.cold+0x61/0xec
   __roundup_pow_of_two+0x25/0x35 [bnxt_re]
   bnxt_qplib_alloc_init_hwq+0xa1/0x470 [bnxt_re]
   bnxt_qplib_create_qp+0x19e/0x840 [bnxt_re]
   bnxt_re_create_qp+0x9b1/0xcd0 [bnxt_re]
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? __kmalloc+0x1b6/0x4f0
   ? create_qp.part.0+0x128/0x1c0 [ib_core]
   ? __pfx_bnxt_re_create_qp+0x10/0x10 [bnxt_re]
   create_qp.part.0+0x128/0x1c0 [ib_core]
   ib_create_qp_kernel+0x50/0xd0 [ib_core]
   create_mad_qp+0x8e/0xe0 [ib_core]
   ? __pfx_qp_event_handler+0x10/0x10 [ib_core]
   ib_mad_init_device+0x2be/0x680 [ib_core]
   add_client_context+0x10d/0x1a0 [ib_core]
   enable_device_and_get+0xe0/0x1d0 [ib_core]
   ib_register_device+0x53c/0x630 [ib_core]
   ? srso_alias_return_thunk+0x5/0xfbef5
   bnxt_re_probe+0xbd8/0xe50 [bnxt_re]
   ? __pfx_bnxt_re_probe+0x10/0x10 [bnxt_re]
   auxiliary_bus_probe+0x49/0x80
   ? driver_sysfs_add+0x57/0xc0
   really_probe+0xde/0x340
   ? pm_runtime_barrier+0x54/0x90
   ? __pfx___driver_attach+0x10/0x10
   __driver_probe_device+0x78/0x110
   driver_probe_device+0x1f/0xa0
   __driver_attach+0xba/0x1c0
   bus_for_each_dev+0x8f/0xe0
   bus_add_driver+0x146/0x220
   driver_register+0x72/0xd0
   __auxiliary_driver_register+0x6e/0xd0
   ? __pfx_bnxt_re_mod_init+0x10/0x10 [bnxt_re]
   bnxt_re_mod_init+0x3e/0xff0 [bnxt_re]
   ? __pfx_bnxt_re_mod_init+0x10/0x10 [bnxt_re]
   do_one_initcall+0x5b/0x310
   do_init_module+0x90/0x250
   init_module_from_file+0x86/0xc0
   idempotent_init_module+0x121/0x2b0
   __x64_sys_finit_module+0x5e/0xb0
   do_syscall_64+0x82/0x160
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? syscall_exit_to_user_mode_prepare+0x149/0x170
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? syscall_exit_to_user_mode+0x75/0x230
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? do_syscall_64+0x8e/0x160
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? __count_memcg_events+0x69/0x100
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? count_memcg_events.constprop.0+0x1a/0x30
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? handle_mm_fault+0x1f0/0x300
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? do_user_addr_fault+0x34e/0x640
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? srso_alias_return_thunk+0x5/0xfbef5
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
  RIP: 0033:0x7f4e5132821d
  Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e3 db 0c 00 f7 d8 64 89 01 48
  RSP: 002b:00007ffca9c906a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
  RAX: ffffffffffffffda RBX: 0000563ec8a8f130 RCX: 00007f4e5132821d
  RDX: 0000000000000000 RSI: 00007f4e518fa07d RDI: 000000000000003b
  RBP: 00007ffca9c90760 R08: 00007f4e513f6b20 R09: 00007ffca9c906f0
  R10: 0000563ec8a8faa0 R11: 0000000000000246 R12: 00007f4e518fa07d
  R13: 0000000000020000 R14: 0000563ec8409e90 R15: 0000563ec8a8fa60
   </TASK>
  ---[ end trace ]---

Fixes: 0c4dcd6 ("RDMA/bnxt_re: Refactor hardware queue memory allocation")
	Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Link: https://lore.kernel.org/r/20240507103929.30003-1-mschmidt@redhat.com
	Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>
	Signed-off-by: Leon Romanovsky <leon@kernel.org>
(cherry picked from commit 78cfd17)
	Signed-off-by: Jonathan Maple <jmaple@ciq.com>

# Conflicts:
#	drivers/infiniband/hw/bnxt_re/qplib_fp.c
PlaidCat pushed a commit that referenced this pull request Dec 19, 2024
Its used from trace__run(), for the 'perf trace' live mode, i.e. its
strace-like, non-perf.data file processing mode, the most common one.

The trace__run() function will set trace->host using machine__new_host()
that is supposed to give a machine instance representing the running
machine, and since we'll use perf_env__arch_strerrno() to get the right
errno -> string table, we need to use machine->env, so initialize it in
machine__new_host().

Before the patch:

  (gdb) run trace --errno-summary -a sleep 1
  <SNIP>
   Summary of events:

   gvfs-afc-volume (3187), 2 events, 0.0%

     syscall            calls  errors  total       min       avg       max       stddev
                                       (msec)    (msec)    (msec)    (msec)        (%)
     --------------- --------  ------ -------- --------- --------- ---------     ------
     pselect6               1      0     0.000     0.000     0.000     0.000      0.00%

   GUsbEventThread (3519), 2 events, 0.0%

     syscall            calls  errors  total       min       avg       max       stddev
                                       (msec)    (msec)    (msec)    (msec)        (%)
     --------------- --------  ------ -------- --------- --------- ---------     ------
     poll                   1      0     0.000     0.000     0.000     0.000      0.00%
  <SNIP>
  Program received signal SIGSEGV, Segmentation fault.
  0x00000000005caba0 in perf_env__arch_strerrno (env=0x0, err=110) at util/env.c:478
  478		if (env->arch_strerrno == NULL)
  (gdb) bt
  #0  0x00000000005caba0 in perf_env__arch_strerrno (env=0x0, err=110) at util/env.c:478
  #1  0x00000000004b75d2 in thread__dump_stats (ttrace=0x14f58f0, trace=0x7fffffffa5b0, fp=0x7ffff6ff74e0 <_IO_2_1_stderr_>) at builtin-trace.c:4673
  #2  0x00000000004b78bf in trace__fprintf_thread (fp=0x7ffff6ff74e0 <_IO_2_1_stderr_>, thread=0x10fa0b0, trace=0x7fffffffa5b0) at builtin-trace.c:4708
  #3  0x00000000004b7ad9 in trace__fprintf_thread_summary (trace=0x7fffffffa5b0, fp=0x7ffff6ff74e0 <_IO_2_1_stderr_>) at builtin-trace.c:4747
  #4  0x00000000004b656e in trace__run (trace=0x7fffffffa5b0, argc=2, argv=0x7fffffffde60) at builtin-trace.c:4456
  #5  0x00000000004ba43e in cmd_trace (argc=2, argv=0x7fffffffde60) at builtin-trace.c:5487
  #6  0x00000000004c0414 in run_builtin (p=0xec3068 <commands+648>, argc=5, argv=0x7fffffffde60) at perf.c:351
  #7  0x00000000004c06bb in handle_internal_command (argc=5, argv=0x7fffffffde60) at perf.c:404
  #8  0x00000000004c0814 in run_argv (argcp=0x7fffffffdc4c, argv=0x7fffffffdc40) at perf.c:448
  #9  0x00000000004c0b5d in main (argc=5, argv=0x7fffffffde60) at perf.c:560
  (gdb)

After:

  root@number:~# perf trace -a --errno-summary sleep 1
  <SNIP>
     pw-data-loop (2685), 1410 events, 16.0%

     syscall            calls  errors  total       min       avg       max       stddev
                                       (msec)    (msec)    (msec)    (msec)        (%)
     --------------- --------  ------ -------- --------- --------- ---------     ------
     epoll_wait           188      0   983.428     0.000     5.231    15.595      8.68%
     ioctl                 94      0     0.811     0.004     0.009     0.016      2.82%
     read                 188      0     0.322     0.001     0.002     0.006      5.15%
     write                141      0     0.280     0.001     0.002     0.018      8.39%
     timerfd_settime       94      0     0.138     0.001     0.001     0.007      6.47%

   gnome-control-c (179406), 1848 events, 20.9%

     syscall            calls  errors  total       min       avg       max       stddev
                                       (msec)    (msec)    (msec)    (msec)        (%)
     --------------- --------  ------ -------- --------- --------- ---------     ------
     poll                 222      0   959.577     0.000     4.322    21.414     11.40%
     recvmsg              150      0     0.539     0.001     0.004     0.013      5.12%
     write                300      0     0.442     0.001     0.001     0.007      3.29%
     read                 150      0     0.183     0.001     0.001     0.009      5.53%
     getpid               102      0     0.101     0.000     0.001     0.008      7.82%

  root@number:~#

Fixes: 54373b5 ("perf env: Introduce perf_env__arch_strerrno()")
Reported-by: Veronika Molnarova <vmolnaro@redhat.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Veronika Molnarova <vmolnaro@redhat.com>
Acked-by: Michael Petlan <mpetlan@redhat.com>
Tested-by: Michael Petlan <mpetlan@redhat.com>
Link: https://lore.kernel.org/r/Z0XffUgNSv_9OjOi@x1
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
PlaidCat pushed a commit that referenced this pull request Dec 19, 2024
…s_lock

For storing a value to a queue attribute, the queue_attr_store function
first freezes the queue (->q_usage_counter(io)) and then acquire
->sysfs_lock. This seems not correct as the usual ordering should be to
acquire ->sysfs_lock before freezing the queue. This incorrect ordering
causes the following lockdep splat which we are able to reproduce always
simply by accessing /sys/kernel/debug file using ls command:

[   57.597146] WARNING: possible circular locking dependency detected
[   57.597154] 6.12.0-10553-gb86545e02e8c #20 Tainted: G        W
[   57.597162] ------------------------------------------------------
[   57.597168] ls/4605 is trying to acquire lock:
[   57.597176] c00000003eb56710 (&mm->mmap_lock){++++}-{4:4}, at: __might_fault+0x58/0xc0
[   57.597200]
               but task is already holding lock:
[   57.597207] c0000018e27c6810 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: iterate_dir+0x94/0x1d4
[   57.597226]
               which lock already depends on the new lock.

[   57.597233]
               the existing dependency chain (in reverse order) is:
[   57.597241]
               -> #5 (&sb->s_type->i_mutex_key#3){++++}-{4:4}:
[   57.597255]        down_write+0x6c/0x18c
[   57.597264]        start_creating+0xb4/0x24c
[   57.597274]        debugfs_create_dir+0x2c/0x1e8
[   57.597283]        blk_register_queue+0xec/0x294
[   57.597292]        add_disk_fwnode+0x2e4/0x548
[   57.597302]        brd_alloc+0x2c8/0x338
[   57.597309]        brd_init+0x100/0x178
[   57.597317]        do_one_initcall+0x88/0x3e4
[   57.597326]        kernel_init_freeable+0x3cc/0x6e0
[   57.597334]        kernel_init+0x34/0x1cc
[   57.597342]        ret_from_kernel_user_thread+0x14/0x1c
[   57.597350]
               -> #4 (&q->debugfs_mutex){+.+.}-{4:4}:
[   57.597362]        __mutex_lock+0xfc/0x12a0
[   57.597370]        blk_register_queue+0xd4/0x294
[   57.597379]        add_disk_fwnode+0x2e4/0x548
[   57.597388]        brd_alloc+0x2c8/0x338
[   57.597395]        brd_init+0x100/0x178
[   57.597402]        do_one_initcall+0x88/0x3e4
[   57.597410]        kernel_init_freeable+0x3cc/0x6e0
[   57.597418]        kernel_init+0x34/0x1cc
[   57.597426]        ret_from_kernel_user_thread+0x14/0x1c
[   57.597434]
               -> #3 (&q->sysfs_lock){+.+.}-{4:4}:
[   57.597446]        __mutex_lock+0xfc/0x12a0
[   57.597454]        queue_attr_store+0x9c/0x110
[   57.597462]        sysfs_kf_write+0x70/0xb0
[   57.597471]        kernfs_fop_write_iter+0x1b0/0x2ac
[   57.597480]        vfs_write+0x3dc/0x6e8
[   57.597488]        ksys_write+0x84/0x140
[   57.597495]        system_call_exception+0x130/0x360
[   57.597504]        system_call_common+0x160/0x2c4
[   57.597516]
               -> #2 (&q->q_usage_counter(io)#21){++++}-{0:0}:
[   57.597530]        __submit_bio+0x5ec/0x828
[   57.597538]        submit_bio_noacct_nocheck+0x1e4/0x4f0
[   57.597547]        iomap_readahead+0x2a0/0x448
[   57.597556]        xfs_vm_readahead+0x28/0x3c
[   57.597564]        read_pages+0x88/0x41c
[   57.597571]        page_cache_ra_unbounded+0x1ac/0x2d8
[   57.597580]        filemap_get_pages+0x188/0x984
[   57.597588]        filemap_read+0x13c/0x4bc
[   57.597596]        xfs_file_buffered_read+0x88/0x17c
[   57.597605]        xfs_file_read_iter+0xac/0x158
[   57.597614]        vfs_read+0x2d4/0x3b4
[   57.597622]        ksys_read+0x84/0x144
[   57.597629]        system_call_exception+0x130/0x360
[   57.597637]        system_call_common+0x160/0x2c4
[   57.597647]
               -> #1 (mapping.invalidate_lock#2){++++}-{4:4}:
[   57.597661]        down_read+0x6c/0x220
[   57.597669]        filemap_fault+0x870/0x100c
[   57.597677]        xfs_filemap_fault+0xc4/0x18c
[   57.597684]        __do_fault+0x64/0x164
[   57.597693]        __handle_mm_fault+0x1274/0x1dac
[   57.597702]        handle_mm_fault+0x248/0x484
[   57.597711]        ___do_page_fault+0x428/0xc0c
[   57.597719]        hash__do_page_fault+0x30/0x68
[   57.597727]        do_hash_fault+0x90/0x35c
[   57.597736]        data_access_common_virt+0x210/0x220
[   57.597745]        _copy_from_user+0xf8/0x19c
[   57.597754]        sel_write_load+0x178/0xd54
[   57.597762]        vfs_write+0x108/0x6e8
[   57.597769]        ksys_write+0x84/0x140
[   57.597777]        system_call_exception+0x130/0x360
[   57.597785]        system_call_common+0x160/0x2c4
[   57.597794]
               -> #0 (&mm->mmap_lock){++++}-{4:4}:
[   57.597806]        __lock_acquire+0x17cc/0x2330
[   57.597814]        lock_acquire+0x138/0x400
[   57.597822]        __might_fault+0x7c/0xc0
[   57.597830]        filldir64+0xe8/0x390
[   57.597839]        dcache_readdir+0x80/0x2d4
[   57.597846]        iterate_dir+0xd8/0x1d4
[   57.597855]        sys_getdents64+0x88/0x2d4
[   57.597864]        system_call_exception+0x130/0x360
[   57.597872]        system_call_common+0x160/0x2c4
[   57.597881]
               other info that might help us debug this:

[   57.597888] Chain exists of:
                 &mm->mmap_lock --> &q->debugfs_mutex --> &sb->s_type->i_mutex_key#3

[   57.597905]  Possible unsafe locking scenario:

[   57.597911]        CPU0                    CPU1
[   57.597917]        ----                    ----
[   57.597922]   rlock(&sb->s_type->i_mutex_key#3);
[   57.597932]                                lock(&q->debugfs_mutex);
[   57.597940]                                lock(&sb->s_type->i_mutex_key#3);
[   57.597950]   rlock(&mm->mmap_lock);
[   57.597958]
                *** DEADLOCK ***

[   57.597965] 2 locks held by ls/4605:
[   57.597971]  #0: c0000000137c12f8 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0xcc/0x154
[   57.597989]  #1: c0000018e27c6810 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: iterate_dir+0x94/0x1d4

Prevent the above lockdep warning by acquiring ->sysfs_lock before
freezing the queue while storing a queue attribute in queue_attr_store
function. Later, we also found[1] another function __blk_mq_update_nr_
hw_queues where we first freeze queue and then acquire the ->sysfs_lock.
So we've also updated lock ordering in __blk_mq_update_nr_hw_queues
function and ensured that in all code paths we follow the correct lock
ordering i.e. acquire ->sysfs_lock before freezing the queue.

[1] https://lore.kernel.org/all/CAFj5m9Ke8+EHKQBs_Nk6hqd=LGXtk4mUxZUN5==ZcCjnZSBwHw@mail.gmail.com/

Reported-by: kjain@linux.ibm.com
Fixes: af28141 ("block: freeze the queue in queue_attr_store")
Tested-by: kjain@linux.ibm.com
Cc: hch@lst.de
Cc: axboe@kernel.dk
Cc: ritesh.list@gmail.com
Cc: ming.lei@redhat.com
Cc: gjoyce@linux.ibm.com
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241210144222.1066229-1-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
gvrose8192 added a commit that referenced this pull request Jan 13, 2025
jira VULN-6730
cve CVE-2023-4921
commit-author valis <sec@valis.email>
commit 8fc134f

When the plug qdisc is used as a class of the qfq qdisc it could trigger a
UAF. This issue can be reproduced with following commands:

  tc qdisc add dev lo root handle 1: qfq
  tc class add dev lo parent 1: classid 1:1 qfq weight 1 maxpkt 512
  tc qdisc add dev lo parent 1:1 handle 2: plug
  tc filter add dev lo parent 1: basic classid 1:1
  ping -c1 127.0.0.1

and boom:

[  285.353793] BUG: KASAN: slab-use-after-free in qfq_dequeue+0xa7/0x7f0
[  285.354910] Read of size 4 at addr ffff8880bad312a8 by task ping/144
[  285.355903]
[  285.356165] CPU: 1 PID: 144 Comm: ping Not tainted 6.5.0-rc3+ #4
[  285.357112] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[  285.358376] Call Trace:
[  285.358773]  <IRQ>
[  285.359109]  dump_stack_lvl+0x44/0x60
[  285.359708]  print_address_description.constprop.0+0x2c/0x3c0
[  285.360611]  kasan_report+0x10c/0x120
[  285.361195]  ? qfq_dequeue+0xa7/0x7f0
[  285.361780]  qfq_dequeue+0xa7/0x7f0
[  285.362342]  __qdisc_run+0xf1/0x970
[  285.362903]  net_tx_action+0x28e/0x460
[  285.363502]  __do_softirq+0x11b/0x3de
[  285.364097]  do_softirq.part.0+0x72/0x90
[  285.364721]  </IRQ>
[  285.365072]  <TASK>
[  285.365422]  __local_bh_enable_ip+0x77/0x90
[  285.366079]  __dev_queue_xmit+0x95f/0x1550
[  285.366732]  ? __pfx_csum_and_copy_from_iter+0x10/0x10
[  285.367526]  ? __pfx___dev_queue_xmit+0x10/0x10
[  285.368259]  ? __build_skb_around+0x129/0x190
[  285.368960]  ? ip_generic_getfrag+0x12c/0x170
[  285.369653]  ? __pfx_ip_generic_getfrag+0x10/0x10
[  285.370390]  ? csum_partial+0x8/0x20
[  285.370961]  ? raw_getfrag+0xe5/0x140
[  285.371559]  ip_finish_output2+0x539/0xa40
[  285.372222]  ? __pfx_ip_finish_output2+0x10/0x10
[  285.372954]  ip_output+0x113/0x1e0
[  285.373512]  ? __pfx_ip_output+0x10/0x10
[  285.374130]  ? icmp_out_count+0x49/0x60
[  285.374739]  ? __pfx_ip_finish_output+0x10/0x10
[  285.375457]  ip_push_pending_frames+0xf3/0x100
[  285.376173]  raw_sendmsg+0xef5/0x12d0
[  285.376760]  ? do_syscall_64+0x40/0x90
[  285.377359]  ? __static_call_text_end+0x136578/0x136578
[  285.378173]  ? do_syscall_64+0x40/0x90
[  285.378772]  ? kasan_enable_current+0x11/0x20
[  285.379469]  ? __pfx_raw_sendmsg+0x10/0x10
[  285.380137]  ? __sock_create+0x13e/0x270
[  285.380673]  ? __sys_socket+0xf3/0x180
[  285.381174]  ? __x64_sys_socket+0x3d/0x50
[  285.381725]  ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  285.382425]  ? __rcu_read_unlock+0x48/0x70
[  285.382975]  ? ip4_datagram_release_cb+0xd8/0x380
[  285.383608]  ? __pfx_ip4_datagram_release_cb+0x10/0x10
[  285.384295]  ? preempt_count_sub+0x14/0xc0
[  285.384844]  ? __list_del_entry_valid+0x76/0x140
[  285.385467]  ? _raw_spin_lock_bh+0x87/0xe0
[  285.386014]  ? __pfx__raw_spin_lock_bh+0x10/0x10
[  285.386645]  ? release_sock+0xa0/0xd0
[  285.387148]  ? preempt_count_sub+0x14/0xc0
[  285.387712]  ? freeze_secondary_cpus+0x348/0x3c0
[  285.388341]  ? aa_sk_perm+0x177/0x390
[  285.388856]  ? __pfx_aa_sk_perm+0x10/0x10
[  285.389441]  ? check_stack_object+0x22/0x70
[  285.390032]  ? inet_send_prepare+0x2f/0x120
[  285.390603]  ? __pfx_inet_sendmsg+0x10/0x10
[  285.391172]  sock_sendmsg+0xcc/0xe0
[  285.391667]  __sys_sendto+0x190/0x230
[  285.392168]  ? __pfx___sys_sendto+0x10/0x10
[  285.392727]  ? kvm_clock_get_cycles+0x14/0x30
[  285.393328]  ? set_normalized_timespec64+0x57/0x70
[  285.393980]  ? _raw_spin_unlock_irq+0x1b/0x40
[  285.394578]  ? __x64_sys_clock_gettime+0x11c/0x160
[  285.395225]  ? __pfx___x64_sys_clock_gettime+0x10/0x10
[  285.395908]  ? _copy_to_user+0x3e/0x60
[  285.396432]  ? exit_to_user_mode_prepare+0x1a/0x120
[  285.397086]  ? syscall_exit_to_user_mode+0x22/0x50
[  285.397734]  ? do_syscall_64+0x71/0x90
[  285.398258]  __x64_sys_sendto+0x74/0x90
[  285.398786]  do_syscall_64+0x64/0x90
[  285.399273]  ? exit_to_user_mode_prepare+0x1a/0x120
[  285.399949]  ? syscall_exit_to_user_mode+0x22/0x50
[  285.400605]  ? do_syscall_64+0x71/0x90
[  285.401124]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  285.401807] RIP: 0033:0x495726
[  285.402233] Code: ff ff ff f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 11 b8 2c 00 00 00 0f 09
[  285.404683] RSP: 002b:00007ffcc25fb618 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[  285.405677] RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 0000000000495726
[  285.406628] RDX: 0000000000000040 RSI: 0000000002518750 RDI: 0000000000000000
[  285.407565] RBP: 00000000005205ef R08: 00000000005f8838 R09: 000000000000001c
[  285.408523] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000002517634
[  285.409460] R13: 00007ffcc25fb6f0 R14: 0000000000000003 R15: 0000000000000000
[  285.410403]  </TASK>
[  285.410704]
[  285.410929] Allocated by task 144:
[  285.411402]  kasan_save_stack+0x1e/0x40
[  285.411926]  kasan_set_track+0x21/0x30
[  285.412442]  __kasan_slab_alloc+0x55/0x70
[  285.412973]  kmem_cache_alloc_node+0x187/0x3d0
[  285.413567]  __alloc_skb+0x1b4/0x230
[  285.414060]  __ip_append_data+0x17f7/0x1b60
[  285.414633]  ip_append_data+0x97/0xf0
[  285.415144]  raw_sendmsg+0x5a8/0x12d0
[  285.415640]  sock_sendmsg+0xcc/0xe0
[  285.416117]  __sys_sendto+0x190/0x230
[  285.416626]  __x64_sys_sendto+0x74/0x90
[  285.417145]  do_syscall_64+0x64/0x90
[  285.417624]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  285.418306]
[  285.418531] Freed by task 144:
[  285.418960]  kasan_save_stack+0x1e/0x40
[  285.419469]  kasan_set_track+0x21/0x30
[  285.419988]  kasan_save_free_info+0x27/0x40
[  285.420556]  ____kasan_slab_free+0x109/0x1a0
[  285.421146]  kmem_cache_free+0x1c2/0x450
[  285.421680]  __netif_receive_skb_core+0x2ce/0x1870
[  285.422333]  __netif_receive_skb_one_core+0x97/0x140
[  285.423003]  process_backlog+0x100/0x2f0
[  285.423537]  __napi_poll+0x5c/0x2d0
[  285.424023]  net_rx_action+0x2be/0x560
[  285.424510]  __do_softirq+0x11b/0x3de
[  285.425034]
[  285.425254] The buggy address belongs to the object at ffff8880bad31280
[  285.425254]  which belongs to the cache skbuff_head_cache of size 224
[  285.426993] The buggy address is located 40 bytes inside of
[  285.426993]  freed 224-byte region [ffff8880bad31280, ffff8880bad31360)
[  285.428572]
[  285.428798] The buggy address belongs to the physical page:
[  285.429540] page:00000000f4b77674 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xbad31
[  285.430758] flags: 0x100000000000200(slab|node=0|zone=1)
[  285.431447] page_type: 0xffffffff()
[  285.431934] raw: 0100000000000200 ffff88810094a8c0 dead000000000122 0000000000000000
[  285.432757] raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000
[  285.433562] page dumped because: kasan: bad access detected
[  285.434144]
[  285.434320] Memory state around the buggy address:
[  285.434828]  ffff8880bad31180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  285.435580]  ffff8880bad31200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  285.436264] >ffff8880bad31280: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  285.436777]                                   ^
[  285.437106]  ffff8880bad31300: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
[  285.437616]  ffff8880bad31380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  285.438126] ==================================================================
[  285.438662] Disabling lock debugging due to kernel taint

Fix this by:
1. Changing sch_plug's .peek handler to qdisc_peek_dequeued(), a
function compatible with non-work-conserving qdiscs
2. Checking the return value of qdisc_dequeue_peeked() in sch_qfq.

Fixes: 462dbc9 ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost")
	Reported-by: valis <sec@valis.email>
	Signed-off-by: valis <sec@valis.email>
	Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20230901162237.11525-1-jhs@mojatatu.com
	Signed-off-by: Paolo Abeni <pabeni@redhat.com>
(cherry picked from commit 8fc134f)
	Signed-off-by: Greg Rose <g.v.rose@ciq.com>
gvrose8192 added a commit that referenced this pull request Jan 13, 2025
jira VULN-6730
cve CVE-2023-4921
commit-author valis <sec@valis.email>
commit 8fc134f

When the plug qdisc is used as a class of the qfq qdisc it could trigger a
UAF. This issue can be reproduced with following commands:

  tc qdisc add dev lo root handle 1: qfq
  tc class add dev lo parent 1: classid 1:1 qfq weight 1 maxpkt 512
  tc qdisc add dev lo parent 1:1 handle 2: plug
  tc filter add dev lo parent 1: basic classid 1:1
  ping -c1 127.0.0.1

and boom:

[  285.353793] BUG: KASAN: slab-use-after-free in qfq_dequeue+0xa7/0x7f0
[  285.354910] Read of size 4 at addr ffff8880bad312a8 by task ping/144
[  285.355903]
[  285.356165] CPU: 1 PID: 144 Comm: ping Not tainted 6.5.0-rc3+ #4
[  285.357112] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[  285.358376] Call Trace:
[  285.358773]  <IRQ>
[  285.359109]  dump_stack_lvl+0x44/0x60
[  285.359708]  print_address_description.constprop.0+0x2c/0x3c0
[  285.360611]  kasan_report+0x10c/0x120
[  285.361195]  ? qfq_dequeue+0xa7/0x7f0
[  285.361780]  qfq_dequeue+0xa7/0x7f0
[  285.362342]  __qdisc_run+0xf1/0x970
[  285.362903]  net_tx_action+0x28e/0x460
[  285.363502]  __do_softirq+0x11b/0x3de
[  285.364097]  do_softirq.part.0+0x72/0x90
[  285.364721]  </IRQ>
[  285.365072]  <TASK>
[  285.365422]  __local_bh_enable_ip+0x77/0x90
[  285.366079]  __dev_queue_xmit+0x95f/0x1550
[  285.366732]  ? __pfx_csum_and_copy_from_iter+0x10/0x10
[  285.367526]  ? __pfx___dev_queue_xmit+0x10/0x10
[  285.368259]  ? __build_skb_around+0x129/0x190
[  285.368960]  ? ip_generic_getfrag+0x12c/0x170
[  285.369653]  ? __pfx_ip_generic_getfrag+0x10/0x10
[  285.370390]  ? csum_partial+0x8/0x20
[  285.370961]  ? raw_getfrag+0xe5/0x140
[  285.371559]  ip_finish_output2+0x539/0xa40
[  285.372222]  ? __pfx_ip_finish_output2+0x10/0x10
[  285.372954]  ip_output+0x113/0x1e0
[  285.373512]  ? __pfx_ip_output+0x10/0x10
[  285.374130]  ? icmp_out_count+0x49/0x60
[  285.374739]  ? __pfx_ip_finish_output+0x10/0x10
[  285.375457]  ip_push_pending_frames+0xf3/0x100
[  285.376173]  raw_sendmsg+0xef5/0x12d0
[  285.376760]  ? do_syscall_64+0x40/0x90
[  285.377359]  ? __static_call_text_end+0x136578/0x136578
[  285.378173]  ? do_syscall_64+0x40/0x90
[  285.378772]  ? kasan_enable_current+0x11/0x20
[  285.379469]  ? __pfx_raw_sendmsg+0x10/0x10
[  285.380137]  ? __sock_create+0x13e/0x270
[  285.380673]  ? __sys_socket+0xf3/0x180
[  285.381174]  ? __x64_sys_socket+0x3d/0x50
[  285.381725]  ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  285.382425]  ? __rcu_read_unlock+0x48/0x70
[  285.382975]  ? ip4_datagram_release_cb+0xd8/0x380
[  285.383608]  ? __pfx_ip4_datagram_release_cb+0x10/0x10
[  285.384295]  ? preempt_count_sub+0x14/0xc0
[  285.384844]  ? __list_del_entry_valid+0x76/0x140
[  285.385467]  ? _raw_spin_lock_bh+0x87/0xe0
[  285.386014]  ? __pfx__raw_spin_lock_bh+0x10/0x10
[  285.386645]  ? release_sock+0xa0/0xd0
[  285.387148]  ? preempt_count_sub+0x14/0xc0
[  285.387712]  ? freeze_secondary_cpus+0x348/0x3c0
[  285.388341]  ? aa_sk_perm+0x177/0x390
[  285.388856]  ? __pfx_aa_sk_perm+0x10/0x10
[  285.389441]  ? check_stack_object+0x22/0x70
[  285.390032]  ? inet_send_prepare+0x2f/0x120
[  285.390603]  ? __pfx_inet_sendmsg+0x10/0x10
[  285.391172]  sock_sendmsg+0xcc/0xe0
[  285.391667]  __sys_sendto+0x190/0x230
[  285.392168]  ? __pfx___sys_sendto+0x10/0x10
[  285.392727]  ? kvm_clock_get_cycles+0x14/0x30
[  285.393328]  ? set_normalized_timespec64+0x57/0x70
[  285.393980]  ? _raw_spin_unlock_irq+0x1b/0x40
[  285.394578]  ? __x64_sys_clock_gettime+0x11c/0x160
[  285.395225]  ? __pfx___x64_sys_clock_gettime+0x10/0x10
[  285.395908]  ? _copy_to_user+0x3e/0x60
[  285.396432]  ? exit_to_user_mode_prepare+0x1a/0x120
[  285.397086]  ? syscall_exit_to_user_mode+0x22/0x50
[  285.397734]  ? do_syscall_64+0x71/0x90
[  285.398258]  __x64_sys_sendto+0x74/0x90
[  285.398786]  do_syscall_64+0x64/0x90
[  285.399273]  ? exit_to_user_mode_prepare+0x1a/0x120
[  285.399949]  ? syscall_exit_to_user_mode+0x22/0x50
[  285.400605]  ? do_syscall_64+0x71/0x90
[  285.401124]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  285.401807] RIP: 0033:0x495726
[  285.402233] Code: ff ff ff f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 11 b8 2c 00 00 00 0f 09
[  285.404683] RSP: 002b:00007ffcc25fb618 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[  285.405677] RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 0000000000495726
[  285.406628] RDX: 0000000000000040 RSI: 0000000002518750 RDI: 0000000000000000
[  285.407565] RBP: 00000000005205ef R08: 00000000005f8838 R09: 000000000000001c
[  285.408523] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000002517634
[  285.409460] R13: 00007ffcc25fb6f0 R14: 0000000000000003 R15: 0000000000000000
[  285.410403]  </TASK>
[  285.410704]
[  285.410929] Allocated by task 144:
[  285.411402]  kasan_save_stack+0x1e/0x40
[  285.411926]  kasan_set_track+0x21/0x30
[  285.412442]  __kasan_slab_alloc+0x55/0x70
[  285.412973]  kmem_cache_alloc_node+0x187/0x3d0
[  285.413567]  __alloc_skb+0x1b4/0x230
[  285.414060]  __ip_append_data+0x17f7/0x1b60
[  285.414633]  ip_append_data+0x97/0xf0
[  285.415144]  raw_sendmsg+0x5a8/0x12d0
[  285.415640]  sock_sendmsg+0xcc/0xe0
[  285.416117]  __sys_sendto+0x190/0x230
[  285.416626]  __x64_sys_sendto+0x74/0x90
[  285.417145]  do_syscall_64+0x64/0x90
[  285.417624]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  285.418306]
[  285.418531] Freed by task 144:
[  285.418960]  kasan_save_stack+0x1e/0x40
[  285.419469]  kasan_set_track+0x21/0x30
[  285.419988]  kasan_save_free_info+0x27/0x40
[  285.420556]  ____kasan_slab_free+0x109/0x1a0
[  285.421146]  kmem_cache_free+0x1c2/0x450
[  285.421680]  __netif_receive_skb_core+0x2ce/0x1870
[  285.422333]  __netif_receive_skb_one_core+0x97/0x140
[  285.423003]  process_backlog+0x100/0x2f0
[  285.423537]  __napi_poll+0x5c/0x2d0
[  285.424023]  net_rx_action+0x2be/0x560
[  285.424510]  __do_softirq+0x11b/0x3de
[  285.425034]
[  285.425254] The buggy address belongs to the object at ffff8880bad31280
[  285.425254]  which belongs to the cache skbuff_head_cache of size 224
[  285.426993] The buggy address is located 40 bytes inside of
[  285.426993]  freed 224-byte region [ffff8880bad31280, ffff8880bad31360)
[  285.428572]
[  285.428798] The buggy address belongs to the physical page:
[  285.429540] page:00000000f4b77674 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xbad31
[  285.430758] flags: 0x100000000000200(slab|node=0|zone=1)
[  285.431447] page_type: 0xffffffff()
[  285.431934] raw: 0100000000000200 ffff88810094a8c0 dead000000000122 0000000000000000
[  285.432757] raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000
[  285.433562] page dumped because: kasan: bad access detected
[  285.434144]
[  285.434320] Memory state around the buggy address:
[  285.434828]  ffff8880bad31180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  285.435580]  ffff8880bad31200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  285.436264] >ffff8880bad31280: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  285.436777]                                   ^
[  285.437106]  ffff8880bad31300: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
[  285.437616]  ffff8880bad31380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  285.438126] ==================================================================
[  285.438662] Disabling lock debugging due to kernel taint

Fix this by:
1. Changing sch_plug's .peek handler to qdisc_peek_dequeued(), a
function compatible with non-work-conserving qdiscs
2. Checking the return value of qdisc_dequeue_peeked() in sch_qfq.

Fixes: 462dbc9 ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost")
	Reported-by: valis <sec@valis.email>
	Signed-off-by: valis <sec@valis.email>
	Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20230901162237.11525-1-jhs@mojatatu.com
	Signed-off-by: Paolo Abeni <pabeni@redhat.com>
(cherry picked from commit 8fc134f)
	Signed-off-by: Greg Rose <g.v.rose@ciq.com>
gvrose8192 added a commit that referenced this pull request Jan 14, 2025
jira VULN-6730
cve CVE-2023-4921
commit-author valis <sec@valis.email>
commit 8fc134f

When the plug qdisc is used as a class of the qfq qdisc it could trigger a
UAF. This issue can be reproduced with following commands:

  tc qdisc add dev lo root handle 1: qfq
  tc class add dev lo parent 1: classid 1:1 qfq weight 1 maxpkt 512
  tc qdisc add dev lo parent 1:1 handle 2: plug
  tc filter add dev lo parent 1: basic classid 1:1
  ping -c1 127.0.0.1

and boom:

[  285.353793] BUG: KASAN: slab-use-after-free in qfq_dequeue+0xa7/0x7f0
[  285.354910] Read of size 4 at addr ffff8880bad312a8 by task ping/144
[  285.355903]
[  285.356165] CPU: 1 PID: 144 Comm: ping Not tainted 6.5.0-rc3+ #4
[  285.357112] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[  285.358376] Call Trace:
[  285.358773]  <IRQ>
[  285.359109]  dump_stack_lvl+0x44/0x60
[  285.359708]  print_address_description.constprop.0+0x2c/0x3c0
[  285.360611]  kasan_report+0x10c/0x120
[  285.361195]  ? qfq_dequeue+0xa7/0x7f0
[  285.361780]  qfq_dequeue+0xa7/0x7f0
[  285.362342]  __qdisc_run+0xf1/0x970
[  285.362903]  net_tx_action+0x28e/0x460
[  285.363502]  __do_softirq+0x11b/0x3de
[  285.364097]  do_softirq.part.0+0x72/0x90
[  285.364721]  </IRQ>
[  285.365072]  <TASK>
[  285.365422]  __local_bh_enable_ip+0x77/0x90
[  285.366079]  __dev_queue_xmit+0x95f/0x1550
[  285.366732]  ? __pfx_csum_and_copy_from_iter+0x10/0x10
[  285.367526]  ? __pfx___dev_queue_xmit+0x10/0x10
[  285.368259]  ? __build_skb_around+0x129/0x190
[  285.368960]  ? ip_generic_getfrag+0x12c/0x170
[  285.369653]  ? __pfx_ip_generic_getfrag+0x10/0x10
[  285.370390]  ? csum_partial+0x8/0x20
[  285.370961]  ? raw_getfrag+0xe5/0x140
[  285.371559]  ip_finish_output2+0x539/0xa40
[  285.372222]  ? __pfx_ip_finish_output2+0x10/0x10
[  285.372954]  ip_output+0x113/0x1e0
[  285.373512]  ? __pfx_ip_output+0x10/0x10
[  285.374130]  ? icmp_out_count+0x49/0x60
[  285.374739]  ? __pfx_ip_finish_output+0x10/0x10
[  285.375457]  ip_push_pending_frames+0xf3/0x100
[  285.376173]  raw_sendmsg+0xef5/0x12d0
[  285.376760]  ? do_syscall_64+0x40/0x90
[  285.377359]  ? __static_call_text_end+0x136578/0x136578
[  285.378173]  ? do_syscall_64+0x40/0x90
[  285.378772]  ? kasan_enable_current+0x11/0x20
[  285.379469]  ? __pfx_raw_sendmsg+0x10/0x10
[  285.380137]  ? __sock_create+0x13e/0x270
[  285.380673]  ? __sys_socket+0xf3/0x180
[  285.381174]  ? __x64_sys_socket+0x3d/0x50
[  285.381725]  ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  285.382425]  ? __rcu_read_unlock+0x48/0x70
[  285.382975]  ? ip4_datagram_release_cb+0xd8/0x380
[  285.383608]  ? __pfx_ip4_datagram_release_cb+0x10/0x10
[  285.384295]  ? preempt_count_sub+0x14/0xc0
[  285.384844]  ? __list_del_entry_valid+0x76/0x140
[  285.385467]  ? _raw_spin_lock_bh+0x87/0xe0
[  285.386014]  ? __pfx__raw_spin_lock_bh+0x10/0x10
[  285.386645]  ? release_sock+0xa0/0xd0
[  285.387148]  ? preempt_count_sub+0x14/0xc0
[  285.387712]  ? freeze_secondary_cpus+0x348/0x3c0
[  285.388341]  ? aa_sk_perm+0x177/0x390
[  285.388856]  ? __pfx_aa_sk_perm+0x10/0x10
[  285.389441]  ? check_stack_object+0x22/0x70
[  285.390032]  ? inet_send_prepare+0x2f/0x120
[  285.390603]  ? __pfx_inet_sendmsg+0x10/0x10
[  285.391172]  sock_sendmsg+0xcc/0xe0
[  285.391667]  __sys_sendto+0x190/0x230
[  285.392168]  ? __pfx___sys_sendto+0x10/0x10
[  285.392727]  ? kvm_clock_get_cycles+0x14/0x30
[  285.393328]  ? set_normalized_timespec64+0x57/0x70
[  285.393980]  ? _raw_spin_unlock_irq+0x1b/0x40
[  285.394578]  ? __x64_sys_clock_gettime+0x11c/0x160
[  285.395225]  ? __pfx___x64_sys_clock_gettime+0x10/0x10
[  285.395908]  ? _copy_to_user+0x3e/0x60
[  285.396432]  ? exit_to_user_mode_prepare+0x1a/0x120
[  285.397086]  ? syscall_exit_to_user_mode+0x22/0x50
[  285.397734]  ? do_syscall_64+0x71/0x90
[  285.398258]  __x64_sys_sendto+0x74/0x90
[  285.398786]  do_syscall_64+0x64/0x90
[  285.399273]  ? exit_to_user_mode_prepare+0x1a/0x120
[  285.399949]  ? syscall_exit_to_user_mode+0x22/0x50
[  285.400605]  ? do_syscall_64+0x71/0x90
[  285.401124]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  285.401807] RIP: 0033:0x495726
[  285.402233] Code: ff ff ff f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 11 b8 2c 00 00 00 0f 09
[  285.404683] RSP: 002b:00007ffcc25fb618 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[  285.405677] RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 0000000000495726
[  285.406628] RDX: 0000000000000040 RSI: 0000000002518750 RDI: 0000000000000000
[  285.407565] RBP: 00000000005205ef R08: 00000000005f8838 R09: 000000000000001c
[  285.408523] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000002517634
[  285.409460] R13: 00007ffcc25fb6f0 R14: 0000000000000003 R15: 0000000000000000
[  285.410403]  </TASK>
[  285.410704]
[  285.410929] Allocated by task 144:
[  285.411402]  kasan_save_stack+0x1e/0x40
[  285.411926]  kasan_set_track+0x21/0x30
[  285.412442]  __kasan_slab_alloc+0x55/0x70
[  285.412973]  kmem_cache_alloc_node+0x187/0x3d0
[  285.413567]  __alloc_skb+0x1b4/0x230
[  285.414060]  __ip_append_data+0x17f7/0x1b60
[  285.414633]  ip_append_data+0x97/0xf0
[  285.415144]  raw_sendmsg+0x5a8/0x12d0
[  285.415640]  sock_sendmsg+0xcc/0xe0
[  285.416117]  __sys_sendto+0x190/0x230
[  285.416626]  __x64_sys_sendto+0x74/0x90
[  285.417145]  do_syscall_64+0x64/0x90
[  285.417624]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  285.418306]
[  285.418531] Freed by task 144:
[  285.418960]  kasan_save_stack+0x1e/0x40
[  285.419469]  kasan_set_track+0x21/0x30
[  285.419988]  kasan_save_free_info+0x27/0x40
[  285.420556]  ____kasan_slab_free+0x109/0x1a0
[  285.421146]  kmem_cache_free+0x1c2/0x450
[  285.421680]  __netif_receive_skb_core+0x2ce/0x1870
[  285.422333]  __netif_receive_skb_one_core+0x97/0x140
[  285.423003]  process_backlog+0x100/0x2f0
[  285.423537]  __napi_poll+0x5c/0x2d0
[  285.424023]  net_rx_action+0x2be/0x560
[  285.424510]  __do_softirq+0x11b/0x3de
[  285.425034]
[  285.425254] The buggy address belongs to the object at ffff8880bad31280
[  285.425254]  which belongs to the cache skbuff_head_cache of size 224
[  285.426993] The buggy address is located 40 bytes inside of
[  285.426993]  freed 224-byte region [ffff8880bad31280, ffff8880bad31360)
[  285.428572]
[  285.428798] The buggy address belongs to the physical page:
[  285.429540] page:00000000f4b77674 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xbad31
[  285.430758] flags: 0x100000000000200(slab|node=0|zone=1)
[  285.431447] page_type: 0xffffffff()
[  285.431934] raw: 0100000000000200 ffff88810094a8c0 dead000000000122 0000000000000000
[  285.432757] raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000
[  285.433562] page dumped because: kasan: bad access detected
[  285.434144]
[  285.434320] Memory state around the buggy address:
[  285.434828]  ffff8880bad31180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  285.435580]  ffff8880bad31200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  285.436264] >ffff8880bad31280: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  285.436777]                                   ^
[  285.437106]  ffff8880bad31300: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
[  285.437616]  ffff8880bad31380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  285.438126] ==================================================================
[  285.438662] Disabling lock debugging due to kernel taint

Fix this by:
1. Changing sch_plug's .peek handler to qdisc_peek_dequeued(), a
function compatible with non-work-conserving qdiscs
2. Checking the return value of qdisc_dequeue_peeked() in sch_qfq.

Fixes: 462dbc9 ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost")
	Reported-by: valis <sec@valis.email>
	Signed-off-by: valis <sec@valis.email>
	Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20230901162237.11525-1-jhs@mojatatu.com
	Signed-off-by: Paolo Abeni <pabeni@redhat.com>
(cherry picked from commit 8fc134f)
	Signed-off-by: Greg Rose <g.v.rose@ciq.com>
pvts-mat pushed a commit to pvts-mat/kernel-src-tree that referenced this pull request Jan 14, 2025
jira LE-1907
Rebuild_History Non-Buildable kernel-rt-5.14.0-284.30.1.rt14.315.el9_2
commit-author minoura makoto <minoura@valinux.co.jp>
commit b18cba0

Commit 9130b8d ("SUNRPC: allow for upcalls for the same uid
but different gss service") introduced `auth` argument to
__gss_find_upcall(), but in gss_pipe_downcall() it was left as NULL
since it (and auth->service) was not (yet) determined.

When multiple upcalls with the same uid and different service are
ongoing, it could happen that __gss_find_upcall(), which returns the
first match found in the pipe->in_downcall list, could not find the
correct gss_msg corresponding to the downcall we are looking for.
Moreover, it might return a msg which is not sent to rpc.gssd yet.

We could see mount.nfs process hung in D state with multiple mount.nfs
are executed in parallel.  The call trace below is of CentOS 7.9
kernel-3.10.0-1160.24.1.el7.x86_64 but we observed the same hang w/
elrepo kernel-ml-6.0.7-1.el7.

PID: 71258  TASK: ffff91ebd4be0000  CPU: 36  COMMAND: "mount.nfs"
 #0 [ffff9203ca3234f8] __schedule at ffffffffa3b8899f
 ctrliq#1 [ffff9203ca323580] schedule at ffffffffa3b88eb9
 ctrliq#2 [ffff9203ca323590] gss_cred_init at ffffffffc0355818 [auth_rpcgss]
 ctrliq#3 [ffff9203ca323658] rpcauth_lookup_credcache at ffffffffc0421ebc
[sunrpc]
 ctrliq#4 [ffff9203ca3236d8] gss_lookup_cred at ffffffffc0353633 [auth_rpcgss]
 ctrliq#5 [ffff9203ca3236e8] rpcauth_lookupcred at ffffffffc0421581 [sunrpc]
 ctrliq#6 [ffff9203ca323740] rpcauth_refreshcred at ffffffffc04223d3 [sunrpc]
 ctrliq#7 [ffff9203ca3237a0] call_refresh at ffffffffc04103dc [sunrpc]
 ctrliq#8 [ffff9203ca3237b8] __rpc_execute at ffffffffc041e1c9 [sunrpc]
 ctrliq#9 [ffff9203ca323820] rpc_execute at ffffffffc0420a48 [sunrpc]

The scenario is like this. Let's say there are two upcalls for
services A and B, A -> B in pipe->in_downcall, B -> A in pipe->pipe.

When rpc.gssd reads pipe to get the upcall msg corresponding to
service B from pipe->pipe and then writes the response, in
gss_pipe_downcall the msg corresponding to service A will be picked
because only uid is used to find the msg and it is before the one for
B in pipe->in_downcall.  And the process waiting for the msg
corresponding to service A will be woken up.

Actual scheduing of that process might be after rpc.gssd processes the
next msg.  In rpc_pipe_generic_upcall it clears msg->errno (for A).
The process is scheduled to see gss_msg->ctx == NULL and
gss_msg->msg.errno == 0, therefore it cannot break the loop in
gss_create_upcall and is never woken up after that.

This patch adds a simple check to ensure that a msg which is not
sent to rpc.gssd yet is not chosen as the matching upcall upon
receiving a downcall.

	Signed-off-by: minoura makoto <minoura@valinux.co.jp>
	Signed-off-by: Hiroshi Shimamoto <h-shimamoto@nec.com>
	Tested-by: Hiroshi Shimamoto <h-shimamoto@nec.com>
	Cc: Trond Myklebust <trondmy@hammerspace.com>
Fixes: 9130b8d ("SUNRPC: allow for upcalls for same uid but different gss service")
	Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
(cherry picked from commit b18cba0)
	Signed-off-by: Jonathan Maple <jmaple@ciq.com>
pvts-mat pushed a commit to pvts-mat/kernel-src-tree that referenced this pull request Jan 14, 2025
jira LE-1907
Rebuild_History Non-Buildable kernel-rt-5.14.0-284.30.1.rt14.315.el9_2
commit-author Stefan Assmann <sassmann@kpanic.de>
commit 4e264be

When a system with E810 with existing VFs gets rebooted the following
hang may be observed.

 Pid 1 is hung in iavf_remove(), part of a network driver:
 PID: 1        TASK: ffff965400e5a340  CPU: 24   COMMAND: "systemd-shutdow"
  #0 [ffffaad04005fa50] __schedule at ffffffff8b3239cb
  ctrliq#1 [ffffaad04005fae8] schedule at ffffffff8b323e2d
  ctrliq#2 [ffffaad04005fb00] schedule_hrtimeout_range_clock at ffffffff8b32cebc
  ctrliq#3 [ffffaad04005fb80] usleep_range_state at ffffffff8b32c930
  ctrliq#4 [ffffaad04005fbb0] iavf_remove at ffffffffc12b9b4c [iavf]
  ctrliq#5 [ffffaad04005fbf0] pci_device_remove at ffffffff8add7513
  ctrliq#6 [ffffaad04005fc10] device_release_driver_internal at ffffffff8af08baa
  ctrliq#7 [ffffaad04005fc40] pci_stop_bus_device at ffffffff8adcc5fc
  ctrliq#8 [ffffaad04005fc60] pci_stop_and_remove_bus_device at ffffffff8adcc81e
  ctrliq#9 [ffffaad04005fc70] pci_iov_remove_virtfn at ffffffff8adf9429
 ctrliq#10 [ffffaad04005fca8] sriov_disable at ffffffff8adf98e4
 ctrliq#11 [ffffaad04005fcc8] ice_free_vfs at ffffffffc04bb2c8 [ice]
 ctrliq#12 [ffffaad04005fd10] ice_remove at ffffffffc04778fe [ice]
 ctrliq#13 [ffffaad04005fd38] ice_shutdown at ffffffffc0477946 [ice]
 ctrliq#14 [ffffaad04005fd50] pci_device_shutdown at ffffffff8add58f1
 ctrliq#15 [ffffaad04005fd70] device_shutdown at ffffffff8af05386
 ctrliq#16 [ffffaad04005fd98] kernel_restart at ffffffff8a92a870
 ctrliq#17 [ffffaad04005fda8] __do_sys_reboot at ffffffff8a92abd6
 ctrliq#18 [ffffaad04005fee0] do_syscall_64 at ffffffff8b317159
 ctrliq#19 [ffffaad04005ff08] __context_tracking_enter at ffffffff8b31b6fc
 ctrliq#20 [ffffaad04005ff18] syscall_exit_to_user_mode at ffffffff8b31b50d
 ctrliq#21 [ffffaad04005ff28] do_syscall_64 at ffffffff8b317169
 ctrliq#22 [ffffaad04005ff50] entry_SYSCALL_64_after_hwframe at ffffffff8b40009b
     RIP: 00007f1baa5c13d7  RSP: 00007fffbcc55a98  RFLAGS: 00000202
     RAX: ffffffffffffffda  RBX: 0000000000000000  RCX: 00007f1baa5c13d7
     RDX: 0000000001234567  RSI: 0000000028121969  RDI: 00000000fee1dead
     RBP: 00007fffbcc55ca0   R8: 0000000000000000   R9: 00007fffbcc54e90
     R10: 00007fffbcc55050  R11: 0000000000000202  R12: 0000000000000005
     R13: 0000000000000000  R14: 00007fffbcc55af0  R15: 0000000000000000
     ORIG_RAX: 00000000000000a9  CS: 0033  SS: 002b

During reboot all drivers PM shutdown callbacks are invoked.
In iavf_shutdown() the adapter state is changed to __IAVF_REMOVE.
In ice_shutdown() the call chain above is executed, which at some point
calls iavf_remove(). However iavf_remove() expects the VF to be in one
of the states __IAVF_RUNNING, __IAVF_DOWN or __IAVF_INIT_FAILED. If
that's not the case it sleeps forever.
So if iavf_shutdown() gets invoked before iavf_remove() the system will
hang indefinitely because the adapter is already in state __IAVF_REMOVE.

Fix this by returning from iavf_remove() if the state is __IAVF_REMOVE,
as we already went through iavf_shutdown().

Fixes: 9745780 ("iavf: Add waiting so the port is initialized in remove")
Fixes: a841733 ("iavf: Fix race condition between iavf_shutdown and iavf_remove")
	Reported-by: Marius Cornea <mcornea@redhat.com>
	Signed-off-by: Stefan Assmann <sassmann@kpanic.de>
	Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
	Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
	Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
(cherry picked from commit 4e264be)
	Signed-off-by: Jonathan Maple <jmaple@ciq.com>
pvts-mat pushed a commit to pvts-mat/kernel-src-tree that referenced this pull request Jan 14, 2025
jira LE-1907
Rebuild_History Non-Buildable kernel-rt-5.14.0-284.30.1.rt14.315.el9_2
commit-author Michael Ellerman <mpe@ellerman.id.au>
commit 6d65028

As reported by Alan, the CFI (Call Frame Information) in the VDSO time
routines is incorrect since commit ce7d805 ("powerpc/vdso: Prepare
for switching VDSO to generic C implementation.").

DWARF has a concept called the CFA (Canonical Frame Address), which on
powerpc is calculated as an offset from the stack pointer (r1). That
means when the stack pointer is changed there must be a corresponding
CFI directive to update the calculation of the CFA.

The current code is missing those directives for the changes to r1,
which prevents gdb from being able to generate a backtrace from inside
VDSO functions, eg:

  Breakpoint 1, 0x00007ffff7f804dc in __kernel_clock_gettime ()
  (gdb) bt
  #0  0x00007ffff7f804dc in __kernel_clock_gettime ()
  ctrliq#1  0x00007ffff7d8872c in clock_gettime@@GLIBC_2.17 () from /lib64/libc.so.6
  ctrliq#2  0x00007fffffffd960 in ?? ()
  ctrliq#3  0x00007ffff7d8872c in clock_gettime@@GLIBC_2.17 () from /lib64/libc.so.6
  Backtrace stopped: frame did not save the PC

Alan helpfully describes some rules for correctly maintaining the CFI information:

  1) Every adjustment to the current frame address reg (ie. r1) must be
     described, and exactly at the instruction where r1 changes. Why?
     Because stack unwinding might want to access previous frames.

  2) If a function changes LR or any non-volatile register, the save
     location for those regs must be given. The CFI can be at any
     instruction after the saves up to the point that the reg is
     changed.
     (Exception: LR save should be described before a bl. not after)

  3) If asychronous unwind info is needed then restores of LR and
     non-volatile regs must also be described. The CFI can be at any
     instruction after the reg is restored up to the point where the
     save location is (potentially) trashed.

Fix the inability to backtrace by adding CFI directives describing the
changes to r1, ie. satisfying rule 1.

Also change the information for LR to point to the copy saved on the
stack, not the value in r0 that will be overwritten by the function
call.

Finally, add CFI directives describing the save/restore of r2.

With the fix gdb can correctly back trace and navigate up and down the stack:

  Breakpoint 1, 0x00007ffff7f804dc in __kernel_clock_gettime ()
  (gdb) bt
  #0  0x00007ffff7f804dc in __kernel_clock_gettime ()
  ctrliq#1  0x00007ffff7d8872c in clock_gettime@@GLIBC_2.17 () from /lib64/libc.so.6
  ctrliq#2  0x0000000100015b60 in gettime ()
  ctrliq#3  0x000000010000c8bc in print_long_format ()
  ctrliq#4  0x000000010000d180 in print_current_files ()
  ctrliq#5  0x00000001000054ac in main ()
  (gdb) up
  ctrliq#1  0x00007ffff7d8872c in clock_gettime@@GLIBC_2.17 () from /lib64/libc.so.6
  (gdb)
  ctrliq#2  0x0000000100015b60 in gettime ()
  (gdb)
  ctrliq#3  0x000000010000c8bc in print_long_format ()
  (gdb)
  ctrliq#4  0x000000010000d180 in print_current_files ()
  (gdb)
  ctrliq#5  0x00000001000054ac in main ()
  (gdb)
  Initial frame selected; you cannot go up.
  (gdb) down
  ctrliq#4  0x000000010000d180 in print_current_files ()
  (gdb)
  ctrliq#3  0x000000010000c8bc in print_long_format ()
  (gdb)
  ctrliq#2  0x0000000100015b60 in gettime ()
  (gdb)
  ctrliq#1  0x00007ffff7d8872c in clock_gettime@@GLIBC_2.17 () from /lib64/libc.so.6
  (gdb)
  #0  0x00007ffff7f804dc in __kernel_clock_gettime ()
  (gdb)

Fixes: ce7d805 ("powerpc/vdso: Prepare for switching VDSO to generic C implementation.")
	Cc: stable@vger.kernel.org # v5.11+
	Reported-by: Alan Modra <amodra@gmail.com>
	Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
	Reviewed-by: Segher Boessenkool <segher@kernel.crashing.org>
Link: https://lore.kernel.org/r/20220502125010.1319370-1-mpe@ellerman.id.au
(cherry picked from commit 6d65028)
	Signed-off-by: Jonathan Maple <jmaple@ciq.com>
pvts-mat pushed a commit to pvts-mat/kernel-src-tree that referenced this pull request Jan 14, 2025
jira LE-1907
Rebuild_History Non-Buildable kernel-rt-5.14.0-284.30.1.rt14.315.el9_2
commit-author Eelco Chaudron <echaudro@redhat.com>
commit de9df6c

Currently, the per cpu upcall counters are allocated after the vport is
created and inserted into the system. This could lead to the datapath
accessing the counters before they are allocated resulting in a kernel
Oops.

Here is an example:

  PID: 59693    TASK: ffff0005f4f51500  CPU: 0    COMMAND: "ovs-vswitchd"
   #0 [ffff80000a39b5b0] __switch_to at ffffb70f0629f2f4
   ctrliq#1 [ffff80000a39b5d0] __schedule at ffffb70f0629f5cc
   ctrliq#2 [ffff80000a39b650] preempt_schedule_common at ffffb70f0629fa60
   ctrliq#3 [ffff80000a39b670] dynamic_might_resched at ffffb70f0629fb58
   ctrliq#4 [ffff80000a39b680] mutex_lock_killable at ffffb70f062a1388
   ctrliq#5 [ffff80000a39b6a0] pcpu_alloc at ffffb70f0594460c
   ctrliq#6 [ffff80000a39b750] __alloc_percpu_gfp at ffffb70f05944e68
   ctrliq#7 [ffff80000a39b760] ovs_vport_cmd_new at ffffb70ee6961b90 [openvswitch]
   ...

  PID: 58682    TASK: ffff0005b2f0bf00  CPU: 0    COMMAND: "kworker/0:3"
   #0 [ffff80000a5d2f40] machine_kexec at ffffb70f056a0758
   ctrliq#1 [ffff80000a5d2f70] __crash_kexec at ffffb70f057e2994
   ctrliq#2 [ffff80000a5d3100] crash_kexec at ffffb70f057e2ad8
   ctrliq#3 [ffff80000a5d3120] die at ffffb70f0628234c
   ctrliq#4 [ffff80000a5d31e0] die_kernel_fault at ffffb70f062828a8
   ctrliq#5 [ffff80000a5d3210] __do_kernel_fault at ffffb70f056a31f4
   ctrliq#6 [ffff80000a5d3240] do_bad_area at ffffb70f056a32a4
   ctrliq#7 [ffff80000a5d3260] do_translation_fault at ffffb70f062a9710
   ctrliq#8 [ffff80000a5d3270] do_mem_abort at ffffb70f056a2f74
   ctrliq#9 [ffff80000a5d32a0] el1_abort at ffffb70f06297dac
  ctrliq#10 [ffff80000a5d32d0] el1h_64_sync_handler at ffffb70f06299b24
  ctrliq#11 [ffff80000a5d3410] el1h_64_sync at ffffb70f056812dc
  ctrliq#12 [ffff80000a5d3430] ovs_dp_upcall at ffffb70ee6963c84 [openvswitch]
  ctrliq#13 [ffff80000a5d3470] ovs_dp_process_packet at ffffb70ee6963fdc [openvswitch]
  ctrliq#14 [ffff80000a5d34f0] ovs_vport_receive at ffffb70ee6972c78 [openvswitch]
  ctrliq#15 [ffff80000a5d36f0] netdev_port_receive at ffffb70ee6973948 [openvswitch]
  ctrliq#16 [ffff80000a5d3720] netdev_frame_hook at ffffb70ee6973a28 [openvswitch]
  ctrliq#17 [ffff80000a5d3730] __netif_receive_skb_core.constprop.0 at ffffb70f06079f90

We moved the per cpu upcall counter allocation to the existing vport
alloc and free functions to solve this.

Fixes: 95637d9 ("net: openvswitch: release vport resources on failure")
Fixes: 1933ea3 ("net: openvswitch: Add support to count upcall packets")
	Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
	Reviewed-by: Simon Horman <simon.horman@corigine.com>
	Acked-by: Aaron Conole <aconole@redhat.com>
	Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit de9df6c)
	Signed-off-by: Jonathan Maple <jmaple@ciq.com>
gvrose8192 added a commit that referenced this pull request Jan 16, 2025
jira VULN-6730
cve CVE-2023-4921
commit-author valis <sec@valis.email>
commit 8fc134f

When the plug qdisc is used as a class of the qfq qdisc it could trigger a
UAF. This issue can be reproduced with following commands:

  tc qdisc add dev lo root handle 1: qfq
  tc class add dev lo parent 1: classid 1:1 qfq weight 1 maxpkt 512
  tc qdisc add dev lo parent 1:1 handle 2: plug
  tc filter add dev lo parent 1: basic classid 1:1
  ping -c1 127.0.0.1

and boom:

[  285.353793] BUG: KASAN: slab-use-after-free in qfq_dequeue+0xa7/0x7f0
[  285.354910] Read of size 4 at addr ffff8880bad312a8 by task ping/144
[  285.355903]
[  285.356165] CPU: 1 PID: 144 Comm: ping Not tainted 6.5.0-rc3+ #4
[  285.357112] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[  285.358376] Call Trace:
[  285.358773]  <IRQ>
[  285.359109]  dump_stack_lvl+0x44/0x60
[  285.359708]  print_address_description.constprop.0+0x2c/0x3c0
[  285.360611]  kasan_report+0x10c/0x120
[  285.361195]  ? qfq_dequeue+0xa7/0x7f0
[  285.361780]  qfq_dequeue+0xa7/0x7f0
[  285.362342]  __qdisc_run+0xf1/0x970
[  285.362903]  net_tx_action+0x28e/0x460
[  285.363502]  __do_softirq+0x11b/0x3de
[  285.364097]  do_softirq.part.0+0x72/0x90
[  285.364721]  </IRQ>
[  285.365072]  <TASK>
[  285.365422]  __local_bh_enable_ip+0x77/0x90
[  285.366079]  __dev_queue_xmit+0x95f/0x1550
[  285.366732]  ? __pfx_csum_and_copy_from_iter+0x10/0x10
[  285.367526]  ? __pfx___dev_queue_xmit+0x10/0x10
[  285.368259]  ? __build_skb_around+0x129/0x190
[  285.368960]  ? ip_generic_getfrag+0x12c/0x170
[  285.369653]  ? __pfx_ip_generic_getfrag+0x10/0x10
[  285.370390]  ? csum_partial+0x8/0x20
[  285.370961]  ? raw_getfrag+0xe5/0x140
[  285.371559]  ip_finish_output2+0x539/0xa40
[  285.372222]  ? __pfx_ip_finish_output2+0x10/0x10
[  285.372954]  ip_output+0x113/0x1e0
[  285.373512]  ? __pfx_ip_output+0x10/0x10
[  285.374130]  ? icmp_out_count+0x49/0x60
[  285.374739]  ? __pfx_ip_finish_output+0x10/0x10
[  285.375457]  ip_push_pending_frames+0xf3/0x100
[  285.376173]  raw_sendmsg+0xef5/0x12d0
[  285.376760]  ? do_syscall_64+0x40/0x90
[  285.377359]  ? __static_call_text_end+0x136578/0x136578
[  285.378173]  ? do_syscall_64+0x40/0x90
[  285.378772]  ? kasan_enable_current+0x11/0x20
[  285.379469]  ? __pfx_raw_sendmsg+0x10/0x10
[  285.380137]  ? __sock_create+0x13e/0x270
[  285.380673]  ? __sys_socket+0xf3/0x180
[  285.381174]  ? __x64_sys_socket+0x3d/0x50
[  285.381725]  ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  285.382425]  ? __rcu_read_unlock+0x48/0x70
[  285.382975]  ? ip4_datagram_release_cb+0xd8/0x380
[  285.383608]  ? __pfx_ip4_datagram_release_cb+0x10/0x10
[  285.384295]  ? preempt_count_sub+0x14/0xc0
[  285.384844]  ? __list_del_entry_valid+0x76/0x140
[  285.385467]  ? _raw_spin_lock_bh+0x87/0xe0
[  285.386014]  ? __pfx__raw_spin_lock_bh+0x10/0x10
[  285.386645]  ? release_sock+0xa0/0xd0
[  285.387148]  ? preempt_count_sub+0x14/0xc0
[  285.387712]  ? freeze_secondary_cpus+0x348/0x3c0
[  285.388341]  ? aa_sk_perm+0x177/0x390
[  285.388856]  ? __pfx_aa_sk_perm+0x10/0x10
[  285.389441]  ? check_stack_object+0x22/0x70
[  285.390032]  ? inet_send_prepare+0x2f/0x120
[  285.390603]  ? __pfx_inet_sendmsg+0x10/0x10
[  285.391172]  sock_sendmsg+0xcc/0xe0
[  285.391667]  __sys_sendto+0x190/0x230
[  285.392168]  ? __pfx___sys_sendto+0x10/0x10
[  285.392727]  ? kvm_clock_get_cycles+0x14/0x30
[  285.393328]  ? set_normalized_timespec64+0x57/0x70
[  285.393980]  ? _raw_spin_unlock_irq+0x1b/0x40
[  285.394578]  ? __x64_sys_clock_gettime+0x11c/0x160
[  285.395225]  ? __pfx___x64_sys_clock_gettime+0x10/0x10
[  285.395908]  ? _copy_to_user+0x3e/0x60
[  285.396432]  ? exit_to_user_mode_prepare+0x1a/0x120
[  285.397086]  ? syscall_exit_to_user_mode+0x22/0x50
[  285.397734]  ? do_syscall_64+0x71/0x90
[  285.398258]  __x64_sys_sendto+0x74/0x90
[  285.398786]  do_syscall_64+0x64/0x90
[  285.399273]  ? exit_to_user_mode_prepare+0x1a/0x120
[  285.399949]  ? syscall_exit_to_user_mode+0x22/0x50
[  285.400605]  ? do_syscall_64+0x71/0x90
[  285.401124]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  285.401807] RIP: 0033:0x495726
[  285.402233] Code: ff ff ff f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 11 b8 2c 00 00 00 0f 09
[  285.404683] RSP: 002b:00007ffcc25fb618 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[  285.405677] RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 0000000000495726
[  285.406628] RDX: 0000000000000040 RSI: 0000000002518750 RDI: 0000000000000000
[  285.407565] RBP: 00000000005205ef R08: 00000000005f8838 R09: 000000000000001c
[  285.408523] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000002517634
[  285.409460] R13: 00007ffcc25fb6f0 R14: 0000000000000003 R15: 0000000000000000
[  285.410403]  </TASK>
[  285.410704]
[  285.410929] Allocated by task 144:
[  285.411402]  kasan_save_stack+0x1e/0x40
[  285.411926]  kasan_set_track+0x21/0x30
[  285.412442]  __kasan_slab_alloc+0x55/0x70
[  285.412973]  kmem_cache_alloc_node+0x187/0x3d0
[  285.413567]  __alloc_skb+0x1b4/0x230
[  285.414060]  __ip_append_data+0x17f7/0x1b60
[  285.414633]  ip_append_data+0x97/0xf0
[  285.415144]  raw_sendmsg+0x5a8/0x12d0
[  285.415640]  sock_sendmsg+0xcc/0xe0
[  285.416117]  __sys_sendto+0x190/0x230
[  285.416626]  __x64_sys_sendto+0x74/0x90
[  285.417145]  do_syscall_64+0x64/0x90
[  285.417624]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  285.418306]
[  285.418531] Freed by task 144:
[  285.418960]  kasan_save_stack+0x1e/0x40
[  285.419469]  kasan_set_track+0x21/0x30
[  285.419988]  kasan_save_free_info+0x27/0x40
[  285.420556]  ____kasan_slab_free+0x109/0x1a0
[  285.421146]  kmem_cache_free+0x1c2/0x450
[  285.421680]  __netif_receive_skb_core+0x2ce/0x1870
[  285.422333]  __netif_receive_skb_one_core+0x97/0x140
[  285.423003]  process_backlog+0x100/0x2f0
[  285.423537]  __napi_poll+0x5c/0x2d0
[  285.424023]  net_rx_action+0x2be/0x560
[  285.424510]  __do_softirq+0x11b/0x3de
[  285.425034]
[  285.425254] The buggy address belongs to the object at ffff8880bad31280
[  285.425254]  which belongs to the cache skbuff_head_cache of size 224
[  285.426993] The buggy address is located 40 bytes inside of
[  285.426993]  freed 224-byte region [ffff8880bad31280, ffff8880bad31360)
[  285.428572]
[  285.428798] The buggy address belongs to the physical page:
[  285.429540] page:00000000f4b77674 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xbad31
[  285.430758] flags: 0x100000000000200(slab|node=0|zone=1)
[  285.431447] page_type: 0xffffffff()
[  285.431934] raw: 0100000000000200 ffff88810094a8c0 dead000000000122 0000000000000000
[  285.432757] raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000
[  285.433562] page dumped because: kasan: bad access detected
[  285.434144]
[  285.434320] Memory state around the buggy address:
[  285.434828]  ffff8880bad31180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  285.435580]  ffff8880bad31200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  285.436264] >ffff8880bad31280: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  285.436777]                                   ^
[  285.437106]  ffff8880bad31300: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
[  285.437616]  ffff8880bad31380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  285.438126] ==================================================================
[  285.438662] Disabling lock debugging due to kernel taint

Fix this by:
1. Changing sch_plug's .peek handler to qdisc_peek_dequeued(), a
function compatible with non-work-conserving qdiscs
2. Checking the return value of qdisc_dequeue_peeked() in sch_qfq.

Fixes: 462dbc9 ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost")
	Reported-by: valis <sec@valis.email>
	Signed-off-by: valis <sec@valis.email>
	Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20230901162237.11525-1-jhs@mojatatu.com
	Signed-off-by: Paolo Abeni <pabeni@redhat.com>
(cherry picked from commit 8fc134f)
	Signed-off-by: Greg Rose <g.v.rose@ciq.com>
github-actions bot pushed a commit that referenced this pull request Jul 4, 2025
JIRA: https://issues.redhat.com/browse/RHEL-78197

upstream
========
commit 9daa05c
Author: Namhyung Kim <namhyung@kernel.org>
Date: Mon Mar 10 17:04:16 2025 -0700

description
===========
The env.pmu_mapping can be leaked when it reads data from a pipe on AMD.
For a pipe data, it reads the header data including pmu_mapping from
PERF_RECORD_HEADER_FEATURE runtime.  But it's already set in:

  perf_session__new()
    __perf_session__new()
      evlist__init_trace_event_sample_raw()
        evlist__has_amd_ibs()
          perf_env__nr_pmu_mappings()

Then it'll overwrite that when it processes the HEADER_FEATURE record.
Here's a report from address sanitizer.

  Direct leak of 2689 byte(s) in 1 object(s) allocated from:
    #0 0x7fed8f814596 in realloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:98
    #1 0x5595a7d416b1 in strbuf_grow util/strbuf.c:64
    #2 0x5595a7d414ef in strbuf_init util/strbuf.c:25
    #3 0x5595a7d0f4b7 in perf_env__read_pmu_mappings util/env.c:362
    #4 0x5595a7d12ab7 in perf_env__nr_pmu_mappings util/env.c:517
    #5 0x5595a7d89d2f in evlist__has_amd_ibs util/amd-sample-raw.c:315
    #6 0x5595a7d87fb2 in evlist__init_trace_event_sample_raw util/sample-raw.c:23
    #7 0x5595a7d7f893 in __perf_session__new util/session.c:179
    #8 0x5595a7b79572 in perf_session__new util/session.h:115
    #9 0x5595a7b7e9dc in cmd_report builtin-report.c:1603
    #10 0x5595a7c019eb in run_builtin perf.c:351
    #11 0x5595a7c01c92 in handle_internal_command perf.c:404
    #12 0x5595a7c01deb in run_argv perf.c:448
    #13 0x5595a7c02134 in main perf.c:556
    #14 0x7fed85833d67 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58

Let's free the existing pmu_mapping data if any.

    Cc: Ravi Bangoria <ravi.bangoria@amd.com>
    Link: https://lore.kernel.org/r/20250311000416.817631-1-namhyung@kernel.org
    Signed-off-by: Namhyung Kim <namhyung@kernel.org>

Signed-off-by: Michael Petlan <mpetlan@redhat.com>
github-actions bot pushed a commit that referenced this pull request Jul 5, 2025
JIRA: https://issues.redhat.com/browse/RHEL-78837
CVE: CVE-2024-57980

commit c6ef3a7
Author: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Date: Fri, 8 Nov 2024 01:51:30 +0200

  If the uvc_status_init() function fails to allocate the int_urb, it will
  free the dev->status pointer but doesn't reset the pointer to NULL. This
  results in the kfree() call in uvc_status_cleanup() trying to
  double-free the memory. Fix it by resetting the dev->status pointer to
  NULL after freeing it.

  Fixes: a31a405 ("V4L/DVB:usbvideo:don't use part of buffer for USB transfer #4")
  Cc: stable@vger.kernel.org
  Link: https://lore.kernel.org/r/20241107235130.31372-1-laurent.pinchart@ideasonboard.com
  Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
  Reviewed by: Ricardo Ribalda <ribalda@chromium.org>
  Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

Signed-off-by: Desnes Nunes <desnesn@redhat.com>
github-actions bot pushed a commit that referenced this pull request Jul 5, 2025
JIRA: https://issues.redhat.com/browse/RHEL-96372
CVE: CVE-2025-37958
Conflicts:
  * context conflicts due to RHEL-9 missing upstream commit 29e847d
    ("mm/rmap: integrate PMD-mapped folio splitting into pagewalk loop")
    which is part of a bigger series "Reclaim lazyfree THP without splitting"
    and is irrelevant to the fixlet for this CVE.

commit be6e843
Author: Gavin Guo <gavinguo@igalia.com>
Date:   Mon Apr 21 19:35:36 2025 +0800

    mm/huge_memory: fix dereferencing invalid pmd migration entry

    When migrating a THP, concurrent access to the PMD migration entry during
    a deferred split scan can lead to an invalid address access, as
    illustrated below.  To prevent this invalid access, it is necessary to
    check the PMD migration entry and return early.  In this context, there is
    no need to use pmd_to_swp_entry and pfn_swap_entry_to_page to verify the
    equality of the target folio.  Since the PMD migration entry is locked, it
    cannot be served as the target.

    Mailing list discussion and explanation from Hugh Dickins: "An anon_vma
    lookup points to a location which may contain the folio of interest, but
    might instead contain another folio: and weeding out those other folios is
    precisely what the "folio != pmd_folio((*pmd)" check (and the "risk of
    replacing the wrong folio" comment a few lines above it) is for."

    BUG: unable to handle page fault for address: ffffea60001db008
    CPU: 0 UID: 0 PID: 2199114 Comm: tee Not tainted 6.14.0+ #4 NONE
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
    RIP: 0010:split_huge_pmd_locked+0x3b5/0x2b60
    Call Trace:
    <TASK>
    try_to_migrate_one+0x28c/0x3730
    rmap_walk_anon+0x4f6/0x770
    unmap_folio+0x196/0x1f0
    split_huge_page_to_list_to_order+0x9f6/0x1560
    deferred_split_scan+0xac5/0x12a0
    shrinker_debugfs_scan_write+0x376/0x470
    full_proxy_write+0x15c/0x220
    vfs_write+0x2fc/0xcb0
    ksys_write+0x146/0x250
    do_syscall_64+0x6a/0x120
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

    The bug is found by syzkaller on an internal kernel, then confirmed on
    upstream.

    Link: https://lkml.kernel.org/r/20250421113536.3682201-1-gavinguo@igalia.com
    Link: https://lore.kernel.org/all/20250414072737.1698513-1-gavinguo@igalia.com/
    Link: https://lore.kernel.org/all/20250418085802.2973519-1-gavinguo@igalia.com/
    Fixes: 84c3fc4 ("mm: thp: check pmd migration entry in common path")
    Signed-off-by: Gavin Guo <gavinguo@igalia.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Acked-by: Hugh Dickins <hughd@google.com>
    Acked-by: Zi Yan <ziy@nvidia.com>
    Reviewed-by: Gavin Shan <gshan@redhat.com>
    Cc: Florent Revest <revest@google.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
github-actions bot pushed a commit that referenced this pull request Jul 5, 2025
… context

The current use of a mutex to protect the notifier hashtable accesses
can lead to issues in the atomic context. It results in the below
kernel warnings:

  |  BUG: sleeping function called from invalid context at kernel/locking/mutex.c:258
  |  in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 9, name: kworker/0:0
  |  preempt_count: 1, expected: 0
  |  RCU nest depth: 0, expected: 0
  |  CPU: 0 UID: 0 PID: 9 Comm: kworker/0:0 Not tainted 6.14.0 #4
  |  Workqueue: ffa_pcpu_irq_notification notif_pcpu_irq_work_fn
  |  Call trace:
  |   show_stack+0x18/0x24 (C)
  |   dump_stack_lvl+0x78/0x90
  |   dump_stack+0x18/0x24
  |   __might_resched+0x114/0x170
  |   __might_sleep+0x48/0x98
  |   mutex_lock+0x24/0x80
  |   handle_notif_callbacks+0x54/0xe0
  |   notif_get_and_handle+0x40/0x88
  |   generic_exec_single+0x80/0xc0
  |   smp_call_function_single+0xfc/0x1a0
  |   notif_pcpu_irq_work_fn+0x2c/0x38
  |   process_one_work+0x14c/0x2b4
  |   worker_thread+0x2e4/0x3e0
  |   kthread+0x13c/0x210
  |   ret_from_fork+0x10/0x20

To address this, replace the mutex with an rwlock to protect the notifier
hashtable accesses. This ensures that read-side locking does not sleep and
multiple readers can acquire the lock concurrently, avoiding unnecessary
contention and potential deadlocks. Writer access remains exclusive,
preserving correctness.

This change resolves warnings from lockdep about potential sleep in
atomic context.

Cc: Jens Wiklander <jens.wiklander@linaro.org>
Reported-by: Jérôme Forissier <jerome.forissier@linaro.org>
Closes: OP-TEE/optee_os#7394
Fixes: e057344 ("firmware: arm_ffa: Add interfaces to request notification callbacks")
Message-Id: <20250528-ffa_notif_fix-v1-3-5ed7bc7f8437@arm.com>
Reviewed-by: Jens Wiklander <jens.wiklander@linaro.org>
Tested-by: Jens Wiklander <jens.wiklander@linaro.org>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
github-actions bot pushed a commit that referenced this pull request Jul 8, 2025
JIRA: https://issues.redhat.com/browse/RHEL-74410

commit c5b6aba
Author: Maxim Levitsky <mlevitsk@redhat.com>
Date:   Mon May 12 14:04:02 2025 -0400

    locking/mutex: implement mutex_trylock_nested

    Despite the fact that several lockdep-related checks are skipped when
    calling trylock* versions of the locking primitives, for example
    mutex_trylock, each time the mutex is acquired, a held_lock is still
    placed onto the lockdep stack by __lock_acquire() which is called
    regardless of whether the trylock* or regular locking API was used.

    This means that if the caller successfully acquires more than
    MAX_LOCK_DEPTH locks of the same class, even when using mutex_trylock,
    lockdep will still complain that the maximum depth of the held lock stack
    has been reached and disable itself.

    For example, the following error currently occurs in the ARM version
    of KVM, once the code tries to lock all vCPUs of a VM configured with more
    than MAX_LOCK_DEPTH vCPUs, a situation that can easily happen on modern
    systems, where having more than 48 CPUs is common, and it's also common to
    run VMs that have vCPU counts approaching that number:

    [  328.171264] BUG: MAX_LOCK_DEPTH too low!
    [  328.175227] turning off the locking correctness validator.
    [  328.180726] Please attach the output of /proc/lock_stat to the bug report
    [  328.187531] depth: 48  max: 48!
    [  328.190678] 48 locks held by qemu-kvm/11664:
    [  328.194957]  #0: ffff800086de5ba0 (&kvm->lock){+.+.}-{3:3}, at: kvm_ioctl_create_device+0x174/0x5b0
    [  328.204048]  #1: ffff0800e78800b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
    [  328.212521]  #2: ffff07ffeee51e98 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
    [  328.220991]  #3: ffff0800dc7d80b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
    [  328.229463]  #4: ffff07ffe0c980b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
    [  328.237934]  #5: ffff0800a3883c78 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
    [  328.246405]  #6: ffff07fffbe480b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0

    Luckily, in all instances that require locking all vCPUs, the
    'kvm->lock' is taken a priori, and that fact makes it possible to use
    the little known feature of lockdep, called a 'nest_lock', to avoid this
    warning and subsequent lockdep self-disablement.

    The action of 'nested lock' being provided to lockdep's lock_acquire(),
    causes the lockdep to detect that the top of the held lock stack contains
    a lock of the same class and then increment its reference counter instead
    of pushing a new held_lock item onto that stack.

    See __lock_acquire for more information.

    Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Message-ID: <20250512180407.659015-2-mlevitsk@redhat.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
github-actions bot pushed a commit that referenced this pull request Jul 8, 2025
JIRA: https://issues.redhat.com/browse/RHEL-74410

commit b586c5d
Author: Maxim Levitsky <mlevitsk@redhat.com>
Date:   Mon May 12 14:04:06 2025 -0400

    KVM: arm64: use kvm_trylock_all_vcpus when locking all vCPUs

    Use kvm_trylock_all_vcpus instead of a custom implementation when locking
    all vCPUs of a VM, to avoid triggering a lockdep warning, in the case in
    which the VM is configured to have more than MAX_LOCK_DEPTH vCPUs.

    This fixes the following false lockdep warning:

    [  328.171264] BUG: MAX_LOCK_DEPTH too low!
    [  328.175227] turning off the locking correctness validator.
    [  328.180726] Please attach the output of /proc/lock_stat to the bug report
    [  328.187531] depth: 48  max: 48!
    [  328.190678] 48 locks held by qemu-kvm/11664:
    [  328.194957]  #0: ffff800086de5ba0 (&kvm->lock){+.+.}-{3:3}, at: kvm_ioctl_create_device+0x174/0x5b0
    [  328.204048]  #1: ffff0800e78800b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
    [  328.212521]  #2: ffff07ffeee51e98 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
    [  328.220991]  #3: ffff0800dc7d80b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
    [  328.229463]  #4: ffff07ffe0c980b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
    [  328.237934]  #5: ffff0800a3883c78 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
    [  328.246405]  #6: ffff07fffbe480b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0

    Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
    Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
    Acked-by: Marc Zyngier <maz@kernel.org>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Message-ID: <20250512180407.659015-6-mlevitsk@redhat.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
github-actions bot pushed a commit that referenced this pull request Jul 11, 2025
…ux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 6.16, take #4

- Gracefully fail initialising pKVM if the interrupt controller isn't
  GICv3

- Also gracefully fail initialising pKVM if the carveout allocation
  fails

- Fix the computing of the minimum MMIO range required for the host on
  stage-2 fault

- Fix the generation of the GICv3 Maintenance Interrupt in nested mode
github-actions bot pushed a commit that referenced this pull request Jul 11, 2025
… context

commit 9ca7a42 upstream.

The current use of a mutex to protect the notifier hashtable accesses
can lead to issues in the atomic context. It results in the below
kernel warnings:

  |  BUG: sleeping function called from invalid context at kernel/locking/mutex.c:258
  |  in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 9, name: kworker/0:0
  |  preempt_count: 1, expected: 0
  |  RCU nest depth: 0, expected: 0
  |  CPU: 0 UID: 0 PID: 9 Comm: kworker/0:0 Not tainted 6.14.0 #4
  |  Workqueue: ffa_pcpu_irq_notification notif_pcpu_irq_work_fn
  |  Call trace:
  |   show_stack+0x18/0x24 (C)
  |   dump_stack_lvl+0x78/0x90
  |   dump_stack+0x18/0x24
  |   __might_resched+0x114/0x170
  |   __might_sleep+0x48/0x98
  |   mutex_lock+0x24/0x80
  |   handle_notif_callbacks+0x54/0xe0
  |   notif_get_and_handle+0x40/0x88
  |   generic_exec_single+0x80/0xc0
  |   smp_call_function_single+0xfc/0x1a0
  |   notif_pcpu_irq_work_fn+0x2c/0x38
  |   process_one_work+0x14c/0x2b4
  |   worker_thread+0x2e4/0x3e0
  |   kthread+0x13c/0x210
  |   ret_from_fork+0x10/0x20

To address this, replace the mutex with an rwlock to protect the notifier
hashtable accesses. This ensures that read-side locking does not sleep and
multiple readers can acquire the lock concurrently, avoiding unnecessary
contention and potential deadlocks. Writer access remains exclusive,
preserving correctness.

This change resolves warnings from lockdep about potential sleep in
atomic context.

Cc: Jens Wiklander <jens.wiklander@linaro.org>
Reported-by: Jérôme Forissier <jerome.forissier@linaro.org>
Closes: OP-TEE/optee_os#7394
Fixes: e057344 ("firmware: arm_ffa: Add interfaces to request notification callbacks")
Message-Id: <20250528-ffa_notif_fix-v1-3-5ed7bc7f8437@arm.com>
Reviewed-by: Jens Wiklander <jens.wiklander@linaro.org>
Tested-by: Jens Wiklander <jens.wiklander@linaro.org>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
github-actions bot pushed a commit that referenced this pull request Jul 15, 2025
JIRA: https://issues.redhat.com/browse/RHEL-96382
CVE: CVE-2025-37958

commit be6e843
Author: Gavin Guo <gavinguo@igalia.com>
Date:   Mon Apr 21 19:35:36 2025 +0800

    mm/huge_memory: fix dereferencing invalid pmd migration entry

    When migrating a THP, concurrent access to the PMD migration entry during
    a deferred split scan can lead to an invalid address access, as
    illustrated below.  To prevent this invalid access, it is necessary to
    check the PMD migration entry and return early.  In this context, there is
    no need to use pmd_to_swp_entry and pfn_swap_entry_to_page to verify the
    equality of the target folio.  Since the PMD migration entry is locked, it
    cannot be served as the target.

    Mailing list discussion and explanation from Hugh Dickins: "An anon_vma
    lookup points to a location which may contain the folio of interest, but
    might instead contain another folio: and weeding out those other folios is
    precisely what the "folio != pmd_folio((*pmd)" check (and the "risk of
    replacing the wrong folio" comment a few lines above it) is for."

    BUG: unable to handle page fault for address: ffffea60001db008
    CPU: 0 UID: 0 PID: 2199114 Comm: tee Not tainted 6.14.0+ #4 NONE
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
    RIP: 0010:split_huge_pmd_locked+0x3b5/0x2b60
    Call Trace:
    <TASK>
    try_to_migrate_one+0x28c/0x3730
    rmap_walk_anon+0x4f6/0x770
    unmap_folio+0x196/0x1f0
    split_huge_page_to_list_to_order+0x9f6/0x1560
    deferred_split_scan+0xac5/0x12a0
    shrinker_debugfs_scan_write+0x376/0x470
    full_proxy_write+0x15c/0x220
    vfs_write+0x2fc/0xcb0
    ksys_write+0x146/0x250
    do_syscall_64+0x6a/0x120
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

    The bug is found by syzkaller on an internal kernel, then confirmed on
    upstream.

    Link: https://lkml.kernel.org/r/20250421113536.3682201-1-gavinguo@igalia.com
    Link: https://lore.kernel.org/all/20250414072737.1698513-1-gavinguo@igalia.com/
    Link: https://lore.kernel.org/all/20250418085802.2973519-1-gavinguo@igalia.com/
    Fixes: 84c3fc4 ("mm: thp: check pmd migration entry in common path")
    Signed-off-by: Gavin Guo <gavinguo@igalia.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Acked-by: Hugh Dickins <hughd@google.com>
    Acked-by: Zi Yan <ziy@nvidia.com>
    Reviewed-by: Gavin Shan <gshan@redhat.com>
    Cc: Florent Revest <revest@google.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: CKI Backport Bot <cki-ci-bot+cki-gitlab-backport-bot@redhat.com>
github-actions bot pushed a commit that referenced this pull request Jul 15, 2025
…migration entry

MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-10/-/merge_requests/1048

JIRA: https://issues.redhat.com/browse/RHEL-96382
CVE: CVE-2025-37958

```
commit be6e843
Author: Gavin Guo <gavinguo@igalia.com>
Date:   Mon Apr 21 19:35:36 2025 +0800

    mm/huge_memory: fix dereferencing invalid pmd migration entry

    When migrating a THP, concurrent access to the PMD migration entry during
    a deferred split scan can lead to an invalid address access, as
    illustrated below.  To prevent this invalid access, it is necessary to
    check the PMD migration entry and return early.  In this context, there is
    no need to use pmd_to_swp_entry and pfn_swap_entry_to_page to verify the
    equality of the target folio.  Since the PMD migration entry is locked, it
    cannot be served as the target.

    Mailing list discussion and explanation from Hugh Dickins: "An anon_vma
    lookup points to a location which may contain the folio of interest, but
    might instead contain another folio: and weeding out those other folios is
    precisely what the "folio != pmd_folio((*pmd)" check (and the "risk of
    replacing the wrong folio" comment a few lines above it) is for."

    BUG: unable to handle page fault for address: ffffea60001db008
    CPU: 0 UID: 0 PID: 2199114 Comm: tee Not tainted 6.14.0+ #4 NONE
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
    RIP: 0010:split_huge_pmd_locked+0x3b5/0x2b60
    Call Trace:
    <TASK>
    try_to_migrate_one+0x28c/0x3730
    rmap_walk_anon+0x4f6/0x770
    unmap_folio+0x196/0x1f0
    split_huge_page_to_list_to_order+0x9f6/0x1560
    deferred_split_scan+0xac5/0x12a0
    shrinker_debugfs_scan_write+0x376/0x470
    full_proxy_write+0x15c/0x220
    vfs_write+0x2fc/0xcb0
    ksys_write+0x146/0x250
    do_syscall_64+0x6a/0x120
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

    The bug is found by syzkaller on an internal kernel, then confirmed on
    upstream.

    Link: https://lkml.kernel.org/r/20250421113536.3682201-1-gavinguo@igalia.com
    Link: https://lore.kernel.org/all/20250414072737.1698513-1-gavinguo@igalia.com/
    Link: https://lore.kernel.org/all/20250418085802.2973519-1-gavinguo@igalia.com/
    Fixes: 84c3fc4 ("mm: thp: check pmd migration entry in common path")
    Signed-off-by: Gavin Guo <gavinguo@igalia.com>
    Acked-by: David Hildenbrand <david@redhat.com>
    Acked-by: Hugh Dickins <hughd@google.com>
    Acked-by: Zi Yan <ziy@nvidia.com>
    Reviewed-by: Gavin Shan <gshan@redhat.com>
    Cc: Florent Revest <revest@google.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
```

Signed-off-by: CKI Backport Bot <cki-ci-bot+cki-gitlab-backport-bot@redhat.com>

---

<small>Created 2025-06-13 12:31 UTC by backporter - [KWF FAQ](https://red.ht/kernel_workflow_doc) - [Slack #team-kernel-workflow](https://redhat-internal.slack.com/archives/C04LRUPMJQ5) - [Source](https://gitlab.com/cki-project/kernel-workflow/-/blob/main/webhook/utils/backporter.py) - [Documentation](https://gitlab.com/cki-project/kernel-workflow/-/blob/main/docs/README.backporter.md) - [Report an issue](https://issues.redhat.com/secure/CreateIssueDetails!init.jspa?pid=12334433&issuetype=1&priority=4&summary=backporter+webhook+issue&components=kernel-workflow+/+backporter)</small>

Approved-by: Rafael Aquini <raquini@redhat.com>
Approved-by: Donald Dutile <ddutile@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Julio Faracco <jfaracco@redhat.com>
github-actions bot pushed a commit that referenced this pull request Jul 16, 2025
JIRA: https://issues.redhat.com/browse/RHEL-89169

commit 78ab4be
Author: Fedor Pchelkin <pchelkin@ispras.ru>
Date:   Tue May 6 14:55:39 2025 +0300

    wifi: mt76: disable napi on driver removal
    
    A warning on driver removal started occurring after commit 9dd05df
    ("net: warn if NAPI instance wasn't shut down"). Disable tx napi before
    deleting it in mt76_dma_cleanup().
    
     WARNING: CPU: 4 PID: 18828 at net/core/dev.c:7288 __netif_napi_del_locked+0xf0/0x100
     CPU: 4 UID: 0 PID: 18828 Comm: modprobe Not tainted 6.15.0-rc4 #4 PREEMPT(lazy)
     Hardware name: ASUS System Product Name/PRIME X670E-PRO WIFI, BIOS 3035 09/05/2024
     RIP: 0010:__netif_napi_del_locked+0xf0/0x100
     Call Trace:
     <TASK>
     mt76_dma_cleanup+0x54/0x2f0 [mt76]
     mt7921_pci_remove+0xd5/0x190 [mt7921e]
     pci_device_remove+0x47/0xc0
     device_release_driver_internal+0x19e/0x200
     driver_detach+0x48/0x90
     bus_remove_driver+0x6d/0xf0
     pci_unregister_driver+0x2e/0xb0
     __do_sys_delete_module.isra.0+0x197/0x2e0
     do_syscall_64+0x7b/0x160
     entry_SYSCALL_64_after_hwframe+0x76/0x7e
    
    Tested with mt7921e but the same pattern can be actually applied to other
    mt76 drivers calling mt76_dma_cleanup() during removal. Tx napi is enabled
    in their *_dma_init() functions and only toggled off and on again inside
    their suspend/resume/reset paths. So it should be okay to disable tx
    napi in such a generic way.
    
    Found by Linux Verification Center (linuxtesting.org).
    
    Fixes: 2ac515a ("mt76: mt76x02: use napi polling for tx cleanup")
    Cc: stable@vger.kernel.org
    Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
    Tested-by: Ming Yen Hsieh <mingyen.hsieh@mediatek.com>
    Link: https://patch.msgid.link/20250506115540.19045-1-pchelkin@ispras.ru
    Signed-off-by: Felix Fietkau <nbd@nbd.name>

Signed-off-by: Jose Ignacio Tornos Martinez <jtornosm@redhat.com>
PlaidCat added a commit that referenced this pull request Jul 16, 2025
jira LE-3587
cve CVE-2024-57980
Rebuild_History Non-Buildable kernel-4.18.0-553.62.1.el8_10
commit-author Laurent Pinchart <laurent.pinchart@ideasonboard.com>
commit c6ef3a7

If the uvc_status_init() function fails to allocate the int_urb, it will
free the dev->status pointer but doesn't reset the pointer to NULL. This
results in the kfree() call in uvc_status_cleanup() trying to
double-free the memory. Fix it by resetting the dev->status pointer to
NULL after freeing it.

Fixes: a31a405 ("V4L/DVB:usbvideo:don't use part of buffer for USB transfer #4")
	Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20241107235130.31372-1-laurent.pinchart@ideasonboard.com
	Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Reviewed by: Ricardo Ribalda <ribalda@chromium.org>
	Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
(cherry picked from commit c6ef3a7)
	Signed-off-by: Jonathan Maple <jmaple@ciq.com>
PlaidCat added a commit that referenced this pull request Jul 16, 2025
jira LE-3587
Rebuild_History Non-Buildable kernel-4.18.0-553.62.1.el8_10
commit-author David Hildenbrand <david@redhat.com>
commit 2ccd42b
Empty-Commit: Cherry-Pick Conflicts during history rebuild.
Will be included in final tarball splat. Ref for failed cherry-pick at:
ciq/ciq_backports/kernel-4.18.0-553.62.1.el8_10/2ccd42b9.failed

If we finds a vq without a name in our input array in
virtio_ccw_find_vqs(), we treat it as "non-existing" and set the vq pointer
to NULL; we will not call virtio_ccw_setup_vq() to allocate/setup a vq.

Consequently, we create only a queue if it actually exists (name != NULL)
and assign an incremental queue index to each such existing queue.

However, in virtio_ccw_register_adapter_ind()->get_airq_indicator() we
will not ignore these "non-existing queues", but instead assign an airq
indicator to them.

Besides never releasing them in virtio_ccw_drop_indicators() (because
there is no virtqueue), the bigger issue seems to be that there will be a
disagreement between the device and the Linux guest about the airq
indicator to be used for notifying a queue, because the indicator bit
for adapter I/O interrupt is derived from the queue index.

The virtio spec states under "Setting Up Two-Stage Queue Indicators":

	... indicator contains the guest address of an area wherein the
	indicators for the devices are contained, starting at bit_nr, one
	bit per virtqueue of the device.

And further in "Notification via Adapter I/O Interrupts":

	For notifying the driver of virtqueue buffers, the device sets the
	bit in the guest-provided indicator area at the corresponding
	offset.

For example, QEMU uses in virtio_ccw_notify() the queue index (passed as
"vector") to select the relevant indicator bit. If a queue does not exist,
it does not have a corresponding indicator bit assigned, because it
effectively doesn't have a queue index.

Using a virtio-balloon-ccw device under QEMU with free-page-hinting
disabled ("free-page-hint=off") but free-page-reporting enabled
("free-page-reporting=on") will result in free page reporting
not working as expected: in the virtio_balloon driver, we'll be stuck
forever in virtballoon_free_page_report()->wait_event(), because the
waitqueue will not be woken up as the notification from the device is
lost: it would use the wrong indicator bit.

Free page reporting stops working and we get splats (when configured to
detect hung wqs) like:

 INFO: task kworker/1:3:463 blocked for more than 61 seconds.
       Not tainted 6.14.0 #4
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kworker/1:3 [...]
 Workqueue: events page_reporting_process
 Call Trace:
  [<000002f404e6dfb2>] __schedule+0x402/0x1640
  [<000002f404e6f22e>] schedule+0x3e/0xe0
  [<000002f3846a88fa>] virtballoon_free_page_report+0xaa/0x110 [virtio_balloon]
  [<000002f40435c8a4>] page_reporting_process+0x2e4/0x740
  [<000002f403fd3ee2>] process_one_work+0x1c2/0x400
  [<000002f403fd4b96>] worker_thread+0x296/0x420
  [<000002f403fe10b4>] kthread+0x124/0x290
  [<000002f403f4e0dc>] __ret_from_fork+0x3c/0x60
  [<000002f404e77272>] ret_from_fork+0xa/0x38

There was recently a discussion [1] whether the "holes" should be
treated differently again, effectively assigning also non-existing
queues a queue index: that should also fix the issue, but requires other
workarounds to not break existing setups.

Let's fix it without affecting existing setups for now by properly ignoring
the non-existing queues, so the indicator bits will match the queue
indexes.

[1] https://lore.kernel.org/all/cover.1720611677.git.mst@redhat.com/

Fixes: a229989 ("virtio: don't allocate vqs when names[i] = NULL")
	Reported-by: Chandra Merla <cmerla@redhat.com>
	Cc: stable@vger.kernel.org
	Signed-off-by: David Hildenbrand <david@redhat.com>
	Tested-by: Thomas Huth <thuth@redhat.com>
	Reviewed-by: Thomas Huth <thuth@redhat.com>
	Reviewed-by: Cornelia Huck <cohuck@redhat.com>
	Acked-by: Michael S. Tsirkin <mst@redhat.com>
	Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Link: https://lore.kernel.org/r/20250402203621.940090-1-david@redhat.com
	Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
(cherry picked from commit 2ccd42b)
	Signed-off-by: Jonathan Maple <jmaple@ciq.com>

# Conflicts:
#	drivers/s390/virtio/virtio_ccw.c
github-actions bot pushed a commit that referenced this pull request Jul 17, 2025
JIRA: https://issues.redhat.com/browse/RHEL-89168

commit 78ab4be
Author: Fedor Pchelkin <pchelkin@ispras.ru>
Date:   Tue May 6 14:55:39 2025 +0300

    wifi: mt76: disable napi on driver removal
    
    A warning on driver removal started occurring after commit 9dd05df
    ("net: warn if NAPI instance wasn't shut down"). Disable tx napi before
    deleting it in mt76_dma_cleanup().
    
     WARNING: CPU: 4 PID: 18828 at net/core/dev.c:7288 __netif_napi_del_locked+0xf0/0x100
     CPU: 4 UID: 0 PID: 18828 Comm: modprobe Not tainted 6.15.0-rc4 #4 PREEMPT(lazy)
     Hardware name: ASUS System Product Name/PRIME X670E-PRO WIFI, BIOS 3035 09/05/2024
     RIP: 0010:__netif_napi_del_locked+0xf0/0x100
     Call Trace:
     <TASK>
     mt76_dma_cleanup+0x54/0x2f0 [mt76]
     mt7921_pci_remove+0xd5/0x190 [mt7921e]
     pci_device_remove+0x47/0xc0
     device_release_driver_internal+0x19e/0x200
     driver_detach+0x48/0x90
     bus_remove_driver+0x6d/0xf0
     pci_unregister_driver+0x2e/0xb0
     __do_sys_delete_module.isra.0+0x197/0x2e0
     do_syscall_64+0x7b/0x160
     entry_SYSCALL_64_after_hwframe+0x76/0x7e
    
    Tested with mt7921e but the same pattern can be actually applied to other
    mt76 drivers calling mt76_dma_cleanup() during removal. Tx napi is enabled
    in their *_dma_init() functions and only toggled off and on again inside
    their suspend/resume/reset paths. So it should be okay to disable tx
    napi in such a generic way.
    
    Found by Linux Verification Center (linuxtesting.org).
    
    Fixes: 2ac515a ("mt76: mt76x02: use napi polling for tx cleanup")
    Cc: stable@vger.kernel.org
    Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
    Tested-by: Ming Yen Hsieh <mingyen.hsieh@mediatek.com>
    Link: https://patch.msgid.link/20250506115540.19045-1-pchelkin@ispras.ru
    Signed-off-by: Felix Fietkau <nbd@nbd.name>

Signed-off-by: Jose Ignacio Tornos Martinez <jtornosm@redhat.com>
PlaidCat added a commit that referenced this pull request Jul 29, 2025
jira LE-3666
Rebuild_History Non-Buildable kernel-5.14.0-570.30.1.el9_6
commit-author David Hildenbrand <david@redhat.com>
commit 2ccd42b

If we finds a vq without a name in our input array in
virtio_ccw_find_vqs(), we treat it as "non-existing" and set the vq pointer
to NULL; we will not call virtio_ccw_setup_vq() to allocate/setup a vq.

Consequently, we create only a queue if it actually exists (name != NULL)
and assign an incremental queue index to each such existing queue.

However, in virtio_ccw_register_adapter_ind()->get_airq_indicator() we
will not ignore these "non-existing queues", but instead assign an airq
indicator to them.

Besides never releasing them in virtio_ccw_drop_indicators() (because
there is no virtqueue), the bigger issue seems to be that there will be a
disagreement between the device and the Linux guest about the airq
indicator to be used for notifying a queue, because the indicator bit
for adapter I/O interrupt is derived from the queue index.

The virtio spec states under "Setting Up Two-Stage Queue Indicators":

	... indicator contains the guest address of an area wherein the
	indicators for the devices are contained, starting at bit_nr, one
	bit per virtqueue of the device.

And further in "Notification via Adapter I/O Interrupts":

	For notifying the driver of virtqueue buffers, the device sets the
	bit in the guest-provided indicator area at the corresponding
	offset.

For example, QEMU uses in virtio_ccw_notify() the queue index (passed as
"vector") to select the relevant indicator bit. If a queue does not exist,
it does not have a corresponding indicator bit assigned, because it
effectively doesn't have a queue index.

Using a virtio-balloon-ccw device under QEMU with free-page-hinting
disabled ("free-page-hint=off") but free-page-reporting enabled
("free-page-reporting=on") will result in free page reporting
not working as expected: in the virtio_balloon driver, we'll be stuck
forever in virtballoon_free_page_report()->wait_event(), because the
waitqueue will not be woken up as the notification from the device is
lost: it would use the wrong indicator bit.

Free page reporting stops working and we get splats (when configured to
detect hung wqs) like:

 INFO: task kworker/1:3:463 blocked for more than 61 seconds.
       Not tainted 6.14.0 #4
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kworker/1:3 [...]
 Workqueue: events page_reporting_process
 Call Trace:
  [<000002f404e6dfb2>] __schedule+0x402/0x1640
  [<000002f404e6f22e>] schedule+0x3e/0xe0
  [<000002f3846a88fa>] virtballoon_free_page_report+0xaa/0x110 [virtio_balloon]
  [<000002f40435c8a4>] page_reporting_process+0x2e4/0x740
  [<000002f403fd3ee2>] process_one_work+0x1c2/0x400
  [<000002f403fd4b96>] worker_thread+0x296/0x420
  [<000002f403fe10b4>] kthread+0x124/0x290
  [<000002f403f4e0dc>] __ret_from_fork+0x3c/0x60
  [<000002f404e77272>] ret_from_fork+0xa/0x38

There was recently a discussion [1] whether the "holes" should be
treated differently again, effectively assigning also non-existing
queues a queue index: that should also fix the issue, but requires other
workarounds to not break existing setups.

Let's fix it without affecting existing setups for now by properly ignoring
the non-existing queues, so the indicator bits will match the queue
indexes.

[1] https://lore.kernel.org/all/cover.1720611677.git.mst@redhat.com/

Fixes: a229989 ("virtio: don't allocate vqs when names[i] = NULL")
	Reported-by: Chandra Merla <cmerla@redhat.com>
	Cc: stable@vger.kernel.org
	Signed-off-by: David Hildenbrand <david@redhat.com>
	Tested-by: Thomas Huth <thuth@redhat.com>
	Reviewed-by: Thomas Huth <thuth@redhat.com>
	Reviewed-by: Cornelia Huck <cohuck@redhat.com>
	Acked-by: Michael S. Tsirkin <mst@redhat.com>
	Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Link: https://lore.kernel.org/r/20250402203621.940090-1-david@redhat.com
	Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
(cherry picked from commit 2ccd42b)
	Signed-off-by: Jonathan Maple <jmaple@ciq.com>
PlaidCat added a commit that referenced this pull request Jul 29, 2025
jira LE-3666
cve CVE-2025-37958
Rebuild_History Non-Buildable kernel-5.14.0-570.30.1.el9_6
commit-author Gavin Guo <gavinguo@igalia.com>
commit be6e843
Empty-Commit: Cherry-Pick Conflicts during history rebuild.
Will be included in final tarball splat. Ref for failed cherry-pick at:
ciq/ciq_backports/kernel-5.14.0-570.30.1.el9_6/be6e843f.failed

When migrating a THP, concurrent access to the PMD migration entry during
a deferred split scan can lead to an invalid address access, as
illustrated below.  To prevent this invalid access, it is necessary to
check the PMD migration entry and return early.  In this context, there is
no need to use pmd_to_swp_entry and pfn_swap_entry_to_page to verify the
equality of the target folio.  Since the PMD migration entry is locked, it
cannot be served as the target.

Mailing list discussion and explanation from Hugh Dickins: "An anon_vma
lookup points to a location which may contain the folio of interest, but
might instead contain another folio: and weeding out those other folios is
precisely what the "folio != pmd_folio((*pmd)" check (and the "risk of
replacing the wrong folio" comment a few lines above it) is for."

BUG: unable to handle page fault for address: ffffea60001db008
CPU: 0 UID: 0 PID: 2199114 Comm: tee Not tainted 6.14.0+ #4 NONE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:split_huge_pmd_locked+0x3b5/0x2b60
Call Trace:
<TASK>
try_to_migrate_one+0x28c/0x3730
rmap_walk_anon+0x4f6/0x770
unmap_folio+0x196/0x1f0
split_huge_page_to_list_to_order+0x9f6/0x1560
deferred_split_scan+0xac5/0x12a0
shrinker_debugfs_scan_write+0x376/0x470
full_proxy_write+0x15c/0x220
vfs_write+0x2fc/0xcb0
ksys_write+0x146/0x250
do_syscall_64+0x6a/0x120
entry_SYSCALL_64_after_hwframe+0x76/0x7e

The bug is found by syzkaller on an internal kernel, then confirmed on
upstream.

Link: https://lkml.kernel.org/r/20250421113536.3682201-1-gavinguo@igalia.com
Link: https://lore.kernel.org/all/20250414072737.1698513-1-gavinguo@igalia.com/
Link: https://lore.kernel.org/all/20250418085802.2973519-1-gavinguo@igalia.com/
Fixes: 84c3fc4 ("mm: thp: check pmd migration entry in common path")
	Signed-off-by: Gavin Guo <gavinguo@igalia.com>
	Acked-by: David Hildenbrand <david@redhat.com>
	Acked-by: Hugh Dickins <hughd@google.com>
	Acked-by: Zi Yan <ziy@nvidia.com>
	Reviewed-by: Gavin Shan <gshan@redhat.com>
	Cc: Florent Revest <revest@google.com>
	Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
	Cc: Miaohe Lin <linmiaohe@huawei.com>
	Cc: <stable@vger.kernel.org>
	Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit be6e843)
	Signed-off-by: Jonathan Maple <jmaple@ciq.com>

# Conflicts:
#	mm/huge_memory.c
PlaidCat added a commit that referenced this pull request Jul 29, 2025
jira LE-3666
cve CVE-2024-57980
Rebuild_History Non-Buildable kernel-5.14.0-570.30.1.el9_6
commit-author Laurent Pinchart <laurent.pinchart@ideasonboard.com>
commit c6ef3a7

If the uvc_status_init() function fails to allocate the int_urb, it will
free the dev->status pointer but doesn't reset the pointer to NULL. This
results in the kfree() call in uvc_status_cleanup() trying to
double-free the memory. Fix it by resetting the dev->status pointer to
NULL after freeing it.

Fixes: a31a405 ("V4L/DVB:usbvideo:don't use part of buffer for USB transfer #4")
	Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20241107235130.31372-1-laurent.pinchart@ideasonboard.com
	Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Reviewed by: Ricardo Ribalda <ribalda@chromium.org>
	Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
(cherry picked from commit c6ef3a7)
	Signed-off-by: Jonathan Maple <jmaple@ciq.com>
github-actions bot pushed a commit that referenced this pull request Aug 2, 2025
pert script tests fails with segmentation fault as below:

  92: perf script tests:
  --- start ---
  test child forked, pid 103769
  DB test
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.012 MB /tmp/perf-test-script.7rbftEpOzX/perf.data (9 samples) ]
  /usr/libexec/perf-core/tests/shell/script.sh: line 35:
  103780 Segmentation fault      (core dumped)
  perf script -i "${perfdatafile}" -s "${db_test}"
  --- Cleaning up ---
  ---- end(-1) ----
  92: perf script tests                                               : FAILED!

Backtrace pointed to :
	#0  0x0000000010247dd0 in maps.machine ()
	#1  0x00000000101d178c in db_export.sample ()
	#2  0x00000000103412c8 in python_process_event ()
	#3  0x000000001004eb28 in process_sample_event ()
	#4  0x000000001024fcd0 in machines.deliver_event ()
	#5  0x000000001025005c in perf_session.deliver_event ()
	#6  0x00000000102568b0 in __ordered_events__flush.part.0 ()
	#7  0x0000000010251618 in perf_session.process_events ()
	#8  0x0000000010053620 in cmd_script ()
	#9  0x00000000100b5a28 in run_builtin ()
	#10 0x00000000100b5f94 in handle_internal_command ()
	#11 0x0000000010011114 in main ()

Further investigation reveals that this occurs in the `perf script tests`,
because it uses `db_test.py` script. This script sets `perf_db_export_mode = True`.

With `perf_db_export_mode` enabled, if a sample originates from a hypervisor,
perf doesn't set maps for "[H]" sample in the code. Consequently, `al->maps` remains NULL
when `maps__machine(al->maps)` is called from `db_export__sample`.

As al->maps can be NULL in case of Hypervisor samples , use thread->maps
because even for Hypervisor sample, machine should exist.
If we don't have machine for some reason, return -1 to avoid segmentation fault.

Reported-by: Disha Goel <disgoel@linux.ibm.com>
Signed-off-by: Aditya Bodkhe <aditya.b1@linux.ibm.com>
Reviewed-by: Adrian Hunter <adrian.hunter@intel.com>
Tested-by: Disha Goel <disgoel@linux.ibm.com>
Link: https://lore.kernel.org/r/20250429065132.36839-1-adityab1@linux.ibm.com
Suggested-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
github-actions bot pushed a commit that referenced this pull request Aug 2, 2025
Without the change `perf `hangs up on charaster devices. On my system
it's enough to run system-wide sampler for a few seconds to get the
hangup:

    $ perf record -a -g --call-graph=dwarf
    $ perf report
    # hung

`strace` shows that hangup happens on reading on a character device
`/dev/dri/renderD128`

    $ strace -y -f -p 2780484
    strace: Process 2780484 attached
    pread64(101</dev/dri/renderD128>, strace: Process 2780484 detached

It's call trace descends into `elfutils`:

    $ gdb -p 2780484
    (gdb) bt
    #0  0x00007f5e508f04b7 in __libc_pread64 (fd=101, buf=0x7fff9df7edb0, count=0, offset=0)
        at ../sysdeps/unix/sysv/linux/pread64.c:25
    #1  0x00007f5e52b79515 in read_file () from /<<NIX>>/elfutils-0.192/lib/libelf.so.1
    #2  0x00007f5e52b25666 in libdw_open_elf () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
    #3  0x00007f5e52b25907 in __libdw_open_file () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
    #4  0x00007f5e52b120a9 in dwfl_report_elf@@ELFUTILS_0.156 ()
       from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
    #5  0x000000000068bf20 in __report_module (al=al@entry=0x7fff9df80010, ip=ip@entry=139803237033216, ui=ui@entry=0x5369b5e0)
        at util/dso.h:537
    #6  0x000000000068c3d1 in report_module (ip=139803237033216, ui=0x5369b5e0) at util/unwind-libdw.c:114
    #7  frame_callback (state=0x535aef10, arg=0x5369b5e0) at util/unwind-libdw.c:242
    #8  0x00007f5e52b261d3 in dwfl_thread_getframes () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
    #9  0x00007f5e52b25bdb in get_one_thread_cb () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
    #10 0x00007f5e52b25faa in dwfl_getthreads () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
    #11 0x00007f5e52b26514 in dwfl_getthread_frames () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
    #12 0x000000000068c6ce in unwind__get_entries (cb=cb@entry=0x5d4620 <unwind_entry>, arg=arg@entry=0x10cd5fa0,
        thread=thread@entry=0x1076a290, data=data@entry=0x7fff9df80540, max_stack=max_stack@entry=127,
        best_effort=best_effort@entry=false) at util/thread.h:152
    #13 0x00000000005dae95 in thread__resolve_callchain_unwind (evsel=0x106006d0, thread=0x1076a290, cursor=0x10cd5fa0,
        sample=0x7fff9df80540, max_stack=127, symbols=true) at util/machine.c:2939
    #14 thread__resolve_callchain_unwind (thread=0x1076a290, cursor=0x10cd5fa0, evsel=0x106006d0, sample=0x7fff9df80540,
        max_stack=127, symbols=true) at util/machine.c:2920
    #15 __thread__resolve_callchain (thread=0x1076a290, cursor=0x10cd5fa0, evsel=0x106006d0, evsel@entry=0x7fff9df80440,
        sample=0x7fff9df80540, parent=parent@entry=0x7fff9df804a0, root_al=root_al@entry=0x7fff9df80440, max_stack=127, symbols=true)
        at util/machine.c:2970
    #16 0x00000000005d0cb2 in thread__resolve_callchain (thread=<optimized out>, cursor=<optimized out>, evsel=0x7fff9df80440,
        sample=<optimized out>, parent=0x7fff9df804a0, root_al=0x7fff9df80440, max_stack=127) at util/machine.h:198
    #17 sample__resolve_callchain (sample=<optimized out>, cursor=<optimized out>, parent=parent@entry=0x7fff9df804a0,
        evsel=evsel@entry=0x106006d0, al=al@entry=0x7fff9df80440, max_stack=max_stack@entry=127) at util/callchain.c:1127
    #18 0x0000000000617e08 in hist_entry_iter__add (iter=iter@entry=0x7fff9df80480, al=al@entry=0x7fff9df80440, max_stack_depth=127,
        arg=arg@entry=0x7fff9df81ae0) at util/hist.c:1255
    #19 0x000000000045d2d0 in process_sample_event (tool=0x7fff9df81ae0, event=<optimized out>, sample=0x7fff9df80540,
        evsel=0x106006d0, machine=<optimized out>) at builtin-report.c:334
    #20 0x00000000005e3bb1 in perf_session__deliver_event (session=0x105ff2c0, event=0x7f5c7d735ca0, tool=0x7fff9df81ae0,
        file_offset=2914716832, file_path=0x105ffbf0 "perf.data") at util/session.c:1367
    #21 0x00000000005e8d93 in do_flush (oe=0x105ffa50, show_progress=false) at util/ordered-events.c:245
    #22 __ordered_events__flush (oe=0x105ffa50, how=OE_FLUSH__ROUND, timestamp=<optimized out>) at util/ordered-events.c:324
    #23 0x00000000005e1f64 in perf_session__process_user_event (session=0x105ff2c0, event=0x7f5c7d752b18, file_offset=2914835224,
        file_path=0x105ffbf0 "perf.data") at util/session.c:1419
    #24 0x00000000005e47c7 in reader__read_event (rd=rd@entry=0x7fff9df81260, session=session@entry=0x105ff2c0,
    --Type <RET> for more, q to quit, c to continue without paging--
    quit
        prog=prog@entry=0x7fff9df81220) at util/session.c:2132
    #25 0x00000000005e4b37 in reader__process_events (rd=0x7fff9df81260, session=0x105ff2c0, prog=0x7fff9df81220)
        at util/session.c:2181
    #26 __perf_session__process_events (session=0x105ff2c0) at util/session.c:2226
    #27 perf_session__process_events (session=session@entry=0x105ff2c0) at util/session.c:2390
    #28 0x0000000000460add in __cmd_report (rep=0x7fff9df81ae0) at builtin-report.c:1076
    #29 cmd_report (argc=<optimized out>, argv=<optimized out>) at builtin-report.c:1827
    #30 0x00000000004c5a40 in run_builtin (p=p@entry=0xd8f7f8 <commands+312>, argc=argc@entry=1, argv=argv@entry=0x7fff9df844b0)
        at perf.c:351
    #31 0x00000000004c5d63 in handle_internal_command (argc=argc@entry=1, argv=argv@entry=0x7fff9df844b0) at perf.c:404
    #32 0x0000000000442de3 in run_argv (argcp=<synthetic pointer>, argv=<synthetic pointer>) at perf.c:448
    #33 main (argc=<optimized out>, argv=0x7fff9df844b0) at perf.c:556

The hangup happens because nothing in` perf` or `elfutils` checks if a
mapped file is easily readable.

The change conservatively skips all non-regular files.

Signed-off-by: Sergei Trofimovich <slyich@gmail.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20250505174419.2814857-1-slyich@gmail.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
github-actions bot pushed a commit that referenced this pull request Aug 2, 2025
Symbolize stack traces by creating a live machine. Add this
functionality to dump_stack and switch dump_stack users to use
it. Switch TUI to use it. Add stack traces to the child test function
which can be useful to diagnose blocked code.

Example output:
```
$ perf test -vv PERF_RECORD_
...
  7: PERF_RECORD_* events & perf_sample fields:
  7: PERF_RECORD_* events & perf_sample fields                       : Running (1 active)
^C
Signal (2) while running tests.
Terminating tests with the same signal
Internal test harness failure. Completing any started tests:
:  7: PERF_RECORD_* events & perf_sample fields:

---- unexpected signal (2) ----
    #0 0x55788c6210a3 in child_test_sig_handler builtin-test.c:0
    #1 0x7fc12fe49df0 in __restore_rt libc_sigaction.c:0
    #2 0x7fc12fe99687 in __internal_syscall_cancel cancellation.c:64
    #3 0x7fc12fee5f7a in clock_nanosleep@GLIBC_2.2.5 clock_nanosleep.c:72
    #4 0x7fc12fef1393 in __nanosleep nanosleep.c:26
    #5 0x7fc12ff02d68 in __sleep sleep.c:55
    #6 0x55788c63196b in test__PERF_RECORD perf-record.c:0
    #7 0x55788c620fb0 in run_test_child builtin-test.c:0
    #8 0x55788c5bd18d in start_command run-command.c:127
    #9 0x55788c621ef3 in __cmd_test builtin-test.c:0
    #10 0x55788c6225bf in cmd_test ??:0
    #11 0x55788c5afbd0 in run_builtin perf.c:0
    #12 0x55788c5afeeb in handle_internal_command perf.c:0
    #13 0x55788c52b383 in main ??:0
    #14 0x7fc12fe33ca8 in __libc_start_call_main libc_start_call_main.h:74
    #15 0x7fc12fe33d65 in __libc_start_main@@GLIBC_2.34 libc-start.c:128
    #16 0x55788c52b9d1 in _start ??:0

---- unexpected signal (2) ----
    #0 0x55788c6210a3 in child_test_sig_handler builtin-test.c:0
    #1 0x7fc12fe49df0 in __restore_rt libc_sigaction.c:0
    #2 0x7fc12fea3a14 in pthread_sigmask@GLIBC_2.2.5 pthread_sigmask.c:45
    #3 0x7fc12fe49fd9 in __GI___sigprocmask sigprocmask.c:26
    #4 0x7fc12ff2601b in __longjmp_chk longjmp.c:36
    #5 0x55788c6210c0 in print_test_result.isra.0 builtin-test.c:0
    #6 0x7fc12fe49df0 in __restore_rt libc_sigaction.c:0
    #7 0x7fc12fe99687 in __internal_syscall_cancel cancellation.c:64
    #8 0x7fc12fee5f7a in clock_nanosleep@GLIBC_2.2.5 clock_nanosleep.c:72
    #9 0x7fc12fef1393 in __nanosleep nanosleep.c:26
    #10 0x7fc12ff02d68 in __sleep sleep.c:55
    #11 0x55788c63196b in test__PERF_RECORD perf-record.c:0
    #12 0x55788c620fb0 in run_test_child builtin-test.c:0
    #13 0x55788c5bd18d in start_command run-command.c:127
    #14 0x55788c621ef3 in __cmd_test builtin-test.c:0
    #15 0x55788c6225bf in cmd_test ??:0
    #16 0x55788c5afbd0 in run_builtin perf.c:0
    #17 0x55788c5afeeb in handle_internal_command perf.c:0
    #18 0x55788c52b383 in main ??:0
    #19 0x7fc12fe33ca8 in __libc_start_call_main libc_start_call_main.h:74
    #20 0x7fc12fe33d65 in __libc_start_main@@GLIBC_2.34 libc-start.c:128
    #21 0x55788c52b9d1 in _start ??:0
  7: PERF_RECORD_* events & perf_sample fields                       : Skip (permissions)
```

Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250624210500.2121303-1-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
github-actions bot pushed a commit that referenced this pull request Aug 2, 2025
Calling perf top with branch filters enabled on Intel CPU's
with branch counters logging (A.K.A LBR event logging [1]) support
results in a segfault.

$ perf top  -e '{cpu_core/cpu-cycles/,cpu_core/event=0xc6,umask=0x3,frontend=0x11,name=frontend_retired_dsb_miss/}' -j any,counter
...
Thread 27 "perf" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffafff76c0 (LWP 949003)]
perf_env__find_br_cntr_info (env=0xf66dc0 <perf_env>, nr=0x0, width=0x7fffafff62c0) at util/env.c:653
653			*width = env->cpu_pmu_caps ? env->br_cntr_width :
(gdb) bt
 #0  perf_env__find_br_cntr_info (env=0xf66dc0 <perf_env>, nr=0x0, width=0x7fffafff62c0) at util/env.c:653
 #1  0x00000000005b1599 in symbol__account_br_cntr (branch=0x7fffcc3db580, evsel=0xfea2d0, offset=12, br_cntr=8) at util/annotate.c:345
 #2  0x00000000005b17fb in symbol__account_cycles (addr=5658172, start=5658160, sym=0x7fffcc0ee420, cycles=539, evsel=0xfea2d0, br_cntr=8) at util/annotate.c:389
 #3  0x00000000005b1976 in addr_map_symbol__account_cycles (ams=0x7fffcd7b01d0, start=0x7fffcd7b02b0, cycles=539, evsel=0xfea2d0, br_cntr=8) at util/annotate.c:422
 #4  0x000000000068d57f in hist__account_cycles (bs=0x110d288, al=0x7fffafff6540, sample=0x7fffafff6760, nonany_branch_mode=false, total_cycles=0x0, evsel=0xfea2d0) at util/hist.c:2850
 #5  0x0000000000446216 in hist_iter__top_callback (iter=0x7fffafff6590, al=0x7fffafff6540, single=true, arg=0x7fffffff9e00) at builtin-top.c:737
 #6  0x0000000000689787 in hist_entry_iter__add (iter=0x7fffafff6590, al=0x7fffafff6540, max_stack_depth=127, arg=0x7fffffff9e00) at util/hist.c:1359
 #7  0x0000000000446710 in perf_event__process_sample (tool=0x7fffffff9e00, event=0x110d250, evsel=0xfea2d0, sample=0x7fffafff6760, machine=0x108c968) at builtin-top.c:845
 #8  0x0000000000447735 in deliver_event (qe=0x7fffffffa120, qevent=0x10fc200) at builtin-top.c:1211
 #9  0x000000000064ccae in do_flush (oe=0x7fffffffa120, show_progress=false) at util/ordered-events.c:245
 #10 0x000000000064d005 in __ordered_events__flush (oe=0x7fffffffa120, how=OE_FLUSH__TOP, timestamp=0) at util/ordered-events.c:324
 #11 0x000000000064d0ef in ordered_events__flush (oe=0x7fffffffa120, how=OE_FLUSH__TOP) at util/ordered-events.c:342
 #12 0x00000000004472a9 in process_thread (arg=0x7fffffff9e00) at builtin-top.c:1120
 #13 0x00007ffff6e7dba8 in start_thread (arg=<optimized out>) at pthread_create.c:448
 #14 0x00007ffff6f01b8c in __GI___clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

The cause is that perf_env__find_br_cntr_info tries to access a
null pointer pmu_caps in the perf_env struct. A similar issue exists
for homogeneous core systems which use the cpu_pmu_caps structure.

Fix this by populating cpu_pmu_caps and pmu_caps structures with
values from sysfs when calling perf top with branch stack sampling
enabled.

[1], LBR event logging introduced here:
https://lore.kernel.org/all/20231025201626.3000228-5-kan.liang@linux.intel.com/

Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Thomas Falcon <thomas.falcon@intel.com>
Link: https://lore.kernel.org/r/20250612163659.1357950-2-thomas.falcon@intel.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
github-actions bot pushed a commit that referenced this pull request Aug 13, 2025
JIRA: https://issues.redhat.com/browse/RHEL-15711
Upstream status: https://git.kernel.org/pub/scm/virt/kvm/kvm.git

Intel TDX protects guest VM's from malicious host and certain physical
attacks.  TDX introduces a new operation mode, Secure Arbitration Mode
(SEAM) to isolate and protect guest VM's.  A TDX guest VM runs in SEAM and,
unlike VMX, direct control and interaction with the guest by the host VMM
is not possible.  Instead, Intel TDX Module, which also runs in SEAM,
provides a SEAMCALL API.

The SEAMCALL that provides the ability to enter a guest is TDH.VP.ENTER.
The TDX Module processes TDH.VP.ENTER, and enters the guest via VMX
VMLAUNCH/VMRESUME instructions.  When a guest VM-exit requires host VMM
interaction, the TDH.VP.ENTER SEAMCALL returns to the host VMM (KVM).

Add tdh_vp_enter() to wrap the SEAMCALL invocation of TDH.VP.ENTER;
tdh_vp_enter() needs to be noinstr because VM entry in KVM is noinstr
as well, which is for two reasons:
* marking the area as CT_STATE_GUEST via guest_state_enter_irqoff() and
  guest_state_exit_irqoff()
* IRET must be avoided between VM-exit and NMI handling, in order to
  avoid prematurely releasing the NMI inhibit.

TDH.VP.ENTER is different from other SEAMCALLs in several ways: it
uses more arguments, and after it returns some host state may need to be
restored.  Therefore tdh_vp_enter() uses __seamcall_saved_ret() instead of
__seamcall_ret(); since it is the only caller of __seamcall_saved_ret(),
it can be made noinstr also.

TDH.VP.ENTER arguments are passed through General Purpose Registers (GPRs).
For the special case of the TD guest invoking TDG.VP.VMCALL, nearly any GPR
can be used, as well as XMM0 to XMM15. Notably, RBP is not used, and Linux
mandates the TDX Module feature NO_RBP_MOD, which is enforced elsewhere.
Additionally, XMM registers are not required for the existing Guest
Hypervisor Communication Interface and are handled by existing KVM code
should they be modified by the guest.

There are 2 input formats and 5 output formats for TDH.VP.ENTER arguments.
Input #1 : Initial entry or following a previous async. TD Exit
Input #2 : Following a previous TDCALL(TDG.VP.VMCALL)
Output #1 : On Error (No TD Entry)
Output #2 : Async. Exits with a VMX Architectural Exit Reason
Output #3 : Async. Exits with a non-VMX TD Exit Status
Output #4 : Async. Exits with Cross-TD Exit Details
Output #5 : On TDCALL(TDG.VP.VMCALL)

Currently, to keep things simple, the wrapper function does not attempt
to support different formats, and just passes all the GPRs that could be
used.  The GPR values are held by KVM in the area set aside for guest
GPRs.  KVM code uses the guest GPR area (vcpu->arch.regs[]) to set up for
or process results of tdh_vp_enter().

Therefore changing tdh_vp_enter() to use more complex argument formats
would also alter the way KVM code interacts with tdh_vp_enter().

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Message-ID: <20241121201448.36170-2-adrian.hunter@intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 69e23fa)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
github-actions bot pushed a commit that referenced this pull request Aug 13, 2025
JIRA: https://issues.redhat.com/browse/RHEL-95318

commit c5b6aba
Author: Maxim Levitsky <mlevitsk@redhat.com>
Date:   Mon May 12 14:04:02 2025 -0400

    locking/mutex: implement mutex_trylock_nested

    Despite the fact that several lockdep-related checks are skipped when
    calling trylock* versions of the locking primitives, for example
    mutex_trylock, each time the mutex is acquired, a held_lock is still
    placed onto the lockdep stack by __lock_acquire() which is called
    regardless of whether the trylock* or regular locking API was used.

    This means that if the caller successfully acquires more than
    MAX_LOCK_DEPTH locks of the same class, even when using mutex_trylock,
    lockdep will still complain that the maximum depth of the held lock stack
    has been reached and disable itself.

    For example, the following error currently occurs in the ARM version
    of KVM, once the code tries to lock all vCPUs of a VM configured with more
    than MAX_LOCK_DEPTH vCPUs, a situation that can easily happen on modern
    systems, where having more than 48 CPUs is common, and it's also common to
    run VMs that have vCPU counts approaching that number:

    [  328.171264] BUG: MAX_LOCK_DEPTH too low!
    [  328.175227] turning off the locking correctness validator.
    [  328.180726] Please attach the output of /proc/lock_stat to the bug report
    [  328.187531] depth: 48  max: 48!
    [  328.190678] 48 locks held by qemu-kvm/11664:
    [  328.194957]  #0: ffff800086de5ba0 (&kvm->lock){+.+.}-{3:3}, at: kvm_ioctl_create_device+0x174/0x5b0
    [  328.204048]  #1: ffff0800e78800b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
    [  328.212521]  #2: ffff07ffeee51e98 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
    [  328.220991]  #3: ffff0800dc7d80b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
    [  328.229463]  #4: ffff07ffe0c980b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
    [  328.237934]  #5: ffff0800a3883c78 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
    [  328.246405]  #6: ffff07fffbe480b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0

    Luckily, in all instances that require locking all vCPUs, the
    'kvm->lock' is taken a priori, and that fact makes it possible to use
    the little known feature of lockdep, called a 'nest_lock', to avoid this
    warning and subsequent lockdep self-disablement.

    The action of 'nested lock' being provided to lockdep's lock_acquire(),
    causes the lockdep to detect that the top of the held lock stack contains
    a lock of the same class and then increment its reference counter instead
    of pushing a new held_lock item onto that stack.

    See __lock_acquire for more information.

    Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Message-ID: <20250512180407.659015-2-mlevitsk@redhat.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
github-actions bot pushed a commit that referenced this pull request Aug 13, 2025
JIRA: https://issues.redhat.com/browse/RHEL-95318

commit b586c5d
Author: Maxim Levitsky <mlevitsk@redhat.com>
Date:   Mon May 12 14:04:06 2025 -0400

    KVM: arm64: use kvm_trylock_all_vcpus when locking all vCPUs

    Use kvm_trylock_all_vcpus instead of a custom implementation when locking
    all vCPUs of a VM, to avoid triggering a lockdep warning, in the case in
    which the VM is configured to have more than MAX_LOCK_DEPTH vCPUs.

    This fixes the following false lockdep warning:

    [  328.171264] BUG: MAX_LOCK_DEPTH too low!
    [  328.175227] turning off the locking correctness validator.
    [  328.180726] Please attach the output of /proc/lock_stat to the bug report
    [  328.187531] depth: 48  max: 48!
    [  328.190678] 48 locks held by qemu-kvm/11664:
    [  328.194957]  #0: ffff800086de5ba0 (&kvm->lock){+.+.}-{3:3}, at: kvm_ioctl_create_device+0x174/0x5b0
    [  328.204048]  #1: ffff0800e78800b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
    [  328.212521]  #2: ffff07ffeee51e98 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
    [  328.220991]  #3: ffff0800dc7d80b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
    [  328.229463]  #4: ffff07ffe0c980b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
    [  328.237934]  #5: ffff0800a3883c78 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
    [  328.246405]  #6: ffff07fffbe480b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0

    Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
    Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
    Acked-by: Marc Zyngier <maz@kernel.org>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Message-ID: <20250512180407.659015-6-mlevitsk@redhat.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
github-actions bot pushed a commit that referenced this pull request Aug 19, 2025
JIRA: https://issues.redhat.com/browse/RHEL-72226
JIRA: https://issues.redhat.com/browse/RHEL-73517
CVE: CVE-2025-21674
Upstream-status: v6.13

commit 2c36880
Author: Leon Romanovsky <leon@kernel.org>
Date:   Wed Jan 15 13:39:08 2025 +0200

    net/mlx5e: Fix inversion dependency warning while enabling IPsec tunnel

    Attempt to enable IPsec packet offload in tunnel mode in debug kernel
    generates the following kernel panic, which is happening due to two
    issues:
    1. In SA add section, the should be _bh() variant when marking SA mode.
    2. There is not needed flush_workqueue in SA delete routine. It is not
    needed as at this stage as it is removed from SADB and the running work
    will be canceled later in SA free.

     =====================================================
     WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
     6.12.0+ #4 Not tainted
     -----------------------------------------------------
     charon/1337 [HC0[0]:SC0[4]:HE1:SE0] is trying to acquire:
     ffff88810f365020 (&xa->xa_lock#24){+.+.}-{3:3}, at: mlx5e_xfrm_del_state+0xca/0x1e0 [mlx5_core]

     and this task is already holding:
     ffff88813e0f0d48 (&x->lock){+.-.}-{3:3}, at: xfrm_state_delete+0x16/0x30
     which would create a new lock dependency:
      (&x->lock){+.-.}-{3:3} -> (&xa->xa_lock#24){+.+.}-{3:3}

     but this new dependency connects a SOFTIRQ-irq-safe lock:
      (&x->lock){+.-.}-{3:3}

     ... which became SOFTIRQ-irq-safe at:
       lock_acquire+0x1be/0x520
       _raw_spin_lock_bh+0x34/0x40
       xfrm_timer_handler+0x91/0xd70
       __hrtimer_run_queues+0x1dd/0xa60
       hrtimer_run_softirq+0x146/0x2e0
       handle_softirqs+0x266/0x860
       irq_exit_rcu+0x115/0x1a0
       sysvec_apic_timer_interrupt+0x6e/0x90
       asm_sysvec_apic_timer_interrupt+0x16/0x20
       default_idle+0x13/0x20
       default_idle_call+0x67/0xa0
       do_idle+0x2da/0x320
       cpu_startup_entry+0x50/0x60
       start_secondary+0x213/0x2a0
       common_startup_64+0x129/0x138

     to a SOFTIRQ-irq-unsafe lock:
      (&xa->xa_lock#24){+.+.}-{3:3}

     ... which became SOFTIRQ-irq-unsafe at:
     ...
       lock_acquire+0x1be/0x520
       _raw_spin_lock+0x2c/0x40
       xa_set_mark+0x70/0x110
       mlx5e_xfrm_add_state+0xe48/0x2290 [mlx5_core]
       xfrm_dev_state_add+0x3bb/0xd70
       xfrm_add_sa+0x2451/0x4a90
       xfrm_user_rcv_msg+0x493/0x880
       netlink_rcv_skb+0x12e/0x380
       xfrm_netlink_rcv+0x6d/0x90
       netlink_unicast+0x42f/0x740
       netlink_sendmsg+0x745/0xbe0
       __sock_sendmsg+0xc5/0x190
       __sys_sendto+0x1fe/0x2c0
       __x64_sys_sendto+0xdc/0x1b0
       do_syscall_64+0x6d/0x140
       entry_SYSCALL_64_after_hwframe+0x4b/0x53

     other info that might help us debug this:

      Possible interrupt unsafe locking scenario:

            CPU0                    CPU1
            ----                    ----
       lock(&xa->xa_lock#24);
                                    local_irq_disable();
                                    lock(&x->lock);
                                    lock(&xa->xa_lock#24);
       <Interrupt>
         lock(&x->lock);

      *** DEADLOCK ***

     2 locks held by charon/1337:
      #0: ffffffff87f8f858 (&net->xfrm.xfrm_cfg_mutex){+.+.}-{4:4}, at: xfrm_netlink_rcv+0x5e/0x90
      #1: ffff88813e0f0d48 (&x->lock){+.-.}-{3:3}, at: xfrm_state_delete+0x16/0x30

     the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
     -> (&x->lock){+.-.}-{3:3} ops: 29 {
        HARDIRQ-ON-W at:
                         lock_acquire+0x1be/0x520
                         _raw_spin_lock_bh+0x34/0x40
                         xfrm_alloc_spi+0xc0/0xe60
                         xfrm_alloc_userspi+0x5f6/0xbc0
                         xfrm_user_rcv_msg+0x493/0x880
                         netlink_rcv_skb+0x12e/0x380
                         xfrm_netlink_rcv+0x6d/0x90
                         netlink_unicast+0x42f/0x740
                         netlink_sendmsg+0x745/0xbe0
                         __sock_sendmsg+0xc5/0x190
                         __sys_sendto+0x1fe/0x2c0
                         __x64_sys_sendto+0xdc/0x1b0
                         do_syscall_64+0x6d/0x140
                         entry_SYSCALL_64_after_hwframe+0x4b/0x53
        IN-SOFTIRQ-W at:
                         lock_acquire+0x1be/0x520
                         _raw_spin_lock_bh+0x34/0x40
                         xfrm_timer_handler+0x91/0xd70
                         __hrtimer_run_queues+0x1dd/0xa60
                         hrtimer_run_softirq+0x146/0x2e0
                         handle_softirqs+0x266/0x860
                         irq_exit_rcu+0x115/0x1a0
                         sysvec_apic_timer_interrupt+0x6e/0x90
                         asm_sysvec_apic_timer_interrupt+0x16/0x20
                         default_idle+0x13/0x20
                         default_idle_call+0x67/0xa0
                         do_idle+0x2da/0x320
                         cpu_startup_entry+0x50/0x60
                         start_secondary+0x213/0x2a0
                         common_startup_64+0x129/0x138
        INITIAL USE at:
                        lock_acquire+0x1be/0x520
                        _raw_spin_lock_bh+0x34/0x40
                        xfrm_alloc_spi+0xc0/0xe60
                        xfrm_alloc_userspi+0x5f6/0xbc0
                        xfrm_user_rcv_msg+0x493/0x880
                        netlink_rcv_skb+0x12e/0x380
                        xfrm_netlink_rcv+0x6d/0x90
                        netlink_unicast+0x42f/0x740
                        netlink_sendmsg+0x745/0xbe0
                        __sock_sendmsg+0xc5/0x190
                        __sys_sendto+0x1fe/0x2c0
                        __x64_sys_sendto+0xdc/0x1b0
                        do_syscall_64+0x6d/0x140
                        entry_SYSCALL_64_after_hwframe+0x4b/0x53
      }
      ... key      at: [<ffffffff87f9cd20>] __key.18+0x0/0x40

     the dependencies between the lock to be acquired
      and SOFTIRQ-irq-unsafe lock:
     -> (&xa->xa_lock#24){+.+.}-{3:3} ops: 9 {
        HARDIRQ-ON-W at:
                         lock_acquire+0x1be/0x520
                         _raw_spin_lock_bh+0x34/0x40
                         mlx5e_xfrm_add_state+0xc5b/0x2290 [mlx5_core]
                         xfrm_dev_state_add+0x3bb/0xd70
                         xfrm_add_sa+0x2451/0x4a90
                         xfrm_user_rcv_msg+0x493/0x880
                         netlink_rcv_skb+0x12e/0x380
                         xfrm_netlink_rcv+0x6d/0x90
                         netlink_unicast+0x42f/0x740
                         netlink_sendmsg+0x745/0xbe0
                         __sock_sendmsg+0xc5/0x190
                         __sys_sendto+0x1fe/0x2c0
                         __x64_sys_sendto+0xdc/0x1b0
                         do_syscall_64+0x6d/0x140
                         entry_SYSCALL_64_after_hwframe+0x4b/0x53
        SOFTIRQ-ON-W at:
                         lock_acquire+0x1be/0x520
                         _raw_spin_lock+0x2c/0x40
                         xa_set_mark+0x70/0x110
                         mlx5e_xfrm_add_state+0xe48/0x2290 [mlx5_core]
                         xfrm_dev_state_add+0x3bb/0xd70
                         xfrm_add_sa+0x2451/0x4a90
                         xfrm_user_rcv_msg+0x493/0x880
                         netlink_rcv_skb+0x12e/0x380
                         xfrm_netlink_rcv+0x6d/0x90
                         netlink_unicast+0x42f/0x740
                         netlink_sendmsg+0x745/0xbe0
                         __sock_sendmsg+0xc5/0x190
                         __sys_sendto+0x1fe/0x2c0
                         __x64_sys_sendto+0xdc/0x1b0
                         do_syscall_64+0x6d/0x140
                         entry_SYSCALL_64_after_hwframe+0x4b/0x53
        INITIAL USE at:
                        lock_acquire+0x1be/0x520
                        _raw_spin_lock_bh+0x34/0x40
                        mlx5e_xfrm_add_state+0xc5b/0x2290 [mlx5_core]
                        xfrm_dev_state_add+0x3bb/0xd70
                        xfrm_add_sa+0x2451/0x4a90
                        xfrm_user_rcv_msg+0x493/0x880
                        netlink_rcv_skb+0x12e/0x380
                        xfrm_netlink_rcv+0x6d/0x90
                        netlink_unicast+0x42f/0x740
                        netlink_sendmsg+0x745/0xbe0
                        __sock_sendmsg+0xc5/0x190
                        __sys_sendto+0x1fe/0x2c0
                        __x64_sys_sendto+0xdc/0x1b0
                        do_syscall_64+0x6d/0x140
                        entry_SYSCALL_64_after_hwframe+0x4b/0x53
      }
      ... key      at: [<ffffffffa078ff60>] __key.48+0x0/0xfffffffffff210a0 [mlx5_core]
      ... acquired at:
        __lock_acquire+0x30a0/0x5040
        lock_acquire+0x1be/0x520
        _raw_spin_lock_bh+0x34/0x40
        mlx5e_xfrm_del_state+0xca/0x1e0 [mlx5_core]
        xfrm_dev_state_delete+0x90/0x160
        __xfrm_state_delete+0x662/0xae0
        xfrm_state_delete+0x1e/0x30
        xfrm_del_sa+0x1c2/0x340
        xfrm_user_rcv_msg+0x493/0x880
        netlink_rcv_skb+0x12e/0x380
        xfrm_netlink_rcv+0x6d/0x90
        netlink_unicast+0x42f/0x740
        netlink_sendmsg+0x745/0xbe0
        __sock_sendmsg+0xc5/0x190
        __sys_sendto+0x1fe/0x2c0
        __x64_sys_sendto+0xdc/0x1b0
        do_syscall_64+0x6d/0x140
        entry_SYSCALL_64_after_hwframe+0x4b/0x53

     stack backtrace:
     CPU: 7 UID: 0 PID: 1337 Comm: charon Not tainted 6.12.0+ #4
     Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
     Call Trace:
      <TASK>
      dump_stack_lvl+0x74/0xd0
      check_irq_usage+0x12e8/0x1d90
      ? print_shortest_lock_dependencies_backwards+0x1b0/0x1b0
      ? check_chain_key+0x1bb/0x4c0
      ? __lockdep_reset_lock+0x180/0x180
      ? check_path.constprop.0+0x24/0x50
      ? mark_lock+0x108/0x2fb0
      ? print_circular_bug+0x9b0/0x9b0
      ? mark_lock+0x108/0x2fb0
      ? print_usage_bug.part.0+0x670/0x670
      ? check_prev_add+0x1c4/0x2310
      check_prev_add+0x1c4/0x2310
      __lock_acquire+0x30a0/0x5040
      ? lockdep_set_lock_cmp_fn+0x190/0x190
      ? lockdep_set_lock_cmp_fn+0x190/0x190
      lock_acquire+0x1be/0x520
      ? mlx5e_xfrm_del_state+0xca/0x1e0 [mlx5_core]
      ? lockdep_hardirqs_on_prepare+0x400/0x400
      ? __xfrm_state_delete+0x5f0/0xae0
      ? lock_downgrade+0x6b0/0x6b0
      _raw_spin_lock_bh+0x34/0x40
      ? mlx5e_xfrm_del_state+0xca/0x1e0 [mlx5_core]
      mlx5e_xfrm_del_state+0xca/0x1e0 [mlx5_core]
      xfrm_dev_state_delete+0x90/0x160
      __xfrm_state_delete+0x662/0xae0
      xfrm_state_delete+0x1e/0x30
      xfrm_del_sa+0x1c2/0x340
      ? xfrm_get_sa+0x250/0x250
      ? check_chain_key+0x1bb/0x4c0
      xfrm_user_rcv_msg+0x493/0x880
      ? copy_sec_ctx+0x270/0x270
      ? check_chain_key+0x1bb/0x4c0
      ? lockdep_set_lock_cmp_fn+0x190/0x190
      ? lockdep_set_lock_cmp_fn+0x190/0x190
      netlink_rcv_skb+0x12e/0x380
      ? copy_sec_ctx+0x270/0x270
      ? netlink_ack+0xd90/0xd90
      ? netlink_deliver_tap+0xcd/0xb60
      xfrm_netlink_rcv+0x6d/0x90
      netlink_unicast+0x42f/0x740
      ? netlink_attachskb+0x730/0x730
      ? lock_acquire+0x1be/0x520
      netlink_sendmsg+0x745/0xbe0
      ? netlink_unicast+0x740/0x740
      ? __might_fault+0xbb/0x170
      ? netlink_unicast+0x740/0x740
      __sock_sendmsg+0xc5/0x190
      ? fdget+0x163/0x1d0
      __sys_sendto+0x1fe/0x2c0
      ? __x64_sys_getpeername+0xb0/0xb0
      ? do_user_addr_fault+0x856/0xe30
      ? lock_acquire+0x1be/0x520
      ? __task_pid_nr_ns+0x117/0x410
      ? lock_downgrade+0x6b0/0x6b0
      __x64_sys_sendto+0xdc/0x1b0
      ? lockdep_hardirqs_on_prepare+0x284/0x400
      do_syscall_64+0x6d/0x140
      entry_SYSCALL_64_after_hwframe+0x4b/0x53
     RIP: 0033:0x7f7d31291ba4
     Code: 7d e8 89 4d d4 e8 4c 42 f7 ff 44 8b 4d d0 4c 8b 45 c8 89 c3 44 8b 55 d4 8b 7d e8 b8 2c 00 00 00 48 8b 55 d8 48 8b 75 e0 0f 05 <48> 3d 00 f0 ff ff 77 34 89 df 48 89 45 e8 e8 99 42 f7 ff 48 8b 45
     RSP: 002b:00007f7d2ccd94f0 EFLAGS: 00000297 ORIG_RAX: 000000000000002c
     RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f7d31291ba4
     RDX: 0000000000000028 RSI: 00007f7d2ccd96a0 RDI: 000000000000000a
     RBP: 00007f7d2ccd9530 R08: 00007f7d2ccd9598 R09: 000000000000000c
     R10: 0000000000000000 R11: 0000000000000297 R12: 0000000000000028
     R13: 00007f7d2ccd9598 R14: 00007f7d2ccd96a0 R15: 00000000000000e1
      </TASK>

    Fixes: 4c24272 ("net/mlx5e: Listen to ARP events to update IPsec L2 headers in tunnel mode")
    Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
    Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
github-actions bot pushed a commit that referenced this pull request Aug 19, 2025
JIRA: https://issues.redhat.com/browse/RHEL-71129

upstream
========
commit ea04fe1
Author: Aditya Bodkhe <aditya.b1@linux.ibm.com>
Date: Tue Apr 29 12:21:32 2025 +0530

description
===========
pert script tests fails with segmentation fault as below:

  92: perf script tests:
  --- start ---
  test child forked, pid 103769
  DB test
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.012 MB /tmp/perf-test-script.7rbftEpOzX/perf.data (9 samples) ]
  /usr/libexec/perf-core/tests/shell/script.sh: line 35:
  103780 Segmentation fault      (core dumped)
  perf script -i "${perfdatafile}" -s "${db_test}"
  --- Cleaning up ---
  ---- end(-1) ----
  92: perf script tests                                               : FAILED!

Backtrace pointed to :
	#0  0x0000000010247dd0 in maps.machine ()
	#1  0x00000000101d178c in db_export.sample ()
	#2  0x00000000103412c8 in python_process_event ()
	#3  0x000000001004eb28 in process_sample_event ()
	#4  0x000000001024fcd0 in machines.deliver_event ()
	#5  0x000000001025005c in perf_session.deliver_event ()
	#6  0x00000000102568b0 in __ordered_events__flush.part.0 ()
	#7  0x0000000010251618 in perf_session.process_events ()
	#8  0x0000000010053620 in cmd_script ()
	#9  0x00000000100b5a28 in run_builtin ()
	#10 0x00000000100b5f94 in handle_internal_command ()
	#11 0x0000000010011114 in main ()

Further investigation reveals that this occurs in the `perf script tests`,
because it uses `db_test.py` script. This script sets `perf_db_export_mode = True`.

With `perf_db_export_mode` enabled, if a sample originates from a hypervisor,
perf doesn't set maps for "[H]" sample in the code. Consequently, `al->maps` remains NULL
when `maps__machine(al->maps)` is called from `db_export__sample`.

As al->maps can be NULL in case of Hypervisor samples , use thread->maps
because even for Hypervisor sample, machine should exist.
If we don't have machine for some reason, return -1 to avoid segmentation fault.

    Reported-by: Disha Goel <disgoel@linux.ibm.com>
    Signed-off-by: Aditya Bodkhe <aditya.b1@linux.ibm.com>
    Reviewed-by: Adrian Hunter <adrian.hunter@intel.com>
    Tested-by: Disha Goel <disgoel@linux.ibm.com>
    Link: https://lore.kernel.org/r/20250429065132.36839-1-adityab1@linux.ibm.com
    Suggested-by: Adrian Hunter <adrian.hunter@intel.com>
    Signed-off-by: Namhyung Kim <namhyung@kernel.org>

Signed-off-by: Jakub Brnak <jbrnak@redhat.com>
github-actions bot pushed a commit that referenced this pull request Aug 19, 2025
JIRA: https://issues.redhat.com/browse/RHEL-72227
JIRA: https://issues.redhat.com/browse/RHEL-73520
CVE: CVE-2025-21674
Upstream-status: v6.13
Conflicts:
- drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec.c
	Due to commit 1cfc87c (tags/kernel-6.12.0-106.el10) which
	is an OOO backport of
	43eca05 xfrm: Add explicit dev to .xdo_dev_state_{add,delete,free} (v6.16-rc1)
	-> Adjust context

commit 2c36880
Author: Leon Romanovsky <leon@kernel.org>
Date:   Wed Jan 15 13:39:08 2025 +0200

    net/mlx5e: Fix inversion dependency warning while enabling IPsec tunnel

    Attempt to enable IPsec packet offload in tunnel mode in debug kernel
    generates the following kernel panic, which is happening due to two
    issues:
    1. In SA add section, the should be _bh() variant when marking SA mode.
    2. There is not needed flush_workqueue in SA delete routine. It is not
    needed as at this stage as it is removed from SADB and the running work
    will be canceled later in SA free.

     =====================================================
     WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
     6.12.0+ #4 Not tainted
     -----------------------------------------------------
     charon/1337 [HC0[0]:SC0[4]:HE1:SE0] is trying to acquire:
     ffff88810f365020 (&xa->xa_lock#24){+.+.}-{3:3}, at: mlx5e_xfrm_del_state+0xca/0x1e0 [mlx5_core]

     and this task is already holding:
     ffff88813e0f0d48 (&x->lock){+.-.}-{3:3}, at: xfrm_state_delete+0x16/0x30
     which would create a new lock dependency:
      (&x->lock){+.-.}-{3:3} -> (&xa->xa_lock#24){+.+.}-{3:3}

     but this new dependency connects a SOFTIRQ-irq-safe lock:
      (&x->lock){+.-.}-{3:3}

     ... which became SOFTIRQ-irq-safe at:
       lock_acquire+0x1be/0x520
       _raw_spin_lock_bh+0x34/0x40
       xfrm_timer_handler+0x91/0xd70
       __hrtimer_run_queues+0x1dd/0xa60
       hrtimer_run_softirq+0x146/0x2e0
       handle_softirqs+0x266/0x860
       irq_exit_rcu+0x115/0x1a0
       sysvec_apic_timer_interrupt+0x6e/0x90
       asm_sysvec_apic_timer_interrupt+0x16/0x20
       default_idle+0x13/0x20
       default_idle_call+0x67/0xa0
       do_idle+0x2da/0x320
       cpu_startup_entry+0x50/0x60
       start_secondary+0x213/0x2a0
       common_startup_64+0x129/0x138

     to a SOFTIRQ-irq-unsafe lock:
      (&xa->xa_lock#24){+.+.}-{3:3}

     ... which became SOFTIRQ-irq-unsafe at:
     ...
       lock_acquire+0x1be/0x520
       _raw_spin_lock+0x2c/0x40
       xa_set_mark+0x70/0x110
       mlx5e_xfrm_add_state+0xe48/0x2290 [mlx5_core]
       xfrm_dev_state_add+0x3bb/0xd70
       xfrm_add_sa+0x2451/0x4a90
       xfrm_user_rcv_msg+0x493/0x880
       netlink_rcv_skb+0x12e/0x380
       xfrm_netlink_rcv+0x6d/0x90
       netlink_unicast+0x42f/0x740
       netlink_sendmsg+0x745/0xbe0
       __sock_sendmsg+0xc5/0x190
       __sys_sendto+0x1fe/0x2c0
       __x64_sys_sendto+0xdc/0x1b0
       do_syscall_64+0x6d/0x140
       entry_SYSCALL_64_after_hwframe+0x4b/0x53

     other info that might help us debug this:

      Possible interrupt unsafe locking scenario:

            CPU0                    CPU1
            ----                    ----
       lock(&xa->xa_lock#24);
                                    local_irq_disable();
                                    lock(&x->lock);
                                    lock(&xa->xa_lock#24);
       <Interrupt>
         lock(&x->lock);

      *** DEADLOCK ***

     2 locks held by charon/1337:
      #0: ffffffff87f8f858 (&net->xfrm.xfrm_cfg_mutex){+.+.}-{4:4}, at: xfrm_netlink_rcv+0x5e/0x90
      #1: ffff88813e0f0d48 (&x->lock){+.-.}-{3:3}, at: xfrm_state_delete+0x16/0x30

     the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
     -> (&x->lock){+.-.}-{3:3} ops: 29 {
        HARDIRQ-ON-W at:
                         lock_acquire+0x1be/0x520
                         _raw_spin_lock_bh+0x34/0x40
                         xfrm_alloc_spi+0xc0/0xe60
                         xfrm_alloc_userspi+0x5f6/0xbc0
                         xfrm_user_rcv_msg+0x493/0x880
                         netlink_rcv_skb+0x12e/0x380
                         xfrm_netlink_rcv+0x6d/0x90
                         netlink_unicast+0x42f/0x740
                         netlink_sendmsg+0x745/0xbe0
                         __sock_sendmsg+0xc5/0x190
                         __sys_sendto+0x1fe/0x2c0
                         __x64_sys_sendto+0xdc/0x1b0
                         do_syscall_64+0x6d/0x140
                         entry_SYSCALL_64_after_hwframe+0x4b/0x53
        IN-SOFTIRQ-W at:
                         lock_acquire+0x1be/0x520
                         _raw_spin_lock_bh+0x34/0x40
                         xfrm_timer_handler+0x91/0xd70
                         __hrtimer_run_queues+0x1dd/0xa60
                         hrtimer_run_softirq+0x146/0x2e0
                         handle_softirqs+0x266/0x860
                         irq_exit_rcu+0x115/0x1a0
                         sysvec_apic_timer_interrupt+0x6e/0x90
                         asm_sysvec_apic_timer_interrupt+0x16/0x20
                         default_idle+0x13/0x20
                         default_idle_call+0x67/0xa0
                         do_idle+0x2da/0x320
                         cpu_startup_entry+0x50/0x60
                         start_secondary+0x213/0x2a0
                         common_startup_64+0x129/0x138
        INITIAL USE at:
                        lock_acquire+0x1be/0x520
                        _raw_spin_lock_bh+0x34/0x40
                        xfrm_alloc_spi+0xc0/0xe60
                        xfrm_alloc_userspi+0x5f6/0xbc0
                        xfrm_user_rcv_msg+0x493/0x880
                        netlink_rcv_skb+0x12e/0x380
                        xfrm_netlink_rcv+0x6d/0x90
                        netlink_unicast+0x42f/0x740
                        netlink_sendmsg+0x745/0xbe0
                        __sock_sendmsg+0xc5/0x190
                        __sys_sendto+0x1fe/0x2c0
                        __x64_sys_sendto+0xdc/0x1b0
                        do_syscall_64+0x6d/0x140
                        entry_SYSCALL_64_after_hwframe+0x4b/0x53
      }
      ... key      at: [<ffffffff87f9cd20>] __key.18+0x0/0x40

     the dependencies between the lock to be acquired
      and SOFTIRQ-irq-unsafe lock:
     -> (&xa->xa_lock#24){+.+.}-{3:3} ops: 9 {
        HARDIRQ-ON-W at:
                         lock_acquire+0x1be/0x520
                         _raw_spin_lock_bh+0x34/0x40
                         mlx5e_xfrm_add_state+0xc5b/0x2290 [mlx5_core]
                         xfrm_dev_state_add+0x3bb/0xd70
                         xfrm_add_sa+0x2451/0x4a90
                         xfrm_user_rcv_msg+0x493/0x880
                         netlink_rcv_skb+0x12e/0x380
                         xfrm_netlink_rcv+0x6d/0x90
                         netlink_unicast+0x42f/0x740
                         netlink_sendmsg+0x745/0xbe0
                         __sock_sendmsg+0xc5/0x190
                         __sys_sendto+0x1fe/0x2c0
                         __x64_sys_sendto+0xdc/0x1b0
                         do_syscall_64+0x6d/0x140
                         entry_SYSCALL_64_after_hwframe+0x4b/0x53
        SOFTIRQ-ON-W at:
                         lock_acquire+0x1be/0x520
                         _raw_spin_lock+0x2c/0x40
                         xa_set_mark+0x70/0x110
                         mlx5e_xfrm_add_state+0xe48/0x2290 [mlx5_core]
                         xfrm_dev_state_add+0x3bb/0xd70
                         xfrm_add_sa+0x2451/0x4a90
                         xfrm_user_rcv_msg+0x493/0x880
                         netlink_rcv_skb+0x12e/0x380
                         xfrm_netlink_rcv+0x6d/0x90
                         netlink_unicast+0x42f/0x740
                         netlink_sendmsg+0x745/0xbe0
                         __sock_sendmsg+0xc5/0x190
                         __sys_sendto+0x1fe/0x2c0
                         __x64_sys_sendto+0xdc/0x1b0
                         do_syscall_64+0x6d/0x140
                         entry_SYSCALL_64_after_hwframe+0x4b/0x53
        INITIAL USE at:
                        lock_acquire+0x1be/0x520
                        _raw_spin_lock_bh+0x34/0x40
                        mlx5e_xfrm_add_state+0xc5b/0x2290 [mlx5_core]
                        xfrm_dev_state_add+0x3bb/0xd70
                        xfrm_add_sa+0x2451/0x4a90
                        xfrm_user_rcv_msg+0x493/0x880
                        netlink_rcv_skb+0x12e/0x380
                        xfrm_netlink_rcv+0x6d/0x90
                        netlink_unicast+0x42f/0x740
                        netlink_sendmsg+0x745/0xbe0
                        __sock_sendmsg+0xc5/0x190
                        __sys_sendto+0x1fe/0x2c0
                        __x64_sys_sendto+0xdc/0x1b0
                        do_syscall_64+0x6d/0x140
                        entry_SYSCALL_64_after_hwframe+0x4b/0x53
      }
      ... key      at: [<ffffffffa078ff60>] __key.48+0x0/0xfffffffffff210a0 [mlx5_core]
      ... acquired at:
        __lock_acquire+0x30a0/0x5040
        lock_acquire+0x1be/0x520
        _raw_spin_lock_bh+0x34/0x40
        mlx5e_xfrm_del_state+0xca/0x1e0 [mlx5_core]
        xfrm_dev_state_delete+0x90/0x160
        __xfrm_state_delete+0x662/0xae0
        xfrm_state_delete+0x1e/0x30
        xfrm_del_sa+0x1c2/0x340
        xfrm_user_rcv_msg+0x493/0x880
        netlink_rcv_skb+0x12e/0x380
        xfrm_netlink_rcv+0x6d/0x90
        netlink_unicast+0x42f/0x740
        netlink_sendmsg+0x745/0xbe0
        __sock_sendmsg+0xc5/0x190
        __sys_sendto+0x1fe/0x2c0
        __x64_sys_sendto+0xdc/0x1b0
        do_syscall_64+0x6d/0x140
        entry_SYSCALL_64_after_hwframe+0x4b/0x53

     stack backtrace:
     CPU: 7 UID: 0 PID: 1337 Comm: charon Not tainted 6.12.0+ #4
     Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
     Call Trace:
      <TASK>
      dump_stack_lvl+0x74/0xd0
      check_irq_usage+0x12e8/0x1d90
      ? print_shortest_lock_dependencies_backwards+0x1b0/0x1b0
      ? check_chain_key+0x1bb/0x4c0
      ? __lockdep_reset_lock+0x180/0x180
      ? check_path.constprop.0+0x24/0x50
      ? mark_lock+0x108/0x2fb0
      ? print_circular_bug+0x9b0/0x9b0
      ? mark_lock+0x108/0x2fb0
      ? print_usage_bug.part.0+0x670/0x670
      ? check_prev_add+0x1c4/0x2310
      check_prev_add+0x1c4/0x2310
      __lock_acquire+0x30a0/0x5040
      ? lockdep_set_lock_cmp_fn+0x190/0x190
      ? lockdep_set_lock_cmp_fn+0x190/0x190
      lock_acquire+0x1be/0x520
      ? mlx5e_xfrm_del_state+0xca/0x1e0 [mlx5_core]
      ? lockdep_hardirqs_on_prepare+0x400/0x400
      ? __xfrm_state_delete+0x5f0/0xae0
      ? lock_downgrade+0x6b0/0x6b0
      _raw_spin_lock_bh+0x34/0x40
      ? mlx5e_xfrm_del_state+0xca/0x1e0 [mlx5_core]
      mlx5e_xfrm_del_state+0xca/0x1e0 [mlx5_core]
      xfrm_dev_state_delete+0x90/0x160
      __xfrm_state_delete+0x662/0xae0
      xfrm_state_delete+0x1e/0x30
      xfrm_del_sa+0x1c2/0x340
      ? xfrm_get_sa+0x250/0x250
      ? check_chain_key+0x1bb/0x4c0
      xfrm_user_rcv_msg+0x493/0x880
      ? copy_sec_ctx+0x270/0x270
      ? check_chain_key+0x1bb/0x4c0
      ? lockdep_set_lock_cmp_fn+0x190/0x190
      ? lockdep_set_lock_cmp_fn+0x190/0x190
      netlink_rcv_skb+0x12e/0x380
      ? copy_sec_ctx+0x270/0x270
      ? netlink_ack+0xd90/0xd90
      ? netlink_deliver_tap+0xcd/0xb60
      xfrm_netlink_rcv+0x6d/0x90
      netlink_unicast+0x42f/0x740
      ? netlink_attachskb+0x730/0x730
      ? lock_acquire+0x1be/0x520
      netlink_sendmsg+0x745/0xbe0
      ? netlink_unicast+0x740/0x740
      ? __might_fault+0xbb/0x170
      ? netlink_unicast+0x740/0x740
      __sock_sendmsg+0xc5/0x190
      ? fdget+0x163/0x1d0
      __sys_sendto+0x1fe/0x2c0
      ? __x64_sys_getpeername+0xb0/0xb0
      ? do_user_addr_fault+0x856/0xe30
      ? lock_acquire+0x1be/0x520
      ? __task_pid_nr_ns+0x117/0x410
      ? lock_downgrade+0x6b0/0x6b0
      __x64_sys_sendto+0xdc/0x1b0
      ? lockdep_hardirqs_on_prepare+0x284/0x400
      do_syscall_64+0x6d/0x140
      entry_SYSCALL_64_after_hwframe+0x4b/0x53
     RIP: 0033:0x7f7d31291ba4
     Code: 7d e8 89 4d d4 e8 4c 42 f7 ff 44 8b 4d d0 4c 8b 45 c8 89 c3 44 8b 55 d4 8b 7d e8 b8 2c 00 00 00 48 8b 55 d8 48 8b 75 e0 0f 05 <48> 3d 00 f0 ff ff 77 34 89 df 48 89 45 e8 e8 99 42 f7 ff 48 8b 45
     RSP: 002b:00007f7d2ccd94f0 EFLAGS: 00000297 ORIG_RAX: 000000000000002c
     RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f7d31291ba4
     RDX: 0000000000000028 RSI: 00007f7d2ccd96a0 RDI: 000000000000000a
     RBP: 00007f7d2ccd9530 R08: 00007f7d2ccd9598 R09: 000000000000000c
     R10: 0000000000000000 R11: 0000000000000297 R12: 0000000000000028
     R13: 00007f7d2ccd9598 R14: 00007f7d2ccd96a0 R15: 00000000000000e1
      </TASK>

    Fixes: 4c24272 ("net/mlx5e: Listen to ARP events to update IPsec L2 headers in tunnel mode")
    Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
    Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
github-actions bot pushed a commit that referenced this pull request Aug 21, 2025
commit f7fa852 upstream.

Add the SM6115 MDSS compatible to clients compatible list, as it also
needs that workaround.
Without this workaround, for example, QRB4210 RB2 which is based on
SM4250/SM6115 generates a lot of smmu unhandled context faults during
boot:

arm_smmu_context_fault: 116854 callbacks suppressed
arm-smmu c600000.iommu: Unhandled context fault: fsr=0x402,
iova=0x5c0ec600, fsynr=0x320021, cbfrsynra=0x420, cb=5
arm-smmu c600000.iommu: FSR    = 00000402 [Format=2 TF], SID=0x420
arm-smmu c600000.iommu: FSYNR0 = 00320021 [S1CBNDX=50 PNU PLVL=1]
arm-smmu c600000.iommu: Unhandled context fault: fsr=0x402,
iova=0x5c0d7800, fsynr=0x320021, cbfrsynra=0x420, cb=5
arm-smmu c600000.iommu: FSR    = 00000402 [Format=2 TF], SID=0x420

and also failed initialisation of lontium lt9611uxc, gpu and dpu is
observed:
(binding MDSS components triggered by lt9611uxc have failed)

 ------------[ cut here ]------------
 !aspace
 WARNING: CPU: 6 PID: 324 at drivers/gpu/drm/msm/msm_gem_vma.c:130 msm_gem_vma_init+0x150/0x18c [msm]
 Modules linked in: ... (long list of modules)
 CPU: 6 UID: 0 PID: 324 Comm: (udev-worker) Not tainted 6.15.0-03037-gaacc73ceeb8b #4 PREEMPT
 Hardware name: Qualcomm Technologies, Inc. QRB4210 RB2 (DT)
 pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
 pc : msm_gem_vma_init+0x150/0x18c [msm]
 lr : msm_gem_vma_init+0x150/0x18c [msm]
 sp : ffff80008144b280
  		...
 Call trace:
  msm_gem_vma_init+0x150/0x18c [msm] (P)
  get_vma_locked+0xc0/0x194 [msm]
  msm_gem_get_and_pin_iova_range+0x4c/0xdc [msm]
  msm_gem_kernel_new+0x48/0x160 [msm]
  msm_gpu_init+0x34c/0x53c [msm]
  adreno_gpu_init+0x1b0/0x2d8 [msm]
  a6xx_gpu_init+0x1e8/0x9e0 [msm]
  adreno_bind+0x2b8/0x348 [msm]
  component_bind_all+0x100/0x230
  msm_drm_bind+0x13c/0x3d0 [msm]
  try_to_bring_up_aggregate_device+0x164/0x1d0
  __component_add+0xa4/0x174
  component_add+0x14/0x20
  dsi_dev_attach+0x20/0x34 [msm]
  dsi_host_attach+0x58/0x98 [msm]
  devm_mipi_dsi_attach+0x34/0x90
  lt9611uxc_attach_dsi.isra.0+0x94/0x124 [lontium_lt9611uxc]
  lt9611uxc_probe+0x540/0x5fc [lontium_lt9611uxc]
  i2c_device_probe+0x148/0x2a8
  really_probe+0xbc/0x2c0
  __driver_probe_device+0x78/0x120
  driver_probe_device+0x3c/0x154
  __driver_attach+0x90/0x1a0
  bus_for_each_dev+0x68/0xb8
  driver_attach+0x24/0x30
  bus_add_driver+0xe4/0x208
  driver_register+0x68/0x124
  i2c_register_driver+0x48/0xcc
  lt9611uxc_driver_init+0x20/0x1000 [lontium_lt9611uxc]
  do_one_initcall+0x60/0x1d4
  do_init_module+0x54/0x1fc
  load_module+0x1748/0x1c8c
  init_module_from_file+0x74/0xa0
  __arm64_sys_finit_module+0x130/0x2f8
  invoke_syscall+0x48/0x104
  el0_svc_common.constprop.0+0xc0/0xe0
  do_el0_svc+0x1c/0x28
  el0_svc+0x2c/0x80
  el0t_64_sync_handler+0x10c/0x138
  el0t_64_sync+0x198/0x19c
 ---[ end trace 0000000000000000 ]---
 msm_dpu 5e01000.display-controller: [drm:msm_gpu_init [msm]] *ERROR* could not allocate memptrs: -22
 msm_dpu 5e01000.display-controller: failed to load adreno gpu
 platform a400000.remoteproc:glink-edge:apr:service@7:dais: Adding to iommu group 19
 msm_dpu 5e01000.display-controller: failed to bind 5900000.gpu (ops a3xx_ops [msm]): -22
 msm_dpu 5e01000.display-controller: adev bind failed: -22
 lt9611uxc 0-002b: failed to attach dsi to host
 lt9611uxc 0-002b: probe with driver lt9611uxc failed with error -22

Suggested-by: Bjorn Andersson <andersson@kernel.org>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>
Fixes: 3581b70 ("drm/msm/disp/dpu1: add support for display on SM6115")
Cc: stable@vger.kernel.org
Signed-off-by: Alexey Klimov <alexey.klimov@linaro.org>
Link: https://lore.kernel.org/r/20250613173238.15061-1-alexey.klimov@linaro.org
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
github-actions bot pushed a commit that referenced this pull request Aug 22, 2025
JIRA: https://issues.redhat.com/browse/RHEL-47242

commit 69e23fa
Author: Kai Huang <kai.huang@intel.com>
Date:   Thu Nov 21 22:14:40 2024 +0200

    x86/virt/tdx: Add SEAMCALL wrapper to enter/exit TDX guest

    Intel TDX protects guest VM's from malicious host and certain physical
    attacks.  TDX introduces a new operation mode, Secure Arbitration Mode
    (SEAM) to isolate and protect guest VM's.  A TDX guest VM runs in SEAM and,
    unlike VMX, direct control and interaction with the guest by the host VMM
    is not possible.  Instead, Intel TDX Module, which also runs in SEAM,
    provides a SEAMCALL API.

    The SEAMCALL that provides the ability to enter a guest is TDH.VP.ENTER.
    The TDX Module processes TDH.VP.ENTER, and enters the guest via VMX
    VMLAUNCH/VMRESUME instructions.  When a guest VM-exit requires host VMM
    interaction, the TDH.VP.ENTER SEAMCALL returns to the host VMM (KVM).

    Add tdh_vp_enter() to wrap the SEAMCALL invocation of TDH.VP.ENTER;
    tdh_vp_enter() needs to be noinstr because VM entry in KVM is noinstr
    as well, which is for two reasons:
    * marking the area as CT_STATE_GUEST via guest_state_enter_irqoff() and
      guest_state_exit_irqoff()
    * IRET must be avoided between VM-exit and NMI handling, in order to
      avoid prematurely releasing the NMI inhibit.

    TDH.VP.ENTER is different from other SEAMCALLs in several ways: it
    uses more arguments, and after it returns some host state may need to be
    restored.  Therefore tdh_vp_enter() uses __seamcall_saved_ret() instead of
    __seamcall_ret(); since it is the only caller of __seamcall_saved_ret(),
    it can be made noinstr also.

    TDH.VP.ENTER arguments are passed through General Purpose Registers (GPRs).
    For the special case of the TD guest invoking TDG.VP.VMCALL, nearly any GPR
    can be used, as well as XMM0 to XMM15. Notably, RBP is not used, and Linux
    mandates the TDX Module feature NO_RBP_MOD, which is enforced elsewhere.
    Additionally, XMM registers are not required for the existing Guest
    Hypervisor Communication Interface and are handled by existing KVM code
    should they be modified by the guest.

    There are 2 input formats and 5 output formats for TDH.VP.ENTER arguments.
    Input #1 : Initial entry or following a previous async. TD Exit
    Input #2 : Following a previous TDCALL(TDG.VP.VMCALL)
    Output #1 : On Error (No TD Entry)
    Output #2 : Async. Exits with a VMX Architectural Exit Reason
    Output #3 : Async. Exits with a non-VMX TD Exit Status
    Output #4 : Async. Exits with Cross-TD Exit Details
    Output #5 : On TDCALL(TDG.VP.VMCALL)

    Currently, to keep things simple, the wrapper function does not attempt
    to support different formats, and just passes all the GPRs that could be
    used.  The GPR values are held by KVM in the area set aside for guest
    GPRs.  KVM code uses the guest GPR area (vcpu->arch.regs[]) to set up for
    or process results of tdh_vp_enter().

    Therefore changing tdh_vp_enter() to use more complex argument formats
    would also alter the way KVM code interacts with tdh_vp_enter().

    Signed-off-by: Kai Huang <kai.huang@intel.com>
    Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
    Message-ID: <20241121201448.36170-2-adrian.hunter@intel.com>
    Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
github-actions bot pushed a commit that referenced this pull request Aug 22, 2025
BPF CI testing report a UAF issue:

  [   16.446633] BUG: kernel NULL pointer dereference, address: 000000000000003  0
  [   16.447134] #PF: supervisor read access in kernel mod  e
  [   16.447516] #PF: error_code(0x0000) - not-present pag  e
  [   16.447878] PGD 0 P4D   0
  [   16.448063] Oops: Oops: 0000 [#1] PREEMPT SMP NOPT  I
  [   16.448409] CPU: 0 UID: 0 PID: 9 Comm: kworker/0:1 Tainted: G           OE      6.13.0-rc3-g89e8a75fda73-dirty #4  2
  [   16.449124] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODUL  E
  [   16.449502] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/201  4
  [   16.450201] Workqueue: smc_hs_wq smc_listen_wor  k
  [   16.450531] RIP: 0010:smc_listen_work+0xc02/0x159  0
  [   16.452158] RSP: 0018:ffffb5ab40053d98 EFLAGS: 0001024  6
  [   16.452526] RAX: 0000000000000001 RBX: 0000000000000002 RCX: 000000000000030  0
  [   16.452994] RDX: 0000000000000280 RSI: 00003513840053f0 RDI: 000000000000000  0
  [   16.453492] RBP: ffffa097808e3800 R08: ffffa09782dba1e0 R09: 000000000000000  5
  [   16.453987] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa0978274640  0
  [   16.454497] R13: 0000000000000000 R14: 0000000000000000 R15: ffffa09782d4092  0
  [   16.454996] FS:  0000000000000000(0000) GS:ffffa097bbc00000(0000) knlGS:000000000000000  0
  [   16.455557] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003  3
  [   16.455961] CR2: 0000000000000030 CR3: 0000000102788004 CR4: 0000000000770ef  0
  [   16.456459] PKRU: 5555555  4
  [   16.456654] Call Trace  :
  [   16.456832]  <TASK  >
  [   16.456989]  ? __die+0x23/0x7  0
  [   16.457215]  ? page_fault_oops+0x180/0x4c  0
  [   16.457508]  ? __lock_acquire+0x3e6/0x249  0
  [   16.457801]  ? exc_page_fault+0x68/0x20  0
  [   16.458080]  ? asm_exc_page_fault+0x26/0x3  0
  [   16.458389]  ? smc_listen_work+0xc02/0x159  0
  [   16.458689]  ? smc_listen_work+0xc02/0x159  0
  [   16.458987]  ? lock_is_held_type+0x8f/0x10  0
  [   16.459284]  process_one_work+0x1ea/0x6d  0
  [   16.459570]  worker_thread+0x1c3/0x38  0
  [   16.459839]  ? __pfx_worker_thread+0x10/0x1  0
  [   16.460144]  kthread+0xe0/0x11  0
  [   16.460372]  ? __pfx_kthread+0x10/0x1  0
  [   16.460640]  ret_from_fork+0x31/0x5  0
  [   16.460896]  ? __pfx_kthread+0x10/0x1  0
  [   16.461166]  ret_from_fork_asm+0x1a/0x3  0
  [   16.461453]  </TASK  >
  [   16.461616] Modules linked in: bpf_testmod(OE) [last unloaded: bpf_testmod(OE)  ]
  [   16.462134] CR2: 000000000000003  0
  [   16.462380] ---[ end trace 0000000000000000 ]---
  [   16.462710] RIP: 0010:smc_listen_work+0xc02/0x1590

The direct cause of this issue is that after smc_listen_out_connected(),
newclcsock->sk may be NULL since it will releases the smcsk. Therefore,
if the application closes the socket immediately after accept,
newclcsock->sk can be NULL. A possible execution order could be as
follows:

smc_listen_work                                 | userspace
-----------------------------------------------------------------
lock_sock(sk)                                   |
smc_listen_out_connected()                      |
| \- smc_listen_out                             |
|    | \- release_sock                          |
     | |- sk->sk_data_ready()                   |
                                                | fd = accept();
                                                | close(fd);
                                                |  \- socket->sk = NULL;
/* newclcsock->sk is NULL now */
SMC_STAT_SERV_SUCC_INC(sock_net(newclcsock->sk))

Since smc_listen_out_connected() will not fail, simply swapping the order
of the code can easily fix this issue.

Fixes: 3b2dec2 ("net/smc: restructure client and server code in af_smc")
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Reviewed-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Link: https://patch.msgid.link/20250818054618.41615-1-alibuda@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants