[syzbot] KASAN: use-after-free Read in skb_release_head_state #6

tedd-an · 2021-04-12T21:43:17Z

Hello,

syzbot found the following issue on:

HEAD commit: 996e435 Merge tag 'zonefs-5.11-rc3' of git://git.kernel.o..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=149f3770d00000
kernel config: https://syzkaller.appspot.com/x/.config?x=bacfc914704718d3
dashboard link: https://syzkaller.appspot.com/bug?extid=60c13361d933487eed83
compiler: gcc (GCC) 10.1.0-syz 20200507

Unfortunately, I don't have any reproducer for this issue yet.

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+60c13361d933487eed83@syzkaller.appspotmail.com

==================================================================
BUG: KASAN: use-after-free in skb_dst_drop include/net/dst.h:269 [inline]
BUG: KASAN: use-after-free in skb_release_head_state+0x223/0x250 net/core/skbuff.c:653
Read of size 8 at addr ffff888020b57a58 by task syz-executor.3/23125

CPU: 0 PID: 23125 Comm: syz-executor.3 Not tainted 5.11.0-rc2-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack+0x107/0x163 lib/dump_stack.c:120
print_address_description.constprop.0.cold+0x5b/0x2f8 mm/kasan/report.c:230
__kasan_report mm/kasan/report.c:396 [inline]
kasan_report.cold+0x79/0xd5 mm/kasan/report.c:413
skb_dst_drop include/net/dst.h:269 [inline]
skb_release_head_state+0x223/0x250 net/core/skbuff.c:653
skb_release_all net/core/skbuff.c:667 [inline]
__kfree_skb net/core/skbuff.c:683 [inline]
kfree_skb net/core/skbuff.c:701 [inline]
kfree_skb+0xfa/0x3f0 net/core/skbuff.c:695
hci_dev_do_open+0xa4a/0x1a00 net/bluetooth/hci_core.c:1619
hci_dev_open+0x132/0x300 net/bluetooth/hci_core.c:1685
hci_sock_ioctl+0x5b6/0x840 net/bluetooth/hci_sock.c:1025
sock_do_ioctl+0xcb/0x2d0 net/socket.c:1037
sock_ioctl+0x477/0x6a0 net/socket.c:1177
vfs_ioctl fs/ioctl.c:48 [inline]
__do_sys_ioctl fs/ioctl.c:753 [inline]
__se_sys_ioctl fs/ioctl.c:739 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:739
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45e087
Code: 48 83 c4 08 48 89 d8 5b 5d c3 66 0f 1f 84 00 00 00 00 00 48 89 e8 48 f7 d8 48 39 c3 0f 92 c0 eb 92 66 90 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 0f 83 6d b5 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fff4e2da0d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 000000000045e087
RDX: 0000000000000000 RSI: 00000000400448c9 RDI: 0000000000000003
RBP: 00007fff4e2da0f0 R08: 0000000000000000 R09: 00007fd7d2806700
R10: 00007fd7d28069d0 R11: 0000000000000246 R12: 0000000002df8914
R13: 0000000000000004 R14: 0000000000000000 R15: 0000000000000000

Allocated by task 8520:
kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38
kasan_set_track mm/kasan/common.c:46 [inline]
set_alloc_info mm/kasan/common.c:401 [inline]
____kasan_kmalloc.constprop.0+0x82/0xa0 mm/kasan/common.c:429
kasan_slab_alloc include/linux/kasan.h:205 [inline]
slab_post_alloc_hook mm/slab.h:512 [inline]
slab_alloc_node mm/slub.c:2891 [inline]
slab_alloc mm/slub.c:2899 [inline]
kmem_cache_alloc+0x1c6/0x440 mm/slub.c:2904
skb_clone+0x14f/0x3c0 net/core/skbuff.c:1449
hci_cmd_work+0x18f/0x390 net/bluetooth/hci_core.c:5007
process_one_work+0x98d/0x15f0 kernel/workqueue.c:2275
worker_thread+0x64c/0x1120 kernel/workqueue.c:2421
kthread+0x3b1/0x4a0 kernel/kthread.c:292
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:296

Freed by task 8519:
kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38
kasan_set_track+0x1c/0x30 mm/kasan/common.c:46
kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:356
____kasan_slab_free+0xe1/0x110 mm/kasan/common.c:362
kasan_slab_free include/linux/kasan.h:188 [inline]
slab_free_hook mm/slub.c:1547 [inline]
slab_free_freelist_hook+0x5d/0x150 mm/slub.c:1580
slab_free mm/slub.c:3142 [inline]
kmem_cache_free+0x82/0x350 mm/slub.c:3158
kfree_skbmem+0xef/0x1b0 net/core/skbuff.c:627
__kfree_skb net/core/skbuff.c:684 [inline]
kfree_skb net/core/skbuff.c:701 [inline]
kfree_skb+0x140/0x3f0 net/core/skbuff.c:695
hci_cmd_work+0x182/0x390 net/bluetooth/hci_core.c:5005
process_one_work+0x98d/0x15f0 kernel/workqueue.c:2275
worker_thread+0x64c/0x1120 kernel/workqueue.c:2421
kthread+0x3b1/0x4a0 kernel/kthread.c:292
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:296

The buggy address belongs to the object at ffff888020b57a00
which belongs to the cache skbuff_head_cache of size 232
The buggy address is located 88 bytes inside of
232-byte region [ffff888020b57a00, ffff888020b57ae8)
The buggy address belongs to the page:
page:00000000508c5c1a refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888020b57140 pfn:0x20b57
flags: 0xfff00000000200(slab)
raw: 00fff00000000200 ffffea00008b8b88 ffffea000098cd48 ffff888010e75640
raw: ffff888020b57140 00000000000c000b 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff888020b57900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888020b57980: fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc
>ffff888020b57a00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                    ^
ffff888020b57a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc
ffff888020b57b00: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
==================================================================

This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

The text was updated successfully, but these errors were encountered:

A race condition is triggered when usermode control is given to userspace before the kernel's MSFT query responds, resulting in an unexpected response to userspace's reset command. Issue can be observed in btmon: < HCI Command: Vendor (0x3f|0x001e) plen 2 #3 [hci0] 05 01 .. @ USER Open: bt_stack_manage (privileged) version 2.22 {0x0002} [hci0] < HCI Command: Reset (0x03|0x0003) plen 0 #4 [hci0] > HCI Event: Command Complete (0x0e) plen 5 #5 [hci0] Vendor (0x3f|0x001e) ncmd 1 Status: Command Disallowed (0x0c) 05 . > HCI Event: Command Complete (0x0e) plen 4 #6 [hci0] Reset (0x03|0x0003) ncmd 2 Status: Success (0x00) Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org> Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org> Signed-off-by: Jesse Melhuish <melhuishj@chromium.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

The exit function fixes a memory leak with the src field as detected by leak sanitizer. An example of which is: Indirect leak of 25133184 byte(s) in 207 object(s) allocated from: #0 0x7f199ecfe987 in __interceptor_calloc libsanitizer/asan/asan_malloc_linux.cpp:154 #1 0x55defe638224 in annotated_source__alloc_histograms util/annotate.c:803 #2 0x55defe6397e4 in symbol__hists util/annotate.c:952 #3 0x55defe639908 in symbol__inc_addr_samples util/annotate.c:968 #4 0x55defe63aa29 in hist_entry__inc_addr_samples util/annotate.c:1119 #5 0x55defe499a79 in hist_iter__report_callback tools/perf/builtin-report.c:182 #6 0x55defe7a859d in hist_entry_iter__add util/hist.c:1236 #7 0x55defe49aa63 in process_sample_event tools/perf/builtin-report.c:315 #8 0x55defe731bc8 in evlist__deliver_sample util/session.c:1473 #9 0x55defe731e38 in machines__deliver_event util/session.c:1510 #10 0x55defe732a23 in perf_session__deliver_event util/session.c:1590 #11 0x55defe72951e in ordered_events__deliver_event util/session.c:183 #12 0x55defe740082 in do_flush util/ordered-events.c:244 #13 0x55defe7407cb in __ordered_events__flush util/ordered-events.c:323 #14 0x55defe740a61 in ordered_events__flush util/ordered-events.c:341 #15 0x55defe73837f in __perf_session__process_events util/session.c:2390 #16 0x55defe7385ff in perf_session__process_events util/session.c:2420 ... Signed-off-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@arm.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kajol Jain <kjain@linux.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Martin Liška <mliska@suse.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: https://lore.kernel.org/r/20211112035124.94327-3-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

As guest_irq is coming from KVM_IRQFD API call, it may trigger crash in svm_update_pi_irte() due to out-of-bounds: crash> bt PID: 22218 TASK: ffff951a6ad74980 CPU: 73 COMMAND: "vcpu8" #0 [ffffb1ba6707fa40] machine_kexec at ffffffff8565b397 #1 [ffffb1ba6707fa90] __crash_kexec at ffffffff85788a6d #2 [ffffb1ba6707fb58] crash_kexec at ffffffff8578995d #3 [ffffb1ba6707fb70] oops_end at ffffffff85623c0d #4 [ffffb1ba6707fb90] no_context at ffffffff856692c9 #5 [ffffb1ba6707fbf8] exc_page_fault at ffffffff85f95b51 #6 [ffffb1ba6707fc50] asm_exc_page_fault at ffffffff86000ace [exception RIP: svm_update_pi_irte+227] RIP: ffffffffc0761b53 RSP: ffffb1ba6707fd08 RFLAGS: 00010086 RAX: ffffb1ba6707fd78 RBX: ffffb1ba66d91000 RCX: 0000000000000001 RDX: 00003c803f63f1c0 RSI: 000000000000019a RDI: ffffb1ba66db2ab8 RBP: 000000000000019a R8: 0000000000000040 R9: ffff94ca41b82200 R10: ffffffffffffffcf R11: 0000000000000001 R12: 0000000000000001 R13: 0000000000000001 R14: ffffffffffffffcf R15: 000000000000005f ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffffb1ba6707fdb8] kvm_irq_routing_update at ffffffffc09f19a1 [kvm] #8 [ffffb1ba6707fde0] kvm_set_irq_routing at ffffffffc09f2133 [kvm] #9 [ffffb1ba6707fe18] kvm_vm_ioctl at ffffffffc09ef544 [kvm] RIP: 00007f143c36488b RSP: 00007f143a4e04b8 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 00007f05780041d0 RCX: 00007f143c36488b RDX: 00007f05780041d0 RSI: 000000004008ae6a RDI: 0000000000000020 RBP: 00000000000004e8 R8: 0000000000000008 R9: 00007f05780041e0 R10: 00007f0578004560 R11: 0000000000000246 R12: 00000000000004e0 R13: 000000000000001a R14: 00007f1424001c60 R15: 00007f0578003bc0 ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b Vmx have been fix this in commit 3a8b067 (KVM: VMX: Do not BUG() on out-of-bounds guest IRQ), so we can just copy source from that to fix this. Co-developed-by: Yi Liu <liu.yi24@zte.com.cn> Signed-off-by: Yi Liu <liu.yi24@zte.com.cn> Signed-off-by: Yi Wang <wang.yi59@zte.com.cn> Message-Id: <20220309113025.44469-1-wang.yi59@zte.com.cn> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Andrii Nakryiko says: ==================== Add libbpf support for USDT (User Statically-Defined Tracing) probes. USDTs is important part of tracing, and BPF, ecosystem, widely used in mission-critical production applications for observability, performance analysis, and debugging. And while USDTs themselves are pretty complicated abstraction built on top of uprobes, for end-users USDT is as natural a primitive as uprobes themselves. And thus it's important for libbpf to provide best possible user experience when it comes to build tracing applications relying on USDTs. USDTs historically presented a lot of challenges for libbpf's no compilation-on-the-fly general approach to BPF tracing. BCC utilizes power of on-the-fly source code generation and compilation using its embedded Clang toolchain, which was impractical for more lightweight and thus more rigid libbpf-based approach. But still, with enough diligence and BPF cookies it's possible to implement USDT support that feels as natural as tracing any uprobe. This patch set is the culmination of such effort to add libbpf USDT support following the spirit and philosophy of BPF CO-RE (even though it's not inherently relying on BPF CO-RE much, see patch #1 for some notes regarding this). Each respective patch has enough details and explanations, so I won't go into details here. In the end, I think the overall usability of libbpf's USDT support *exceeds* the status quo set by BCC due to the elimination of awkward runtime USDT supporting code generation. It also exceeds BCC's capabilities due to the use of BPF cookie. This eliminates the need to determine a USDT call site (and thus specifics about how exactly to fetch arguments) based on its *absolute IP address*, which is impossible with shared libraries if no PID is specified (as we then just *can't* know absolute IP at which shared library is loaded, because it might be different for each process). With BPF cookie this is not a problem as we record "call site ID" directly in a BPF cookie value. This makes it possible to do a system-wide tracing of a USDT defined in a shared library. Think about tracing some USDT in libc across any process in the system, both running at the time of attachment and all the new processes started *afterwards*. This is a very powerful capability that allows more efficient observability and tracing tooling. Once this functionality lands, the plan is to extend libbpf-bootstrap ([0]) with an USDT example. It will also become possible to start converting BCC tools that rely on USDTs to their libbpf-based counterparts ([1]). It's worth noting that preliminary version of this code was currently used and tested in production code running fleet-wide observability toolkit. Libbpf functionality is broken down into 5 mostly logically independent parts, for ease of reviewing: - patch #1 adds BPF-side implementation; - patch #2 adds user-space APIs and wires bpf_link for USDTs; - patch #3 adds the most mundate pieces: handling ELF, parsing USDT notes, dealing with memory segments, relative vs absolute addresses, etc; - patch #4 adds internal ID allocation and setting up/tearing down of BPF-side state (spec and IP-to-ID mapping); - patch #5 implements x86/x86-64-specific logic of parsing USDT argument specifications; - patch #6 adds testing of various basic aspects of handling of USDT; - patch #7 extends the set of tests with more combinations of semaphore, executable vs shared library, and PID filter options. [0] https://github.com/libbpf/libbpf-bootstrap [1] https://github.com/iovisor/bcc/tree/master/libbpf-tools v2->v3: - fix typos, leave link to systemtap doc, acks, etc (Dave); - include sys/sdt.h to avoid extra system-wide package dependencies; v1->v2: - huge high-level comment describing how all the moving parts fit together (Alan, Alexei); - switched from `__hidden __weak` to `static inline __noinline` for now, as there is a bug in BPF linker breaking final BPF object file due to invalid .BTF.ext data; I want to fix it separately at which point I'll switch back to __hidden __weak again. The fix isn't trivial, so I don't want to block on that. Same for __weak variable lookup bug that Henqi reported. - various fixes and improvements, addressing other feedback (Alan, Hengqi); Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Dave Marchevsky <davemarchevsky@fb.com> Cc: Hengqi Chen <hengqi.chen@gmail.com> ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Ido Schimmel says: ==================== mlxsw: Preparations for line cards support Currently, mlxsw registers thermal zones as well as hwmon entries for objects such as transceiver modules and gearboxes. In upcoming modular systems, these objects are no longer found on the main board (i.e., slot 0), but on plug-able line cards. This patchset prepares mlxsw for such systems in terms of hwmon, thermal and cable access support. Patches #1-#3 gradually prepare mlxsw for transceiver modules access support for line cards by splitting some of the internal structures and some APIs. Patches #4-#5 gradually prepare mlxsw for hwmon support for line cards by splitting some of the internal structures and augmenting them with a slot index. Patches #6-#7 do the same for thermal zones. Patch #8 selects cooling device for binding to a thermal zone by exact name match to prevent binding to non-relevant devices. Patch #9 replaces internal define for thermal zone name length with a common define. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>

Ido Schimmel says: ==================== mlxsw: Line cards status tracking When a line card is provisioned, netdevs corresponding to the ports found on the line card are registered. User space can then perform various logical configurations (e.g., splitting, setting MTU) on these netdevs. However, since the line card is not present / powered on (i.e., it is not in 'active' state), user space cannot access the various components found on the line card. For example, user space cannot read the temperature of gearboxes or transceiver modules found on the line card via hwmon / thermal. Similarly, it cannot dump the EEPROM contents of these transceiver modules. The above is only possible when the line card becomes active. This patchset solves the problem by tracking the status of each line card and invoking callbacks from interested parties when a line card becomes active / inactive. Patchset overview: Patch #1 adds the infrastructure in the line cards core that allows users to registers a set of callbacks that are invoked when a line card becomes active / inactive. To avoid races, if a line card is already active during registration, the got_active() callback is invoked. Patches #2-#3 are preparations. Patch #4 changes the port module core to register a set of callbacks with the line cards core. See detailed description with examples in the commit message. Patches #5-#6 do the same with regards to thermal / hwmon support, so that user space will be able to monitor the temperature of various components on the line card when it becomes active. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>

While handling PCI errors (AER flow) driver tries to disable NAPI [napi_disable()] after NAPI is deleted [__netif_napi_del()] which causes unexpected system hang/crash. System message log shows the following: ======================================= [ 3222.537510] EEH: Detected PCI bus error on PHB#384-PE#800000 [ 3222.537511] EEH: This PCI device has failed 2 times in the last hour and will be permanently disabled after 5 failures. [ 3222.537512] EEH: Notify device drivers to shutdown [ 3222.537513] EEH: Beginning: 'error_detected(IO frozen)' [ 3222.537514] EEH: PE#800000 (PCI 0384:80:00.0): Invoking bnx2x->error_detected(IO frozen) [ 3222.537516] bnx2x: [bnx2x_io_error_detected:14236(eth14)]IO error detected [ 3222.537650] EEH: PE#800000 (PCI 0384:80:00.0): bnx2x driver reports: 'need reset' [ 3222.537651] EEH: PE#800000 (PCI 0384:80:00.1): Invoking bnx2x->error_detected(IO frozen) [ 3222.537651] bnx2x: [bnx2x_io_error_detected:14236(eth13)]IO error detected [ 3222.537729] EEH: PE#800000 (PCI 0384:80:00.1): bnx2x driver reports: 'need reset' [ 3222.537729] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset' [ 3222.537890] EEH: Collect temporary log [ 3222.583481] EEH: of node=0384:80:00.0 [ 3222.583519] EEH: PCI device/vendor: 168e14e4 [ 3222.583557] EEH: PCI cmd/status register: 00100140 [ 3222.583557] EEH: PCI-E capabilities and status follow: [ 3222.583744] EEH: PCI-E 00: 00020010 012c8da2 00095d5e 00455c82 [ 3222.583892] EEH: PCI-E 10: 10820000 00000000 00000000 00000000 [ 3222.583893] EEH: PCI-E 20: 00000000 [ 3222.583893] EEH: PCI-E AER capability register set follows: [ 3222.584079] EEH: PCI-E AER 00: 13c10001 00000000 00000000 00062030 [ 3222.584230] EEH: PCI-E AER 10: 00002000 000031c0 000001e0 00000000 [ 3222.584378] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 [ 3222.584416] EEH: PCI-E AER 30: 00000000 00000000 [ 3222.584416] EEH: of node=0384:80:00.1 [ 3222.584454] EEH: PCI device/vendor: 168e14e4 [ 3222.584491] EEH: PCI cmd/status register: 00100140 [ 3222.584492] EEH: PCI-E capabilities and status follow: [ 3222.584677] EEH: PCI-E 00: 00020010 012c8da2 00095d5e 00455c82 [ 3222.584825] EEH: PCI-E 10: 10820000 00000000 00000000 00000000 [ 3222.584826] EEH: PCI-E 20: 00000000 [ 3222.584826] EEH: PCI-E AER capability register set follows: [ 3222.585011] EEH: PCI-E AER 00: 13c10001 00000000 00000000 00062030 [ 3222.585160] EEH: PCI-E AER 10: 00002000 000031c0 000001e0 00000000 [ 3222.585309] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 [ 3222.585347] EEH: PCI-E AER 30: 00000000 00000000 [ 3222.586872] RTAS: event: 5, Type: Platform Error (224), Severity: 2 [ 3222.586873] EEH: Reset without hotplug activity [ 3224.762767] EEH: Beginning: 'slot_reset' [ 3224.762770] EEH: PE#800000 (PCI 0384:80:00.0): Invoking bnx2x->slot_reset() [ 3224.762771] bnx2x: [bnx2x_io_slot_reset:14271(eth14)]IO slot reset initializing... [ 3224.762887] bnx2x 0384:80:00.0: enabling device (0140 -> 0142) [ 3224.768157] bnx2x: [bnx2x_io_slot_reset:14287(eth14)]IO slot reset --> driver unload Uninterruptible tasks ===================== crash> ps | grep UN 213 2 11 c000000004c89e00 UN 0.0 0 0 [eehd] 215 2 0 c000000004c80000 UN 0.0 0 0 [kworker/0:2] 2196 1 28 c000000004504f00 UN 0.1 15936 11136 wickedd 4287 1 9 c00000020d076800 UN 0.0 4032 3008 agetty 4289 1 20 c00000020d056680 UN 0.0 7232 3840 agetty 32423 2 26 c00000020038c580 UN 0.0 0 0 [kworker/26:3] 32871 4241 27 c0000002609ddd00 UN 0.1 18624 11648 sshd 32920 10130 16 c00000027284a100 UN 0.1 48512 12608 sendmail 33092 32987 0 c000000205218b00 UN 0.1 48512 12608 sendmail 33154 4567 16 c000000260e51780 UN 0.1 48832 12864 pickup 33209 4241 36 c000000270cb6500 UN 0.1 18624 11712 sshd 33473 33283 0 c000000205211480 UN 0.1 48512 12672 sendmail 33531 4241 37 c00000023c902780 UN 0.1 18624 11648 sshd EEH handler hung while bnx2x sleeping and holding RTNL lock =========================================================== crash> bt 213 PID: 213 TASK: c000000004c89e00 CPU: 11 COMMAND: "eehd" #0 [c000000004d477e0] __schedule at c000000000c70808 #1 [c000000004d478b0] schedule at c000000000c70ee0 #2 [c000000004d478e0] schedule_timeout at c000000000c76dec #3 [c000000004d479c0] msleep at c0000000002120cc #4 [c000000004d479f0] napi_disable at c000000000a06448 ^^^^^^^^^^^^^^^^ #5 [c000000004d47a30] bnx2x_netif_stop at c0080000018dba94 [bnx2x] #6 [c000000004d47a60] bnx2x_io_slot_reset at c0080000018a551c [bnx2x] #7 [c000000004d47b20] eeh_report_reset at c00000000004c9bc #8 [c000000004d47b90] eeh_pe_report at c00000000004d1a8 #9 [c000000004d47c40] eeh_handle_normal_event at c00000000004da64 And the sleeping source code ============================ crash> dis -ls c000000000a06448 FILE: ../net/core/dev.c LINE: 6702 6697 { 6698 might_sleep(); 6699 set_bit(NAPI_STATE_DISABLE, &n->state); 6700 6701 while (test_and_set_bit(NAPI_STATE_SCHED, &n->state)) * 6702 msleep(1); 6703 while (test_and_set_bit(NAPI_STATE_NPSVC, &n->state)) 6704 msleep(1); 6705 6706 hrtimer_cancel(&n->timer); 6707 6708 clear_bit(NAPI_STATE_DISABLE, &n->state); 6709 } EEH calls into bnx2x twice based on the system log above, first through bnx2x_io_error_detected() and then bnx2x_io_slot_reset(), and executes the following call chains: bnx2x_io_error_detected() +-> bnx2x_eeh_nic_unload() +-> bnx2x_del_all_napi() +-> __netif_napi_del() bnx2x_io_slot_reset() +-> bnx2x_netif_stop() +-> bnx2x_napi_disable() +->napi_disable() Fix this by correcting the sequence of NAPI APIs usage, that is delete the NAPI after disabling it. Fixes: 7fa6f34 ("bnx2x: AER revised") Reported-by: David Christensen <drc@linux.vnet.ibm.com> Tested-by: David Christensen <drc@linux.vnet.ibm.com> Signed-off-by: Manish Chopra <manishc@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Link: https://lore.kernel.org/r/20220426153913.6966-1-manishc@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Current DP driver implementation has adding safe mode done at dp_hpd_plug_handle() which is expected to be executed under event thread context. However there is possible circular locking happen (see blow stack trace) after edp driver call dp_hpd_plug_handle() from dp_bridge_enable() which is executed under drm_thread context. After review all possibilities methods and as discussed on https://patchwork.freedesktop.org/patch/483155/, supporting EDID compliance tests in the driver is quite hacky. As seen with other vendor drivers, supporting these will be much easier with IGT. Hence removing all the related fail safe code for it so that no possibility of circular lock will happen. Reviewed-by: Stephen Boyd <swboyd@chromium.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> ====================================================== WARNING: possible circular locking dependency detected 5.15.35-lockdep #6 Tainted: G W ------------------------------------------------------ frecon/429 is trying to acquire lock: ffffff808dc3c4e8 (&dev->mode_config.mutex){+.+.}-{3:3}, at: dp_panel_add_fail_safe_mode+0x4c/0xa0 but task is already holding lock: ffffff808dc441e0 (&kms->commit_lock[i]){+.+.}-{3:3}, at: lock_crtcs+0xb4/0x124 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&kms->commit_lock[i]){+.+.}-{3:3}: __mutex_lock_common+0x174/0x1a64 mutex_lock_nested+0x98/0xac lock_crtcs+0xb4/0x124 msm_atomic_commit_tail+0x330/0x748 commit_tail+0x19c/0x278 drm_atomic_helper_commit+0x1dc/0x1f0 drm_atomic_commit+0xc0/0xd8 drm_atomic_helper_set_config+0xb4/0x134 drm_mode_setcrtc+0x688/0x1248 drm_ioctl_kernel+0x1e4/0x338 drm_ioctl+0x3a4/0x684 __arm64_sys_ioctl+0x118/0x154 invoke_syscall+0x78/0x224 el0_svc_common+0x178/0x200 do_el0_svc+0x94/0x13c el0_svc+0x5c/0xec el0t_64_sync_handler+0x78/0x108 el0t_64_sync+0x1a4/0x1a8 -> #2 (crtc_ww_class_mutex){+.+.}-{3:3}: __mutex_lock_common+0x174/0x1a64 ww_mutex_lock+0xb8/0x278 modeset_lock+0x304/0x4ac drm_modeset_lock+0x4c/0x7c drmm_mode_config_init+0x4a8/0xc50 msm_drm_init+0x274/0xac0 msm_drm_bind+0x20/0x2c try_to_bring_up_master+0x3dc/0x470 __component_add+0x18c/0x3c0 component_add+0x1c/0x28 dp_display_probe+0x954/0xa98 platform_probe+0x124/0x15c really_probe+0x1b0/0x5f8 __driver_probe_device+0x174/0x20c driver_probe_device+0x70/0x134 __device_attach_driver+0x130/0x1d0 bus_for_each_drv+0xfc/0x14c __device_attach+0x1bc/0x2bc device_initial_probe+0x1c/0x28 bus_probe_device+0x94/0x178 deferred_probe_work_func+0x1a4/0x1f0 process_one_work+0x5d4/0x9dc worker_thread+0x898/0xccc kthread+0x2d4/0x3d4 ret_from_fork+0x10/0x20 -> #1 (crtc_ww_class_acquire){+.+.}-{0:0}: ww_acquire_init+0x1c4/0x2c8 drm_modeset_acquire_init+0x44/0xc8 drm_helper_probe_single_connector_modes+0xb0/0x12dc drm_mode_getconnector+0x5dc/0xfe8 drm_ioctl_kernel+0x1e4/0x338 drm_ioctl+0x3a4/0x684 __arm64_sys_ioctl+0x118/0x154 invoke_syscall+0x78/0x224 el0_svc_common+0x178/0x200 do_el0_svc+0x94/0x13c el0_svc+0x5c/0xec el0t_64_sync_handler+0x78/0x108 el0t_64_sync+0x1a4/0x1a8 -> #0 (&dev->mode_config.mutex){+.+.}-{3:3}: __lock_acquire+0x2650/0x672c lock_acquire+0x1b4/0x4ac __mutex_lock_common+0x174/0x1a64 mutex_lock_nested+0x98/0xac dp_panel_add_fail_safe_mode+0x4c/0xa0 dp_hpd_plug_handle+0x1f0/0x280 dp_bridge_enable+0x94/0x2b8 drm_atomic_bridge_chain_enable+0x11c/0x168 drm_atomic_helper_commit_modeset_enables+0x500/0x740 msm_atomic_commit_tail+0x3e4/0x748 commit_tail+0x19c/0x278 drm_atomic_helper_commit+0x1dc/0x1f0 drm_atomic_commit+0xc0/0xd8 drm_atomic_helper_set_config+0xb4/0x134 drm_mode_setcrtc+0x688/0x1248 drm_ioctl_kernel+0x1e4/0x338 drm_ioctl+0x3a4/0x684 __arm64_sys_ioctl+0x118/0x154 invoke_syscall+0x78/0x224 el0_svc_common+0x178/0x200 do_el0_svc+0x94/0x13c el0_svc+0x5c/0xec el0t_64_sync_handler+0x78/0x108 el0t_64_sync+0x1a4/0x1a8 Changes in v2: -- re text commit title -- remove all fail safe mode Changes in v3: -- remove dp_panel_add_fail_safe_mode() from dp_panel.h -- add Fixes Changes in v5: -- to=dianders@chromium.org Changes in v6: -- fix Fixes commit ID Fixes: 8b2c181 ("drm/msm/dp: add fail safe mode outside of event_mutex context") Reported-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Kuogee Hsieh <quic_khsieh@quicinc.com> Link: https://lore.kernel.org/r/1651007534-31842-1-git-send-email-quic_khsieh@quicinc.com Signed-off-by: Rob Clark <robdclark@chromium.org>

Recent commit that modified fib route event handler to handle events according to their priority introduced use-after-free[0] in mp->mfi pointer usage. The pointer now is not just cached in order to be compared to following fib_info instances, but is also dereferenced to obtain fib_priority. However, since mlx5 lag code doesn't hold the reference to fin_info during whole mp->mfi lifetime, it could be used after fib_info instance has already been freed be kernel infrastructure code. Don't ever dereference mp->mfi pointer. Refactor it to be 'const void*' type and cache fib_info priority in dedicated integer. Group fib_info-related data into dedicated 'fib' structure that will be further extended by following patches in the series. [0]: [ 203.588029] ================================================================== [ 203.590161] BUG: KASAN: use-after-free in mlx5_lag_fib_update+0xabd/0xd60 [mlx5_core] [ 203.592386] Read of size 4 at addr ffff888144df2050 by task kworker/u20:4/138 [ 203.594766] CPU: 3 PID: 138 Comm: kworker/u20:4 Tainted: G B 5.17.0-rc7+ #6 [ 203.596751] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 203.598813] Workqueue: mlx5_lag_mp mlx5_lag_fib_update [mlx5_core] [ 203.600053] Call Trace: [ 203.600608] <TASK> [ 203.601110] dump_stack_lvl+0x48/0x5e [ 203.601860] print_address_description.constprop.0+0x1f/0x160 [ 203.602950] ? mlx5_lag_fib_update+0xabd/0xd60 [mlx5_core] [ 203.604073] ? mlx5_lag_fib_update+0xabd/0xd60 [mlx5_core] [ 203.605177] kasan_report.cold+0x83/0xdf [ 203.605969] ? mlx5_lag_fib_update+0xabd/0xd60 [mlx5_core] [ 203.607102] mlx5_lag_fib_update+0xabd/0xd60 [mlx5_core] [ 203.608199] ? mlx5_lag_init_fib_work+0x1c0/0x1c0 [mlx5_core] [ 203.609382] ? read_word_at_a_time+0xe/0x20 [ 203.610463] ? strscpy+0xa0/0x2a0 [ 203.611463] process_one_work+0x722/0x1270 [ 203.612344] worker_thread+0x540/0x11e0 [ 203.613136] ? rescuer_thread+0xd50/0xd50 [ 203.613949] kthread+0x26e/0x300 [ 203.614627] ? kthread_complete_and_exit+0x20/0x20 [ 203.615542] ret_from_fork+0x1f/0x30 [ 203.616273] </TASK> [ 203.617174] Allocated by task 3746: [ 203.617874] kasan_save_stack+0x1e/0x40 [ 203.618644] __kasan_kmalloc+0x81/0xa0 [ 203.619394] fib_create_info+0xb41/0x3c50 [ 203.620213] fib_table_insert+0x190/0x1ff0 [ 203.621020] fib_magic.isra.0+0x246/0x2e0 [ 203.621803] fib_add_ifaddr+0x19f/0x670 [ 203.622563] fib_inetaddr_event+0x13f/0x270 [ 203.623377] blocking_notifier_call_chain+0xd4/0x130 [ 203.624355] __inet_insert_ifa+0x641/0xb20 [ 203.625185] inet_rtm_newaddr+0xc3d/0x16a0 [ 203.626009] rtnetlink_rcv_msg+0x309/0x880 [ 203.626826] netlink_rcv_skb+0x11d/0x340 [ 203.627626] netlink_unicast+0x4cc/0x790 [ 203.628430] netlink_sendmsg+0x762/0xc00 [ 203.629230] sock_sendmsg+0xb2/0xe0 [ 203.629955] ____sys_sendmsg+0x58a/0x770 [ 203.630756] ___sys_sendmsg+0xd8/0x160 [ 203.631523] __sys_sendmsg+0xb7/0x140 [ 203.632294] do_syscall_64+0x35/0x80 [ 203.633045] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 203.634427] Freed by task 0: [ 203.635063] kasan_save_stack+0x1e/0x40 [ 203.635844] kasan_set_track+0x21/0x30 [ 203.636618] kasan_set_free_info+0x20/0x30 [ 203.637450] __kasan_slab_free+0xfc/0x140 [ 203.638271] kfree+0x94/0x3b0 [ 203.638903] rcu_core+0x5e4/0x1990 [ 203.639640] __do_softirq+0x1ba/0x5d3 [ 203.640828] Last potentially related work creation: [ 203.641785] kasan_save_stack+0x1e/0x40 [ 203.642571] __kasan_record_aux_stack+0x9f/0xb0 [ 203.643478] call_rcu+0x88/0x9c0 [ 203.644178] fib_release_info+0x539/0x750 [ 203.644997] fib_table_delete+0x659/0xb80 [ 203.645809] fib_magic.isra.0+0x1a3/0x2e0 [ 203.646617] fib_del_ifaddr+0x93f/0x1300 [ 203.647415] fib_inetaddr_event+0x9f/0x270 [ 203.648251] blocking_notifier_call_chain+0xd4/0x130 [ 203.649225] __inet_del_ifa+0x474/0xc10 [ 203.650016] devinet_ioctl+0x781/0x17f0 [ 203.650788] inet_ioctl+0x1ad/0x290 [ 203.651533] sock_do_ioctl+0xce/0x1c0 [ 203.652315] sock_ioctl+0x27b/0x4f0 [ 203.653058] __x64_sys_ioctl+0x124/0x190 [ 203.653850] do_syscall_64+0x35/0x80 [ 203.654608] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 203.666952] The buggy address belongs to the object at ffff888144df2000 which belongs to the cache kmalloc-256 of size 256 [ 203.669250] The buggy address is located 80 bytes inside of 256-byte region [ffff888144df2000, ffff888144df2100) [ 203.671332] The buggy address belongs to the page: [ 203.672273] page:00000000bf6c9314 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x144df0 [ 203.674009] head:00000000bf6c9314 order:2 compound_mapcount:0 compound_pincount:0 [ 203.675422] flags: 0x2ffff800010200(slab|head|node=0|zone=2|lastcpupid=0x1ffff) [ 203.676819] raw: 002ffff800010200 0000000000000000 dead000000000122 ffff888100042b40 [ 203.678384] raw: 0000000000000000 0000000080200020 00000001ffffffff 0000000000000000 [ 203.679928] page dumped because: kasan: bad access detected [ 203.681455] Memory state around the buggy address: [ 203.682421] ffff888144df1f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 203.683863] ffff888144df1f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 203.685310] >ffff888144df2000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 203.686701] ^ [ 203.687820] ffff888144df2080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 203.689226] ffff888144df2100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 203.690620] ================================================================== Fixes: ad11c4f ("net/mlx5e: Lag, Only handle events from highest priority multipath entry") Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Maor Dickman <maord@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

Ido Schimmel says: ==================== mlxsw: Various updates Patches #1-#3 add missing topology diagrams in selftests and perform small cleanups. Patches #4-#5 make small adjustments in QoS configuration. See detailed description in the commit messages. Patches #6-#8 reduce the number of background EMAD transactions. The driver periodically queries the device (via EMAD transactions) about updates that cannot happen in certain situations. This can negatively impact the latency of time critical transactions, as the device is busy processing other transactions. Before: # perf stat -a -e devlink:devlink_hwmsg -- sleep 10 Performance counter stats for 'system wide': 452 devlink:devlink_hwmsg 10.009736160 seconds time elapsed After: # perf stat -a -e devlink:devlink_hwmsg -- sleep 10 Performance counter stats for 'system wide': 0 devlink:devlink_hwmsg 10.001726333 seconds time elapsed ==================== Signed-off-by: David S. Miller <davem@davemloft.net>

Ido Schimmel says: ==================== mlxsw: A dedicated notifier block for router code Petr says: Currently all netdevice events are handled in the centralized notifier handler maintained by spectrum.c. Since a number of events are involving router code, spectrum.c needs to dispatch them to spectrum_router.c. The spectrum module therefore needs to know more about the router code than it should have, and there is are several API points through which the two modules communicate. In this patchset, move bulk of the router-related event handling to the router code. Some of the knowledge has to stay: spectrum.c cannot veto events that the router supports, and vice versa. But beyond that, the two can ignore each other's details, which leads to more focused and simpler code. As a side effect, this fixes L3 HW stats support on tunnel netdevices. The patch set progresses as follows: - In patch #1, change spectrum code to not bounce L3 enslavement, which the router code supports. - In patch #2, add a new do-nothing notifier block to the router code. - In patches #3-#6, move router-specific event handling to the router module. In patch #7, clean up a comment. - In patch #8, use the advantage that all router event handling is in the router code and clean up taking router lock. - mlxsw supports L3 HW stats on tunnels as of this patchset. Patches #9 and #10 therefore add a selftest for L3 HW stats support on tunnels. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>

Guangbin Huang says: ==================== net: hns3: updates for -next This series includes some updates for the HNS3 ethernet driver. Change logs: V1 -> V2: - Fix some sparse warnings of patch 3# and 4#. - Add patch #6 to fix sparse warnings of incorrect type of argument. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>

Do not allow to write timestamps on RX rings if PF is being configured. When PF is being configured RX rings can be freed or rebuilt. If at the same time timestamps are updated, the kernel will crash by dereferencing null RX ring pointer. PID: 1449 TASK: ff187d28ed658040 CPU: 34 COMMAND: "ice-ptp-0000:51" #0 [ff1966a94a713bb0] machine_kexec at ffffffff9d05a0be #1 [ff1966a94a713c08] __crash_kexec at ffffffff9d192e9d #2 [ff1966a94a713cd0] crash_kexec at ffffffff9d1941bd #3 [ff1966a94a713ce8] oops_end at ffffffff9d01bd54 #4 [ff1966a94a713d08] no_context at ffffffff9d06bda4 #5 [ff1966a94a713d60] __bad_area_nosemaphore at ffffffff9d06c10c #6 [ff1966a94a713da8] do_page_fault at ffffffff9d06cae4 #7 [ff1966a94a713de0] page_fault at ffffffff9da0107e [exception RIP: ice_ptp_update_cached_phctime+91] RIP: ffffffffc076db8b RSP: ff1966a94a713e98 RFLAGS: 00010246 RAX: 16e3db9c6b7ccae4 RBX: ff187d269dd3c180 RCX: ff187d269cd4d018 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ff187d269cfcc644 R8: ff187d339b9641b0 R9: 0000000000000000 R10: 0000000000000002 R11: 0000000000000000 R12: ff187d269cfcc648 R13: ffffffff9f128784 R14: ffffffff9d101b70 R15: ff187d269cfcc640 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #8 [ff1966a94a713ea0] ice_ptp_periodic_work at ffffffffc076dbef [ice] #9 [ff1966a94a713ee0] kthread_worker_fn at ffffffff9d101c1b #10 [ff1966a94a713f10] kthread at ffffffff9d101b4d #11 [ff1966a94a713f50] ret_from_fork at ffffffff9da0023f Fixes: 77a7811 ("ice: enable receive hardware timestamping") Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Reviewed-by: Michal Schmidt <mschmidt@redhat.com> Tested-by: Dave Cain <dcain@redhat.com> Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

Ido Schimmel says: ==================== mlxsw: Add PTP support for Spectrum-2 and newer ASICs This patchset adds PTP support for Spectrum-{2,3,4} switch ASICs. They all act largely the same with respect to PTP except for a workaround implemented for Spectrum-{2,3} in patch #6. Spectrum-2 and newer ASICs essentially implement a transparent clock between all the switch ports, including the CPU port. The hardware will generate the UTC time stamp for transmitted / received packets at the CPU port, but will compensate for forwarding delays in the ASIC by adjusting the correction field in the PTP header (for PTP events) at the ingress and egress ports. Specifically, the hardware will subtract the current time stamp from the correction field at the ingress port and will add the current time stamp to the correction field at the egress port. For the purpose of an ordinary or boundary clock (this patchset), the correction field will always be adjusted between the CPU port and one of the front panel ports, but never between two front panel ports. Patchset overview: Patch #1 extracts a helper to configure traps for PTP packets (event and general messages). The helper is shared between all Spectrum generations. Patch #2 transitions Spectrum-2 and newer ASICs to use a different format of Tx completions that includes the UTC time stamp of transmitted packets. Patch #3 adds basic initialization required for Spectrum-2 PTP support. It mainly invokes the helper from patch #1. Patch #4 adds helpers to read the UTC time (seconds and nanoseconds) from the device over memory-mapped I/O instead of going through firmware which is slower and therefore inaccurate. The helpers will be used to implement various PHC operations (e.g., gettimex64) and to construct the full UTC time stamp from the truncated one reported over Tx / Rx completions. Patch #5 implements the various PHC operations. Patch #6 implements the previously described workaround for Spectrum-{2,3}. Patch #7 adds the ability to report a hardware time stamp for a received / transmitted packet based off the associated Rx / Tx completion that includes a truncated UTC time stamp. Patches #8 and #9 implement support for the SIOCGHWTSTAMP / SIOCSHWTSTAMP ioctls and the get_ts_info ethtool callback, respectively. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>

Commit 6930bcb dropped the setting of the file_lock range when decoding a nlm_lock off the wire. This causes the client side grant callback to miss matching blocks and reject the lock, only to rerequest it 30s later. Add a helper function to set the file_lock range from the start and end values that the protocol uses, and have the nlm_lock decoder call that to set up the file_lock args properly. Fixes: 6930bcb ("lockd: detect and reject lock arguments that overflow") Reported-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Tested-by: Amir Goldstein <amir73il@gmail.com> Cc: stable@vger.kernel.org #6.0 Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

When a system with E810 with existing VFs gets rebooted the following hang may be observed. Pid 1 is hung in iavf_remove(), part of a network driver: PID: 1 TASK: ffff965400e5a340 CPU: 24 COMMAND: "systemd-shutdow" #0 [ffffaad04005fa50] __schedule at ffffffff8b3239cb #1 [ffffaad04005fae8] schedule at ffffffff8b323e2d #2 [ffffaad04005fb00] schedule_hrtimeout_range_clock at ffffffff8b32cebc #3 [ffffaad04005fb80] usleep_range_state at ffffffff8b32c930 #4 [ffffaad04005fbb0] iavf_remove at ffffffffc12b9b4c [iavf] #5 [ffffaad04005fbf0] pci_device_remove at ffffffff8add7513 #6 [ffffaad04005fc10] device_release_driver_internal at ffffffff8af08baa #7 [ffffaad04005fc40] pci_stop_bus_device at ffffffff8adcc5fc #8 [ffffaad04005fc60] pci_stop_and_remove_bus_device at ffffffff8adcc81e #9 [ffffaad04005fc70] pci_iov_remove_virtfn at ffffffff8adf9429 #10 [ffffaad04005fca8] sriov_disable at ffffffff8adf98e4 #11 [ffffaad04005fcc8] ice_free_vfs at ffffffffc04bb2c8 [ice] #12 [ffffaad04005fd10] ice_remove at ffffffffc04778fe [ice] #13 [ffffaad04005fd38] ice_shutdown at ffffffffc0477946 [ice] #14 [ffffaad04005fd50] pci_device_shutdown at ffffffff8add58f1 #15 [ffffaad04005fd70] device_shutdown at ffffffff8af05386 #16 [ffffaad04005fd98] kernel_restart at ffffffff8a92a870 #17 [ffffaad04005fda8] __do_sys_reboot at ffffffff8a92abd6 #18 [ffffaad04005fee0] do_syscall_64 at ffffffff8b317159 #19 [ffffaad04005ff08] __context_tracking_enter at ffffffff8b31b6fc #20 [ffffaad04005ff18] syscall_exit_to_user_mode at ffffffff8b31b50d #21 [ffffaad04005ff28] do_syscall_64 at ffffffff8b317169 #22 [ffffaad04005ff50] entry_SYSCALL_64_after_hwframe at ffffffff8b40009b RIP: 00007f1baa5c13d7 RSP: 00007fffbcc55a98 RFLAGS: 00000202 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1baa5c13d7 RDX: 0000000001234567 RSI: 0000000028121969 RDI: 00000000fee1dead RBP: 00007fffbcc55ca0 R8: 0000000000000000 R9: 00007fffbcc54e90 R10: 00007fffbcc55050 R11: 0000000000000202 R12: 0000000000000005 R13: 0000000000000000 R14: 00007fffbcc55af0 R15: 0000000000000000 ORIG_RAX: 00000000000000a9 CS: 0033 SS: 002b During reboot all drivers PM shutdown callbacks are invoked. In iavf_shutdown() the adapter state is changed to __IAVF_REMOVE. In ice_shutdown() the call chain above is executed, which at some point calls iavf_remove(). However iavf_remove() expects the VF to be in one of the states __IAVF_RUNNING, __IAVF_DOWN or __IAVF_INIT_FAILED. If that's not the case it sleeps forever. So if iavf_shutdown() gets invoked before iavf_remove() the system will hang indefinitely because the adapter is already in state __IAVF_REMOVE. Fix this by returning from iavf_remove() if the state is __IAVF_REMOVE, as we already went through iavf_shutdown(). Fixes: 9745780 ("iavf: Add waiting so the port is initialized in remove") Fixes: a841733 ("iavf: Fix race condition between iavf_shutdown and iavf_remove") Reported-by: Marius Cornea <mcornea@redhat.com> Signed-off-by: Stefan Assmann <sassmann@kpanic.de> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

@icur

In xfs_buffered_write_iomap_begin, @icur is the iext cursor for the data fork and @CCur is the cursor for the cow fork. Pass in whichever cursor corresponds to allocfork, because otherwise the xfs_iext_prev_extent call can use the data fork cursor to walk off the end of the cow fork structure. Best case it returns the wrong results, worst case it does this: stack segment: 0000 [#1] PREEMPT SMP CPU: 2 PID: 3141909 Comm: fsstress Tainted: G W 6.3.0-rc2-xfsx #6.3.0-rc2 7bf5cc2e98997627cae5c930d890aba3aeec65dd Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014 RIP: 0010:xfs_iext_prev+0x71/0x150 [xfs] RSP: 0018:ffffc90002233aa8 EFLAGS: 00010297 RAX: 000000000000000f RBX: 000000000000000e RCX: 000000000000000c RDX: 0000000000000002 RSI: 000000000000000e RDI: ffff8883d0019ba0 RBP: 989642409af8a7a7 R08: ffffea0000000001 R09: 0000000000000002 R10: 0000000000000000 R11: 000000000000000c R12: ffffc90002233b00 R13: ffff8883d0019ba0 R14: 989642409af8a6bf R15: 000ffffffffe0000 FS: 00007fdf8115f740(0000) GS:ffff88843fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fdf8115e000 CR3: 0000000357256000 CR4: 00000000003506e0 Call Trace: <TASK> xfs_iomap_prealloc_size.constprop.0.isra.0+0x1a6/0x410 [xfs 619a268fb2406d68bd34e007a816b27e70abc22c] xfs_buffered_write_iomap_begin+0xa87/0xc60 [xfs 619a268fb2406d68bd34e007a816b27e70abc22c] iomap_iter+0x132/0x2f0 iomap_file_buffered_write+0x92/0x330 xfs_file_buffered_write+0xb1/0x330 [xfs 619a268fb2406d68bd34e007a816b27e70abc22c] vfs_write+0x2eb/0x410 ksys_write+0x65/0xe0 do_syscall_64+0x2b/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 Found by xfs/538 in alwayscow mode, but this doesn't seem particular to that test. Fixes: 590b165 ("xfs: refactor xfs_iomap_prealloc_size") Actually-Fixes: 66ae56a ("xfs: introduce an always_cow mode") Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Andrii Nakryiko says: ==================== Add support for open-coded (aka inline) iterators in BPF world. This is a next evolution of gradually allowing more powerful and less restrictive looping and iteration capabilities to BPF programs. We set up a framework for implementing all kinds of iterators (e.g., cgroup, task, file, etc, iterators), but this patch set only implements numbers iterator, which is used to implement ergonomic bpf_for() for-like construct (see patches #4-#5). We also add bpf_for_each(), which is a generic foreach-like construct that will work with any kind of open-coded iterator implementation, as long as we stick with bpf_iter_<type>_{new,next,destroy}() naming pattern (which we now enforce on the kernel side). Patch #1 is preparatory refactoring for easier way to check for special kfunc calls. Patch #2 is adding iterator kfunc registration and validation logic, which is mostly independent from the rest of open-coded iterator logic, so is separated out for easier reviewing. The meat of verifier-side logic is in patch #3. Patch #4 implements numbers iterator. I kept them separate to have clean reference for how to integrate new iterator types (now even simpler to do than in v1 of this patch set). Patch #5 adds bpf_for(), bpf_for_each(), and bpf_repeat() macros to bpf_misc.h, and also adds yet another pyperf test variant, now with bpf_for() loop. Patch #6 is verification tests, based on numbers iterator (as the only available right now). Patch #7 actually tests runtime behavior of numbers iterator. Finally, with changes in v2, it's possible and trivial to implement custom iterators completely in kernel modules, which we showcase and test by adding a simple iterator returning same number a given number of times to bpf_testmod. Patch #8 is where all this happens and is tested. Most of the relevant details are in corresponding commit messages or code comments. v4->v5: - fixing missed inner for() in is_iter_reg_valid_uninit, and fixed return false (kernel test robot); - typo fixes and comment/commit description improvements throughout the patch set; v3->v4: - remove unused variable from is_iter_reg_valid_init (kernel test robot); v2->v3: - remove special kfunc leftovers for bpf_iter_num_{new,next,destroy}; - add iters/testmod_seq* to DENYLIST.s390x, it doesn't support kfuncs in modules yet (CI); v1->v2: - rebased on latest, dropping previously landed preparatory patches; - each iterator type now have its own `struct bpf_iter_<type>` which allows each iterator implementation to use exactly as much stack space as necessary, allowing to avoid runtime allocations (Alexei); - reworked how iterator kfuncs are defined, no verifier changes are required when adding new iterator type; - added bpf_testmod-based iterator implementation; - address the rest of feedback, comments, commit message adjustment, etc. Cc: Tejun Heo <tj@kernel.org> ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Ido Schimmel says: ==================== bridge: Add per-{Port, VLAN} neighbor suppression Background ========== In order to minimize the flooding of ARP and ND messages in the VXLAN network, EVPN includes provisions [1] that allow participating VTEPs to suppress such messages in case they know the MAC-IP binding and can reply on behalf of the remote host. In Linux, the above is implemented in the bridge driver using a per-port option called "neigh_suppress" that was added in kernel version 4.15 [2]. Motivation ========== Some applications use ARP messages as keepalives between the application nodes in the network. This works perfectly well when two nodes are connected to the same VTEP. When a node goes down it will stop responding to ARP requests and the other node will notice it immediately. However, when the two nodes are connected to different VTEPs and neighbor suppression is enabled, the local VTEP will reply to ARP requests even after the remote node went down, until certain timers expire and the EVPN control plane decides to withdraw the MAC/IP Advertisement route for the address. Therefore, some users would like to be able to disable neighbor suppression on VLANs where such applications reside and keep it enabled on the rest. Implementation ============== The proposed solution is to allow user space to control neighbor suppression on a per-{Port, VLAN} basis, in a similar fashion to other per-port options that gained per-{Port, VLAN} counterparts such as "mcast_router". This allows users to benefit from the operational simplicity and scalability associated with shared VXLAN devices (i.e., external / collect-metadata mode), while still allowing for per-VLAN/VNI neighbor suppression control. The user interface is extended with a new "neigh_vlan_suppress" bridge port option that allows user space to enable per-{Port, VLAN} neighbor suppression on the bridge port. When enabled, the existing "neigh_suppress" option has no effect and neighbor suppression is controlled using a new "neigh_suppress" VLAN option. Example usage: # bridge link set dev vxlan0 neigh_vlan_suppress on # bridge vlan add vid 10 dev vxlan0 # bridge vlan set vid 10 dev vxlan0 neigh_suppress on Testing ======= Tested using existing bridge selftests. Added a dedicated selftest in the last patch. Patchset overview ================= Patches #1-#5 are preparations. Patch #6 adds per-{Port, VLAN} neighbor suppression support to the bridge's data path. Patches #7-#8 add the required netlink attributes to enable the feature. Patch #9 adds a selftest. iproute2 patches can be found here [3]. Changelog ========= Since RFC [4]: No changes. [1] https://www.rfc-editor.org/rfc/rfc7432#section-10 [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a42317785c898c0ed46db45a33b0cc71b671bf29 [3] https://github.com/idosch/iproute2/tree/submit/neigh_suppress_v1 [4] https://lore.kernel.org/netdev/20230413095830.2182382-1-idosch@nvidia.com/ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>

Currently, the per cpu upcall counters are allocated after the vport is created and inserted into the system. This could lead to the datapath accessing the counters before they are allocated resulting in a kernel Oops. Here is an example: PID: 59693 TASK: ffff0005f4f51500 CPU: 0 COMMAND: "ovs-vswitchd" #0 [ffff80000a39b5b0] __switch_to at ffffb70f0629f2f4 #1 [ffff80000a39b5d0] __schedule at ffffb70f0629f5cc #2 [ffff80000a39b650] preempt_schedule_common at ffffb70f0629fa60 #3 [ffff80000a39b670] dynamic_might_resched at ffffb70f0629fb58 #4 [ffff80000a39b680] mutex_lock_killable at ffffb70f062a1388 #5 [ffff80000a39b6a0] pcpu_alloc at ffffb70f0594460c #6 [ffff80000a39b750] __alloc_percpu_gfp at ffffb70f05944e68 #7 [ffff80000a39b760] ovs_vport_cmd_new at ffffb70ee6961b90 [openvswitch] ... PID: 58682 TASK: ffff0005b2f0bf00 CPU: 0 COMMAND: "kworker/0:3" #0 [ffff80000a5d2f40] machine_kexec at ffffb70f056a0758 #1 [ffff80000a5d2f70] __crash_kexec at ffffb70f057e2994 #2 [ffff80000a5d3100] crash_kexec at ffffb70f057e2ad8 #3 [ffff80000a5d3120] die at ffffb70f0628234c #4 [ffff80000a5d31e0] die_kernel_fault at ffffb70f062828a8 #5 [ffff80000a5d3210] __do_kernel_fault at ffffb70f056a31f4 #6 [ffff80000a5d3240] do_bad_area at ffffb70f056a32a4 #7 [ffff80000a5d3260] do_translation_fault at ffffb70f062a9710 #8 [ffff80000a5d3270] do_mem_abort at ffffb70f056a2f74 #9 [ffff80000a5d32a0] el1_abort at ffffb70f06297dac #10 [ffff80000a5d32d0] el1h_64_sync_handler at ffffb70f06299b24 #11 [ffff80000a5d3410] el1h_64_sync at ffffb70f056812dc #12 [ffff80000a5d3430] ovs_dp_upcall at ffffb70ee6963c84 [openvswitch] #13 [ffff80000a5d3470] ovs_dp_process_packet at ffffb70ee6963fdc [openvswitch] #14 [ffff80000a5d34f0] ovs_vport_receive at ffffb70ee6972c78 [openvswitch] #15 [ffff80000a5d36f0] netdev_port_receive at ffffb70ee6973948 [openvswitch] #16 [ffff80000a5d3720] netdev_frame_hook at ffffb70ee6973a28 [openvswitch] #17 [ffff80000a5d3730] __netif_receive_skb_core.constprop.0 at ffffb70f06079f90 We moved the per cpu upcall counter allocation to the existing vport alloc and free functions to solve this. Fixes: 95637d9 ("net: openvswitch: release vport resources on failure") Fixes: 1933ea3 ("net: openvswitch: Add support to count upcall packets") Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>

Petr Machata says: ==================== mlxsw: Cleanups in router code This patchset moves some router-related code from spectrum.c to spectrum_router.c where it should be. It also simplifies handlers of netevent notifications. - Patch #1 caches router pointer in a dedicated variable. This obviates the need to access the same as mlxsw_sp->router, making lines shorter, and permitting a future patch to add code that fits within 80 character limit. - Patch #2 moves IP / IPv6 validation notifier blocks from spectrum.c to spectrum_router, where the handlers are anyway. - In patch #3, pass router pointer to scheduler of deferred work directly, instead of having it deduce it on its own. - This makes the router pointer available in the handler function mlxsw_sp_router_netevent_event(), so in patch #4, use it directly, instead of finding it through mlxsw_sp_port. - In patch #5, extend mlxsw_sp_router_schedule_work() so that the NETEVENT_NEIGH_UPDATE handler can use it directly instead of inlining equivalent code. - In patches #6 and #7, add helpers for two common operations involving a backing netdev of a RIF. This makes it unnecessary for the function mlxsw_sp_rif_dev() to be visible outside of the router module, so in patch #8, hide it. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>

Petr Machata says: ==================== mlxsw: Preparations for out-of-order-operations patches The mlxsw driver currently makes the assumption that the user applies configuration in a bottom-up manner. Thus netdevices need to be added to the bridge before IP addresses are configured on that bridge or SVI added on top of it. Enslaving a netdevice to another netdevice that already has uppers is in fact forbidden by mlxsw for this reason. Despite this safety, it is rather easy to get into situations where the offloaded configuration is just plain wrong. As an example, take a front panel port, configure an IP address: it gets a RIF. Now enslave the port to a bridge, and the RIF is gone. Remove the port from the bridge again, but the RIF never comes back. There is a number of similar situations, where changing the configuration there and back utterly breaks the offload. Over the course of the following several patchsets, mlxsw code is going to be adjusted to diminish the space of wrongly offloaded configurations. Ideally the offload state will reflect the actual state, regardless of the sequence of operation used to construct that state. No functional changes are intended in this patchset yet. Rather the patches prepare the codebase for easier introduction of functional changes in later patchsets. - In patch #1, extract a helper to join a RIF of a given port, if there is one. In patch #2, use it in a newly-added helper to join a LAG interface. - In patches #3, #4 and #5, add helpers that abstract away the rif->dev access. This will make it simpler in the future to change the way the deduction is done. In patch #6, do this for deduction from nexthop group info to RIF. - In patch #7, add a helper to destroy a RIF. So far RIF was destroyed simply by kfree'ing it. - In patch #8, add a helper to check if any IP addresses are configured on a netdevice. This helper will be useful later. - In patch #9, add a helper to migrate a RIF. This will be a convenient place to put extensions later on. - Patch #10 move IPIP initialization up to make ipip_ops_arr available earlier. ==================== Link: https://lore.kernel.org/r/cover.1686581444.git.petrm@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Petr Machata says: ==================== mlxsw: Maintain candidate RIFs The mlxsw driver currently makes the assumption that the user applies configuration in a bottom-up manner. Thus netdevices need to be added to the bridge before IP addresses are configured on that bridge or SVI added on top of it. Enslaving a netdevice to another netdevice that already has uppers is in fact forbidden by mlxsw for this reason. Despite this safety, it is rather easy to get into situations where the offloaded configuration is just plain wrong. As an example, take a front panel port, configure an IP address: it gets a RIF. Now enslave the port to the bridge, and the RIF is gone. Remove the port from the bridge again, but the RIF never comes back. There is a number of similar situations, where changing the configuration there and back utterly breaks the offload. The situation is going to be made better by implementing a range of replays and post-hoc offloads. This patch set lays the ground for replay of next hops. The particular issue that it deals with is that currently, driver-specific bookkeeping for next hops is hooked off RIF objects, which come and go across the lifetime of a netdevice. We would rather keep these objects at an entity that mirrors the lifetime of the netdevice itself. That way they are at hand and can be offloaded when a RIF is eventually created. To that end, with this patchset, mlxsw keeps a hash table of CRIFs: candidate RIFs, persistent handles for netdevices that mlxsw deems potentially interesting. The lifetime of a CRIF matches that of the underlying netdevice, and thus a RIF can always assume a CRIF exists. A CRIF is where next hops are kept, and when RIF is created, these next hops can be easily offloaded. (Previously only the next hops created after the RIF was created were offloaded.) - Patches #1 and #2 are minor adjustments. - In patches #3 and #4, add CRIF bookkeeping. - In patch #5, link CRIFs to RIFs such that given a netdevice-backed RIF, the corresponding CRIF is easy to look up. - Patch #6 is a clean-up allowed by the previous patches - Patches #7 and #8 move next hop tracking to CRIFs No observable effects are intended as of yet. This will be useful once there is support for RIF creation for netdevices that become mlxsw uppers, which will come in following patch sets. ==================== Link: https://lore.kernel.org/r/cover.1687438411.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

…tnguy/net-queue Tony Nguyen says: ==================== igc: Fix corner cases for TSN offload Florian Kauer says: The igc driver supports several different offloading capabilities relevant in the TSN context. Recent patches in this area introduced regressions for certain corner cases that are fixed in this series. Each of the patches (except the first one) addresses a different regression that can be separately reproduced. Still, they have overlapping code changes so they should not be separately applied. Especially #4 and #6 address the same observation, but both need to be applied to avoid TX hang occurrences in the scenario described in the patches. ==================== Signed-off-by: Florian Kauer <florian.kauer@linutronix.de> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Reviewed-by: Muhammad Husaini Zulkifli <muhammad.husaini.zulkifli@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>

Petr Machata says: ==================== mlxsw: Manage RIF across PVID changes The mlxsw driver currently makes the assumption that the user applies configuration in a bottom-up manner. Thus netdevices need to be added to the bridge before IP addresses are configured on that bridge or SVI added on top of it. Enslaving a netdevice to another netdevice that already has uppers is in fact forbidden by mlxsw for this reason. Despite this safety, it is rather easy to get into situations where the offloaded configuration is just plain wrong. As an example, take a front panel port, configure an IP address: it gets a RIF. Now enslave the port to the bridge, and the RIF is gone. Remove the port from the bridge again, but the RIF never comes back. There is a number of similar situations, where changing the configuration there and back utterly breaks the offload. The situation is going to be made better by implementing a range of replays and post-hoc offloads. In this patch set, address the ordering issues related to creation of bridge RIFs. Currently, mlxsw has several shortcomings with regards to RIF handling due to PVID changes: - In order to cause RIF for a bridge device to be created, the user is expected first to set PVID, then to add an IP address. The reverse ordering is disallowed, which is not very user-friendly. - When such bridge gets a VLAN upper whose VID was the same as the existing PVID, and this VLAN netdevice gets an IP address, a RIF is created for this netdevice. The new RIF is then assigned to the 802.1Q FID for the given VID. This results in a working configuration. However, then, when the VLAN netdevice is removed again, the RIF for the bridge itself is never reassociated to the PVID. - PVID cannot be changed once the bridge has uppers. Presumably this is because the driver does not manage RIFs properly in face of PVID changes. However, as the previous point shows, it is still possible to get into invalid configurations. This patch set addresses these issues and relaxes some of the ordering requirements that mlxsw had. The patch set proceeds as follows: - In patch #1, pass extack to mlxsw_sp_br_ban_rif_pvid_change() - To relax ordering between setting PVID and adding an IP address to a bridge, mlxsw must be able to request that a RIF is created with a given VLAN ID, instead of trying to deduce it from the current netdevice settings, which do not reflect the user-requested values yet. This is done in patches #2 and #3. - Similarly, mlxsw_sp_inetaddr_bridge_event() will need to make decisions based on the user-requested value of PVID, not the current value. Thus in patches #4 and #5, add a new argument which carries the requested PVID value. - Finally in patch #6 relax the ban on PVID changes when a bridge has uppers. Instead, add the logic necessary for creation of a RIF as a result of PVID change. - Relevant selftests are presented afterwards. In patch #7 a preparatory helper is added to lib.sh. Patches #8, #9, #10 and #11 include selftests themselves. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>

…l/git/netfilter/nf Florisn Westphal says: ==================== These are netfilter fixes for the *net* tree. First patch resolves a false-positive lockdep splat: rcu_dereference is used outside of rcu read lock. Let lockdep validate that the transaction mutex is locked. Second patch fixes a kdoc warning added in previous PR. Third patch fixes a memory leak: The catchall element isn't disabled correctly, this allows userspace to deactivate the element again. This results in refcount underflow which in turn prevents memory release. This was always broken since the feature was added in 5.13. Patch 4 fixes an incorrect change in the previous pull request: Adding a duplicate key to a set should work if the duplicate key has expired, restore this behaviour. All from myself. Patch #5 resolves an old historic artifact in sctp conntrack: a 300ms timeout for shutdown_ack. Increase this to 3s. From Xin Long. Patch #6 fixes a sysctl data race in ipvs, two threads can clobber the sysctl value, from Sishuai Gong. This is a day-0 bug that predates git history. Patches 7, 8 and 9, from Pablo Neira Ayuso, are also followups for the previous GC rework in nf_tables: The netlink notifier and the netns exit path must both increment the gc worker seqcount, else worker may encounter stale (free'd) pointers. ================ Signed-off-by: David S. Miller <davem@davemloft.net>

Noticed with: make EXTRA_CFLAGS="-fsanitize=address" BUILD_BPF_SKEL=1 CORESIGHT=1 O=/tmp/build/perf-tools-next -C tools/perf install-bin Direct leak of 45 byte(s) in 1 object(s) allocated from: #0 0x7f213f87243b in strdup (/lib64/libasan.so.8+0x7243b) #1 0x63d15f in evsel__set_filter util/evsel.c:1371 #2 0x63d15f in evsel__append_filter util/evsel.c:1387 #3 0x63d15f in evsel__append_tp_filter util/evsel.c:1400 #4 0x62cd52 in evlist__append_tp_filter util/evlist.c:1145 #5 0x62cd52 in evlist__append_tp_filter_pids util/evlist.c:1196 #6 0x541e49 in trace__set_filter_loop_pids /home/acme/git/perf-tools/tools/perf/builtin-trace.c:3646 #7 0x541e49 in trace__set_filter_pids /home/acme/git/perf-tools/tools/perf/builtin-trace.c:3670 #8 0x541e49 in trace__run /home/acme/git/perf-tools/tools/perf/builtin-trace.c:3970 #9 0x541e49 in cmd_trace /home/acme/git/perf-tools/tools/perf/builtin-trace.c:5141 #10 0x5ef1a2 in run_builtin /home/acme/git/perf-tools/tools/perf/perf.c:323 #11 0x4196da in handle_internal_command /home/acme/git/perf-tools/tools/perf/perf.c:377 #12 0x4196da in run_argv /home/acme/git/perf-tools/tools/perf/perf.c:421 #13 0x4196da in main /home/acme/git/perf-tools/tools/perf/perf.c:537 #14 0x7f213e84a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f) Free it on evsel__exit(). Acked-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/lkml/20230719202951.534582-2-acme@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

To plug these leaks detected with: $ make EXTRA_CFLAGS="-fsanitize=address" BUILD_BPF_SKEL=1 CORESIGHT=1 O=/tmp/build/perf-tools-next -C tools/perf install-bin ================================================================= ==473890==ERROR: LeakSanitizer: detected memory leaks Direct leak of 112 byte(s) in 1 object(s) allocated from: #0 0x7fdf19aba097 in calloc (/lib64/libasan.so.8+0xba097) #1 0x987836 in zalloc (/home/acme/bin/perf+0x987836) #2 0x5367ae in thread_trace__new /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:1289 #3 0x5367ae in thread__trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:1307 #4 0x5367ae in trace__sys_exit /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:2468 #5 0x52bf34 in trace__handle_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3177 #6 0x52bf34 in __trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3685 #7 0x542927 in trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3712 #8 0x542927 in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:4055 #9 0x542927 in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5141 #10 0x5ef1a2 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323 #11 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377 #12 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421 #13 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537 #14 0x7fdf18a4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f) Direct leak of 2048 byte(s) in 1 object(s) allocated from: #0 0x7f788fcba6af in __interceptor_malloc (/lib64/libasan.so.8+0xba6af) #1 0x5337c0 in trace__sys_enter /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:2342 #2 0x52bfb4 in trace__handle_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3191 #3 0x52bfb4 in __trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3699 #4 0x542883 in trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3726 #5 0x542883 in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:4069 #6 0x542883 in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5155 #7 0x5ef232 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323 #8 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377 #9 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421 #10 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537 #11 0x7f788ec4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f) Indirect leak of 48 byte(s) in 1 object(s) allocated from: #0 0x7fdf19aba6af in __interceptor_malloc (/lib64/libasan.so.8+0xba6af) #1 0x77b335 in intlist__new util/intlist.c:116 #2 0x5367fd in thread_trace__new /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:1293 #3 0x5367fd in thread__trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:1307 #4 0x5367fd in trace__sys_exit /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:2468 #5 0x52bf34 in trace__handle_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3177 #6 0x52bf34 in __trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3685 #7 0x542927 in trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3712 #8 0x542927 in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:4055 #9 0x542927 in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5141 #10 0x5ef1a2 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323 #11 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377 #12 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421 #13 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537 #14 0x7fdf18a4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f) Acked-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/lkml/20230719202951.534582-4-acme@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

In 3cb4d5e ("perf trace: Free syscall tp fields in evsel->priv") it only was freeing if strcmp(evsel->tp_format->system, "syscalls") returned zero, while the corresponding initialization of evsel->priv was being performed if it was _not_ zero, i.e. if the tp system wasn't 'syscalls'. Just stop looking for that and free it if evsel->priv was set, which should be equivalent. Also use the pre-existing evsel_trace__delete() function. This resolves these leaks, detected with: $ make EXTRA_CFLAGS="-fsanitize=address" BUILD_BPF_SKEL=1 CORESIGHT=1 O=/tmp/build/perf-tools-next -C tools/perf install-bin ================================================================= ==481565==ERROR: LeakSanitizer: detected memory leaks Direct leak of 40 byte(s) in 1 object(s) allocated from: #0 0x7f7343cba097 in calloc (/lib64/libasan.so.8+0xba097) #1 0x987966 in zalloc (/home/acme/bin/perf+0x987966) #2 0x52f9b9 in evsel_trace__new /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:307 #3 0x52f9b9 in evsel__syscall_tp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:333 #4 0x52f9b9 in evsel__init_raw_syscall_tp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:458 #5 0x52f9b9 in perf_evsel__raw_syscall_newtp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:480 #6 0x540e8b in trace__add_syscall_newtp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3212 #7 0x540e8b in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3891 #8 0x540e8b in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5156 #9 0x5ef262 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323 #10 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377 #11 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421 #12 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537 #13 0x7f7342c4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f) Direct leak of 40 byte(s) in 1 object(s) allocated from: #0 0x7f7343cba097 in calloc (/lib64/libasan.so.8+0xba097) #1 0x987966 in zalloc (/home/acme/bin/perf+0x987966) #2 0x52f9b9 in evsel_trace__new /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:307 #3 0x52f9b9 in evsel__syscall_tp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:333 #4 0x52f9b9 in evsel__init_raw_syscall_tp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:458 #5 0x52f9b9 in perf_evsel__raw_syscall_newtp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:480 #6 0x540dd1 in trace__add_syscall_newtp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3205 #7 0x540dd1 in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3891 #8 0x540dd1 in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5156 #9 0x5ef262 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323 #10 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377 #11 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421 #12 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537 #13 0x7f7342c4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f) SUMMARY: AddressSanitizer: 80 byte(s) leaked in 2 allocation(s). [root@quaco ~]# With this we plug all leaks with "perf trace sleep 1". Fixes: 3cb4d5e ("perf trace: Free syscall tp fields in evsel->priv") Acked-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Riccardo Mancini <rickyman7@gmail.com> Link: https://lore.kernel.org/lkml/20230719202951.534582-5-acme@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

…failure to add a probe Building perf with EXTRA_CFLAGS="-fsanitize=address" a leak is detect when trying to add a probe to a non-existent function: # perf probe -x ~/bin/perf dso__neW Probe point 'dso__neW' not found. Error: Failed to add events. ================================================================= ==296634==ERROR: LeakSanitizer: detected memory leaks Direct leak of 128 byte(s) in 1 object(s) allocated from: #0 0x7f67642ba097 in calloc (/lib64/libasan.so.8+0xba097) #1 0x7f67641a76f1 in allocate_cfi (/lib64/libdw.so.1+0x3f6f1) Direct leak of 65 byte(s) in 1 object(s) allocated from: #0 0x7f67642b95b5 in __interceptor_realloc.part.0 (/lib64/libasan.so.8+0xb95b5) #1 0x6cac75 in strbuf_grow util/strbuf.c:64 #2 0x6ca934 in strbuf_init util/strbuf.c:25 #3 0x9337d2 in synthesize_perf_probe_point util/probe-event.c:2018 #4 0x92be51 in try_to_find_probe_trace_events util/probe-event.c:964 #5 0x93d5c6 in convert_to_probe_trace_events util/probe-event.c:3512 #6 0x93d6d5 in convert_perf_probe_events util/probe-event.c:3529 #7 0x56f37f in perf_add_probe_events /var/home/acme/git/perf-tools-next/tools/perf/builtin-probe.c:354 #8 0x572fbc in __cmd_probe /var/home/acme/git/perf-tools-next/tools/perf/builtin-probe.c:738 #9 0x5730f2 in cmd_probe /var/home/acme/git/perf-tools-next/tools/perf/builtin-probe.c:766 #10 0x635d81 in run_builtin /var/home/acme/git/perf-tools-next/tools/perf/perf.c:323 #11 0x6362c1 in handle_internal_command /var/home/acme/git/perf-tools-next/tools/perf/perf.c:377 #12 0x63667a in run_argv /var/home/acme/git/perf-tools-next/tools/perf/perf.c:421 #13 0x636b8d in main /var/home/acme/git/perf-tools-next/tools/perf/perf.c:537 #14 0x7f676302950f in __libc_start_call_main (/lib64/libc.so.6+0x2950f) SUMMARY: AddressSanitizer: 193 byte(s) leaked in 2 allocation(s). # synthesize_perf_probe_point() returns a "detachec" strbuf, i.e. a malloc'ed string that needs to be free'd. An audit will be performed to find other such cases. Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/lkml/ZM0l1Oxamr4SVjfY@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

While debugging a segfault on 'perf lock contention' without an available perf.data file I noticed that it was basically calling: perf_session__delete(ERR_PTR(-1)) Resulting in: (gdb) run lock contention Starting program: /root/bin/perf lock contention [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". failed to open perf.data: No such file or directory (try 'perf record' first) Initializing perf session failed Program received signal SIGSEGV, Segmentation fault. 0x00000000005e7515 in auxtrace__free (session=0xffffffffffffffff) at util/auxtrace.c:2858 2858 if (!session->auxtrace) (gdb) p session $1 = (struct perf_session *) 0xffffffffffffffff (gdb) bt #0 0x00000000005e7515 in auxtrace__free (session=0xffffffffffffffff) at util/auxtrace.c:2858 #1 0x000000000057bb4d in perf_session__delete (session=0xffffffffffffffff) at util/session.c:300 #2 0x000000000047c421 in __cmd_contention (argc=0, argv=0x7fffffffe200) at builtin-lock.c:2161 #3 0x000000000047dc95 in cmd_lock (argc=0, argv=0x7fffffffe200) at builtin-lock.c:2604 #4 0x0000000000501466 in run_builtin (p=0xe597a8 <commands+552>, argc=2, argv=0x7fffffffe200) at perf.c:322 #5 0x00000000005016d5 in handle_internal_command (argc=2, argv=0x7fffffffe200) at perf.c:375 #6 0x0000000000501824 in run_argv (argcp=0x7fffffffe02c, argv=0x7fffffffe020) at perf.c:419 #7 0x0000000000501b11 in main (argc=2, argv=0x7fffffffe200) at perf.c:535 (gdb) So just set it to NULL after using PTR_ERR(session) to decode the error as perf_session__delete(NULL) is supported. Fixes: eef4fee ("perf lock: Dynamically allocate lockhash_table") Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Leo Yan <leo.yan@linaro.org> Cc: Mamatha Inamdar <mamatha4@linux.vnet.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi Bangoria <ravi.bangoria@amd.com> Cc: Ross Zwisler <zwisler@chromium.org> Cc: Sean Christopherson <seanjc@google.com> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Tiezhu Yang <yangtiezhu@loongson.cn> Cc: Yang Jihong <yangjihong1@huawei.com> Link: https://lore.kernel.org/lkml/ZN4R1AYfsD2J8lRs@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

Fix an error detected by memory sanitizer: ``` ==4033==WARNING: MemorySanitizer: use-of-uninitialized-value #0 0x55fb0fbedfc7 in read_alias_info tools/perf/util/pmu.c:457:6 #1 0x55fb0fbea339 in check_info_data tools/perf/util/pmu.c:1434:2 #2 0x55fb0fbea339 in perf_pmu__check_alias tools/perf/util/pmu.c:1504:9 #3 0x55fb0fbdca85 in parse_events_add_pmu tools/perf/util/parse-events.c:1429:32 #4 0x55fb0f965230 in parse_events_parse tools/perf/util/parse-events.y:299:6 #5 0x55fb0fbdf6b2 in parse_events__scanner tools/perf/util/parse-events.c:1822:8 #6 0x55fb0fbdf8c1 in __parse_events tools/perf/util/parse-events.c:2094:8 #7 0x55fb0fa8ffa9 in parse_events tools/perf/util/parse-events.h:41:9 #8 0x55fb0fa8ffa9 in test_event tools/perf/tests/parse-events.c:2393:8 #9 0x55fb0fa8f458 in test__pmu_events tools/perf/tests/parse-events.c:2551:15 #10 0x55fb0fa6d93f in run_test tools/perf/tests/builtin-test.c:242:9 #11 0x55fb0fa6d93f in test_and_print tools/perf/tests/builtin-test.c:271:8 #12 0x55fb0fa6d082 in __cmd_test tools/perf/tests/builtin-test.c:442:5 #13 0x55fb0fa6d082 in cmd_test tools/perf/tests/builtin-test.c:564:9 #14 0x55fb0f942720 in run_builtin tools/perf/perf.c:322:11 #15 0x55fb0f942486 in handle_internal_command tools/perf/perf.c:375:8 #16 0x55fb0f941dab in run_argv tools/perf/perf.c:419:2 #17 0x55fb0f941dab in main tools/perf/perf.c:535:3 ``` Fixes: 7b723db ("perf pmu: Be lazy about loading event info files from sysfs") Signed-off-by: Ian Rogers <irogers@google.com> Cc: James Clark <james.clark@arm.com> Cc: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20230914022425.1489035-1-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>

The following call trace shows a deadlock issue due to recursive locking of mutex "device_mutex". First lock acquire is in target_for_each_device() and second in target_free_device(). PID: 148266 TASK: ffff8be21ffb5d00 CPU: 10 COMMAND: "iscsi_ttx" #0 [ffffa2bfc9ec3b18] __schedule at ffffffffa8060e7f #1 [ffffa2bfc9ec3ba0] schedule at ffffffffa8061224 #2 [ffffa2bfc9ec3bb8] schedule_preempt_disabled at ffffffffa80615ee #3 [ffffa2bfc9ec3bc8] __mutex_lock at ffffffffa8062fd7 #4 [ffffa2bfc9ec3c40] __mutex_lock_slowpath at ffffffffa80631d3 #5 [ffffa2bfc9ec3c50] mutex_lock at ffffffffa806320c #6 [ffffa2bfc9ec3c68] target_free_device at ffffffffc0935998 [target_core_mod] #7 [ffffa2bfc9ec3c90] target_core_dev_release at ffffffffc092f975 [target_core_mod] #8 [ffffa2bfc9ec3ca0] config_item_put at ffffffffa79d250f #9 [ffffa2bfc9ec3cd0] config_item_put at ffffffffa79d2583 #10 [ffffa2bfc9ec3ce0] target_devices_idr_iter at ffffffffc0933f3a [target_core_mod] #11 [ffffa2bfc9ec3d00] idr_for_each at ffffffffa803f6fc #12 [ffffa2bfc9ec3d60] target_for_each_device at ffffffffc0935670 [target_core_mod] #13 [ffffa2bfc9ec3d98] transport_deregister_session at ffffffffc0946408 [target_core_mod] #14 [ffffa2bfc9ec3dc8] iscsit_close_session at ffffffffc09a44a6 [iscsi_target_mod] #15 [ffffa2bfc9ec3df0] iscsit_close_connection at ffffffffc09a4a88 [iscsi_target_mod] #16 [ffffa2bfc9ec3df8] finish_task_switch at ffffffffa76e5d07 #17 [ffffa2bfc9ec3e78] iscsit_take_action_for_connection_exit at ffffffffc0991c23 [iscsi_target_mod] #18 [ffffa2bfc9ec3ea0] iscsi_target_tx_thread at ffffffffc09a403b [iscsi_target_mod] #19 [ffffa2bfc9ec3f08] kthread at ffffffffa76d8080 #20 [ffffa2bfc9ec3f50] ret_from_fork at ffffffffa8200364 Fixes: 36d4cb4 ("scsi: target: Avoid that EXTENDED COPY commands trigger lock inversion") Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Link: https://lore.kernel.org/r/20230918225848.66463-1-junxiao.bi@oracle.com Reviewed-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

A skcipher_request object is made up of struct skcipher_request followed by a variable-sized trailer. The allocation of the skcipher_request and IV in crypt_iv_eboiv_gen is missing the memory for struct skcipher_request. Fix it by adding it to reqsize. Fixes: e302309 ("dm crypt: Avoid using MAX_CIPHER_BLOCKSIZE") Cc: <stable@vger.kernel.org> #6.5+ Reported-by: Tatu Heikkilä <tatu.heikkila@gmail.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Petr Machata says: ==================== mlxsw: Move allocation of LAG table to the driver PGT is an in-HW table that maps addresses to sets of ports. Then when some HW process needs a set of ports as an argument, instead of embedding the actual set in the dynamic configuration, what gets configured is the address referencing the set. The HW then works with the appropriate PGT entry. Within the PGT is placed a LAG table. That is a contiguous block of PGT memory where each entry describes which ports are members of the corresponding LAG port. The PGT is split to two parts: one managed by the FW, and one managed by the driver. Historically, the FW part included also the LAG table, referred to as FW LAG mode. Giving the responsibility for placement of the LAG table to the driver, referred to as SW LAG mode, makes the whole system more flexible. The FW currently supports both FW and SW LAG modes. To shed complexity, the FW should in the future only support SW LAG mode. Hence this patchset, where support for placement of LAG is added to mlxsw. There are FW versions out there that do not support SW LAG mode, and on Spectrum-1 in particular, there is no plan to support it at all. mlxsw will therefore have to support both modes of operation. Another aspect is that at least on Spectrum-1, there are FW versions out there that claim to support driver-placed LAG table, but then reject or ignore configurations enabling the same. The driver thus has to have a say in whether an attempt to configure SW LAG mode should even be done. The feature is therefore expressed in terms of "does the driver prefer SW LAG mode?", and "what LAG mode the PCI module managed to configure the FW with". This is unlike current flood mode configuration, where the driver can give a strict value, and that's what gets configured. But it gives a chance to the driver to determine whether LAG mode should be enabled at all. The "does the driver prefer SW LAG mode?" bit is expressed as a boolean lag_mode_prefer_sw. The reason for this is largely another feature that will be introduced in a follow-up patchset: support for CFF flood mode. The driver currently requires that the FW be configured with what is called controlled flood mode. But on capable systems, CFF would be preferred. So there are two values in flight: the preferred flood mode, and the fallback. This could be expressed with an array of flood modes ordered by preference, but that looks like an overkill in comparison. This flag/value model is then reused for LAG mode as well, except the fallback value is absent and implied to be FW, because there are no other values to choose from. The patchset progresses as follows: - Patches #1 to #5 adjust reg.h and cmd.h with new register fields, constants and remarks. - Patches #6 and #7 add the ability to request SW LAG mode and to query the LAG mode that was actually negotiated. This is where the abovementioned lag_mode_prefer_sw flag is added. - Patches #7 to #9 generalize PGT allocations to make it possible to allocate the LAG table, which is done in patch #10. - In patch #11, toggle lag_mode_prefer_sw on Spectrum-2 and above, which makes the newly-added code live. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>

Hou Tao says: ==================== bpf: Fix the release of inner map From: Hou Tao <houtao1@huawei.com> Hi, The patchset aims to fix the release of inner map in map array or map htab. The release of inner map is different with normal map. For normal map, the map is released after the bpf program which uses the map is destroyed, because the bpf program tracks the used maps. However bpf program can not track the used inner map because these inner map may be updated or deleted dynamically, and for now the ref-counter of inner map is decreased after the inner map is remove from outer map, so the inner map may be freed before the bpf program, which is accessing the inner map, exits and there will be use-after-free problem as demonstrated by patch #6. The patchset fixes the problem by deferring the release of inner map. The freeing of inner map is deferred according to the sleepable attributes of the bpf programs which own the outer map. Patch #1 fixes the warning when running the newly-added selftest under interpreter mode. Patch #2 adds more parameters to .map_fd_put_ptr() to prepare for the fix. Patch #3 fixes the incorrect value of need_defer when freeing the fd array. Patch #4 fixes the potential use-after-free problem by using call_rcu_tasks_trace() and call_rcu() to wait for one tasks trace RCU GP and one RCU GP unconditionally. Patch #5 optimizes the free of inner map by removing the unnecessary RCU GP waiting. Patch #6 adds a selftest to demonstrate the potential use-after-free problem. Patch #7 updates a selftest to update outer map in syscall bpf program. Please see individual patches for more details. And comments are always welcome. Change Log: v5: * patch #3: rename fd_array_map_delete_elem_with_deferred_free() to __fd_array_map_delete_elem() (Alexei) * patch #5: use atomic64_t instead of atomic_t to prevent potential overflow (Alexei) * patch #7: use ptr_to_u64() helper instead of force casting to initialize pointers in bpf_attr (Alexei) v4: https://lore.kernel.org/bpf/20231130140120.1736235-1-houtao@huaweicloud.com * patch #2: don't use "deferred", use "need_defer" uniformly * patch #3: newly-added, fix the incorrect value of need_defer during fd array free. * patch #4: doesn't consider the case in which bpf map is not used by any bpf program and only use sleepable_refcnt to remove unnecessary tasks trace RCU GP (Alexei) * patch #4: remove memory barriers added due to cautiousness (Alexei) v3: https://lore.kernel.org/bpf/20231124113033.503338-1-houtao@huaweicloud.com * multiple variable renamings (Martin) * define BPF_MAP_RCU_GP/BPF_MAP_RCU_TT_GP as bit (Martin) * use call_rcu() and its variants instead of synchronize_rcu() (Martin) * remove unnecessary mask in bpf_map_free_deferred() (Martin) * place atomic_or() and the related smp_mb() together (Martin) * add patch #6 to demonstrate that updating outer map in syscall program is dead-lock free (Alexei) * update comments about the memory barrier in bpf_map_fd_put_ptr() * update commit message for patch #3 and #4 to describe more details v2: https://lore.kernel.org/bpf/20231113123324.3914612-1-houtao@huaweicloud.com * defer the invocation of ops->map_free() instead of bpf_map_put() (Martin) * update selftest to make it being reproducible under JIT mode (Martin) * remove unnecessary preparatory patches v1: https://lore.kernel.org/bpf/20231107140702.1891778-1-houtao@huaweicloud.com ==================== Link: https://lore.kernel.org/r/20231204140425.1480317-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Andrii Nakryiko says: ==================== BPF token support in libbpf's BPF object Add fuller support for BPF token in high-level BPF object APIs. This is the most frequently used way to work with BPF using libbpf, so supporting BPF token there is critical. Patch #1 is improving kernel-side BPF_TOKEN_CREATE behavior by rejecting to create "empty" BPF token with no delegation. This seems like saner behavior which also makes libbpf's caching better overall. If we ever want to create BPF token with no delegate_xxx options set on BPF FS, we can use a new flag to enable that. Patches #2-#5 refactor libbpf internals, mostly feature detection code, to prepare it from BPF token FD. Patch #6 adds options to pass BPF token into BPF object open options. It also adds implicit BPF token creation logic to BPF object load step, even without any explicit involvement of the user. If the environment is setup properly, BPF token will be created transparently and used implicitly. This allows for all existing application to gain BPF token support by just linking with latest version of libbpf library. No source code modifications are required. All that under assumption that privileged container management agent properly set up default BPF FS instance at /sys/bpf/fs to allow BPF token creation. Patches #7-#8 adds more selftests, validating BPF object APIs work as expected under unprivileged user namespaced conditions in the presence of BPF token. Patch #9 extends libbpf with LIBBPF_BPF_TOKEN_PATH envvar knowledge, which can be used to override custom BPF FS location used for implicit BPF token creation logic without needing to adjust application code. This allows admins or container managers to mount BPF token-enabled BPF FS at non-standard location without the need to coordinate with applications. LIBBPF_BPF_TOKEN_PATH can also be used to disable BPF token implicit creation by setting it to an empty value. Patch #10 tests this new envvar functionality. v2->v3: - move some stray feature cache refactorings into patch #4 (Alexei); - add LIBBPF_BPF_TOKEN_PATH envvar support (Alexei); v1->v2: - remove minor code redundancies (Eduard, John); - add acks and rebase. ==================== Link: https://lore.kernel.org/r/20231213190842.3844987-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Ido Schimmel says: ==================== Add MDB bulk deletion support This patchset adds MDB bulk deletion support, allowing user space to request the deletion of matching entries instead of dumping the entire MDB and issuing a separate deletion request for each matching entry. Support is added in both the bridge and VXLAN drivers in a similar fashion to the existing FDB bulk deletion support. The parameters according to which bulk deletion can be performed are similar to the FDB ones, namely: Destination port, VLAN ID, state (e.g., "permanent"), routing protocol, source / destination VNI, destination IP and UDP port. Flushing based on flags (e.g., "offload", "fast_leave", "added_by_star_ex", "blocked") is not currently supported, but can be added in the future, if a use case arises. Patch #1 adds a new uAPI attribute to allow specifying the state mask according to which bulk deletion will be performed, if any. Patch #2 adds a new policy according to which bulk deletion requests (with 'NLM_F_BULK' flag set) will be parsed. Patches #3-#4 add a new NDO for MDB bulk deletion and invoke it from the rtnetlink code when a bulk deletion request is made. Patches #5-#6 implement the MDB bulk deletion NDO in the bridge and VXLAN drivers, respectively. Patch #7 allows user space to issue MDB bulk deletion requests by no longer rejecting the 'NLM_F_BULK' flag when it is set in 'RTM_DELMDB' requests. Patches #8-#9 add selftests for both drivers, for both good and bad flows. iproute2 changes can be found here [1]. https://github.com/idosch/iproute2/tree/submit/mdb_flush_v1 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>

Andrii Nakryiko says: ==================== Enhance BPF global subprogs with argument tags This patch set adds verifier support for annotating user's global BPF subprog arguments with few commonly requested annotations, to improve global subprog verification experience. These tags are: - ability to annotate a special PTR_TO_CTX argument; - ability to annotate a generic PTR_TO_MEM as non-null. We utilize btf_decl_tag attribute for this and provide two helper macros as part of bpf_helpers.h in libbpf (patch #8). Besides this we also add abilit to pass a pointer to dynptr into global subprog. This is done based on type name match (struct bpf_dynptr *). This allows to pass dynptrs into global subprogs, for use cases that deal with variable-sized generic memory pointers. Big chunk of the patch set (patches #1 through #5) are various refactorings to make verifier internals around global subprog validation logic easier to extend and support long term, eliminating BTF parsing logic duplication, factoring out argument expectation definitions from BTF parsing, etc. New functionality is added in patch #6 (ctx and non-null) and patch #7 (dynptr), extending global subprog checks with awareness for arg tags. Patch #9 adds simple tests validating each of the added tags and dynptr argument passing. Patch #10 adds a simple negative case for freplace programs to make sure that target BPF programs with "unreliable" BTF func proto cannot be freplaced. v2->v3: - patch #10 improved by checking expected verifier error (Eduard); v1->v2: - dropped packet args for now (Eduard); - added back unreliable=true detection for entry BPF programs (Eduard); - improved subprog arg validation (Eduard); - switched dynptr arg from tag to just type name based check (Eduard). ==================== Link: https://lore.kernel.org/r/20231215011334.2307144-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>

An issue occurred while reading an ELF file in libbpf.c during fuzzing: Program received signal SIGSEGV, Segmentation fault. 0x0000000000958e97 in bpf_object.collect_prog_relos () at libbpf.c:4206 4206 in libbpf.c (gdb) bt #0 0x0000000000958e97 in bpf_object.collect_prog_relos () at libbpf.c:4206 #1 0x000000000094f9d6 in bpf_object.collect_relos () at libbpf.c:6706 #2 0x000000000092bef3 in bpf_object_open () at libbpf.c:7437 #3 0x000000000092c046 in bpf_object.open_mem () at libbpf.c:7497 #4 0x0000000000924afa in LLVMFuzzerTestOneInput () at fuzz/bpf-object-fuzzer.c:16 #5 0x000000000060be11 in testblitz_engine::fuzzer::Fuzzer::run_one () #6 0x000000000087ad92 in tracing::span::Span::in_scope () #7 0x00000000006078aa in testblitz_engine::fuzzer::util::walkdir () #8 0x00000000005f3217 in testblitz_engine::entrypoint::main::{{closure}} () #9 0x00000000005f2601 in main () (gdb) scn_data was null at this code(tools/lib/bpf/src/libbpf.c): if (rel->r_offset % BPF_INSN_SZ || rel->r_offset >= scn_data->d_size) { The scn_data is derived from the code above: scn = elf_sec_by_idx(obj, sec_idx); scn_data = elf_sec_data(obj, scn); relo_sec_name = elf_sec_str(obj, shdr->sh_name); sec_name = elf_sec_name(obj, scn); if (!relo_sec_name || !sec_name)// don't check whether scn_data is NULL return -EINVAL; In certain special scenarios, such as reading a malformed ELF file, it is possible that scn_data may be a null pointer Signed-off-by: Mingyi Zhang <zhangmingyi5@huawei.com> Signed-off-by: Xin Liu <liuxin350@huawei.com> Signed-off-by: Changye Wu <wuchangye@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20231221033947.154564-1-liuxin350@huawei.com

Wen Gu says: ==================== net/smc: implement SMCv2.1 virtual ISM device support The fourth edition of SMCv2 adds the SMC version 2.1 feature updates for SMC-Dv2 with virtual ISM. Virtual ISM are created and supported mainly by OS or hypervisor software, comparable to IBM ISM which is based on platform firmware or hardware. With the introduction of virtual ISM, SMCv2.1 makes some updates: - Introduce feature bitmask to indicate supplemental features. - Reserve a range of CHIDs for virtual ISM. - Support extended GIDs (128 bits) in CLC handshake. So this patch set aims to implement these updates in Linux kernel. And it acts as the first part of SMC-D virtual ISM extension & loopback-ism [1]. [1] https://lore.kernel.org/netdev/1695568613-125057-1-git-send-email-guwen@linux.alibaba.com/ v8->v7: - Patch #7: v7 mistakenly changed the type of gid_ext in smc_clc_msg_accept_confirm to u64 instead of __be64 as previous versions when fixing the rebase conflicts. So fix this mistake. v7->v6: Link: https://lore.kernel.org/netdev/20231219084536.8158-1-guwen@linux.alibaba.com/ - Collect the Reviewed-by tag in v6; - Patch #3: redefine the struct smc_clc_msg_accept_confirm; - Patch #7: Because that the Patch #3 already adds '__packed' to smc_clc_msg_accept_confirm, so Patch #7 doesn't need to do the same thing. But this is a minor change, so I kept the 'Reviewed-by' tag. Other changes in previous versions but not yet acked: - Patch #1: Some minor changes in subject and fix the format issue (length exceeds 80 columns) compared to v3. - Patch #5: removes useless ini->feature_mask assignment in __smc_connect() and smc_listen_v2_check() compared to v4. - Patch #8: new added, compared to v3. v6->v5: Link: https://lore.kernel.org/netdev/1702371151-125258-1-git-send-email-guwen@linux.alibaba.com/ - Add 'Reviewed-by' label given in the previous versions: * Patch #4, #6, #9, #10 have nothing changed since v3; - Patch #2: * fix the format issue (Alignment should match open parenthesis) compared to v5; * remove useless clc->hdr.length assignment in smcr_clc_prep_confirm_accept() compared to v5; - Patch #3: new added compared to v5. - Patch #7: some minor changes like aclc_v2->aclc or clc_v2->clc compared to v5 due to the introduction of Patch #3. Since there were no major changes, I kept the 'Reviewed-by' label. Other changes in previous versions but not yet acked: - Patch #1: Some minor changes in subject and fix the format issue (length exceeds 80 columns) compared to v3. - Patch #5: removes useless ini->feature_mask assignment in __smc_connect() and smc_listen_v2_check() compared to v4. - Patch #8: new added, compared to v3. v5->v4: Link: https://lore.kernel.org/netdev/1702021259-41504-1-git-send-email-guwen@linux.alibaba.com/ - Patch #6: improve the comment of SMCD_CLC_MAX_V2_GID_ENTRIES; - Patch #4: remove useless ini->feature_mask assignment; v4->v3: https://lore.kernel.org/netdev/1701920994-73705-1-git-send-email-guwen@linux.alibaba.com/ - Patch #6: use SMCD_CLC_MAX_V2_GID_ENTRIES to indicate the max gid entries in CLC proposal and using SMC_MAX_V2_ISM_DEVS to indicate the max devices to propose; - Patch #6: use i and i+1 in smc_find_ism_v2_device_serv(); - Patch #2: replace the large if-else block in smc_clc_send_confirm_accept() with 2 subfunctions; - Fix missing byte order conversion of GID and token in CLC handshake, which is in a separate patch sending to net: https://lore.kernel.org/netdev/1701882157-87956-1-git-send-email-guwen@linux.alibaba.com/ - Patch #7: add extended GID in SMC-D lgr netlink attribute; v3->v2: https://lore.kernel.org/netdev/1701343695-122657-1-git-send-email-guwen@linux.alibaba.com/ - Rename smc_clc_fill_fce as smc_clc_fill_fce_v2x; - Remove ISM_IDENT_MASK from drivers/s390/net/ism.h; - Add explicitly assigning 'false' to ism_v2_capable in ism_dev_init(); - Remove smc_ism_set_v2_capable() helper for now, and introduce it in later loopback-ism implementation; v2->v1: - Fix sparse complaint; - Rebase to the latest net-next; ==================== Signed-off-by: David S. Miller <davem@davemloft.net>

Petr Machata says: ==================== selftests: Fixes for kernel CI As discussed on the bi-weekly call on Jan 30, and in mailing around kernel CI effort, some changes are desirable in the suite of forwarding selftests the better to work with the CI tooling. Namely: - The forwarding selftests use a configuration file where names of interfaces are defined and various variables can be overridden. There is also forwarding.config.sample that users can use as a template to refer to when creating the config file. What happens a fair bit is that users either do not know about this at all, or simply forget, and are confused by cryptic failures about interfaces that cannot be created. In patches #1 - #3 have lib.sh just be the single source of truth with regards to which variables exist. That includes the topology variables which were previously only in the sample file, and any "tweak variables", such as what tools to use, sleep times, etc. forwarding.config.sample then becomes just a placeholder with a couple examples. Unless specific HW should be exercised, or specific tools used, the defaults are usually just fine. - Several net/forwarding/ selftests (and one net/ one) cannot be run on veth pairs, they need an actual HW interface to run on. They are generic in the sense that any capable HW should pass them, which is why they have been put to net/forwarding/ as opposed to drivers/net/, but they do not generalize to veth. The fact that these tests are in net/forwarding/, but still complaining when run, is confusing. In patches #4 - #6 move these tests to a new directory drivers/net/hw. - The following patches extend the codebase to handle well test results other than pass and fail. Patch #7 is preparatory. It converts several log_test_skip to XFAIL, so that tests do not spuriously end up returning non-0 when they are not supposed to. In patches #8 - #10, introduce some missing ksft constants, then support having those constants in RET, and then finally in EXIT_STATUS. - The traffic scheduler tests generate a large amount of network traffic to test the behavior of the scheduler. This demands a relatively high-performance computer. On slow machines, such as with a debugging kernel, the test would spuriously fail. It can still be useful to "go through the motions" though, to possibly catch bugs in setup of the scheduler graph and passing packets around. Thus we still want to run the tests, just with lowered demands. To that end, in patches #11 - #12, introduce an environment variable KSFT_MACHINE_SLOW, with obvious meaning. Tests can then make checks more lenient, such as mark failures as XFAIL. A helper, xfail_on_slow, is provided to mark performance-sensitive parts of the selftest. - In patch #13, use a similar mechanism to mark a NH group stats selftest to XFAIL HW stats tests when run on VETH pairs. - All these changes complicate the hitherto straightforward logging and checking logic, so in patch #14, add a selftest that checks this functionality in lib.sh. v1 (vs. an RFC circulated through linux-kselftest): - Patch #9: - Clarify intended usage by s/set_ret/ret_set_ksft_status/, s/nret/ksft_status/ ==================== Link: https://lore.kernel.org/r/cover.1711464583.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

The driver creates /sys/kernel/debug/dri/0/mob_ttm even when the corresponding ttm_resource_manager is not allocated. This leads to a crash when trying to read from this file. Add a check to create mob_ttm, system_mob_ttm, and gmr_ttm debug file only when the corresponding ttm_resource_manager is allocated. crash> bt PID: 3133409 TASK: ffff8fe4834a5000 CPU: 3 COMMAND: "grep" #0 [ffffb954506b3b20] machine_kexec at ffffffffb2a6bec3 #1 [ffffb954506b3b78] __crash_kexec at ffffffffb2bb598a #2 [ffffb954506b3c38] crash_kexec at ffffffffb2bb68c1 #3 [ffffb954506b3c50] oops_end at ffffffffb2a2a9b1 #4 [ffffb954506b3c70] no_context at ffffffffb2a7e913 #5 [ffffb954506b3cc8] __bad_area_nosemaphore at ffffffffb2a7ec8c #6 [ffffb954506b3d10] do_page_fault at ffffffffb2a7f887 #7 [ffffb954506b3d40] page_fault at ffffffffb360116e [exception RIP: ttm_resource_manager_debug+0x11] RIP: ffffffffc04afd11 RSP: ffffb954506b3df0 RFLAGS: 00010246 RAX: ffff8fe41a6d1200 RBX: 0000000000000000 RCX: 0000000000000940 RDX: 0000000000000000 RSI: ffffffffc04b4338 RDI: 0000000000000000 RBP: ffffb954506b3e08 R8: ffff8fee3ffad000 R9: 0000000000000000 R10: ffff8fe41a76a000 R11: 0000000000000001 R12: 00000000ffffffff R13: 0000000000000001 R14: ffff8fe5bb6f3900 R15: ffff8fe41a6d1200 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #8 [ffffb954506b3e00] ttm_resource_manager_show at ffffffffc04afde7 [ttm] #9 [ffffb954506b3e30] seq_read at ffffffffb2d8f9f3 RIP: 00007f4c4eda8985 RSP: 00007ffdbba9e9f8 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 000000000037e000 RCX: 00007f4c4eda8985 RDX: 000000000037e000 RSI: 00007f4c41573000 RDI: 0000000000000003 RBP: 000000000037e000 R8: 0000000000000000 R9: 000000000037fe30 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f4c41573000 R13: 0000000000000003 R14: 00007f4c41572010 R15: 0000000000000003 ORIG_RAX: 0000000000000000 CS: 0033 SS: 002b Signed-off-by: Jocelyn Falempe <jfalempe@redhat.com> Fixes: af4a25b ("drm/vmwgfx: Add debugfs entries for various ttm resource managers") Cc: <stable@vger.kernel.org> Reviewed-by: Zack Rusin <zack.rusin@broadcom.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240312093551.196609-1-jfalempe@redhat.com

Petr Machata says: ==================== mlxsw: Preparations for improving performance Amit Cohen writes: mlxsw driver will use NAPI for event processing in a next patch set. Some additional improvements will be added later. This patch set prepares the code for NAPI usage and refactor some relevant areas. See more details in commit messages. Patch Set overview: Patches #1-#2 are preparations for patch #3 Patch #3 setups tasklets as part of queue initializtion Patch #4 removes handling of unlikely scenario Patch #5 removes unused counters Patch #6 makes style change in mlxsw_pci_eq_tasklet() Patch #7-#10 poll command interface instead of EQ0 usage Patches #11-#12 make style change and break the function mlxsw_pci_cq_tasklet() Patches #13-#14 remove functions which can be replaced by a stored value Patch #15 improves accessing to descriptor queue instance ==================== Link: https://lore.kernel.org/r/cover.1712062203.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

…git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: Patch #1 unlike early commit path stage which triggers a call to abort, an explicit release of the batch is required on abort, otherwise mutex is released and commit_list remains in place. Patch #2 release mutex after nft_gc_seq_end() in commit path, otherwise async GC worker could collect expired objects. Patch #3 flush pending destroy work in module removal path, otherwise UaF is possible. Patch #4 and #6 restrict the table dormant flag with basechain updates to fix state inconsistency in the hook registration. Patch #5 adds missing RCU read side lock to flowtable type to avoid races with module removal. * tag 'nf-24-04-04' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: discard table flag update with pending basechain deletion netfilter: nf_tables: Fix potential data-race in __nft_flowtable_type_get() netfilter: nf_tables: reject new basechain after table flag update netfilter: nf_tables: flush pending destroy work before exit_net release netfilter: nf_tables: release mutex after nft_gc_seq_end from abort path netfilter: nf_tables: release batch on table validation from abort path ==================== Link: https://lore.kernel.org/r/20240404104334.1627-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>

At current x1e80100 interface table, interface #3 is wrongly connected to DP controller #0 and interface #4 wrongly connected to DP controller #2. Fix this problem by connect Interface #3 to DP controller #0 and interface #4 connect to DP controller #1. Also add interface #6, #7 and #8 connections to DP controller to complete x1e80100 interface table. Changs in V3: -- add v2 changes log Changs in V2: -- add x1e80100 to subject -- add Fixes Fixes: e3b1f36 ("drm/msm/dpu: Add X1E80100 support") Signed-off-by: Kuogee Hsieh <quic_khsieh@quicinc.com> Reviewed-by: Abhinav Kumar <quic_abhinavk@quicinc.com> Reviewed-by: Abel Vesa <abel.vesa@linaro.org> Patchwork: https://patchwork.freedesktop.org/patch/585549/ Link: https://lore.kernel.org/r/1711741586-9037-1-git-send-email-quic_khsieh@quicinc.com Signed-off-by: Abhinav Kumar <quic_abhinavk@quicinc.com>

…git/netfilter/nf netfilter pull request 24-04-11 Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: Patches #1 and #2 add missing rcu read side lock when iterating over expression and object type list which could race with module removal. Patch #3 prevents promisc packet from visiting the bridge/input hook to amend a recent fix to address conntrack confirmation race in br_netfilter and nf_conntrack_bridge. Patch #4 adds and uses iterate decorator type to fetch the current pipapo set backend datastructure view when netlink dumps the set elements. Patch #5 fixes removal of duplicate elements in the pipapo set backend. Patch #6 flowtable validates pppoe header before accessing it. Patch #7 fixes flowtable datapath for pppoe packets, otherwise lookup fails and pppoe packets follow classic path. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>

vhost_worker will call tun call backs to receive packets. If too many illegal packets arrives, tun_do_read will keep dumping packet contents. When console is enabled, it will costs much more cpu time to dump packet and soft lockup will be detected. net_ratelimit mechanism can be used to limit the dumping rate. PID: 33036 TASK: ffff949da6f20000 CPU: 23 COMMAND: "vhost-32980" #0 [fffffe00003fce50] crash_nmi_callback at ffffffff89249253 #1 [fffffe00003fce58] nmi_handle at ffffffff89225fa3 #2 [fffffe00003fceb0] default_do_nmi at ffffffff8922642e #3 [fffffe00003fced0] do_nmi at ffffffff8922660d #4 [fffffe00003fcef0] end_repeat_nmi at ffffffff89c01663 [exception RIP: io_serial_in+20] RIP: ffffffff89792594 RSP: ffffa655314979e8 RFLAGS: 00000002 RAX: ffffffff89792500 RBX: ffffffff8af428a0 RCX: 0000000000000000 RDX: 00000000000003fd RSI: 0000000000000005 RDI: ffffffff8af428a0 RBP: 0000000000002710 R8: 0000000000000004 R9: 000000000000000f R10: 0000000000000000 R11: ffffffff8acbf64f R12: 0000000000000020 R13: ffffffff8acbf698 R14: 0000000000000058 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #5 [ffffa655314979e8] io_serial_in at ffffffff89792594 #6 [ffffa655314979e8] wait_for_xmitr at ffffffff89793470 #7 [ffffa65531497a08] serial8250_console_putchar at ffffffff897934f6 #8 [ffffa65531497a20] uart_console_write at ffffffff8978b605 #9 [ffffa65531497a48] serial8250_console_write at ffffffff89796558 #10 [ffffa65531497ac8] console_unlock at ffffffff89316124 #11 [ffffa65531497b10] vprintk_emit at ffffffff89317c07 #12 [ffffa65531497b68] printk at ffffffff89318306 #13 [ffffa65531497bc8] print_hex_dump at ffffffff89650765 #14 [ffffa65531497ca8] tun_do_read at ffffffffc0b06c27 [tun] #15 [ffffa65531497d38] tun_recvmsg at ffffffffc0b06e34 [tun] #16 [ffffa65531497d68] handle_rx at ffffffffc0c5d682 [vhost_net] #17 [ffffa65531497ed0] vhost_worker at ffffffffc0c644dc [vhost] #18 [ffffa65531497f10] kthread at ffffffff892d2e72 #19 [ffffa65531497f50] ret_from_fork at ffffffff89c0022f Fixes: ef3db4a ("tun: avoid BUG, dump packet on GSO errors") Signed-off-by: Lei Chen <lei.chen@smartx.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Acked-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://lore.kernel.org/r/20240415020247.2207781-1-lei.chen@smartx.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Andrii Nakryiko says: ==================== bench: fast in-kernel triggering benchmarks Remove "legacy" triggering benchmarks which rely on syscalls (and thus syscall overhead is a noticeable part of benchmark, unfortunately). Replace them with faster versions that rely on triggering BPF programs in-kernel through another simple "driver" BPF program. See patch #2 with comparison results. raw_tp/tp/fmodret benchmarks required adding a simple kfunc in kernel to be able to trigger a simple tracepoint from BPF program (plus it is also allowed to be replaced by fmod_ret programs). This limits raw_tp/tp/fmodret benchmarks to new kernels only, but it keeps bench tool itself very portable and most of other benchmarks will still work on wide variety of kernels without the need to worry about building and deploying custom kernel module. See patches #5 and #6 for details. v1->v2: - move new TP closer to BPF test run code; - rename/move kfunc and register it for fmod_rets (Alexei); - limit --trig-batch-iters param to [1, 1000] (Alexei). ==================== Link: https://lore.kernel.org/r/20240326162151.3981687-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Wen Gu says: ==================== net/smc: SMC intra-OS shortcut with loopback-ism This patch set acts as the second part of the new version of [1] (The first part can be referred from [2]), the updated things of this version are listed at the end. - Background SMC-D is now used in IBM z with ISM function to optimize network interconnect for intra-CPC communications. Inspired by this, we try to make SMC-D available on the non-s390 architecture through a software-implemented Emulated-ISM device, that is the loopback-ism device here, to accelerate inter-process or inter-containers communication within the same OS instance. - Design This patch set includes 3 parts: - Patch #1: some prepare work for loopback-ism. - Patch #2-#7: implement loopback-ism device and adapt SMC-D for it. loopback-ism now serves only SMC and no userspace interfaces exposed. - Patch #8-#11: memory copy optimization for intra-OS scenario. The loopback-ism device is designed as an ISMv2 device and not be limited to a specific net namespace, ends of both inter-process connection (1/1' in diagram below) or inter-container connection (2/2' in diagram below) can find the same available loopback-ism and choose it during the CLC handshake. Container 1 (ns1) Container 2 (ns2) +-----------------------------------------+ +-------------------------+ | +-------+ +-------+ +-------+ | | +-------+ | | | App A | | App B | | App C | | | | App D |<-+ | | +-------+ +---^---+ +-------+ | | +-------+ |(2') | | |127.0.0.1 (1')| |192.168.0.11 192.168.0.12| | | (1)| +--------+ | +--------+ |(2) | | +--------+ +--------+ | | `-->| lo |-` | eth0 |<-` | | | lo | | eth0 | | +---------+--|---^-+---+-----|--+---------+ +-+--------+---+-^------+-+ | | | | Kernel | | | | +----+-------v---+-----------v----------------------------------+---+----+ | | TCP | | | | | | | +--------------------------------------------------------------+ | | | | +--------------+ | | | smc loopback | | +---------------------------+--------------+-----------------------------+ loopback-ism device creates DMBs (shared memory) for each connection peer. Since data transfer occurs within the same kernel, the sndbuf of each peer is only a descriptor and point to the same memory region as peer DMB, so that the data copy from sndbuf to peer DMB can be avoided in loopback-ism case. Container 1 (ns1) Container 2 (ns2) +-----------------------------------------+ +-------------------------+ | +-------+ | | +-------+ | | | App C |-----+ | | | App D | | | +-------+ | | | +-^-----+ | | | | | | | | (2) | | | (2') | | | | | | | | +---------------|-------------------------+ +----------|--------------+ | | Kernel | | +---------------|-----------------------------------------|--------------+ | +--------+ +--v-----+ +--------+ +--------+ | | |dmb_desc| |snd_desc| |dmb_desc| |snd_desc| | | +-----|--+ +--|-----+ +-----|--+ +--------+ | | +-----|--+ | +-----|--+ | | | DMB C | +---------------------------------| DMB D | | | +--------+ +--------+ | | | | +--------------+ | | | smc loopback | | +---------------------------+--------------+-----------------------------+ - Benchmark Test * Test environments: - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem. - SMC sndbuf/DMB size 1MB. * Test object: - TCP: run on TCP loopback. - SMC lo: run on SMC loopback-ism. 1. ipc-benchmark (see [3]) - ./<foo> -c 1000000 -s 100 TCP SMC-lo Message rate (msg/s) 84991 151293(+78.01%) 2. sockperf - serv: <smc_run> sockperf sr --tcp - clnt: <smc_run> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30 TCP SMC-lo Bandwidth(MBps) 5033.569 7987.732(+58.69%) Latency(us) 5.986 3.398(-43.23%) 3. nginx/wrk - serv: <smc_run> nginx - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80 TCP SMC-lo Requests/s 187951.76 267107.90(+42.12%) 4. redis-benchmark - serv: <smc_run> redis-server - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024 TCP SMC-lo GET(Requests/s) 86132.64 118133.49(+37.15%) SET(Requests/s) 87374.40 122887.86(+40.65%) Change log: v7->v6 - Patch #2: minor: remove unnecessary 'return' of inline smc_loopback_exit(). - Patch #10: minor: directly return 0 instead of 'rc' in smcd_cdc_msg_send(). - all: collect the Reviewed-by tags. v6->RFC v5 Link: https://lore.kernel.org/netdev/20240414040304.54255-1-guwen@linux.alibaba.com/ - Patch #2: make the use of CONFIG_SMC_LO cleaner. - Patch #5: mark some smcd_ops that loopback-ism doesn't support as optional and check for the support when they are called. - Patch #7: keep loopback-ism at the beginning of the SMC-D device list. - Some expression changes in commit logs and comments. RFC v5->RFC v4: Link: https://lore.kernel.org/netdev/20240324135522.108564-1-guwen@linux.alibaba.com/ - Patch #2: minor changes in description of config SMC_LO and comments. - Patch #10: minor changes in comments and if(smc_ism_support_dmb_nocopy()) check in smcd_cdc_msg_send(). - Patch #3: change smc_lo_generate_id() to smc_lo_generate_ids() and SMC_LO_CHID to SMC_LO_RESERVED_CHID. - Patch #5: memcpy while holding the ldev->dmb_ht_lock. - Some expression changes in commit logs. RFC v4->v3: Link: https://lore.kernel.org/netdev/20240317100545.96663-1-guwen@linux.alibaba.com/ - The merge window of v6.9 is open, so post this series as an RFC. - Patch #6: since some information fed back by smc_nl_handle_smcd_dev() dose not apply to Emulated-ISM (including loopback-ism here), loopback-ism is not exposed through smc netlink for the time being. we may refactor this part when smc netlink interface is updated. v3->v2: Link: https://lore.kernel.org/netdev/20240312142743.41406-1-guwen@linux.alibaba.com/ - Patch #11: use tasklet_schedule(&conn->rx_tsklet) instead of smcd_cdc_rx_handler() to avoid possible recursive locking of conn->send_lock and use {read|write}_lock_bh() to acquire dmb_ht_lock. v2->v1: Link: https://lore.kernel.org/netdev/20240307095536.29648-1-guwen@linux.alibaba.com/ - All the patches: changed the term virtual-ISM to Emulated-ISM as defined by SMCv2.1. - Patch #3: optimized the description of SMC_LO config. Avoid exposing loopback-ism to sysfs and remove all the knobs until future definition clear. - Patch #3: try to make lockdep happy by using read_lock_bh() in smc_lo_move_data(). - Patch #6: defaultly use physical contiguous DMB buffers. - Patch #11: defaultly enable DMB no-copy for loopback-ism and free the DMB in unregister_dmb or detach_dmb when dmb_node->refcnt reaches 0, instead of using wait_event to keep waiting in unregister_dmb. v1->RFC: Link: https://lore.kernel.org/netdev/20240111120036.109903-1-guwen@linux.alibaba.com/ - Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics: /sys/devices/virtual/smc/loopback-ism/xfer_bytes - Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports merging sndbuf with peer DMB. - Patch #13 & #14: introduce loopback-ism device control of DMB memory type and control of whether to merge sndbuf and DMB. They can be respectively set by: /sys/devices/virtual/smc/loopback-ism/dmb_type /sys/devices/virtual/smc/loopback-ism/dmb_copy The motivation for these two control is that a performance bottleneck was found when using vzalloced DMB and sndbuf is merged with DMB, and there are many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg() or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the vmap lock contention [6]. It has significant effects, but using virtual memory still has additional overhead compared to using physical memory. So this new version provides controls of dmb_type and dmb_copy to suit different scenarios. - Some minor changes and comments improvements. RFC->old version([1]): Link: https://lore.kernel.org/netdev/1702214654-32069-1-git-send-email-guwen@linux.alibaba.com/ - Patch #1: improve the loopback-ism dump, it shows as follows now: # smcd d FID Type PCI-ID PCHID InUse #LGs PNET-ID 0000 0 loopback-ism ffff No 0 - Patch #3: introduce the smc_ism_set_v2_capable() helper and set smc_ism_v2_capable when ISMv2 or virtual ISM is registered, regardless of whether there is already a device in smcd device list. - Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/. - Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active to activate or deactivate the loopback-ism. - Patch #9: introduce the statistics of loopback-ism by /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}. - Some minor changes and comments improvements. [1] https://lore.kernel.org/netdev/1695568613-125057-1-git-send-email-guwen@linux.alibaba.com/ [2] https://lore.kernel.org/netdev/20231219142616.80697-1-guwen@linux.alibaba.com/ [3] https://github.com/goldsborough/ipc-bench [4] https://lore.kernel.org/all/3189e342-c38f-6076-b730-19a6efd732a5@linux.alibaba.com/ [5] https://lore.kernel.org/all/238e63cd-e0e8-4fbf-852f-bc4d5bc35d5a@linux.alibaba.com/ [6] https://lore.kernel.org/all/20240102184633.748113-1-urezki@gmail.com/ ==================== Link: https://lore.kernel.org/r/20240428060738.60843-1-guwen@linux.alibaba.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>

…/git/pablo/gtp Pablo neira Ayuso says: ==================== gtp pull request 24-05-07 This v3 includes: - fix for clang uninitialized variable per Jakub. - address Smatch and Coccinelle reports per Simon - remove inline in new IPv6 support per Simon - fix memleaks in netlink control plane per Simon -o- The following patchset contains IPv6 GTP driver support for net-next, this also includes IPv6 over IPv4 and vice-versa: Patch #1 removes a unnecessary stack variable initialization in the socket routine. Patch #2 deals with GTP extension headers. This variable length extension header to decapsulate packets accordingly. Otherwise, packets are dropped when these extension headers are present which breaks interoperation with other non-Linux based GTP implementations. Patch #3 prepares for IPv6 support by moving IPv4 specific fields in PDP context objects to a union. Patch #4 adds IPv6 support while retaining backward compatibility. Three new attributes allows to declare an IPv6 GTP tunnel GTPA_FAMILY, GTPA_PEER_ADDR6 and GTPA_MS_ADDR6 as well as IFLA_GTP_LOCAL6 to declare the IPv6 GTP UDP socket. Up to this patch, only IPv6 outer in IPv6 inner is supported. Patch #5 uses IPv6 address /64 prefix for UE/MS in the inner headers. Unlike IPv4, which provides a 1:1 mapping between UE/MS, IPv6 tunnel encapsulates traffic for /64 address as specified by 3GPP TS. Patch has been split from Patch #4 to highlight this behaviour. Patch #6 passes up IPv6 link-local traffic, such as IPv6 SLAAC, for handling to userspace so they are handled as control packets. Patch #7 prepares to allow for GTP IPv4 over IPv6 and vice-versa by moving IP specific debugging out of the function to build IPv4 and IPv6 GTP packets. Patch #8 generalizes TOS/DSCP handling following similar approach as in the existing iptunnel infrastructure. Patch #9 adds a helper function to build an IPv4 GTP packet in the outer header. Patch #10 adds a helper function to build an IPv6 GTP packet in the outer header. Patch #11 adds support for GTP IPv4-over-IPv6 and vice-versa. Patch #12 allows to use the same TID/TEID (tunnel identifier) for inner IPv4 and IPv6 packets for better UE/MS dual stack integration. This series integrates with the osmocom.org project CI and TTCN-3 test infrastructure (Oliver Smith) as well as the userspace libgtpnl library. Thanks to Harald Welte, Oliver Smith and Pau Espin for reviewing and providing feedback through the osmocom.org redmine platform to make this happen. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>

…rnel/git/netfilter/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: Patch #1 skips transaction if object type provides no .update interface. Patch #2 skips NETDEV_CHANGENAME which is unused. Patch #3 enables conntrack to handle Multicast Router Advertisements and Multicast Router Solicitations from the Multicast Router Discovery protocol (RFC4286) as untracked opposed to invalid packets. From Linus Luessing. Patch #4 updates DCCP conntracker to mark invalid as invalid, instead of dropping them, from Jason Xing. Patch #5 uses NF_DROP instead of -NF_DROP since NF_DROP is 0, also from Jason. Patch #6 removes reference in netfilter's sysctl documentation on pickup entries which were already removed by Florian Westphal. Patch #7 removes check for IPS_OFFLOAD flag to disable early drop which allows to evict entries from the conntrack table, also from Florian. Patches #8 to #16 updates nf_tables pipapo set backend to allocate the datastructure copy on-demand from preparation phase, to better deal with OOM situations where .commit step is too late to fail. Series from Florian Westphal. Patch #17 adds a selftest with packetdrill to cover conntrack TCP state transitions, also from Florian. Patch #18 use GFP_KERNEL to clone elements from control plane to avoid quick atomic reserves exhaustion with large sets, reporter refers to million entries magnitude. * tag 'nf-next-24-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nf_tables: allow clone callbacks to sleep selftests: netfilter: add packetdrill based conntrack tests netfilter: nft_set_pipapo: remove dirty flag netfilter: nft_set_pipapo: move cloning of match info to insert/removal path netfilter: nft_set_pipapo: prepare pipapo_get helper for on-demand clone netfilter: nft_set_pipapo: merge deactivate helper into caller netfilter: nft_set_pipapo: prepare walk function for on-demand clone netfilter: nft_set_pipapo: prepare destroy function for on-demand clone netfilter: nft_set_pipapo: make pipapo_clone helper return NULL netfilter: nft_set_pipapo: move prove_locking helper around netfilter: conntrack: remove flowtable early-drop test netfilter: conntrack: documentation: remove reference to non-existent sysctl netfilter: use NF_DROP instead of -NF_DROP netfilter: conntrack: dccp: try not to drop skb in conntrack netfilter: conntrack: fix ct-state for ICMPv6 Multicast Router Discovery netfilter: nf_tables: remove NETDEV_CHANGENAME from netdev chain event handler netfilter: nf_tables: skip transaction if update object is not implemented ==================== Link: https://lore.kernel.org/r/20240512161436.168973-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>

With commit c4cb231 ("iommu/amd: Add support for enable/disable IOPF") we are hitting below issue. This happens because in IOPF enablement path it holds spin lock with irq disable and then tries to take mutex lock. dmesg: ----- [ 0.938739] ============================= [ 0.938740] [ BUG: Invalid wait context ] [ 0.938742] 6.10.0-rc1+ #1 Not tainted [ 0.938745] ----------------------------- [ 0.938746] swapper/0/1 is trying to lock: [ 0.938748] ffffffff8c9f01d8 (&port_lock_key){....}-{3:3}, at: serial8250_console_write+0x78/0x4a0 [ 0.938767] other info that might help us debug this: [ 0.938768] context-{5:5} [ 0.938769] 7 locks held by swapper/0/1: [ 0.938772] #0: ffff888101a91310 (&group->mutex){+.+.}-{4:4}, at: bus_iommu_probe+0x70/0x160 [ 0.938790] #1: ffff888101d1f1b8 (&domain->lock){....}-{3:3}, at: amd_iommu_attach_device+0xa5/0x700 [ 0.938799] #2: ffff888101cc3d18 (&dev_data->lock){....}-{3:3}, at: amd_iommu_attach_device+0xc5/0x700 [ 0.938806] #3: ffff888100052830 (&iommu->lock){....}-{2:2}, at: amd_iommu_iopf_add_device+0x3f/0xa0 [ 0.938813] #4: ffffffff8945a340 (console_lock){+.+.}-{0:0}, at: _printk+0x48/0x50 [ 0.938822] #5: ffffffff8945a390 (console_srcu){....}-{0:0}, at: console_flush_all+0x58/0x4e0 [ 0.938867] #6: ffffffff82459f80 (console_owner){....}-{0:0}, at: console_flush_all+0x1f0/0x4e0 [ 0.938872] stack backtrace: [ 0.938874] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 6.10.0-rc1+ #1 [ 0.938877] Hardware name: HP HP EliteBook 745 G3/807E, BIOS N73 Ver. 01.39 04/16/2019 Fix above issue by re-arranging code in attach device path: - move device PASID/IOPF enablement outside lock in AMD IOMMU driver. This is safe as core layer holds group->mutex lock before calling iommu_ops->attach_dev. Reported-by: Borislav Petkov <bp@alien8.de> Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Reported-by: Chris Bainbridge <chris.bainbridge@gmail.com> Fixes: c4cb231 ("iommu/amd: Add support for enable/disable IOPF") Tested-by: Borislav Petkov <bp@alien8.de> Tested-by: Chris Bainbridge <chris.bainbridge@gmail.com> Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Signed-off-by: Vasant Hegde <vasant.hegde@amd.com> Link: https://lore.kernel.org/r/20240530084801.10758-1-vasant.hegde@amd.com Signed-off-by: Joerg Roedel <jroedel@suse.de>

…PLES event" This reverts commit 7d1405c. This causes segfaults in some cases, as reported by Milian: ``` sudo /usr/bin/perf record -z --call-graph dwarf -e cycles -e raw_syscalls:sys_enter ls ... [ perf record: Woken up 3 times to write data ] malloc(): invalid next size (unsorted) Aborted ``` Backtrace with GDB + debuginfod: ``` malloc(): invalid next size (unsorted) Thread 1 "perf" received signal SIGABRT, Aborted. __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44 Downloading source file /usr/src/debug/glibc/glibc/nptl/pthread_kill.c 44 return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0; (gdb) bt #0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44 #1 0x00007ffff6ea8eb3 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:78 #2 0x00007ffff6e50a30 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/ raise.c:26 #3 0x00007ffff6e384c3 in __GI_abort () at abort.c:79 #4 0x00007ffff6e39354 in __libc_message_impl (fmt=fmt@entry=0x7ffff6fc22ea "%s\n") at ../sysdeps/posix/libc_fatal.c:132 #5 0x00007ffff6eb3085 in malloc_printerr (str=str@entry=0x7ffff6fc5850 "malloc(): invalid next size (unsorted)") at malloc.c:5772 #6 0x00007ffff6eb657c in _int_malloc (av=av@entry=0x7ffff6ff6ac0 <main_arena>, bytes=bytes@entry=368) at malloc.c:4081 #7 0x00007ffff6eb877e in __libc_calloc (n=<optimized out>, elem_size=<optimized out>) at malloc.c:3754 #8 0x000055555569bdb6 in perf_session.do_write_header () #9 0x00005555555a373a in __cmd_record.constprop.0 () #10 0x00005555555a6846 in cmd_record () #11 0x000055555564db7f in run_builtin () #12 0x000055555558ed77 in main () ``` Valgrind memcheck: ``` ==45136== Invalid write of size 8 ==45136== at 0x2B38A5: perf_event__synthesize_id_sample (in /usr/bin/perf) ==45136== by 0x157069: __cmd_record.constprop.0 (in /usr/bin/perf) ==45136== by 0x15A845: cmd_record (in /usr/bin/perf) ==45136== by 0x201B7E: run_builtin (in /usr/bin/perf) ==45136== by 0x142D76: main (in /usr/bin/perf) ==45136== Address 0x6a866a8 is 0 bytes after a block of size 40 alloc'd ==45136== at 0x4849BF3: calloc (vg_replace_malloc.c:1675) ==45136== by 0x3574AB: zalloc (in /usr/bin/perf) ==45136== by 0x1570E0: __cmd_record.constprop.0 (in /usr/bin/perf) ==45136== by 0x15A845: cmd_record (in /usr/bin/perf) ==45136== by 0x201B7E: run_builtin (in /usr/bin/perf) ==45136== by 0x142D76: main (in /usr/bin/perf) ==45136== ==45136== Syscall param write(buf) points to unaddressable byte(s) ==45136== at 0x575953D: __libc_write (write.c:26) ==45136== by 0x575953D: write (write.c:24) ==45136== by 0x35761F: ion (in /usr/bin/perf) ==45136== by 0x357778: writen (in /usr/bin/perf) ==45136== by 0x1548F7: record__write (in /usr/bin/perf) ==45136== by 0x15708A: __cmd_record.constprop.0 (in /usr/bin/perf) ==45136== by 0x15A845: cmd_record (in /usr/bin/perf) ==45136== by 0x201B7E: run_builtin (in /usr/bin/perf) ==45136== by 0x142D76: main (in /usr/bin/perf) ==45136== Address 0x6a866a8 is 0 bytes after a block of size 40 alloc'd ==45136== at 0x4849BF3: calloc (vg_replace_malloc.c:1675) ==45136== by 0x3574AB: zalloc (in /usr/bin/perf) ==45136== by 0x1570E0: __cmd_record.constprop.0 (in /usr/bin/perf) ==45136== by 0x15A845: cmd_record (in /usr/bin/perf) ==45136== by 0x201B7E: run_builtin (in /usr/bin/perf) ==45136== by 0x142D76: main (in /usr/bin/perf) ==45136== ----- Closes: https://lore.kernel.org/linux-perf-users/23879991.0LEYPuXRzz@milian-workstation/ Reported-by: Milian Wolff <milian.wolff@kdab.com> Tested-by: Milian Wolff <milian.wolff@kdab.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: stable@kernel.org # 6.8+ Link: https://lore.kernel.org/lkml/Zl9ksOlHJHnKM70p@x1 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

We have been seeing crashes on duplicate keys in btrfs_set_item_key_safe(): BTRFS critical (device vdb): slot 4 key (450 108 8192) new key (450 108 8192) ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.c:2620! invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 3139 Comm: xfs_io Kdump: loaded Not tainted 6.9.0 #6 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014 RIP: 0010:btrfs_set_item_key_safe+0x11f/0x290 [btrfs] With the following stack trace: #0 btrfs_set_item_key_safe (fs/btrfs/ctree.c:2620:4) #1 btrfs_drop_extents (fs/btrfs/file.c:411:4) #2 log_one_extent (fs/btrfs/tree-log.c:4732:9) #3 btrfs_log_changed_extents (fs/btrfs/tree-log.c:4955:9) #4 btrfs_log_inode (fs/btrfs/tree-log.c:6626:9) #5 btrfs_log_inode_parent (fs/btrfs/tree-log.c:7070:8) #6 btrfs_log_dentry_safe (fs/btrfs/tree-log.c:7171:8) #7 btrfs_sync_file (fs/btrfs/file.c:1933:8) #8 vfs_fsync_range (fs/sync.c:188:9) #9 vfs_fsync (fs/sync.c:202:9) #10 do_fsync (fs/sync.c:212:9) #11 __do_sys_fdatasync (fs/sync.c:225:9) #12 __se_sys_fdatasync (fs/sync.c:223:1) #13 __x64_sys_fdatasync (fs/sync.c:223:1) #14 do_syscall_x64 (arch/x86/entry/common.c:52:14) #15 do_syscall_64 (arch/x86/entry/common.c:83:7) #16 entry_SYSCALL_64+0xaf/0x14c (arch/x86/entry/entry_64.S:121) So we're logging a changed extent from fsync, which is splitting an extent in the log tree. But this split part already exists in the tree, triggering the BUG(). This is the state of the log tree at the time of the crash, dumped with drgn (https://github.com/osandov/drgn/blob/main/contrib/btrfs_tree.py) to get more details than btrfs_print_leaf() gives us: >>> print_extent_buffer(prog.crashed_thread().stack_trace()[0]["eb"]) leaf 33439744 level 0 items 72 generation 9 owner 18446744073709551610 leaf 33439744 flags 0x100000000000000 fs uuid e5bd3946-400c-4223-8923-190ef1f18677 chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da item 0 key (450 INODE_ITEM 0) itemoff 16123 itemsize 160 generation 7 transid 9 size 8192 nbytes 8473563889606862198 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 204 flags 0x10(PREALLOC) atime 1716417703.220000000 (2024-05-22 15:41:43) ctime 1716417704.983333333 (2024-05-22 15:41:44) mtime 1716417704.983333333 (2024-05-22 15:41:44) otime 17592186044416.000000000 (559444-03-08 01:40:16) item 1 key (450 INODE_REF 256) itemoff 16110 itemsize 13 index 195 namelen 3 name: 193 item 2 key (450 XATTR_ITEM 1640047104) itemoff 16073 itemsize 37 location key (0 UNKNOWN.0 0) type XATTR transid 7 data_len 1 name_len 6 name: user.a data a item 3 key (450 EXTENT_DATA 0) itemoff 16020 itemsize 53 generation 9 type 1 (regular) extent data disk byte 303144960 nr 12288 extent data offset 0 nr 4096 ram 12288 extent compression 0 (none) item 4 key (450 EXTENT_DATA 4096) itemoff 15967 itemsize 53 generation 9 type 2 (prealloc) prealloc data disk byte 303144960 nr 12288 prealloc data offset 4096 nr 8192 item 5 key (450 EXTENT_DATA 8192) itemoff 15914 itemsize 53 generation 9 type 2 (prealloc) prealloc data disk byte 303144960 nr 12288 prealloc data offset 8192 nr 4096 ... So the real problem happened earlier: notice that items 4 (4k-12k) and 5 (8k-12k) overlap. Both are prealloc extents. Item 4 straddles i_size and item 5 starts at i_size. Here is the state of the filesystem tree at the time of the crash: >>> root = prog.crashed_thread().stack_trace()[2]["inode"].root >>> ret, nodes, slots = btrfs_search_slot(root, BtrfsKey(450, 0, 0)) >>> print_extent_buffer(nodes[0]) leaf 30425088 level 0 items 184 generation 9 owner 5 leaf 30425088 flags 0x100000000000000 fs uuid e5bd3946-400c-4223-8923-190ef1f18677 chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da ... item 179 key (450 INODE_ITEM 0) itemoff 4907 itemsize 160 generation 7 transid 7 size 4096 nbytes 12288 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 6 flags 0x10(PREALLOC) atime 1716417703.220000000 (2024-05-22 15:41:43) ctime 1716417703.220000000 (2024-05-22 15:41:43) mtime 1716417703.220000000 (2024-05-22 15:41:43) otime 1716417703.220000000 (2024-05-22 15:41:43) item 180 key (450 INODE_REF 256) itemoff 4894 itemsize 13 index 195 namelen 3 name: 193 item 181 key (450 XATTR_ITEM 1640047104) itemoff 4857 itemsize 37 location key (0 UNKNOWN.0 0) type XATTR transid 7 data_len 1 name_len 6 name: user.a data a item 182 key (450 EXTENT_DATA 0) itemoff 4804 itemsize 53 generation 9 type 1 (regular) extent data disk byte 303144960 nr 12288 extent data offset 0 nr 8192 ram 12288 extent compression 0 (none) item 183 key (450 EXTENT_DATA 8192) itemoff 4751 itemsize 53 generation 9 type 2 (prealloc) prealloc data disk byte 303144960 nr 12288 prealloc data offset 8192 nr 4096 Item 5 in the log tree corresponds to item 183 in the filesystem tree, but nothing matches item 4. Furthermore, item 183 is the last item in the leaf. btrfs_log_prealloc_extents() is responsible for logging prealloc extents beyond i_size. It first truncates any previously logged prealloc extents that start beyond i_size. Then, it walks the filesystem tree and copies the prealloc extent items to the log tree. If it hits the end of a leaf, then it calls btrfs_next_leaf(), which unlocks the tree and does another search. However, while the filesystem tree is unlocked, an ordered extent completion may modify the tree. In particular, it may insert an extent item that overlaps with an extent item that was already copied to the log tree. This may manifest in several ways depending on the exact scenario, including an EEXIST error that is silently translated to a full sync, overlapping items in the log tree, or this crash. This particular crash is triggered by the following sequence of events: - Initially, the file has i_size=4k, a regular extent from 0-4k, and a prealloc extent beyond i_size from 4k-12k. The prealloc extent item is the last item in its B-tree leaf. - The file is fsync'd, which copies its inode item and both extent items to the log tree. - An xattr is set on the file, which sets the BTRFS_INODE_COPY_EVERYTHING flag. - The range 4k-8k in the file is written using direct I/O. i_size is extended to 8k, but the ordered extent is still in flight. - The file is fsync'd. Since BTRFS_INODE_COPY_EVERYTHING is set, this calls copy_inode_items_to_log(), which calls btrfs_log_prealloc_extents(). - btrfs_log_prealloc_extents() finds the 4k-12k prealloc extent in the filesystem tree. Since it starts before i_size, it skips it. Since it is the last item in its B-tree leaf, it calls btrfs_next_leaf(). - btrfs_next_leaf() unlocks the path. - The ordered extent completion runs, which converts the 4k-8k part of the prealloc extent to written and inserts the remaining prealloc part from 8k-12k. - btrfs_next_leaf() does a search and finds the new prealloc extent 8k-12k. - btrfs_log_prealloc_extents() copies the 8k-12k prealloc extent into the log tree. Note that it overlaps with the 4k-12k prealloc extent that was copied to the log tree by the first fsync. - fsync calls btrfs_log_changed_extents(), which tries to log the 4k-8k extent that was written. - This tries to drop the range 4k-8k in the log tree, which requires adjusting the start of the 4k-12k prealloc extent in the log tree to 8k. - btrfs_set_item_key_safe() sees that there is already an extent starting at 8k in the log tree and calls BUG(). Fix this by detecting when we're about to insert an overlapping file extent item in the log tree and truncating the part that would overlap. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>

Petr Machata says: ==================== mlxsw: ACL fixes Ido Schimmel writes: Patches #1-#3 fix various spelling mistakes I noticed while working on the code base. Patch #4 fixes a general protection fault by bailing out when the error occurs and warning. Patch #5 fixes the warning. Patch #6 fixes ACL scale regression and firmware errors. See the commit messages for more info. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>

Petr Machata says: ==================== mlxsw: Use page pool for Rx buffers allocation Amit Cohen writes: After using NAPI to process events from hardware, the next step is to use page pool for Rx buffers allocation, which is also enhances performance. To simplify this change, first use page pool to allocate one continuous buffer for each packet, later memory consumption can be improved by using fragmented buffers. This set significantly enhances mlxsw driver performance, CPU can handle about 370% of the packets per second it previously handled. The next planned improvement is using XDP to optimize telemetry. Patch set overview: Patches #1-#2 are small preparations for page pool usage Patch #3 initializes page pool, but do not use it Patch #4 converts the driver to use page pool for buffers allocations Patch #5 is an optimization for buffer access Patch #6 cleans up an unused structure Patch #7 uses napi_consume_skb() as part of Tx completion ==================== Link: https://lore.kernel.org/r/cover.1718709196.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

The code in ocfs2_dio_end_io_write() estimates number of necessary transaction credits using ocfs2_calc_extend_credits(). This however does not take into account that the IO could be arbitrarily large and can contain arbitrary number of extents. Extent tree manipulations do often extend the current transaction but not in all of the cases. For example if we have only single block extents in the tree, ocfs2_mark_extent_written() will end up calling ocfs2_replace_extent_rec() all the time and we will never extend the current transaction and eventually exhaust all the transaction credits if the IO contains many single block extents. Once that happens a WARN_ON(jbd2_handle_buffer_credits(handle) <= 0) is triggered in jbd2_journal_dirty_metadata() and subsequently OCFS2 aborts in response to this error. This was actually triggered by one of our customers on a heavily fragmented OCFS2 filesystem. To fix the issue make sure the transaction always has enough credits for one extent insert before each call of ocfs2_mark_extent_written(). Heming Zhao said: ------ PANIC: "Kernel panic - not syncing: OCFS2: (device dm-1): panic forced after error" PID: xxx TASK: xxxx CPU: 5 COMMAND: "SubmitThread-CA" #0 machine_kexec at ffffffff8c069932 #1 __crash_kexec at ffffffff8c1338fa #2 panic at ffffffff8c1d69b9 #3 ocfs2_handle_error at ffffffffc0c86c0c [ocfs2] #4 __ocfs2_abort at ffffffffc0c88387 [ocfs2] #5 ocfs2_journal_dirty at ffffffffc0c51e98 [ocfs2] #6 ocfs2_split_extent at ffffffffc0c27ea3 [ocfs2] #7 ocfs2_change_extent_flag at ffffffffc0c28053 [ocfs2] #8 ocfs2_mark_extent_written at ffffffffc0c28347 [ocfs2] #9 ocfs2_dio_end_io_write at ffffffffc0c2bef9 [ocfs2] #10 ocfs2_dio_end_io at ffffffffc0c2c0f5 [ocfs2] #11 dio_complete at ffffffff8c2b9fa7 #12 do_blockdev_direct_IO at ffffffff8c2bc09f #13 ocfs2_direct_IO at ffffffffc0c2b653 [ocfs2] #14 generic_file_direct_write at ffffffff8c1dcf14 #15 __generic_file_write_iter at ffffffff8c1dd07b #16 ocfs2_file_write_iter at ffffffffc0c49f1f [ocfs2] #17 aio_write at ffffffff8c2cc72e #18 kmem_cache_alloc at ffffffff8c248dde #19 do_io_submit at ffffffff8c2ccada #20 do_syscall_64 at ffffffff8c004984 #21 entry_SYSCALL_64_after_hwframe at ffffffff8c8000ba Link: https://lkml.kernel.org/r/20240617095543.6971-1-jack@suse.cz Link: https://lkml.kernel.org/r/20240614145243.8837-1-jack@suse.cz Fixes: c15471f ("ocfs2: fix sparse file & data ordering issue in direct io") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Danielle Ratson says: ==================== Add ability to flash modules' firmware CMIS compliant modules such as QSFP-DD might be running a firmware that can be updated in a vendor-neutral way by exchanging messages between the host and the module as described in section 7.2.2 of revision 4.0 of the CMIS standard. According to the CMIS standard, the firmware update process is done using a CDB commands sequence. CDB (Command Data Block Message Communication) reads and writes are performed on memory map pages 9Fh-AFh according to the CMIS standard, section 8.12 of revision 4.0. Add a pair of new ethtool messages that allow: * User space to trigger firmware update of transceiver modules * The kernel to notify user space about the progress of the process The user interface is designed to be asynchronous in order to avoid RTNL being held for too long and to allow several modules to be updated simultaneously. The interface is designed with CMIS compliant modules in mind, but kept generic enough to accommodate future use cases, if these arise. The kernel interface that will implement the firmware update using CDB command will include 2 layers that will be added under ethtool: * The upper layer that will be triggered from the module layer, is cmis_ fw_update. * The lower one is cmis_cdb. In the future there might be more operations to implement using CDB commands. Therefore, the idea is to keep the cmis_cdb interface clean and the cmis_fw_update specific to the cdb commands handling it. The communication between the kernel and the driver will be done using two ethtool operations that enable reading and writing the transceiver module EEPROM. The operation ethtool_ops::get_module_eeprom_by_page, that is already implemented, will be used for reading from the EEPROM the CDB reply, e.g. reading module setting, state, etc. The operation ethtool_ops::set_module_eeprom_by_page, that is added in the current patchset, will be used for writing to the EEPROM the CDB command such as start firmware image, run firmware image, etc. Therefore in order for a driver to implement module flashing, that driver needs to implement the two functions mentioned above. Patchset overview: Patch #1-#2: Implement the EEPROM writing in mlxsw. Patch #3: Define the interface between the kernel and user space. Patch #4: Add ability to notify the flashing firmware progress. Patch #5: Veto operations during flashing. Patch #6: Add extended compliance codes. Patch #7: Add the cdb layer. Patch #8: Add the fw_update layer. Patch #9: Add ability to flash transceiver modules' firmware. v8: Patch #7: * In the ethtool_cmis_wait_for_cond() evaluate the condition once more to decide if the error code should be -ETIMEDOUT or something else. * s/netdev_err/netdev_err_once. v7: Patch #4: * Return -ENOMEM instead of PTR_ERR(attr) on ethnl_module_fw_flash_ntf_put_err(). Patch #9: * Fix Warning for not unlocking the spin_lock in the error flow on module_flash_fw_work_list_add(). * Avoid the fall-through on ethnl_sock_priv_destroy(). v6: * Squash some of the last patch to patch #5 and patch #9. Patch #3: * Add paragraph in .rst file. Patch #4: * Reserve '1' more place on SKB for NUL terminator in the error message string. * Add more prints on error flow, re-write the printing function and add ethnl_module_fw_flash_ntf_put_err(). * Change the communication method so notification will be sent in unicast instead of multicast. * Add new 'struct ethnl_module_fw_flash_ntf_params' that holds the relevant info for unicast communication and use it to send notification to the specific socket. * s/nla_put_u64_64bit/nla_put_uint/ Patch #7: * In ethtool_cmis_cdb_init(), Use 'const' for the 'params' parameter. Patch #8: * Add a list field to struct ethtool_module_fw_flash for module_fw_flash_work_list that will be presented in the next patch. * Move ethtool_cmis_fw_update() cleaning to a new function that will be represented in the next patch. * Move some of the fields in struct ethtool_module_fw_flash to a separate struct, so ethtool_cmis_fw_update() will get only the relevant parameters for it. * Edit the relevant functions to get the relevant params for them. * s/CMIS_MODULE_READY_MAX_DURATION_USEC/CMIS_MODULE_READY_MAX_DURATION_MSEC Patch #9: * Add a paragraph in the commit message. * Rename labels in module_flash_fw_schedule(). * Add info to genl_sk_priv_*() and implement the relevant callbacks, in order to handle properly a scenario of closing the socket from user space before the work item was ended. * Add a list the holds all the ethtool_module_fw_flash struct that corresponds to the in progress work items. * Add a new enum for the socket types. * Use both above to identify a flashing socket, add it to the list and when closing socket affect only the flashing type. * Create a new function that will get the work item instead of ethtool_cmis_fw_update(). * Edit the relevant functions to get the relevant params for them. * The new function will call the old ethtool_cmis_fw_update(), and do the cleaning, so the existence of the list should be completely isolated in module.c. =================== Signed-off-by: David S. Miller <davem@davemloft.net>

Petr Machata says: ==================== selftest: Clean-up and stabilize mirroring tests The mirroring selftests work by sending ICMP traffic between two hosts. Along the way, this traffic is mirrored to a gretap netdevice, and counter taps are then installed strategically along the path of the mirrored traffic to verify the mirroring took place. The problem with this is that besides mirroring the primary traffic, any other service traffic is mirrored as well. At the same time, because the tests need to work in HW-offloaded scenarios, the ability of the device to do arbitrary packet inspection should not be taken for granted. Most tests therefore simply use matchall, one uses flower to match on IP address. As a result, the selftests are noisy. mirror_test() accommodated this noisiness by giving the counters an allowance of several packets. But that only works up to a point, and on busy systems won't be always enough. In this patch set, clean up and stabilize the mirroring selftests. The original intention was to port the tests over to UDP, but the logic of ICMP ends up being so entangled in the mirroring selftests that the changes feel overly invasive. Instead, ICMP is kept, but where possible, we match on ICMP message type, thus filtering out hits by other ICMP messages. Where this is not practical (where the counter tap is put on a device that carries encapsulated packets), switch the counter condition to _at least_ X observed packets. This is less robust, but barely so -- probably the only scenario that this would not catch is something like erroneous packet duplication, which would hopefully get caught by the numerous other tests in this extensive suite. - Patches #1 to #3 clean up parameters at various helpers. - Patches #4 to #6 stabilize the mirroring selftests as described above. - Mirroring tests currently allow testing SW datapath even on HW netdevices by trapping traffic to the SW datapath. This complicates the tests a bit without a good reason: to test SW datapath, just run the selftests on the veth topology. Thus in patch #7, drop support for this dual SW/HW testing. - At this point, some cleanups were either made possible by the previous patches, or were always possible. In patches #8 to #11, realize these cleanups. - In patch #12, fix mlxsw mirror_gre selftest to respect setting TESTS. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[syzbot] KASAN: use-after-free Read in skb_release_head_state #6

[syzbot] KASAN: use-after-free Read in skb_release_head_state #6

tedd-an commented Apr 12, 2021

[syzbot] KASAN: use-after-free Read in skb_release_head_state #6

[syzbot] KASAN: use-after-free Read in skb_release_head_state #6

Comments

tedd-an commented Apr 12, 2021