Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It's hot and consumes a little power. The phone is still hot after turning off the screen #5

Closed
FreeLife2 opened this issue Aug 23, 2021 · 0 comments

Comments

@FreeLife2
Copy link

This kernel does a great job and works well on my Apollo. It provides me with a double-click bright screen node, but it is a little hot and doesn't save much power. Please optimize it. Thank you for your work. I'm willing to donate

myissam pushed a commit to myissam/kernel_xiaomi_sm8250 that referenced this issue Aug 30, 2021
[ Upstream commit 6cf539a87a61a4fbc43f625267dbcbcf283872ed ]

This fixes a data-race where `atomic_t dynticks` is copied by value. The
copy is performed non-atomically, resulting in a data-race if `dynticks`
is updated concurrently.

This data-race was found with KCSAN:
==================================================================
BUG: KCSAN: data-race in dyntick_save_progress_counter / rcu_irq_enter

write to 0xffff989dbdbe98e0 of 4 bytes by task 10 on cpu 3:
 atomic_add_return include/asm-generic/atomic-instrumented.h:78 [inline]
 rcu_dynticks_snap kernel/rcu/tree.c:310 [inline]
 dyntick_save_progress_counter+0x43/0x1b0 kernel/rcu/tree.c:984
 force_qs_rnp+0x183/0x200 kernel/rcu/tree.c:2286
 rcu_gp_fqs kernel/rcu/tree.c:1601 [inline]
 rcu_gp_fqs_loop+0x71/0x880 kernel/rcu/tree.c:1653
 rcu_gp_kthread+0x22c/0x3b0 kernel/rcu/tree.c:1799
 kthread+0x1b5/0x200 kernel/kthread.c:255
 <snip>

read to 0xffff989dbdbe98e0 of 4 bytes by task 154 on cpu 7:
 rcu_nmi_enter_common kernel/rcu/tree.c:828 [inline]
 rcu_irq_enter+0xda/0x240 kernel/rcu/tree.c:870
 irq_enter+0x5/0x50 kernel/softirq.c:347
 <snip>

Reported by Kernel Concurrency Sanitizer on:
CPU: 7 PID: 154 Comm: kworker/7:1H Not tainted 5.3.0+ UtsavBalar1231#5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
Workqueue: kblockd blk_mq_run_work_fn
==================================================================

Signed-off-by: Marco Elver <elver@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: rcu@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Lucas Lee Jing Yi <lucasleeeeeeeee@gmail.com>
Signed-off-by: Carlos Ayrton Lopez Arroyo <15030201@itcelaya.edu.mx>
myissam pushed a commit to myissam/kernel_xiaomi_sm8250 that referenced this issue Sep 8, 2021
[ Upstream commit 6cf539a87a61a4fbc43f625267dbcbcf283872ed ]

This fixes a data-race where `atomic_t dynticks` is copied by value. The
copy is performed non-atomically, resulting in a data-race if `dynticks`
is updated concurrently.

This data-race was found with KCSAN:
==================================================================
BUG: KCSAN: data-race in dyntick_save_progress_counter / rcu_irq_enter

write to 0xffff989dbdbe98e0 of 4 bytes by task 10 on cpu 3:
 atomic_add_return include/asm-generic/atomic-instrumented.h:78 [inline]
 rcu_dynticks_snap kernel/rcu/tree.c:310 [inline]
 dyntick_save_progress_counter+0x43/0x1b0 kernel/rcu/tree.c:984
 force_qs_rnp+0x183/0x200 kernel/rcu/tree.c:2286
 rcu_gp_fqs kernel/rcu/tree.c:1601 [inline]
 rcu_gp_fqs_loop+0x71/0x880 kernel/rcu/tree.c:1653
 rcu_gp_kthread+0x22c/0x3b0 kernel/rcu/tree.c:1799
 kthread+0x1b5/0x200 kernel/kthread.c:255
 <snip>

read to 0xffff989dbdbe98e0 of 4 bytes by task 154 on cpu 7:
 rcu_nmi_enter_common kernel/rcu/tree.c:828 [inline]
 rcu_irq_enter+0xda/0x240 kernel/rcu/tree.c:870
 irq_enter+0x5/0x50 kernel/softirq.c:347
 <snip>

Reported by Kernel Concurrency Sanitizer on:
CPU: 7 PID: 154 Comm: kworker/7:1H Not tainted 5.3.0+ UtsavBalar1231#5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
Workqueue: kblockd blk_mq_run_work_fn
==================================================================

Signed-off-by: Marco Elver <elver@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: rcu@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Lucas Lee Jing Yi <lucasleeeeeeeee@gmail.com>
Signed-off-by: Carlos Ayrton Lopez Arroyo <15030201@itcelaya.edu.mx>
myissam pushed a commit to myissam/kernel_xiaomi_sm8250 that referenced this issue Sep 9, 2021
[ Upstream commit 6cf539a87a61a4fbc43f625267dbcbcf283872ed ]

This fixes a data-race where `atomic_t dynticks` is copied by value. The
copy is performed non-atomically, resulting in a data-race if `dynticks`
is updated concurrently.

This data-race was found with KCSAN:
==================================================================
BUG: KCSAN: data-race in dyntick_save_progress_counter / rcu_irq_enter

write to 0xffff989dbdbe98e0 of 4 bytes by task 10 on cpu 3:
 atomic_add_return include/asm-generic/atomic-instrumented.h:78 [inline]
 rcu_dynticks_snap kernel/rcu/tree.c:310 [inline]
 dyntick_save_progress_counter+0x43/0x1b0 kernel/rcu/tree.c:984
 force_qs_rnp+0x183/0x200 kernel/rcu/tree.c:2286
 rcu_gp_fqs kernel/rcu/tree.c:1601 [inline]
 rcu_gp_fqs_loop+0x71/0x880 kernel/rcu/tree.c:1653
 rcu_gp_kthread+0x22c/0x3b0 kernel/rcu/tree.c:1799
 kthread+0x1b5/0x200 kernel/kthread.c:255
 <snip>

read to 0xffff989dbdbe98e0 of 4 bytes by task 154 on cpu 7:
 rcu_nmi_enter_common kernel/rcu/tree.c:828 [inline]
 rcu_irq_enter+0xda/0x240 kernel/rcu/tree.c:870
 irq_enter+0x5/0x50 kernel/softirq.c:347
 <snip>

Reported by Kernel Concurrency Sanitizer on:
CPU: 7 PID: 154 Comm: kworker/7:1H Not tainted 5.3.0+ UtsavBalar1231#5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
Workqueue: kblockd blk_mq_run_work_fn
==================================================================

Signed-off-by: Marco Elver <elver@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: rcu@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Lucas Lee Jing Yi <lucasleeeeeeeee@gmail.com>
Signed-off-by: Carlos Ayrton Lopez Arroyo <15030201@itcelaya.edu.mx>
myissam pushed a commit to myissam/kernel_xiaomi_sm8250 that referenced this issue Sep 18, 2021
[ Upstream commit 6cf539a87a61a4fbc43f625267dbcbcf283872ed ]

This fixes a data-race where `atomic_t dynticks` is copied by value. The
copy is performed non-atomically, resulting in a data-race if `dynticks`
is updated concurrently.

This data-race was found with KCSAN:
==================================================================
BUG: KCSAN: data-race in dyntick_save_progress_counter / rcu_irq_enter

write to 0xffff989dbdbe98e0 of 4 bytes by task 10 on cpu 3:
 atomic_add_return include/asm-generic/atomic-instrumented.h:78 [inline]
 rcu_dynticks_snap kernel/rcu/tree.c:310 [inline]
 dyntick_save_progress_counter+0x43/0x1b0 kernel/rcu/tree.c:984
 force_qs_rnp+0x183/0x200 kernel/rcu/tree.c:2286
 rcu_gp_fqs kernel/rcu/tree.c:1601 [inline]
 rcu_gp_fqs_loop+0x71/0x880 kernel/rcu/tree.c:1653
 rcu_gp_kthread+0x22c/0x3b0 kernel/rcu/tree.c:1799
 kthread+0x1b5/0x200 kernel/kthread.c:255
 <snip>

read to 0xffff989dbdbe98e0 of 4 bytes by task 154 on cpu 7:
 rcu_nmi_enter_common kernel/rcu/tree.c:828 [inline]
 rcu_irq_enter+0xda/0x240 kernel/rcu/tree.c:870
 irq_enter+0x5/0x50 kernel/softirq.c:347
 <snip>

Reported by Kernel Concurrency Sanitizer on:
CPU: 7 PID: 154 Comm: kworker/7:1H Not tainted 5.3.0+ UtsavBalar1231#5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
Workqueue: kblockd blk_mq_run_work_fn
==================================================================

Signed-off-by: Marco Elver <elver@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: rcu@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Lucas Lee Jing Yi <lucasleeeeeeeee@gmail.com>
Signed-off-by: Carlos Ayrton Lopez Arroyo <15030201@itcelaya.edu.mx>
UtsavBalar1231 pushed a commit that referenced this issue Sep 29, 2021
commit 57f0ff059e3daa4e70a811cb1d31a49968262d20 upstream.

It's later supposed to be either a correct address or NULL. Without the
initialization, it may contain an undefined value which results in the
following segmentation fault:

  # perf top --sort comm -g --ignore-callees=do_idle

terminates with:

  #0  0x00007ffff56b7685 in __strlen_avx2 () from /lib64/libc.so.6
  #1  0x00007ffff55e3802 in strdup () from /lib64/libc.so.6
  #2  0x00005555558cb139 in hist_entry__init (callchain_size=<optimized out>, sample_self=true, template=0x7fffde7fb110, he=0x7fffd801c250) at util/hist.c:489
  #3  hist_entry__new (template=template@entry=0x7fffde7fb110, sample_self=sample_self@entry=true) at util/hist.c:564
  #4  0x00005555558cb4ba in hists__findnew_entry (hists=hists@entry=0x5555561d9e38, entry=entry@entry=0x7fffde7fb110, al=al@entry=0x7fffde7fb420,
      sample_self=sample_self@entry=true) at util/hist.c:657
  #5  0x00005555558cba1b in __hists__add_entry (hists=hists@entry=0x5555561d9e38, al=0x7fffde7fb420, sym_parent=<optimized out>, bi=bi@entry=0x0, mi=mi@entry=0x0,
      sample=sample@entry=0x7fffde7fb4b0, sample_self=true, ops=0x0, block_info=0x0) at util/hist.c:288
  #6  0x00005555558cbb70 in hists__add_entry (sample_self=true, sample=0x7fffde7fb4b0, mi=0x0, bi=0x0, sym_parent=<optimized out>, al=<optimized out>, hists=0x5555561d9e38)
      at util/hist.c:1056
  #7  iter_add_single_cumulative_entry (iter=0x7fffde7fb460, al=<optimized out>) at util/hist.c:1056
  #8  0x00005555558cc8a4 in hist_entry_iter__add (iter=iter@entry=0x7fffde7fb460, al=al@entry=0x7fffde7fb420, max_stack_depth=<optimized out>, arg=arg@entry=0x7fffffff7db0)
      at util/hist.c:1231
  #9  0x00005555557cdc9a in perf_event__process_sample (machine=<optimized out>, sample=0x7fffde7fb4b0, evsel=<optimized out>, event=<optimized out>, tool=0x7fffffff7db0)
      at builtin-top.c:842
  #10 deliver_event (qe=<optimized out>, qevent=<optimized out>) at builtin-top.c:1202
  #11 0x00005555558a9318 in do_flush (show_progress=false, oe=0x7fffffff80e0) at util/ordered-events.c:244
  #12 __ordered_events__flush (oe=oe@entry=0x7fffffff80e0, how=how@entry=OE_FLUSH__TOP, timestamp=timestamp@entry=0) at util/ordered-events.c:323
  #13 0x00005555558a9789 in __ordered_events__flush (timestamp=<optimized out>, how=<optimized out>, oe=<optimized out>) at util/ordered-events.c:339
  #14 ordered_events__flush (how=OE_FLUSH__TOP, oe=0x7fffffff80e0) at util/ordered-events.c:341
  #15 ordered_events__flush (oe=oe@entry=0x7fffffff80e0, how=how@entry=OE_FLUSH__TOP) at util/ordered-events.c:339
  #16 0x00005555557cd631 in process_thread (arg=0x7fffffff7db0) at builtin-top.c:1114
  #17 0x00007ffff7bb817a in start_thread () from /lib64/libpthread.so.0
  #18 0x00007ffff5656dc3 in clone () from /lib64/libc.so.6

If you look at the frame #2, the code is:

488	 if (he->srcline) {
489          he->srcline = strdup(he->srcline);
490          if (he->srcline == NULL)
491              goto err_rawdata;
492	 }

If he->srcline is not NULL (it is not NULL if it is uninitialized rubbish),
it gets strdupped and strdupping a rubbish random string causes the problem.

Also, if you look at the commit 1fb7d06, it adds the srcline property
into the struct, but not initializing it everywhere needed.

Committer notes:

Now I see, when using --ignore-callees=do_idle we end up here at line
2189 in add_callchain_ip():

2181         if (al.sym != NULL) {
2182                 if (perf_hpp_list.parent && !*parent &&
2183                     symbol__match_regex(al.sym, &parent_regex))
2184                         *parent = al.sym;
2185                 else if (have_ignore_callees && root_al &&
2186                   symbol__match_regex(al.sym, &ignore_callees_regex)) {
2187                         /* Treat this symbol as the root,
2188                            forgetting its callees. */
2189                         *root_al = al;
2190                         callchain_cursor_reset(cursor);
2191                 }
2192         }

And the al that doesn't have the ->srcline field initialized will be
copied to the root_al, so then, back to:

1211 int hist_entry_iter__add(struct hist_entry_iter *iter, struct addr_location *al,
1212                          int max_stack_depth, void *arg)
1213 {
1214         int err, err2;
1215         struct map *alm = NULL;
1216
1217         if (al)
1218                 alm = map__get(al->map);
1219
1220         err = sample__resolve_callchain(iter->sample, &callchain_cursor, &iter->parent,
1221                                         iter->evsel, al, max_stack_depth);
1222         if (err) {
1223                 map__put(alm);
1224                 return err;
1225         }
1226
1227         err = iter->ops->prepare_entry(iter, al);
1228         if (err)
1229                 goto out;
1230
1231         err = iter->ops->add_single_entry(iter, al);
1232         if (err)
1233                 goto out;
1234

That al at line 1221 is what hist_entry_iter__add() (called from
sample__resolve_callchain()) saw as 'root_al', and then:

        iter->ops->add_single_entry(iter, al);

will go on with al->srcline with a bogus value, I'll add the above
sequence to the cset and apply, thanks!

Signed-off-by: Michael Petlan <mpetlan@redhat.com>
CC: Milian Wolff <milian.wolff@kdab.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Fixes: 1fb7d06 ("perf report Use srcline from callchain for hist entries")
Link: https //lore.kernel.org/r/20210719145332.29747-1-mpetlan@redhat.com
Reported-by: Juri Lelli <jlelli@redhat.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
mesziman pushed a commit to mesziman/kernel_xiaomi_sm8250 that referenced this issue Dec 8, 2021
… into dot11

Squashed commit of the following:

commit 3e329ba26326d4153032cc1ba51e6d0797852e09
Merge: 36141974c08e 6eb8d7f5c932
Author: KyuoFoxHuyu <88403766@qq.com>
Date:   Sat Oct 2 14:57:36 2021 +0800

    Merge branch 'android-4.19-stable' of https://android.googlesource.com/kernel/common into baseline-r

commit 36141974c08eb8cb46c0894d8800c8ca0b88b22e
Merge: 222d1f891329 647114f95dd9
Author: KyuoFoxHuyu <88403766@qq.com>
Date:   Sat Oct 2 14:47:43 2021 +0800

    Merge tag 'LA.UM.9.12.r1-12900-SMxx50.QSSI12.0' of https://source.codeaurora.cn/quic/la/kernel/msm-4.19 into baseline-r

    "LA.UM.9.12.r1-12900-SMxx50.QSSI12.0"

commit 6eb8d7f5c9329641bad380f03f6e4d0a2dfd169a
Author: Greg Kroah-Hartman <gregkh@google.com>
Date:   Tue Sep 28 14:38:19 2021 +0200

    ANDROID: GKI: rework the ANDROID_KABI_USE() macro to not use __UNIQUE()

    The __UNIQUE_ID() macro causes problems as it turns out to not be
    deterministic across different compiler runs as it relies on the
    __COUNTER__ macro which could have been used on other .h files previous
    to this .h file being included.

    This shows up specifically when building with "LTO=thin" vs. "LTO=full"
    as different build paths seem to be triggered.

    As the structure name isn't really needed at all here, we were just
    including it for older compilers that could not handle anonymous
    structures in a union, just drop the whole thing which resolves the abi
    naming issue.

    Bug: 210255585
    Reported-by: Giuliano Procida <gprocida@google.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
    Change-Id: I6b9449fa9d26ffc5d66b2f0f3b41e2d5f3003f68

commit be89a6f80be6526d8b76d0e2144d3db1e4645744
Author: Martijn Coenen <maco@android.com>
Date:   Tue Aug 25 09:18:29 2020 +0200

    BACKPORT: loop: Set correct device size when using LOOP_CONFIGURE

    The device size calculation was done before processing the loop
    configuration, which meant that the we set the size on the underlying
    block device incorrectly in case lo_offset/lo_sizelimit were set in the
    configuration. Delay computing the size until we've setup the device
    parameters correctly.

    Fixes: 3448914e8cc5("loop: Add LOOP_CONFIGURE ioctl")
    Reported-by: Lennart Poettering <mzxreary@0pointer.de>
    Tested-by: Yang Xu <xuyang2018.jy@cn.fujitsu.com>
    Signed-off-by: Martijn Coenen <maco@android.com>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    (cherry picked from commit 79e5dc59e2974a48764269fa9ff544ae8ffe3338)
    Bug: 187129171
    Signed-off-by: Connor O'Brien <connoro@google.com>
    Change-Id: I823aba7e482eaf347992d507c875c10469a27c16

commit 11156bde8db8df89d92b533ea7a12df379bf23a7
Merge: a6850bb536d1 2950c9c5e0df
Author: Greg Kroah-Hartman <gregkh@google.com>
Date:   Sat Sep 25 14:26:55 2021 +0200

    Merge 4.19.207 into android-4.19-stable

    Changes in 4.19.207
    	ext4: fix race writing to an inline_data file while its xattrs are changing
    	xtensa: fix kconfig unmet dependency warning for HAVE_FUTEX_CMPXCHG
    	gpu: ipu-v3: Fix i.MX IPU-v3 offset calculations for (semi)planar U/V formats
    	qed: Fix the VF msix vectors flow
    	net: macb: Add a NULL check on desc_ptp
    	qede: Fix memset corruption
    	perf/x86/intel/pt: Fix mask of num_address_ranges
    	perf/x86/amd/ibs: Work around erratum #1197
    	cryptoloop: add a deprecation warning
    	ARM: 8918/2: only build return_address() if needed
    	ALSA: pcm: fix divide error in snd_pcm_lib_ioctl
    	clk: fix build warning for orphan_list
    	media: stkwebcam: fix memory leak in stk_camera_probe
    	ARM: imx: add missing clk_disable_unprepare()
    	ARM: imx: fix missing 3rd argument in macro imx_mmdc_perf_init
    	igmp: Add ip_mc_list lock in ip_check_mc_rcu
    	USB: serial: mos7720: improve OOM-handling in read_mos_reg()
    	ipv4/icmp: l3mdev: Perform icmp error route lookup on source device routing table (v2)
    	SUNRPC/nfs: Fix return value for nfs4_callback_compound()
    	crypto: talitos - reduce max key size for SEC1
    	powerpc/module64: Fix comment in R_PPC64_ENTRY handling
    	powerpc/boot: Delete unneeded .globl _zimage_start
    	net: ll_temac: Remove left-over debug message
    	mm/page_alloc: speed up the iteration of max_order
    	Revert "btrfs: compression: don't try to compress if we don't have enough pages"
    	ALSA: usb-audio: Add registration quirk for JBL Quantum 800
    	usb: host: xhci-rcar: Don't reload firmware after the completion
    	usb: mtu3: use @mult for HS isoc or intr
    	usb: mtu3: fix the wrong HS mult value
    	x86/reboot: Limit Dell Optiplex 990 quirk to early BIOS versions
    	PCI: Call Max Payload Size-related fixup quirks early
    	locking/mutex: Fix HANDOFF condition
    	regmap: fix the offset of register error log
    	crypto: mxs-dcp - Check for DMA mapping errors
    	sched/deadline: Fix reset_on_fork reporting of DL tasks
    	power: supply: axp288_fuel_gauge: Report register-address on readb / writeb errors
    	crypto: omap-sham - clear dma flags only after omap_sham_update_dma_stop()
    	sched/deadline: Fix missing clock update in migrate_task_rq_dl()
    	hrtimer: Avoid double reprogramming in __hrtimer_start_range_ns()
    	udf: Check LVID earlier
    	isofs: joliet: Fix iocharset=utf8 mount option
    	bcache: add proper error unwinding in bcache_device_init
    	nvme-rdma: don't update queue count when failing to set io queues
    	power: supply: max17042_battery: fix typo in MAx17042_TOFF
    	s390/cio: add dev_busid sysfs entry for each subchannel
    	libata: fix ata_host_start()
    	crypto: qat - do not ignore errors from enable_vf2pf_comms()
    	crypto: qat - handle both source of interrupt in VF ISR
    	crypto: qat - fix reuse of completion variable
    	crypto: qat - fix naming for init/shutdown VF to PF notifications
    	crypto: qat - do not export adf_iov_putmsg()
    	fcntl: fix potential deadlock for &fasync_struct.fa_lock
    	udf_get_extendedattr() had no boundary checks.
    	m68k: emu: Fix invalid free in nfeth_cleanup()
    	spi: spi-fsl-dspi: Fix issue with uninitialized dma_slave_config
    	spi: spi-pic32: Fix issue with uninitialized dma_slave_config
    	lib/mpi: use kcalloc in mpi_resize
    	clocksource/drivers/sh_cmt: Fix wrong setting if don't request IRQ for clock source channel
    	crypto: qat - use proper type for vf_mask
    	certs: Trigger creation of RSA module signing key if it's not an RSA key
    	spi: sprd: Fix the wrong WDG_LOAD_VAL
    	media: TDA1997x: enable EDID support
    	soc: rockchip: ROCKCHIP_GRF should not default to y, unconditionally
    	media: dvb-usb: fix uninit-value in dvb_usb_adapter_dvb_init
    	media: dvb-usb: fix uninit-value in vp702x_read_mac_addr
    	media: go7007: remove redundant initialization
    	Bluetooth: sco: prevent information leak in sco_conn_defer_accept()
    	tcp: seq_file: Avoid skipping sk during tcp_seek_last_pos
    	net: cipso: fix warnings in netlbl_cipsov4_add_std
    	i2c: highlander: add IRQ check
    	media: em28xx-input: fix refcount bug in em28xx_usb_disconnect
    	media: venus: venc: Fix potential null pointer dereference on pointer fmt
    	PCI: PM: Avoid forcing PCI_D0 for wakeup reasons inconsistently
    	PCI: PM: Enable PME if it can be signaled from D3cold
    	soc: qcom: smsm: Fix missed interrupts if state changes while masked
    	Bluetooth: increase BTNAMSIZ to 21 chars to fix potential buffer overflow
    	drm/msm/dpu: make dpu_hw_ctl_clear_all_blendstages clear necessary LMs
    	arm64: dts: exynos: correct GIC CPU interfaces address range on Exynos7
    	Bluetooth: fix repeated calls to sco_sock_kill
    	drm/msm/dsi: Fix some reference counted resource leaks
    	usb: gadget: udc: at91: add IRQ check
    	usb: phy: fsl-usb: add IRQ check
    	usb: phy: twl6030: add IRQ checks
    	Bluetooth: Move shutdown callback before flushing tx and rx queue
    	usb: host: ohci-tmio: add IRQ check
    	usb: phy: tahvo: add IRQ check
    	mac80211: Fix insufficient headroom issue for AMSDU
    	usb: gadget: mv_u3d: request_irq() after initializing UDC
    	Bluetooth: add timeout sanity check to hci_inquiry
    	i2c: iop3xx: fix deferred probing
    	i2c: s3c2410: fix IRQ check
    	mmc: dw_mmc: Fix issue with uninitialized dma_slave_config
    	mmc: moxart: Fix issue with uninitialized dma_slave_config
    	CIFS: Fix a potencially linear read overflow
    	i2c: mt65xx: fix IRQ check
    	usb: ehci-orion: Handle errors of clk_prepare_enable() in probe
    	usb: bdc: Fix an error handling path in 'bdc_probe()' when no suitable DMA config is available
    	tty: serial: fsl_lpuart: fix the wrong mapbase value
    	ath6kl: wmi: fix an error code in ath6kl_wmi_sync_point()
    	bcma: Fix memory leak for internally-handled cores
    	ipv4: make exception cache less predictible
    	net: sched: Fix qdisc_rate_table refcount leak when get tcf_block failed
    	net: qualcomm: fix QCA7000 checksum handling
    	ipv4: fix endianness issue in inet_rtm_getroute_build_skb()
    	netns: protect netns ID lookups with RCU
    	fscrypt: add fscrypt_symlink_getattr() for computing st_size
    	ext4: report correct st_size for encrypted symlinks
    	f2fs: report correct st_size for encrypted symlinks
    	ubifs: report correct st_size for encrypted symlinks
    	tty: Fix data race between tiocsti() and flush_to_ldisc()
    	x86/resctrl: Fix a maybe-uninitialized build warning treated as error
    	KVM: x86: Update vCPU's hv_clock before back to guest when tsc_offset is adjusted
    	IMA: remove -Wmissing-prototypes warning
    	IMA: remove the dependency on CRYPTO_MD5
    	fbmem: don't allow too huge resolutions
    	backlight: pwm_bl: Improve bootloader/kernel device handover
    	clk: kirkwood: Fix a clocking boot regression
    	rtc: tps65910: Correct driver module alias
    	btrfs: reset replace target device to allocation state on close
    	blk-zoned: allow zone management send operations without CAP_SYS_ADMIN
    	blk-zoned: allow BLKREPORTZONE without CAP_SYS_ADMIN
    	PCI/MSI: Skip masking MSI-X on Xen PV
    	powerpc/perf/hv-gpci: Fix counter value parsing
    	xen: fix setting of max_pfn in shared_info
    	include/linux/list.h: add a macro to test if entry is pointing to the head
    	9p/xen: Fix end of loop tests for list_for_each_entry
    	bpf/verifier: per-register parent pointers
    	bpf: correct slot_type marking logic to allow more stack slot sharing
    	bpf: Support variable offset stack access from helpers
    	bpf: Reject indirect var_off stack access in raw mode
    	bpf: Reject indirect var_off stack access in unpriv mode
    	bpf: Sanity check max value for var_off stack access
    	selftests/bpf: Test variable offset stack access
    	bpf: track spill/fill of constants
    	selftests/bpf: fix tests due to const spill/fill
    	bpf: Introduce BPF nospec instruction for mitigating Spectre v4
    	bpf: Fix leakage due to insufficient speculative store bypass mitigation
    	bpf: verifier: Allocate idmap scratch in verifier env
    	bpf: Fix pointer arithmetic mask tightening under state pruning
    	tools/thermal/tmon: Add cross compiling support
    	soc: aspeed: lpc-ctrl: Fix boundary check for mmap
    	arm64: head: avoid over-mapping in map_memory
    	crypto: public_key: fix overflow during implicit conversion
    	block: bfq: fix bfq_set_next_ioprio_data()
    	power: supply: max17042: handle fails of reading status register
    	dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc()
    	VMCI: fix NULL pointer dereference when unmapping queue pair
    	media: uvc: don't do DMA on stack
    	media: rc-loopback: return number of emitters rather than error
    	libata: add ATA_HORKAGE_NO_NCQ_TRIM for Samsung 860 and 870 SSDs
    	ARM: 9105/1: atags_to_fdt: don't warn about stack size
    	PCI: Restrict ASMedia ASM1062 SATA Max Payload Size Supported
    	PCI: Return ~0 data on pciconfig_read() CAP_SYS_ADMIN failure
    	PCI: xilinx-nwl: Enable the clock through CCF
    	PCI: aardvark: Increase polling delay to 1.5s while waiting for PIO response
    	PCI: aardvark: Fix masking and unmasking legacy INTx interrupts
    	HID: input: do not report stylus battery state as "full"
    	RDMA/iwcm: Release resources if iw_cm module initialization fails
    	docs: Fix infiniband uverbs minor number
    	pinctrl: samsung: Fix pinctrl bank pin count
    	vfio: Use config not menuconfig for VFIO_NOIOMMU
    	powerpc/stacktrace: Include linux/delay.h
    	openrisc: don't printk() unconditionally
    	pinctrl: single: Fix error return code in pcs_parse_bits_in_pinctrl_entry()
    	scsi: qedi: Fix error codes in qedi_alloc_global_queues()
    	platform/x86: dell-smbios-wmi: Add missing kfree in error-exit from run_smbios_call
    	fscache: Fix cookie key hashing
    	f2fs: fix to account missing .skipped_gc_rwsem
    	f2fs: fix to unmap pages from userspace process in punch_hole()
    	MIPS: Malta: fix alignment of the devicetree buffer
    	userfaultfd: prevent concurrent API initialization
    	media: dib8000: rewrite the init prbs logic
    	crypto: mxs-dcp - Use sg_mapping_iter to copy data
    	PCI: Use pci_update_current_state() in pci_enable_device_flags()
    	tipc: keep the skb in rcv queue until the whole data is read
    	iio: dac: ad5624r: Fix incorrect handling of an optional regulator.
    	ARM: dts: qcom: apq8064: correct clock names
    	video: fbdev: kyro: fix a DoS bug by restricting user input
    	netlink: Deal with ESRCH error in nlmsg_notify()
    	Smack: Fix wrong semantics in smk_access_entry()
    	usb: host: fotg210: fix the endpoint's transactional opportunities calculation
    	usb: host: fotg210: fix the actual_length of an iso packet
    	usb: gadget: u_ether: fix a potential null pointer dereference
    	usb: gadget: composite: Allow bMaxPower=0 if self-powered
    	staging: board: Fix uninitialized spinlock when attaching genpd
    	tty: serial: jsm: hold port lock when reporting modem line changes
    	drm/amd/amdgpu: Update debugfs link_settings output link_rate field in hex
    	bpf/tests: Fix copy-and-paste error in double word test
    	bpf/tests: Do not PASS tests without actually testing the result
    	video: fbdev: asiliantfb: Error out if 'pixclock' equals zero
    	video: fbdev: kyro: Error out if 'pixclock' equals zero
    	video: fbdev: riva: Error out if 'pixclock' equals zero
    	ipv4: ip_output.c: Fix out-of-bounds warning in ip_copy_addrs()
    	flow_dissector: Fix out-of-bounds warnings
    	s390/jump_label: print real address in a case of a jump label bug
    	serial: 8250: Define RX trigger levels for OxSemi 950 devices
    	xtensa: ISS: don't panic in rs_init
    	hvsi: don't panic on tty_register_driver failure
    	serial: 8250_pci: make setup_port() parameters explicitly unsigned
    	staging: ks7010: Fix the initialization of the 'sleep_status' structure
    	samples: bpf: Fix tracex7 error raised on the missing argument
    	ata: sata_dwc_460ex: No need to call phy_exit() befre phy_init()
    	Bluetooth: skip invalid hci_sync_conn_complete_evt
    	bonding: 3ad: fix the concurrency between __bond_release_one() and bond_3ad_state_machine_handler()
    	ASoC: Intel: bytcr_rt5640: Move "Platform Clock" routes to the maps for the matching in-/output
    	media: imx258: Rectify mismatch of VTS value
    	media: imx258: Limit the max analogue gain to 480
    	media: v4l2-dv-timings.c: fix wrong condition in two for-loops
    	media: TDA1997x: fix tda1997x_query_dv_timings() return value
    	media: tegra-cec: Handle errors of clk_prepare_enable()
    	ARM: dts: imx53-ppd: Fix ACHC entry
    	arm64: dts: qcom: sdm660: use reg value for memory node
    	net: ethernet: stmmac: Do not use unreachable() in ipq806x_gmac_probe()
    	Bluetooth: schedule SCO timeouts with delayed_work
    	Bluetooth: avoid circular locks in sco_sock_connect
    	gpu: drm: amd: amdgpu: amdgpu_i2c: fix possible uninitialized-variable access in amdgpu_i2c_router_select_ddc_port()
    	ARM: tegra: tamonten: Fix UART pad setting
    	Bluetooth: Fix handling of LE Enhanced Connection Complete
    	serial: sh-sci: fix break handling for sysrq
    	tcp: enable data-less, empty-cookie SYN with TFO_SERVER_COOKIE_NOT_REQD
    	rpc: fix gss_svc_init cleanup on failure
    	staging: rts5208: Fix get_ms_information() heap buffer size
    	gfs2: Don't call dlm after protocol is unmounted
    	of: Don't allow __of_attached_node_sysfs() without CONFIG_SYSFS
    	mmc: sdhci-of-arasan: Check return value of non-void funtions
    	mmc: rtsx_pci: Fix long reads when clock is prescaled
    	selftests/bpf: Enlarge select() timeout for test_maps
    	mmc: core: Return correct emmc response in case of ioctl error
    	cifs: fix wrong release in sess_alloc_buffer() failed path
    	Revert "USB: xhci: fix U1/U2 handling for hardware with XHCI_INTEL_HOST quirk set"
    	usb: musb: musb_dsps: request_irq() after initializing musb
    	usbip: give back URBs for unsent unlink requests during cleanup
    	usbip:vhci_hcd USB port can get stuck in the disabled state
    	ASoC: rockchip: i2s: Fix regmap_ops hang
    	ASoC: rockchip: i2s: Fixup config for DAIFMT_DSP_A/B
    	parport: remove non-zero check on count
    	ath9k: fix OOB read ar9300_eeprom_restore_internal
    	ath9k: fix sleeping in atomic context
    	net: fix NULL pointer reference in cipso_v4_doi_free
    	net: w5100: check return value after calling platform_get_resource()
    	parisc: fix crash with signals and alloca
    	ovl: fix BUG_ON() in may_delete() when called from ovl_cleanup()
    	scsi: BusLogic: Fix missing pr_cont() use
    	scsi: qla2xxx: Sync queue idx with queue_pair_map idx
    	cpufreq: powernv: Fix init_chip_info initialization in numa=off
    	mm/hugetlb: initialize hugetlb_usage in mm_init
    	memcg: enable accounting for pids in nested pid namespaces
    	platform/chrome: cros_ec_proto: Send command again when timeout occurs
    	drm/amdgpu: Fix BUG_ON assert
    	dm thin metadata: Fix use-after-free in dm_bm_set_read_only
    	xen: reset legacy rtc flag for PV domU
    	bnx2x: Fix enabling network interfaces without VFs
    	arm64/sve: Use correct size when reinitialising SVE state
    	PM: base: power: don't try to use non-existing RTC for storing data
    	PCI: Add AMD GPU multi-function power dependencies
    	x86/mm: Fix kern_addr_valid() to cope with existing but not present entries
    	tipc: fix an use-after-free issue in tipc_recvmsg
    	net-caif: avoid user-triggerable WARN_ON(1)
    	ptp: dp83640: don't define PAGE0
    	dccp: don't duplicate ccid when cloning dccp sock
    	net/l2tp: Fix reference count leak in l2tp_udp_recv_core
    	r6040: Restore MDIO clock frequency after MAC reset
    	tipc: increase timeout in tipc_sk_enqueue()
    	perf machine: Initialize srcline string member in add_location struct
    	net/mlx5: Fix potential sleeping in atomic context
    	events: Reuse value read using READ_ONCE instead of re-reading it
    	net/af_unix: fix a data-race in unix_dgram_poll
    	net: dsa: destroy the phylink instance on any error in dsa_slave_phy_setup
    	tcp: fix tp->undo_retrans accounting in tcp_sacktag_one()
    	qed: Handle management FW error
    	ibmvnic: check failover_pending in login response
    	net: hns3: pad the short tunnel frame before sending to hardware
    	mm/memory_hotplug: use "unsigned long" for PFN in zone_for_pfn_range()
    	KVM: s390: index kvm->arch.idle_mask by vcpu_idx
    	dt-bindings: mtd: gpmc: Fix the ECC bytes vs. OOB bytes equation
    	mfd: Don't use irq_create_mapping() to resolve a mapping
    	PCI: Add ACS quirks for Cavium multi-function devices
    	net: usb: cdc_mbim: avoid altsetting toggling for Telit LN920
    	block, bfq: honor already-setup queue merges
    	ethtool: Fix an error code in cxgb2.c
    	NTB: perf: Fix an error code in perf_setup_inbuf()
    	mfd: axp20x: Update AXP288 volatile ranges
    	PCI: Fix pci_dev_str_match_path() alloc while atomic bug
    	KVM: arm64: Handle PSCI resets before userspace touches vCPU state
    	PCI: Sync __pci_register_driver() stub for CONFIG_PCI=n
    	mtd: rawnand: cafe: Fix a resource leak in the error handling path of 'cafe_nand_probe()'
    	ARC: export clear_user_page() for modules
    	net: dsa: b53: Fix calculating number of switch ports
    	netfilter: socket: icmp6: fix use-after-scope
    	fq_codel: reject silly quantum parameters
    	qlcnic: Remove redundant unlock in qlcnic_pinit_from_rom
    	ip_gre: validate csum_start only on pull
    	net: renesas: sh_eth: Fix freeing wrong tx descriptor
    	s390/bpf: Fix 64-bit subtraction of the -0x80000000 constant
    	Linux 4.19.207

    Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
    Change-Id: I18108cb47ba9e95838ebe55aaabe34de345ee846

commit 2950c9c5e0df6bd34af45a5168bbee345e95eae2
Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date:   Wed Sep 22 11:48:14 2021 +0200

    Linux 4.19.207

    Link: https://lore.kernel.org/r/20210920163933.258815435@linuxfoundation.org
    Tested-by: Pavel Machek (CIP) <pavel@denx.de>
    Tested-by: Jon Hunter <jonathanh@nvidia.com>
    Tested-by: Shuah Khan <skhan@linuxfoundation.org>
    Tested-by: Sudip Mukherjee <sudip.mukherjee@codethink.co.uk>
    Tested-by: Guenter Roeck <linux@roeck-us.net>
    Tested-by: Hulk Robot <hulkrobot@huawei.com>
    Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit e15c2fe2def24324bfdbfb7ec2837e40b2aac7fd
Author: Ilya Leoshkevich <iii@linux.ibm.com>
Date:   Tue Sep 7 13:41:16 2021 +0200

    s390/bpf: Fix 64-bit subtraction of the -0x80000000 constant

    commit 6e61dc9da0b7a0d91d57c2e20b5ea4fd2d4e7e53 upstream.

    The JIT uses agfi for subtracting constants, but -(-0x80000000) cannot
    be represented as a 32-bit signed binary integer. Fix by using algfi in
    this particular case.

    Reported-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
    Fixes: 054623105728 ("s390/bpf: Add s390x eBPF JIT compiler backend")
    Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
    Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
    Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 044e7097e849366dfc71cccfc5d8c8a97cb3f010
Author: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Date:   Tue Sep 7 20:29:40 2021 +0900

    net: renesas: sh_eth: Fix freeing wrong tx descriptor

    [ Upstream commit 0341d5e3d1ee2a36dd5a49b5bef2ce4ad1cfa6b4 ]

    The cur_tx counter must be incremented after TACT bit of
    txdesc->status was set. However, a CPU is possible to reorder
    instructions and/or memory accesses between cur_tx and
    txdesc->status. And then, if TX interrupt happened at such a
    timing, the sh_eth_tx_free() may free the descriptor wrongly.
    So, add wmb() before cur_tx++.
    Otherwise NETDEV WATCHDOG timeout is possible to happen.

    Fixes: 86a74ff21a7a ("net: sh_eth: add support for Renesas SuperH Ethernet")
    Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit 774430026bd9a472d08c5d3c33351a782315771a
Author: Willem de Bruijn <willemb@google.com>
Date:   Sun Sep 5 11:21:09 2021 -0400

    ip_gre: validate csum_start only on pull

    [ Upstream commit 8a0ed250f911da31a2aef52101bc707846a800ff ]

    The GRE tunnel device can pull existing outer headers in ipge_xmit.
    This is a rare path, apparently unique to this device. The below
    commit ensured that pulling does not move skb->data beyond csum_start.

    But it has a false positive if ip_summed is not CHECKSUM_PARTIAL and
    thus csum_start is irrelevant.

    Refine to exclude this. At the same time simplify and strengthen the
    test.

    Simplify, by moving the check next to the offending pull, making it
    more self documenting and removing an unnecessary branch from other
    code paths.

    Strengthen, by also ensuring that the transport header is correct and
    therefore the inner headers will be after skb_reset_inner_headers.
    The transport header is set to csum_start in skb_partial_csum_set.

    Link: https://lore.kernel.org/netdev/YS+h%2FtqCJJiQei+W@shredder/
    Fixes: 1d011c4803c7 ("ip_gre: add validation for csum_start")
    Reported-by: Ido Schimmel <idosch@idosch.org>
    Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
    Signed-off-by: Willem de Bruijn <willemb@google.com>
    Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit 833ffc44049435ab0e9b636a671c2108acfbc72a
Author: Dinghao Liu <dinghao.liu@zju.edu.cn>
Date:   Fri Sep 3 15:35:43 2021 +0800

    qlcnic: Remove redundant unlock in qlcnic_pinit_from_rom

    [ Upstream commit 9ddbc2a00d7f63fa9748f4278643193dac985f2d ]

    Previous commit 68233c583ab4 removes the qlcnic_rom_lock()
    in qlcnic_pinit_from_rom(), but remains its corresponding
    unlock function, which is odd. I'm not very sure whether the
    lock is missing, or the unlock is redundant. This bug is
    suggested by a static analysis tool, please advise.

    Fixes: 68233c583ab4 ("qlcnic: updated reset sequence")
    Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit 7c113506163a1ec6157927428eddd80038d2916e
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Sep 3 15:03:43 2021 -0700

    fq_codel: reject silly quantum parameters

    [ Upstream commit c7c5e6ff533fe1f9afef7d2fa46678987a1335a7 ]

    syzbot found that forcing a big quantum attribute would crash hosts fast,
    essentially using this:

    tc qd replace dev eth0 root fq_codel quantum 4294967295

    This is because fq_codel_dequeue() would have to loop
    ~2^31 times in :

    	if (flow->deficit <= 0) {
    		flow->deficit += q->quantum;
    		list_move_tail(&flow->flowchain, &q->old_flows);
    		goto begin;
    	}

    SFQ max quantum is 2^19 (half a megabyte)
    Lets adopt a max quantum of one megabyte for FQ_CODEL.

    Fixes: 4b549a2ef4be ("fq_codel: Fair Queue Codel AQM")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit d6efada330af09253b0f81a0d836cee02192bd4f
Author: Benjamin Hesmans <benjamin.hesmans@tessares.net>
Date:   Fri Sep 3 15:23:35 2021 +0200

    netfilter: socket: icmp6: fix use-after-scope

    [ Upstream commit 730affed24bffcd1eebd5903171960f5ff9f1f22 ]

    Bug reported by KASAN:

    BUG: KASAN: use-after-scope in inet6_ehashfn (net/ipv6/inet6_hashtables.c:40)
    Call Trace:
    (...)
    inet6_ehashfn (net/ipv6/inet6_hashtables.c:40)
    (...)
    nf_sk_lookup_slow_v6 (net/ipv6/netfilter/nf_socket_ipv6.c:91
    net/ipv6/netfilter/nf_socket_ipv6.c:146)

    It seems that this bug has already been fixed by Eric Dumazet in the
    past in:
    commit 78296c97ca1f ("netfilter: xt_socket: fix a stack corruption bug")

    But a variant of the same issue has been introduced in
    commit d64d80a2cde9 ("netfilter: x_tables: don't extract flow keys on early demuxed sks in socket match")

    `daddr` and `saddr` potentially hold a reference to ipv6_var that is no
    longer in scope when the call to `nf_socket_get_sock_v6` is made.

    Fixes: d64d80a2cde9 ("netfilter: x_tables: don't extract flow keys on early demuxed sks in socket match")
    Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
    Signed-off-by: Benjamin Hesmans <benjamin.hesmans@tessares.net>
    Reviewed-by: Florian Westphal <fw@strlen.de>
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit e24ffdb604179a057e9be5c9a8ed68b90a28eee5
Author: Rafał Miłecki <rafal@milecki.pl>
Date:   Thu Sep 2 10:30:50 2021 +0200

    net: dsa: b53: Fix calculating number of switch ports

    [ Upstream commit cdb067d31c0fe4cce98b9d15f1f2ef525acaa094 ]

    It isn't true that CPU port is always the last one. Switches BCM5301x
    have 9 ports (port 6 being inactive) and they use port 5 as CPU by
    default (depending on design some other may be CPU ports too).

    A more reliable way of determining number of ports is to check for the
    last set bit in the "enabled_ports" bitfield.

    This fixes b53 internal state, it will allow providing accurate info to
    the DSA and is required to fix BCM5301x support.

    Fixes: 967dd82ffc52 ("net: dsa: b53: Add support for Broadcom RoboSwitch")
    Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
    Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit e91ae87f7e3f4e3e9b4700c498a0b0a4ce82cf1d
Author: Randy Dunlap <rdunlap@infradead.org>
Date:   Mon Aug 16 14:05:33 2021 -0700

    ARC: export clear_user_page() for modules

    [ Upstream commit 6b5ff0405e4190f23780362ea324b250bc495683 ]

    0day bot reports a build error:
      ERROR: modpost: "clear_user_page" [drivers/media/v4l2-core/videobuf-dma-sg.ko] undefined!
    so export it in arch/arc/ to fix the build error.

    In most ARCHes, clear_user_page() is a macro. OTOH, in a few
    ARCHes it is a function and needs to be exported.
    PowerPC exported it in 2004. It looks like nds32 and nios2
    still need to have it exported.

    Fixes: 4102b53392d63 ("ARC: [mm] Aliasing VIPT dcache support 2/4")
    Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
    Reported-by: kernel test robot <lkp@intel.com>
    Cc: Guenter Roeck <linux@roeck-us.net>
    Cc: linux-snps-arc@lists.infradead.org
    Signed-off-by: Vineet Gupta <vgupta@kernel.org>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit 4a3dd774453aae49c8987decb0ab86f8eb349cc4
Author: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Date:   Sat Aug 21 09:58:45 2021 +0200

    mtd: rawnand: cafe: Fix a resource leak in the error handling path of 'cafe_nand_probe()'

    [ Upstream commit 6b430c7595e4eb95fae8fb54adc3c3ce002e75ae ]

    A successful 'init_rs_non_canonical()' call should be balanced by a
    corresponding 'free_rs()' call in the error handling path of the probe, as
    already done in the remove function.

    Update the error handling path accordingly.

    Fixes: 8c61b7a7f4d4 ("[MTD] [NAND] Use rslib for CAFÉ ECC")
    Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
    Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
    Link: https://lore.kernel.org/linux-mtd/fd313d3fb787458bcc73189e349f481133a2cdc9.1629532640.git.christophe.jaillet@wanadoo.fr
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit 6bdadfff347e42b6da70a9c77bb443479781c1f3
Author: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Date:   Fri Aug 13 18:36:19 2021 +0300

    PCI: Sync __pci_register_driver() stub for CONFIG_PCI=n

    [ Upstream commit 817f9916a6e96ae43acdd4e75459ef4f92d96eb1 ]

    The CONFIG_PCI=y case got a new parameter long time ago.  Sync the stub as
    well.

    [bhelgaas: add parameter names]
    Fixes: 725522b5453d ("PCI: add the sysfs driver name to all modules")
    Link: https://lore.kernel.org/r/20210813153619.89574-1-andriy.shevchenko@linux.intel.com
    Reported-by: kernel test robot <lkp@intel.com>
    Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit b6e5cd323d1d1537f450ab0d3e06811a868267c2
Author: Oliver Upton <oupton@google.com>
Date:   Wed Aug 18 20:21:31 2021 +0000

    KVM: arm64: Handle PSCI resets before userspace touches vCPU state

    [ Upstream commit 6826c6849b46aaa91300201213701eb861af4ba0 ]

    The CPU_ON PSCI call takes a payload that KVM uses to configure a
    destination vCPU to run. This payload is non-architectural state and not
    exposed through any existing UAPI. Effectively, we have a race between
    CPU_ON and userspace saving/restoring a guest: if the target vCPU isn't
    ran again before the VMM saves its state, the requested PC and context
    ID are lost. When restored, the target vCPU will be runnable and start
    executing at its old PC.

    We can avoid this race by making sure the reset payload is serviced
    before userspace can access a vCPU's state.

    Fixes: 358b28f09f0a ("arm/arm64: KVM: Allow a VCPU to fully reset itself")
    Signed-off-by: Oliver Upton <oupton@google.com>
    Signed-off-by: Marc Zyngier <maz@kernel.org>
    Link: https://lore.kernel.org/r/20210818202133.1106786-3-oupton@google.com
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit 1a091bfd11e61032b6192cf2a1ebb259889f28b3
Author: Dan Carpenter <dan.carpenter@oracle.com>
Date:   Thu Aug 12 10:00:04 2021 +0300

    PCI: Fix pci_dev_str_match_path() alloc while atomic bug

    [ Upstream commit 7eb6ea4148579b85540a41d57bcec315b8af8ff8 ]

    pci_dev_str_match_path() is often called with a spinlock held so the
    allocation has to be atomic.  The call tree is:

      pci_specified_resource_alignment() <-- takes spin_lock();
        pci_dev_str_match()
          pci_dev_str_match_path()

    Fixes: 45db33709ccc ("PCI: Allow specifying devices using a base bus and path of devfns")
    Link: https://lore.kernel.org/r/20210812070004.GC31863@kili
    Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
    Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit 91264ae7fceb7e30172ab7dd27e3e41751cb83b3
Author: Hans de Goede <hdegoede@redhat.com>
Date:   Tue Jun 29 19:12:39 2021 +0200

    mfd: axp20x: Update AXP288 volatile ranges

    [ Upstream commit f949a9ebce7a18005266b859a17f10c891bb13d7 ]

    On Cherry Trail devices with an AXP288 PMIC the external SD-card slot
    used the AXP's DLDO2 as card-voltage and either DLDO3 or GPIO1LDO
    (GPIO1 pin in low noise LDO mode) as signal-voltage.

    These regulators are turned on/off and in case of the signal-voltage
    also have their output-voltage changed by the _PS0 and _PS3 power-
    management ACPI methods on the MMC-controllers ACPI fwnode as well as
    by the _DSM ACPI method for changing the signal voltage.

    The AML code implementing these methods is directly accessing the
    PMIC through ACPI I2C OpRegion accesses, instead of using the special
    PMIC OpRegion handled by drivers/acpi/pmic/intel_pmic_xpower.c .

    This means that the contents of the involved PMIC registers can change
    without the change being made through the regmap interface, so regmap
    should not cache the contents of these registers.

    Mark the regulator power on/off, the regulator voltage control and the
    GPIO1 control registers as volatile, to avoid regmap caching them.

    Specifically this fixes an issue on some models where the i915 driver
    toggles another LDO using the same on/off register on/off through
    MIPI sequences (through intel_soc_pmic_exec_mipi_pmic_seq_element())
    which then writes back a cached on/off register-value where the
    card-voltage is off causing the external sdcard slot to stop working
    when the screen goes blank, or comes back on again.

    The regulator register-range now marked volatile also includes the
    buck regulator control registers. This is done on purpose these are
    normally not touched by the AML code, but they are updated directly
    by the SoC's PUNIT which means that they may also change without going
    through regmap.

    Note the AXP288 PMIC is only used on Bay- and Cherry-Trail platforms,
    so even though this is an ACPI specific problem there is no need to
    make the new volatile ranges conditional since these platforms always
    use ACPI.

    Fixes: dc91c3b6fe66 ("mfd: axp20x: Mark AXP20X_VBUS_IPSOUT_MGMT as volatile")
    Fixes: cd53216625a0 ("mfd: axp20x: Fix axp288 volatile ranges")
    Reported-and-tested-by: Clamshell <clamfly@163.com>
    Signed-off-by: Hans de Goede <hdegoede@redhat.com>
    Reviewed-by: Chen-Yu Tsai <wens@csie.org>
    Signed-off-by: Lee Jones <lee.jones@linaro.org>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit a3e968b65cd5bc1578438a5436339640f1b79342
Author: Yang Li <yang.lee@linux.alibaba.com>
Date:   Mon Jun 7 16:40:36 2021 +0800

    NTB: perf: Fix an error code in perf_setup_inbuf()

    [ Upstream commit 0097ae5f7af5684f961a5f803ff7ad3e6f933668 ]

    When the function IS_ALIGNED() returns false, the value of ret is 0.
    So, we set ret to -EINVAL to indicate this error.

    Clean up smatch warning:
    drivers/ntb/test/ntb_perf.c:602 perf_setup_inbuf() warn: missing error
    code 'ret'.

    Reported-by: Abaci Robot <abaci@linux.alibaba.com>
    Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
    Reviewed-by: Serge Semin <fancer.lancer@gmail.com>
    Signed-off-by: Jon Mason <jdmason@kudzu.us>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit 86e6540a230281cc913e9e2a39953360b9924859
Author: Yang Li <yang.lee@linux.alibaba.com>
Date:   Fri Sep 3 14:42:33 2021 +0800

    ethtool: Fix an error code in cxgb2.c

    [ Upstream commit 7db8263a12155c7ae4ad97e850f1e499c73765fc ]

    When adapter->registered_device_map is NULL, the value of err is
    uncertain, we set err to -EINVAL to avoid ambiguity.

    Clean up smatch warning:
    drivers/net/ethernet/chelsio/cxgb/cxgb2.c:1114 init_one() warn: missing
    error code 'err'

    Reported-by: Abaci Robot <abaci@linux.alibaba.com>
    Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit 2c1b1848357dc69f62ce3630b850e6680b87854b
Author: Paolo Valente <paolo.valente@linaro.org>
Date:   Mon Aug 2 16:13:52 2021 +0200

    block, bfq: honor already-setup queue merges

    [ Upstream commit 2d52c58b9c9bdae0ca3df6a1eab5745ab3f7d80b ]

    The function bfq_setup_merge prepares the merging between two
    bfq_queues, say bfqq and new_bfqq. To this goal, it assigns
    bfqq->new_bfqq = new_bfqq. Then, each time some I/O for bfqq arrives,
    the process that generated that I/O is disassociated from bfqq and
    associated with new_bfqq (merging is actually a redirection). In this
    respect, bfq_setup_merge increases new_bfqq->ref in advance, adding
    the number of processes that are expected to be associated with
    new_bfqq.

    Unfortunately, the stable-merging mechanism interferes with this
    setup. After bfqq->new_bfqq has been set by bfq_setup_merge, and
    before all the expected processes have been associated with
    bfqq->new_bfqq, bfqq may happen to be stably merged with a different
    queue than the current bfqq->new_bfqq. In this case, bfqq->new_bfqq
    gets changed. So, some of the processes that have been already
    accounted for in the ref counter of the previous new_bfqq will not be
    associated with that queue.  This creates an unbalance, because those
    references will never be decremented.

    This commit fixes this issue by reestablishing the previous, natural
    behaviour: once bfqq->new_bfqq has been set, it will not be changed
    until all expected redirections have occurred.

    Signed-off-by: Davide Zini <davidezini2@gmail.com>
    Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
    Link: https://lore.kernel.org/r/20210802141352.74353-2-paolo.valente@linaro.org
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit 052447e9d4aaf6cba23e96b26b3dec79cf63a1bd
Author: Daniele Palmas <dnlplm@gmail.com>
Date:   Thu Sep 2 12:51:22 2021 +0200

    net: usb: cdc_mbim: avoid altsetting toggling for Telit LN920

    [ Upstream commit aabbdc67f3485b5db27ab4eba01e5fbf1ffea62c ]

    Add quirk CDC_MBIM_FLAG_AVOID_ALTSETTING_TOGGLE for Telit LN920
    0x1061 composition in order to avoid bind error.

    Signed-off-by: Daniele Palmas <dnlplm@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit 9859de9fd373ef3beaf7d845e963920cf99293a6
Author: George Cherian <george.cherian@marvell.com>
Date:   Tue Aug 10 17:54:25 2021 +0530

    PCI: Add ACS quirks for Cavium multi-function devices

    [ Upstream commit 32837d8a8f63eb95dcb9cd005524a27f06478832 ]

    Some Cavium endpoints are implemented as multi-function devices without ACS
    capability, but they actually don't support peer-to-peer transactions.

    Add ACS quirks to declare DMA isolation for the following devices:

      - BGX device found on Octeon-TX (8xxx)
      - CGX device found on Octeon-TX2 (9xxx)
      - RPM device found on Octeon-TX3 (10xxx)

    Link: https://lore.kernel.org/r/20210810122425.1115156-1-george.cherian@marvell.com
    Signed-off-by: George Cherian <george.cherian@marvell.com>
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit d3dc49079ef7d2bfea436d1124ed92802214f1c2
Author: Marc Zyngier <maz@kernel.org>
Date:   Sun Jul 25 19:07:54 2021 +0100

    mfd: Don't use irq_create_mapping() to resolve a mapping

    [ Upstream commit 9ff80e2de36d0554e3a6da18a171719fe8663c17 ]

    Although irq_create_mapping() is able to deal with duplicate
    mappings, it really isn't supposed to be a substitute for
    irq_find_mapping(), and can result in allocations that take place
    in atomic context if the mapping didn't exist.

    Fix the handful of MFD drivers that use irq_create_mapping() in
    interrupt context by using irq_find_mapping() instead.

    Cc: Linus Walleij <linus.walleij@linaro.org>
    Cc: Lee Jones <lee.jones@linaro.org>
    Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
    Cc: Alexandre Torgue <alexandre.torgue@foss.st.com>
    Signed-off-by: Marc Zyngier <maz@kernel.org>
    Signed-off-by: Lee Jones <lee.jones@linaro.org>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit 05e4fdd51a934d4bf51b368e5c00dd35f744637e
Author: Miquel Raynal <miquel.raynal@bootlin.com>
Date:   Thu Jun 10 16:39:45 2021 +0200

    dt-bindings: mtd: gpmc: Fix the ECC bytes vs. OOB bytes equation

    [ Upstream commit 778cb8e39f6ec252be50fc3850d66f3dcbd5dd5a ]

    "PAGESIZE / 512" is the number of ECC chunks.
    "ECC_BYTES" is the number of bytes needed to store a single ECC code.
    "2" is the space reserved by the bad block marker.

    "2 + (PAGESIZE / 512) * ECC_BYTES" should of course be lower or equal
    than the total number of OOB bytes, otherwise it won't fit.

    Fix the equation by substituting s/>=/<=/.

    Suggested-by: Ryan J. Barnett <ryan.barnett@collins.com>
    Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
    Acked-by: Rob Herring <robh@kernel.org>
    Link: https://lore.kernel.org/linux-mtd/20210610143945.3504781-1-miquel.raynal@bootlin.com
    Signed-off-by: Sasha Levin <sashal@kernel.org>

commit 3d95bdee23e92568710530f1eca161ccd63db39f
Author: Halil Pasic <pasic@linux.ibm.com>
Date:   Fri Aug 27 14:54:29 2021 +0200

    KVM: s390: index kvm->arch.idle_mask by vcpu_idx

    commit a3e03bc1368c1bc16e19b001fc96dc7430573cc8 upstream.

    While in practice vcpu->vcpu_idx ==  vcpu->vcp_id is often true, it may
    not always be, and we must not rely on this. Reason is that KVM decides
    the vcpu_idx, userspace decides the vcpu_id, thus the two might not
    match.

    Currently kvm->arch.idle_mask is indexed by vcpu_id, which implies
    that code like
    for_each_set_bit(vcpu_id, kvm->arch.idle_mask, online_vcpus) {
                    vcpu = kvm_get_vcpu(kvm, vcpu_id);
    		do_stuff(vcpu);
    }
    is not legit. Reason is that kvm_get_vcpu expects an vcpu_idx, not an
    vcpu_id.  The trouble is, we do actually use kvm->arch.idle_mask like
    this. To fix this problem we have two options. Either use
    kvm_get_vcpu_by_id(vcpu_id), which would loop to find the right vcpu_id,
    or switch to indexing via vcpu_idx. The latter is preferable for obvious
    reasons.

    Let us make switch from indexing kvm->arch.idle_mask by vcpu_id to
    indexing it by vcpu_idx.  To keep gisa_int.kicked_mask indexed by the
    same index as idle_mask lets make the same change for it as well.

    Fixes: 1ee0bc559dc3 ("KVM: s390: get rid of local_int array")
    Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
    Reviewed-by: Christian Bornträger <borntraeger@de.ibm.com>
    Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
    Cc: <stable@vger.kernel.org> # 3.15+
    Link: https://lore.kernel.org/r/20210827125429.1912577-1-pasic@linux.ibm.com
    [borntraeger@de.ibm.com]: change  idle mask, remove kicked_mask
    Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit c48402015e02901ff8b1fac5c112b02546364bb6
Author: David Hildenbrand <david@redhat.com>
Date:   Tue Sep 7 19:54:59 2021 -0700

    mm/memory_hotplug: use "unsigned long" for PFN in zone_for_pfn_range()

    commit 7cf209ba8a86410939a24cb1aeb279479a7e0ca6 upstream.

    Patch series "mm/memory_hotplug: preparatory patches for new online policy and memory"

    These are all cleanups and one fix previously sent as part of [1]:
    [PATCH v1 00/12] mm/memory_hotplug: "auto-movable" online policy and memory
    groups.

    These patches make sense even without the other series, therefore I pulled
    them out to make the other series easier to digest.

    [1] https://lkml.kernel.org/r/20210607195430.48228-1-david@redhat.com

    This patch (of 4):

    Checkpatch complained on a follow-up patch that we are using "unsigned"
    here, which defaults to "unsigned int" and checkpatch is correct.

    As we will search for a fitting zone using the wrong pfn, we might end
    up onlining memory to one of the special kernel zones, such as ZONE_DMA,
    which can end badly as the onlined memory does not satisfy properties of
    these zones.

    Use "unsigned long" instead, just as we do in other places when handling
    PFNs.  This can bite us once we have physical addresses in the range of
    multiple TB.

    Link: https://lkml.kernel.org/r/20210712124052.26491-2-david@redhat.com
    Fixes: e5e689302633 ("mm, memory_hotplug: display allowed zones in the preferred ordering")
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
    Reviewed-by: Muchun Song <songmuchun@bytedance.com>
    Reviewed-by: Oscar Salvador <osalvador@suse.de>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
    Cc: "Michael S. Tsirkin" <mst@redhat.com>
    Cc: Jason Wang <jasowang@redhat.com>
    Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
    Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Mike Rapoport <rppt@kernel.org>
    Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
    Cc: Len Brown <lenb@kernel.org>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Heiko Carstens <hca@linux.ibm.com>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: virtualization@lists.linux-foundation.org
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
    Cc: Anton Blanchard <anton@ozlabs.org>
    Cc: Ard Biesheuvel <ardb@kernel.org>
    Cc: Baoquan He <bhe@redhat.com>
    Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Christian Borntraeger <borntraeger@de.ibm.com>
    Cc: Christophe Leroy <christophe.leroy@c-s.fr>
    Cc: Dave Jiang <dave.jiang@intel.com>
    Cc: "H. Peter Anvin" <hpa@zytor.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jia He <justin.he@arm.com>
    Cc: Joe Perches <joe@perches.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Laurent Dufour <ldufour@linux.ibm.com>
    Cc: Michel Lespinasse <michel@lespinasse.org>
    Cc: Nathan Lynch <nathanl@linux.ibm.com>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Paul Mackerras <paulus@samba.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Pierre Morel <pmorel@linux.ibm.com>
    Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
    Cc: Rich Felker <dalias@libc.org>
    Cc: Scott Cheloha <cheloha@linux.ibm.com>
    Cc: Sergei Trofimovich <slyfox@gentoo.org>
    Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Vishal Verma <vishal.l.verma@intel.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit e365c0137ac1bc78bfa95bd3220a89d2f5d38e89
Author: Yufeng Mo <moyufeng@huawei.com>
Date:   Mon Sep 13 21:08:21 2021 +0800

    net: hns3: pad the short tunnel frame before sending to hardware

    commit d18e81183b1cb9c309266cbbce9acd3e0c528d04 upstream.

    The hardware cannot handle short tunnel frames below 65 bytes,
    and will cause vlan tag missing problem. So pads packet size to
    65 bytes for tunnel frames to fix this bug.

    Fixes: 3db084d28dc0("net: hns3: Fix for vxlan tx checksum bug")
    Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
    Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 6323a3ec9058ae951cc968d378e5ec6eec92bc4b
Author: Sukadev Bhattiprolu <sukadev@linux.ibm.com>
Date:   Wed Sep 8 09:58:20 2021 -0700

    ibmvnic: check failover_pending in login response

    commit 273c29e944bda9a20a30c26cfc34c9a3f363280b upstream.

    If a failover occurs before a login response is received, the login
    response buffer maybe undefined. Check that there was no failover
    before accessing the login response buffer.

    Fixes: 032c5e82847a ("Driver for IBM System i/p VNIC protocol")
    Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 533ef9965857cbca7a16b0440f33d109d4e6d891
Author: Shai Malin <smalin@marvell.com>
Date:   Fri Sep 10 11:33:56 2021 +0300

    qed: Handle management FW error

    commit 20e100f52730cd0db609e559799c1712b5f27582 upstream.

    Handle MFW (management FW) error response in order to avoid a crash
    during recovery flows.

    Changes from v1:
    - Add "Fixes tag".

    Fixes: tag 5e7ba042fd05 ("qed: Fix reading stale configuration information")
    Signed-off-by: Ariel Elior <aelior@marvell.com>
    Signed-off-by: Shai Malin <smalin@marvell.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit dfefcc46354530c3ec1d12db0e16c740548c6229
Author: zhenggy <zhenggy@chinatelecom.cn>
Date:   Tue Sep 14 09:51:15 2021 +0800

    tcp: fix tp->undo_retrans accounting in tcp_sacktag_one()

    commit 4f884f3962767877d7aabbc1ec124d2c307a4257 upstream.

    Commit 10d3be569243 ("tcp-tso: do not split TSO packets at retransmit
    time") may directly retrans a multiple segments TSO/GSO packet without
    split, Since this commit, we can no longer assume that a retransmitted
    packet is a single segment.

    This patch fixes the tp->undo_retrans accounting in tcp_sacktag_one()
    that use the actual segments(pcount) of the retransmitted packet.

    Before that commit (10d3be569243), the assumption underlying the
    tp->undo_retrans-- seems correct.

    Fixes: 10d3be569243 ("tcp-tso: do not split TSO packets at retransmit time")
    Signed-off-by: zhenggy <zhenggy@chinatelecom.cn>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Yuchung Cheng <ycheng@google.com>
    Acked-by: Neal Cardwell <ncardwell@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 948fa0bac0840b67d7ab1b748b1eb7d9d6c032b8
Author: Vladimir Oltean <vladimir.oltean@nxp.com>
Date:   Tue Sep 14 16:43:31 2021 +0300

    net: dsa: destroy the phylink instance on any error in dsa_slave_phy_setup

    commit 6a52e73368038f47f6618623d75061dc263b26ae upstream.

    DSA supports connecting to a phy-handle, and has a fallback to a non-OF
    based method of connecting to an internal PHY on the switch's own MDIO
    bus, if no phy-handle and no fixed-link nodes were present.

    The -ENODEV error code from the first attempt (phylink_of_phy_connect)
    is what triggers the second attempt (phylink_connect_phy).

    However, when the first attempt returns a different error code than
    -ENODEV, this results in an unbalance of calls to phylink_create and
    phylink_destroy by the time we exit the function. The phylink instance
    has leaked.

    There are many other error codes that can be returned by
    phylink_of_phy_connect. For example, phylink_validate returns -EINVAL.
    So this is a practical issue too.

    Fixes: aab9c4067d23 ("net: dsa: Plug in PHYLINK support")
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
    Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
    Link: https://lore.kernel.org/r/20210914134331.2303380-1-vladimir.oltean@nxp.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 44ba281510190e2915506016407ba5b374a3add2
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Sep 8 17:00:29 2021 -0700

    net/af_unix: fix a data-race in unix_dgram_poll

    commit 04f08eb44b5011493d77b602fdec29ff0f5c6cd5 upstream.

    syzbot reported another data-race in af_unix [1]

    Lets change __skb_insert() to use WRITE_ONCE() when changing
    skb head qlen.

    Also, change unix_dgram_poll() to use lockless version
    of unix_recvq_full()

    It is verry possible we can switch all/most unix_recvq_full()
    to the lockless version, this will be done in a future kernel version.

    [1] HEAD commit: 8596e589b787732c8346f0482919e83cc9362db1

    BUG: KCSAN: data-race in skb_queue_tail / unix_dgram_poll

    write to 0xffff88814eeb24e0 of 4 bytes by task 25815 on cpu 0:
     __skb_insert include/linux/skbuff.h:1938 [inline]
     __skb_queue_before include/linux/skbuff.h:2043 [inline]
     __skb_queue_tail include/linux/skbuff.h:2076 [inline]
     skb_queue_tail+0x80/0xa0 net/core/skbuff.c:3264
     unix_dgram_sendmsg+0xff2/0x1600 net/unix/af_unix.c:1850
     sock_sendmsg_nosec net/socket.c:703 [inline]
     sock_sendmsg net/socket.c:723 [inline]
     ____sys_sendmsg+0x360/0x4d0 net/socket.c:2392
     ___sys_sendmsg net/socket.c:2446 [inline]
     __sys_sendmmsg+0x315/0x4b0 net/socket.c:2532
     __do_sys_sendmmsg net/socket.c:2561 [inline]
     __se_sys_sendmmsg net/socket.c:2558 [inline]
     __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2558
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x44/0xae

    read to 0xffff88814eeb24e0 of 4 bytes by task 25834 on cpu 1:
     skb_queue_len include/linux/skbuff.h:1869 [inline]
     unix_recvq_full net/unix/af_unix.c:194 [inline]
     unix_dgram_poll+0x2bc/0x3e0 net/unix/af_unix.c:2777
     sock_poll+0x23e/0x260 net/socket.c:1288
     vfs_poll include/linux/poll.h:90 [inline]
     ep_item_poll fs/eventpoll.c:846 [inline]
     ep_send_events fs/eventpoll.c:1683 [inline]
     ep_poll fs/eventpoll.c:1798 [inline]
     do_epoll_wait+0x6ad/0xf00 fs/eventpoll.c:2226
     __do_sys_epoll_wait fs/eventpoll.c:2238 [inline]
     __se_sys_epoll_wait fs/eventpoll.c:2233 [inline]
     __x64_sys_epoll_wait+0xf6/0x120 fs/eventpoll.c:2233
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x44/0xae

    value changed: 0x0000001b -> 0x00000001

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 1 PID: 25834 Comm: syz-executor.1 Tainted: G        W         5.14.0-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

    Fixes: 86b18aaa2b5b ("skbuff: fix a data race in skb_queue_len()")
    Cc: Qian Cai <cai@lca.pw>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit c09a84aea0d3902f955cc6504e1c25cddb7c48c2
Author: Baptiste Lepers <baptiste.lepers@gmail.com>
Date:   Mon Sep 6 11:53:10 2021 +1000

    events: Reuse value read using READ_ONCE instead of re-reading it

    commit b89a05b21f46150ac10a962aa50109250b56b03b upstream.

    In perf_event_addr_filters_apply, the task associated with
    the event (event->ctx->task) is read using READ_ONCE at the beginning
    of the function, checked, and then re-read from event->ctx->task,
    voiding all guarantees of the checks. Reuse the value that was read by
    READ_ONCE to ensure the consistency of the task struct throughout the
    function.

    Fixes: 375637bc52495 ("perf/core: Introduce address range filtering")
    Signed-off-by: Baptiste Lepers <baptiste.lepers@gmail.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20210906015310.12802-1-baptiste.lepers@gmail.com
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit e68a11a4324dbf7e98eeda55c16501263ce2d00c
Author: Maor Gottlieb <maorg@nvidia.com>
Date:   Wed Sep 1 11:48:13 2021 +0300

    net/mlx5: Fix potential sleeping in atomic context

    commit ee27e330a953595903979ffdb84926843595a9fe upstream.

    Fixes the below flow of sleeping in atomic context by releasing
    the RCU lock before calling to free_match_list.

    build_match_list() <- disables preempt
    -> free_match_list()
       -> tree_put_node()
          -> down_write_ref_node() <- take write lock

    Fixes: 693c6883bbc4 ("net/mlx5: Add hash table for flow groups in flow table")
    Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
    Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
    Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 44428199a3ce7ba6c3c6b9c6385cbbc9131662ea
Author: Michael Petlan <mpetlan@redhat.com>
Date:   Mon Jul 19 16:53:32 2021 +0200

    perf machine: Initialize srcline string member in add_location struct

    commit 57f0ff059e3daa4e70a811cb1d31a49968262d20 upstream.

    It's later supposed to be either a correct address or NULL. Without the
    initialization, it may contain an undefined value which results in the
    following segmentation fault:

      # perf top --sort comm -g --ignore-callees=do_idle

    terminates with:

      #0  0x00007ffff56b7685 in __strlen_avx2 () from /lib64/libc.so.6
      #1  0x00007ffff55e3802 in strdup () from /lib64/libc.so.6
      #2  0x00005555558cb139 in hist_entry__init (callchain_size=<optimized out>, sample_self=true, template=0x7fffde7fb110, he=0x7fffd801c250) at util/hist.c:489
      #3  hist_entry__new (template=template@entry=0x7fffde7fb110, sample_self=sample_self@entry=true) at util/hist.c:564
      #4  0x00005555558cb4ba in hists__findnew_entry (hists=hists@entry=0x5555561d9e38, entry=entry@entry=0x7fffde7fb110, al=al@entry=0x7fffde7fb420,
          sample_self=sample_self@entry=true) at util/hist.c:657
      #5  0x00005555558cba1b in __hists__add_entry (hists=hists@entry=0x5555561d9e38, al=0x7fffde7fb420, sym_parent=<optimized out>, bi=bi@entry=0x0, mi=mi@entry=0x0,
          sample=sample@entry=0x7fffde7fb4b0, sample_self=true, ops=0x0, block_info=0x0) at util/hist.c:288
      #6  0x00005555558cbb70 in hists__add_entry (sample_self=true, sample=0x7fffde7fb4b0, mi=0x0, bi=0x0, sym_parent=<optimized out>, al=<optimized out>, hists=0x5555561d9e38)
          at util/hist.c:1056
      #7  iter_add_single_cumulative_entry (iter=0x7fffde7fb460, al=<optimized out>) at util/hist.c:1056
      #8  0x00005555558cc8a4 in hist_entry_iter__add (iter=iter@entry=0x7fffde7fb460, al=al@entry=0x7fffde7fb420, max_stack_depth=<optimized out>, arg=arg@entry=0x7fffffff7db0)
          at util/hist.c:1231
      #9  0x00005555557cdc9a in perf_event__process_sample (machine=<optimized out>, sample=0x7fffde7fb4b0, evsel=<optimized out>, event=<optimized out>, tool=0x7fffffff7db0)
          at builtin-top.c:842
      #10 deliver_event (qe=<optimized out>, qevent=<optimized out>) at builtin-top.c:1202
      #11 0x00005555558a9318 in do_flush (show_progress=false, oe=0x7fffffff80e0) at util/ordered-events.c:244
      #12 __ordered_events__flush (oe=oe@entry=0x7fffffff80e0, how=how@entry=OE_FLUSH__TOP, timestamp=timestamp@entry=0) at util/ordered-events.c:323
      #13 0x00005555558a9789 in __ordered_events__flush (timestamp=<optimized out>, how=<optimized out>, oe=<optimized out>) at util/ordered-events.c:339
      #14 ordered_events__flush (how=OE_FLUSH__TOP, oe=0x7fffffff80e0) at util/ordered-events.c:341
      #15 ordered_events__flush (oe=oe@entry=0x7fffffff80e0, how=how@entry=OE_FLUSH__TOP) at util/ordered-events.c:339
      #16 0x00005555557cd631 in process_thread (arg=0x7fffffff7db0) at builtin-top.c:1114
      #17 0x00007ffff7bb817a in start_thread () from /lib64/libpthread.so.0
      #18 0x00007ffff5656dc3 in clone () from /lib64/libc.so.6

    If you look at the frame #2, the code is:

    488	 if (he->srcline) {
    489          he->srcline = strdup(he->srcline);
    490          if (he->srcline == NULL)
    491              goto err_rawdata;
    492	 }

    If he->srcline is not NULL (it is not NULL if it is uninitialized rubbish),
    it gets strdupped and strdupping a rubbish random string causes the problem.

    Also, if you look at the commit 1fb7d06a509e, it adds the srcline property
    into the struct, but not initializing it everywhere needed.

    Committer notes:

    Now I see, when using --ignore-callees=do_idle we end up here at line
    2189 in add_callchain_ip():

    2181         if (al.sym != NULL) {
    2182                 if (perf_hpp_list.parent && !*parent &&
    2183                     symbol__match_regex(al.sym, &parent_regex))
    2184                         *parent = al.sym;
    2185                 else if (have_ignore_callees && root_al &&
    2186                   symbol__match_regex(al.sym, &ignore_callees_regex)) {
    2187                         /* Treat this symbol as the root,
    2188                            forgetting its callees. */
    2189                         *root_al = al;
    2190                         callchain_cursor_reset(cursor);
    2191                 }
    2192         }

    And the al that doesn't have the ->srcline field initialized will be
    copied to the root_al, so then, back to:

    1211 int hist_entry_iter__add(struct hist_entry_iter *iter, struct addr_location *al,
    1212                          int max_stack_depth, void *arg)
    1213 {
    1214         int err, err2;
    1215         struct map *alm = NULL;
    1216
    1217         if (al)
    1218                 alm = map__get(al->map);
    1219
    1220         err = sample__resolve_callchain(iter->sample, &callchain_cursor, &iter->parent,
    1221                                         iter->evsel, al, max_stack_depth);
    1222         if (err) {
    1223                 map__put(alm);
    1224                 return err;
    1225         }
    1226
    1227         err = iter->ops->prepare_entry(iter, al);
    1228         if (err)
    1229                 goto out;
    1230
    1231         err = iter->ops->add_single_entry(iter, al);
    1232         if (err)
    1233                 goto out;
    1234

    That al at line 1221 is what hist_entry_iter__add() (called from
    sample__resolve_callchain()) saw as 'root_al', and then:

            iter->ops->add_single_entry(iter, al);

    will go on with al->srcline with a bogus value, I'll add the above
    sequence to the cset and apply, thanks!

    Signed-off-by: Michael Petlan <mpetlan@redhat.com>
    CC: Milian Wolff <milian.wolff@kdab.com>
    Cc: Jiri Olsa <jolsa@redhat.com>
    Fixes: 1fb7d06a509e ("perf report Use srcline from callchain for hist entries")
    Link: https //lore.kernel.org/r/20210719145332.29747-1-mpetlan@redhat.com
    Reported-by: Juri Lelli <jlelli@redhat.com>
    Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit df957462980e7a68ccc389ccee9c2de72e973c43
Author:…
UtsavBalar1231 pushed a commit that referenced this issue Feb 26, 2022
commit c23a9fd209bc6f8c1fa6ee303fdf037d784a1627 upstream.

Two patches listed below removed ctnetlink_dump_helpinfo call from under
rcu_read_lock. Now its rcu_dereference generates following warning:
=============================
WARNING: suspicious RCU usage
5.13.0+ #5 Not tainted
-----------------------------
net/netfilter/nf_conntrack_netlink.c:221 suspicious rcu_dereference_check() usage!

other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
stack backtrace:
CPU: 1 PID: 2251 Comm: conntrack Not tainted 5.13.0+ #5
Call Trace:
 dump_stack+0x7f/0xa1
 ctnetlink_dump_helpinfo+0x134/0x150 [nf_conntrack_netlink]
 ctnetlink_fill_info+0x2c2/0x390 [nf_conntrack_netlink]
 ctnetlink_dump_table+0x13f/0x370 [nf_conntrack_netlink]
 netlink_dump+0x10c/0x370
 __netlink_dump_start+0x1a7/0x260
 ctnetlink_get_conntrack+0x1e5/0x250 [nf_conntrack_netlink]
 nfnetlink_rcv_msg+0x613/0x993 [nfnetlink]
 netlink_rcv_skb+0x50/0x100
 nfnetlink_rcv+0x55/0x120 [nfnetlink]
 netlink_unicast+0x181/0x260
 netlink_sendmsg+0x23f/0x460
 sock_sendmsg+0x5b/0x60
 __sys_sendto+0xf1/0x160
 __x64_sys_sendto+0x24/0x30
 do_syscall_64+0x36/0x70
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Fixes: 49ca022bccc5 ("netfilter: ctnetlink: don't dump ct extensions of unconfirmed conntracks")
Fixes: 0b35f60 ("netfilter: Remove duplicated rcu_read_lock.")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
myissam pushed a commit to myissam/kernel_xiaomi_sm8250 that referenced this issue Apr 2, 2022
Patch series "lib/sort & lib/list_sort: faster and smaller", v2.

Because CONFIG_RETPOLINE has made indirect calls much more expensive, I
thought I'd try to reduce the number made by the library sort functions.

The first three patches apply to lib/sort.c.

Patch UtsavBalar1231#1 is a simple optimization.  The built-in swap has special cases
for aligned 4- and 8-byte objects.  But those are almost never used;
most calls to sort() work on larger structures, which fall back to the
byte-at-a-time loop.  This generalizes them to aligned *multiples* of 4
and 8 bytes.  (If nothing else, it saves an awful lot of energy by not
thrashing the store buffers as much.)

Patch UtsavBalar1231#2 grabs a juicy piece of low-hanging fruit.  I agree that nice
simple solid heapsort is preferable to more complex algorithms (sorry,
Andrey), but it's possible to implement heapsort with far fewer
comparisons (50% asymptotically, 25-40% reduction for realistic sizes)
than the way it's been done up to now.  And with some care, the code
ends up smaller, as well.  This is the "big win" patch.

Patch UtsavBalar1231#3 adds the same sort of indirect call bypass that has been added
to the net code of late.  The great majority of the callers use the
builtin swap functions, so replace the indirect call to sort_func with a
(highly preditable) series of if() statements.  Rather surprisingly,
this decreased code size, as the swap functions were inlined and their
prologue & epilogue code eliminated.

lib/list_sort.c is a bit trickier, as merge sort is already close to
optimal, and we don't want to introduce triumphs of theory over
practicality like the Ford-Johnson merge-insertion sort.

Patch UtsavBalar1231#4, without changing the algorithm, chops 32% off the code size
and removes the part[MAX_LIST_LENGTH+1] pointer array (and the
corresponding upper limit on efficiently sortable input size).

Patch UtsavBalar1231#5 improves the algorithm.  The previous code is already optimal
for power-of-two (or slightly smaller) size inputs, but when the input
size is just over a power of 2, there's a very unbalanced final merge.

There are, in the literature, several algorithms which solve this, but
they all depend on the "breadth-first" merge order which was replaced by
commit 835cc0c with a more cache-friendly "depth-first" order.
Some hard thinking came up with a depth-first algorithm which defers
merges as little as possible while avoiding bad merges.  This saves
0.2*n compares, averaged over all sizes.

The code size increase is minimal (64 bytes on x86-64, reducing the net
savings to 26%), but the comments expanded significantly to document the
clever algorithm.

TESTING NOTES: I have some ugly user-space benchmarking code which I
used for testing before moving this code into the kernel.  Shout if you
want a copy.

I'm running this code right now, with CONFIG_TEST_SORT and
CONFIG_TEST_LIST_SORT, but I confess I haven't rebooted since the last
round of minor edits to quell checkpatch.  I figure there will be at
least one round of comments and final testing.

This patch (of 5):

Rather than having special-case swap functions for 4- and 8-byte
objects, special-case aligned multiples of 4 or 8 bytes.  This speeds up
most users of sort() by avoiding fallback to the byte copy loop.

Despite what ca96ab8 ("lib/sort: Add 64 bit swap function") claims,
very few users of sort() sort pointers (or pointer-sized objects); most
sort structures containing at least two words.  (E.g.
drivers/acpi/fan.c:acpi_fan_get_fps() sorts an array of 40-byte struct
acpi_fan_fps.)

The functions also got renamed to reflect the fact that they support
multiple words.  In the great tradition of bikeshedding, the names were
by far the most contentious issue during review of this patch series.

x86-64 code size 872 -> 886 bytes (+14)

With feedback from Andy Shevchenko, Rasmus Villemoes and Geert
Uytterhoeven.

Link: http://lkml.kernel.org/r/f24f932df3a7fa1973c1084154f1cea596bcf341.1552704200.git.lkml@sdf.org
Signed-off-by: George Spelvin <lkml@sdf.org>
Acked-by: Andrey Abramov <st5pub@yandex.ru>
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Daniel Wagner <daniel.wagner@siemens.com>
Cc: Don Mullis <don.mullis@gmail.com>
Cc: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
UtsavBalar1231 pushed a commit that referenced this issue Apr 20, 2022
[ Upstream commit ef27324e2cb7bb24542d6cb2571740eefe6b00dc ]

Our detector found a concurrent use-after-free bug when detaching an
NCI device. The main reason for this bug is the unexpected scheduling
between the used delayed mechanism (timer and workqueue).

The race can be demonstrated below:

Thread-1                           Thread-2
                                 | nci_dev_up()
                                 |   nci_open_device()
                                 |     __nci_request(nci_reset_req)
                                 |       nci_send_cmd
                                 |         queue_work(cmd_work)
nci_unregister_device()          |
  nci_close_device()             | ...
    del_timer_sync(cmd_timer)[1] |
...                              | Worker
nci_free_device()                | nci_cmd_work()
  kfree(ndev)[3]                 |   mod_timer(cmd_timer)[2]

In short, the cleanup routine thought that the cmd_timer has already
been detached by [1] but the mod_timer can re-attach the timer [2], even
it is already released [3], resulting in UAF.

This UAF is easy to trigger, crash trace by POC is like below

[   66.703713] ==================================================================
[   66.703974] BUG: KASAN: use-after-free in enqueue_timer+0x448/0x490
[   66.703974] Write of size 8 at addr ffff888009fb7058 by task kworker/u4:1/33
[   66.703974]
[   66.703974] CPU: 1 PID: 33 Comm: kworker/u4:1 Not tainted 5.18.0-rc2 #5
[   66.703974] Workqueue: nfc2_nci_cmd_wq nci_cmd_work
[   66.703974] Call Trace:
[   66.703974]  <TASK>
[   66.703974]  dump_stack_lvl+0x57/0x7d
[   66.703974]  print_report.cold+0x5e/0x5db
[   66.703974]  ? enqueue_timer+0x448/0x490
[   66.703974]  kasan_report+0xbe/0x1c0
[   66.703974]  ? enqueue_timer+0x448/0x490
[   66.703974]  enqueue_timer+0x448/0x490
[   66.703974]  __mod_timer+0x5e6/0xb80
[   66.703974]  ? mark_held_locks+0x9e/0xe0
[   66.703974]  ? try_to_del_timer_sync+0xf0/0xf0
[   66.703974]  ? lockdep_hardirqs_on_prepare+0x17b/0x410
[   66.703974]  ? queue_work_on+0x61/0x80
[   66.703974]  ? lockdep_hardirqs_on+0xbf/0x130
[   66.703974]  process_one_work+0x8bb/0x1510
[   66.703974]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[   66.703974]  ? pwq_dec_nr_in_flight+0x230/0x230
[   66.703974]  ? rwlock_bug.part.0+0x90/0x90
[   66.703974]  ? _raw_spin_lock_irq+0x41/0x50
[   66.703974]  worker_thread+0x575/0x1190
[   66.703974]  ? process_one_work+0x1510/0x1510
[   66.703974]  kthread+0x2a0/0x340
[   66.703974]  ? kthread_complete_and_exit+0x20/0x20
[   66.703974]  ret_from_fork+0x22/0x30
[   66.703974]  </TASK>
[   66.703974]
[   66.703974] Allocated by task 267:
[   66.703974]  kasan_save_stack+0x1e/0x40
[   66.703974]  __kasan_kmalloc+0x81/0xa0
[   66.703974]  nci_allocate_device+0xd3/0x390
[   66.703974]  nfcmrvl_nci_register_dev+0x183/0x2c0
[   66.703974]  nfcmrvl_nci_uart_open+0xf2/0x1dd
[   66.703974]  nci_uart_tty_ioctl+0x2c3/0x4a0
[   66.703974]  tty_ioctl+0x764/0x1310
[   66.703974]  __x64_sys_ioctl+0x122/0x190
[   66.703974]  do_syscall_64+0x3b/0x90
[   66.703974]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   66.703974]
[   66.703974] Freed by task 406:
[   66.703974]  kasan_save_stack+0x1e/0x40
[   66.703974]  kasan_set_track+0x21/0x30
[   66.703974]  kasan_set_free_info+0x20/0x30
[   66.703974]  __kasan_slab_free+0x108/0x170
[   66.703974]  kfree+0xb0/0x330
[   66.703974]  nfcmrvl_nci_unregister_dev+0x90/0xd0
[   66.703974]  nci_uart_tty_close+0xdf/0x180
[   66.703974]  tty_ldisc_kill+0x73/0x110
[   66.703974]  tty_ldisc_hangup+0x281/0x5b0
[   66.703974]  __tty_hangup.part.0+0x431/0x890
[   66.703974]  tty_release+0x3a8/0xc80
[   66.703974]  __fput+0x1f0/0x8c0
[   66.703974]  task_work_run+0xc9/0x170
[   66.703974]  exit_to_user_mode_prepare+0x194/0x1a0
[   66.703974]  syscall_exit_to_user_mode+0x19/0x50
[   66.703974]  do_syscall_64+0x48/0x90
[   66.703974]  entry_SYSCALL_64_after_hwframe+0x44/0xae

To fix the UAF, this patch adds flush_workqueue() to ensure the
nci_cmd_work is finished before the following del_timer_sync.
This combination will promise the timer is actually detached.

Fixes: 6a2968a ("NFC: basic NCI protocol implementation")
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
UtsavBalar1231 pushed a commit that referenced this issue May 19, 2022
[ Upstream commit af68656d66eda219b7f55ce8313a1da0312c79e1 ]

While handling PCI errors (AER flow) driver tries to
disable NAPI [napi_disable()] after NAPI is deleted
[__netif_napi_del()] which causes unexpected system
hang/crash.

System message log shows the following:
=======================================
[ 3222.537510] EEH: Detected PCI bus error on PHB#384-PE#800000 [ 3222.537511] EEH: This PCI device has failed 2 times in the last hour and will be permanently disabled after 5 failures.
[ 3222.537512] EEH: Notify device drivers to shutdown [ 3222.537513] EEH: Beginning: 'error_detected(IO frozen)'
[ 3222.537514] EEH: PE#800000 (PCI 0384:80:00.0): Invoking
bnx2x->error_detected(IO frozen)
[ 3222.537516] bnx2x: [bnx2x_io_error_detected:14236(eth14)]IO error detected [ 3222.537650] EEH: PE#800000 (PCI 0384:80:00.0): bnx2x driver reports:
'need reset'
[ 3222.537651] EEH: PE#800000 (PCI 0384:80:00.1): Invoking
bnx2x->error_detected(IO frozen)
[ 3222.537651] bnx2x: [bnx2x_io_error_detected:14236(eth13)]IO error detected [ 3222.537729] EEH: PE#800000 (PCI 0384:80:00.1): bnx2x driver reports:
'need reset'
[ 3222.537729] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset'
[ 3222.537890] EEH: Collect temporary log [ 3222.583481] EEH: of node=0384:80:00.0 [ 3222.583519] EEH: PCI device/vendor: 168e14e4 [ 3222.583557] EEH: PCI cmd/status register: 00100140 [ 3222.583557] EEH: PCI-E capabilities and status follow:
[ 3222.583744] EEH: PCI-E 00: 00020010 012c8da2 00095d5e 00455c82 [ 3222.583892] EEH: PCI-E 10: 10820000 00000000 00000000 00000000 [ 3222.583893] EEH: PCI-E 20: 00000000 [ 3222.583893] EEH: PCI-E AER capability register set follows:
[ 3222.584079] EEH: PCI-E AER 00: 13c10001 00000000 00000000 00062030 [ 3222.584230] EEH: PCI-E AER 10: 00002000 000031c0 000001e0 00000000 [ 3222.584378] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 [ 3222.584416] EEH: PCI-E AER 30: 00000000 00000000 [ 3222.584416] EEH: of node=0384:80:00.1 [ 3222.584454] EEH: PCI device/vendor: 168e14e4 [ 3222.584491] EEH: PCI cmd/status register: 00100140 [ 3222.584492] EEH: PCI-E capabilities and status follow:
[ 3222.584677] EEH: PCI-E 00: 00020010 012c8da2 00095d5e 00455c82 [ 3222.584825] EEH: PCI-E 10: 10820000 00000000 00000000 00000000 [ 3222.584826] EEH: PCI-E 20: 00000000 [ 3222.584826] EEH: PCI-E AER capability register set follows:
[ 3222.585011] EEH: PCI-E AER 00: 13c10001 00000000 00000000 00062030 [ 3222.585160] EEH: PCI-E AER 10: 00002000 000031c0 000001e0 00000000 [ 3222.585309] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 [ 3222.585347] EEH: PCI-E AER 30: 00000000 00000000 [ 3222.586872] RTAS: event: 5, Type: Platform Error (224), Severity: 2 [ 3222.586873] EEH: Reset without hotplug activity [ 3224.762767] EEH: Beginning: 'slot_reset'
[ 3224.762770] EEH: PE#800000 (PCI 0384:80:00.0): Invoking
bnx2x->slot_reset()
[ 3224.762771] bnx2x: [bnx2x_io_slot_reset:14271(eth14)]IO slot reset initializing...
[ 3224.762887] bnx2x 0384:80:00.0: enabling device (0140 -> 0142) [ 3224.768157] bnx2x: [bnx2x_io_slot_reset:14287(eth14)]IO slot reset
--> driver unload

Uninterruptible tasks
=====================
crash> ps | grep UN
     213      2  11  c000000004c89e00  UN   0.0       0      0  [eehd]
     215      2   0  c000000004c80000  UN   0.0       0      0
[kworker/0:2]
    2196      1  28  c000000004504f00  UN   0.1   15936  11136  wickedd
    4287      1   9  c00000020d076800  UN   0.0    4032   3008  agetty
    4289      1  20  c00000020d056680  UN   0.0    7232   3840  agetty
   32423      2  26  c00000020038c580  UN   0.0       0      0
[kworker/26:3]
   32871   4241  27  c0000002609ddd00  UN   0.1   18624  11648  sshd
   32920  10130  16  c00000027284a100  UN   0.1   48512  12608  sendmail
   33092  32987   0  c000000205218b00  UN   0.1   48512  12608  sendmail
   33154   4567  16  c000000260e51780  UN   0.1   48832  12864  pickup
   33209   4241  36  c000000270cb6500  UN   0.1   18624  11712  sshd
   33473  33283   0  c000000205211480  UN   0.1   48512  12672  sendmail
   33531   4241  37  c00000023c902780  UN   0.1   18624  11648  sshd

EEH handler hung while bnx2x sleeping and holding RTNL lock
===========================================================
crash> bt 213
PID: 213    TASK: c000000004c89e00  CPU: 11  COMMAND: "eehd"
  #0 [c000000004d477e0] __schedule at c000000000c70808
  #1 [c000000004d478b0] schedule at c000000000c70ee0
  #2 [c000000004d478e0] schedule_timeout at c000000000c76dec
  #3 [c000000004d479c0] msleep at c0000000002120cc
  #4 [c000000004d479f0] napi_disable at c000000000a06448
                                        ^^^^^^^^^^^^^^^^
  #5 [c000000004d47a30] bnx2x_netif_stop at c0080000018dba94 [bnx2x]
  #6 [c000000004d47a60] bnx2x_io_slot_reset at c0080000018a551c [bnx2x]
  #7 [c000000004d47b20] eeh_report_reset at c00000000004c9bc
  #8 [c000000004d47b90] eeh_pe_report at c00000000004d1a8
  #9 [c000000004d47c40] eeh_handle_normal_event at c00000000004da64

And the sleeping source code
============================
crash> dis -ls c000000000a06448
FILE: ../net/core/dev.c
LINE: 6702

   6697  {
   6698          might_sleep();
   6699          set_bit(NAPI_STATE_DISABLE, &n->state);
   6700
   6701          while (test_and_set_bit(NAPI_STATE_SCHED, &n->state))
* 6702                  msleep(1);
   6703          while (test_and_set_bit(NAPI_STATE_NPSVC, &n->state))
   6704                  msleep(1);
   6705
   6706          hrtimer_cancel(&n->timer);
   6707
   6708          clear_bit(NAPI_STATE_DISABLE, &n->state);
   6709  }

EEH calls into bnx2x twice based on the system log above, first through
bnx2x_io_error_detected() and then bnx2x_io_slot_reset(), and executes
the following call chains:

bnx2x_io_error_detected()
  +-> bnx2x_eeh_nic_unload()
       +-> bnx2x_del_all_napi()
            +-> __netif_napi_del()

bnx2x_io_slot_reset()
  +-> bnx2x_netif_stop()
       +-> bnx2x_napi_disable()
            +->napi_disable()

Fix this by correcting the sequence of NAPI APIs usage,
that is delete the NAPI after disabling it.

Fixes: 7fa6f34 ("bnx2x: AER revised")
Reported-by: David Christensen <drc@linux.vnet.ibm.com>
Tested-by: David Christensen <drc@linux.vnet.ibm.com>
Signed-off-by: Manish Chopra <manishc@marvell.com>
Signed-off-by: Ariel Elior <aelior@marvell.com>
Link: https://lore.kernel.org/r/20220426153913.6966-1-manishc@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
myissam pushed a commit to myissam/kernel_xiaomi_sm8250 that referenced this issue Jun 1, 2022
[ Upstream commit 6cf539a87a61a4fbc43f625267dbcbcf283872ed ]

This fixes a data-race where `atomic_t dynticks` is copied by value. The
copy is performed non-atomically, resulting in a data-race if `dynticks`
is updated concurrently.

This data-race was found with KCSAN:
==================================================================
BUG: KCSAN: data-race in dyntick_save_progress_counter / rcu_irq_enter

write to 0xffff989dbdbe98e0 of 4 bytes by task 10 on cpu 3:
 atomic_add_return include/asm-generic/atomic-instrumented.h:78 [inline]
 rcu_dynticks_snap kernel/rcu/tree.c:310 [inline]
 dyntick_save_progress_counter+0x43/0x1b0 kernel/rcu/tree.c:984
 force_qs_rnp+0x183/0x200 kernel/rcu/tree.c:2286
 rcu_gp_fqs kernel/rcu/tree.c:1601 [inline]
 rcu_gp_fqs_loop+0x71/0x880 kernel/rcu/tree.c:1653
 rcu_gp_kthread+0x22c/0x3b0 kernel/rcu/tree.c:1799
 kthread+0x1b5/0x200 kernel/kthread.c:255
 <snip>

read to 0xffff989dbdbe98e0 of 4 bytes by task 154 on cpu 7:
 rcu_nmi_enter_common kernel/rcu/tree.c:828 [inline]
 rcu_irq_enter+0xda/0x240 kernel/rcu/tree.c:870
 irq_enter+0x5/0x50 kernel/softirq.c:347
 <snip>

Reported by Kernel Concurrency Sanitizer on:
CPU: 7 PID: 154 Comm: kworker/7:1H Not tainted 5.3.0+ UtsavBalar1231#5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
Workqueue: kblockd blk_mq_run_work_fn
==================================================================

Signed-off-by: Marco Elver <elver@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: rcu@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Lucas Lee Jing Yi <lucasleeeeeeeee@gmail.com>
Signed-off-by: Carlos Ayrton Lopez Arroyo <15030201@itcelaya.edu.mx>
Signed-off-by: Little-W <1405481963@qq.com>
myissam pushed a commit to myissam/kernel_xiaomi_sm8250 that referenced this issue Jun 18, 2022
[ Upstream commit 6cf539a87a61a4fbc43f625267dbcbcf283872ed ]

This fixes a data-race where `atomic_t dynticks` is copied by value. The
copy is performed non-atomically, resulting in a data-race if `dynticks`
is updated concurrently.

This data-race was found with KCSAN:
==================================================================
BUG: KCSAN: data-race in dyntick_save_progress_counter / rcu_irq_enter

write to 0xffff989dbdbe98e0 of 4 bytes by task 10 on cpu 3:
 atomic_add_return include/asm-generic/atomic-instrumented.h:78 [inline]
 rcu_dynticks_snap kernel/rcu/tree.c:310 [inline]
 dyntick_save_progress_counter+0x43/0x1b0 kernel/rcu/tree.c:984
 force_qs_rnp+0x183/0x200 kernel/rcu/tree.c:2286
 rcu_gp_fqs kernel/rcu/tree.c:1601 [inline]
 rcu_gp_fqs_loop+0x71/0x880 kernel/rcu/tree.c:1653
 rcu_gp_kthread+0x22c/0x3b0 kernel/rcu/tree.c:1799
 kthread+0x1b5/0x200 kernel/kthread.c:255
 <snip>

read to 0xffff989dbdbe98e0 of 4 bytes by task 154 on cpu 7:
 rcu_nmi_enter_common kernel/rcu/tree.c:828 [inline]
 rcu_irq_enter+0xda/0x240 kernel/rcu/tree.c:870
 irq_enter+0x5/0x50 kernel/softirq.c:347
 <snip>

Reported by Kernel Concurrency Sanitizer on:
CPU: 7 PID: 154 Comm: kworker/7:1H Not tainted 5.3.0+ UtsavBalar1231#5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
Workqueue: kblockd blk_mq_run_work_fn
==================================================================

Signed-off-by: Marco Elver <elver@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: rcu@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Lucas Lee Jing Yi <lucasleeeeeeeee@gmail.com>
Signed-off-by: Carlos Ayrton Lopez Arroyo <15030201@itcelaya.edu.mx>
Signed-off-by: Little-W <1405481963@qq.com>
UtsavBalar1231 pushed a commit that referenced this issue Jul 20, 2022
…for migration

Each cset (css_set) is pinned by its tasks. When we're moving tasks around
across csets for a migration, we need to hold the source and destination
csets to ensure that they don't go away while we're moving tasks about. This
is done by linking cset->mg_preload_node on either the
mgctx->preloaded_src_csets or mgctx->preloaded_dst_csets list. Using the
same cset->mg_preload_node for both the src and dst lists was deemed okay as
a cset can't be both the source and destination at the same time.

Unfortunately, this overloading becomes problematic when multiple tasks are
involved in a migration and some of them are identity noop migrations while
others are actually moving across cgroups. For example, this can happen with
the following sequence on cgroup1:

 #1> mkdir -p /sys/fs/cgroup/misc/a/b
 #2> echo $$ > /sys/fs/cgroup/misc/a/cgroup.procs
 #3> RUN_A_COMMAND_WHICH_CREATES_MULTIPLE_THREADS &
 #4> PID=$!
 #5> echo $PID > /sys/fs/cgroup/misc/a/b/tasks
 #6> echo $PID > /sys/fs/cgroup/misc/a/cgroup.procs

the process including the group leader back into a. In this final migration,
non-leader threads would be doing identity migration while the group leader
is doing an actual one.

After #3, let's say the whole process was in cset A, and that after #4, the
leader moves to cset B. Then, during #6, the following happens:

 1. cgroup_migrate_add_src() is called on B for the leader.

 2. cgroup_migrate_add_src() is called on A for the other threads.

 3. cgroup_migrate_prepare_dst() is called. It scans the src list.

 4. It notices that B wants to migrate to A, so it tries to A to the dst
    list but realizes that its ->mg_preload_node is already busy.

 5. and then it notices A wants to migrate to A as it's an identity
    migration, it culls it by list_del_init()'ing its ->mg_preload_node and
    putting references accordingly.

 6. The rest of migration takes place with B on the src list but nothing on
    the dst list.

This means that A isn't held while migration is in progress. If all tasks
leave A before the migration finishes and the incoming task pins it, the
cset will be destroyed leading to use-after-free.

This is caused by overloading cset->mg_preload_node for both src and dst
preload lists. We wanted to exclude the cset from the src list but ended up
inadvertently excluding it from the dst list too.

This patch fixes the issue by separating out cset->mg_preload_node into
->mg_src_preload_node and ->mg_dst_preload_node, so that the src and dst
preloadings don't interfere with each other.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reported-by: shisiyuan <shisiyuan19870131@gmail.com>
Link: http://lkml.kernel.org/r/1654187688-27411-1-git-send-email-shisiyuan@xiaomi.com
Link: https://www.spinics.net/lists/cgroups/msg33313.html
Fixes: f817de9 ("cgroup: prepare migration path for unified hierarchy")
Cc: stable@vger.kernel.org # v3.16+
(cherry picked from commit 07fd5b6cdf3cc30bfde8fe0f644771688be04447
 https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-5.19-fixes)
Bug: 235577024
Change-Id: Ieaf1c0c8fc23753570897fd6e48a54335ab939ce
Signed-off-by: Steve Muckle <smuckle@google.com>
Git-commit: d1faa010ca160216a17435b81c642daf08ecbcbd
Git-repo: https://android.googlesource.com/kernel/common/
Signed-off-by: Srinivasarao Pathipati <quic_c_spathi@quicinc.com>
UtsavBalar1231 pushed a commit that referenced this issue Aug 3, 2022
…tion

commit 07fd5b6cdf3cc30bfde8fe0f644771688be04447 upstream.

Each cset (css_set) is pinned by its tasks. When we're moving tasks around
across csets for a migration, we need to hold the source and destination
csets to ensure that they don't go away while we're moving tasks about. This
is done by linking cset->mg_preload_node on either the
mgctx->preloaded_src_csets or mgctx->preloaded_dst_csets list. Using the
same cset->mg_preload_node for both the src and dst lists was deemed okay as
a cset can't be both the source and destination at the same time.

Unfortunately, this overloading becomes problematic when multiple tasks are
involved in a migration and some of them are identity noop migrations while
others are actually moving across cgroups. For example, this can happen with
the following sequence on cgroup1:

 #1> mkdir -p /sys/fs/cgroup/misc/a/b
 #2> echo $$ > /sys/fs/cgroup/misc/a/cgroup.procs
 #3> RUN_A_COMMAND_WHICH_CREATES_MULTIPLE_THREADS &
 #4> PID=$!
 #5> echo $PID > /sys/fs/cgroup/misc/a/b/tasks
 #6> echo $PID > /sys/fs/cgroup/misc/a/cgroup.procs

the process including the group leader back into a. In this final migration,
non-leader threads would be doing identity migration while the group leader
is doing an actual one.

After #3, let's say the whole process was in cset A, and that after #4, the
leader moves to cset B. Then, during #6, the following happens:

 1. cgroup_migrate_add_src() is called on B for the leader.

 2. cgroup_migrate_add_src() is called on A for the other threads.

 3. cgroup_migrate_prepare_dst() is called. It scans the src list.

 4. It notices that B wants to migrate to A, so it tries to A to the dst
    list but realizes that its ->mg_preload_node is already busy.

 5. and then it notices A wants to migrate to A as it's an identity
    migration, it culls it by list_del_init()'ing its ->mg_preload_node and
    putting references accordingly.

 6. The rest of migration takes place with B on the src list but nothing on
    the dst list.

This means that A isn't held while migration is in progress. If all tasks
leave A before the migration finishes and the incoming task pins it, the
cset will be destroyed leading to use-after-free.

This is caused by overloading cset->mg_preload_node for both src and dst
preload lists. We wanted to exclude the cset from the src list but ended up
inadvertently excluding it from the dst list too.

This patch fixes the issue by separating out cset->mg_preload_node into
->mg_src_preload_node and ->mg_dst_preload_node, so that the src and dst
preloadings don't interfere with each other.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reported-by: shisiyuan <shisiyuan19870131@gmail.com>
Link: http://lkml.kernel.org/r/1654187688-27411-1-git-send-email-shisiyuan@xiaomi.com
Link: https://www.spinics.net/lists/cgroups/msg33313.html
Fixes: f817de9 ("cgroup: prepare migration path for unified hierarchy")
Cc: stable@vger.kernel.org # v3.16+
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
UtsavBalar1231 pushed a commit that referenced this issue Aug 3, 2022
In 2019, Sergey fixed a lockdep splat with 15341b1 ("char/random:
silence a lockdep splat with printk()"), but that got reverted soon
after from 4.19 because back then it apparently caused various problems.
But the issue it was fixing is still there, and more generally, many
patches turning printk() into printk_deferred() have landed since,
making me suspect it's okay to try this out again.

This should fix the following deadlock found by the kernel test robot:

[   18.287691] WARNING: possible circular locking dependency detected
[   18.287692] 4.19.248-00165-g3d1f971aa81f #1 Not tainted
[   18.287693] ------------------------------------------------------
[   18.287712] stop/202 is trying to acquire lock:
[   18.287713] (ptrval) (console_owner){..-.}, at: console_unlock (??:?)
[   18.287717]
[   18.287718] but task is already holding lock:
[   18.287718] (ptrval) (&(&port->lock)->rlock){-...}, at: pty_write (pty.c:?)
[   18.287722]
[   18.287722] which lock already depends on the new lock.
[   18.287723]
[   18.287724]
[   18.287725] the existing dependency chain (in reverse order) is:
[   18.287725]
[   18.287726] -> #2 (&(&port->lock)->rlock){-...}:
[   18.287729] validate_chain+0x84a/0xe00
[   18.287729] __lock_acquire (lockdep.c:?)
[   18.287730] lock_acquire (??:?)
[   18.287731] _raw_spin_lock_irqsave (??:?)
[   18.287732] tty_port_tty_get (??:?)
[   18.287733] tty_port_default_wakeup (tty_port.c:?)
[   18.287734] tty_port_tty_wakeup (??:?)
[   18.287734] uart_write_wakeup (??:?)
[   18.287735] serial8250_tx_chars (??:?)
[   18.287736] serial8250_handle_irq (??:?)
[   18.287737] serial8250_default_handle_irq (8250_port.c:?)
[   18.287738] serial8250_interrupt (8250_core.c:?)
[   18.287738] __handle_irq_event_percpu (??:?)
[   18.287739] handle_irq_event_percpu (??:?)
[   18.287740] handle_irq_event (??:?)
[   18.287741] handle_edge_irq (??:?)
[   18.287742] handle_irq (??:?)
[   18.287742] do_IRQ (??:?)
[   18.287743] common_interrupt (entry_32.o:?)
[   18.287744] _raw_spin_unlock_irqrestore (??:?)
[   18.287745] uart_write (serial_core.c:?)
[   18.287746] process_output_block (n_tty.c:?)
[   18.287747] n_tty_write (n_tty.c:?)
[   18.287747] tty_write (tty_io.c:?)
[   18.287748] __vfs_write (??:?)
[   18.287749] vfs_write (??:?)
[   18.287750] ksys_write (??:?)
[   18.287750] sys_write (??:?)
[   18.287751] do_fast_syscall_32 (??:?)
[   18.287752] entry_SYSENTER_32 (??:?)
[   18.287752]
[   18.287753] -> #1 (&port_lock_key){-.-.}:
[   18.287756]
[   18.287756] -> #0 (console_owner){..-.}:
[   18.287759] check_prevs_add (lockdep.c:?)
[   18.287760] validate_chain+0x84a/0xe00
[   18.287761] __lock_acquire (lockdep.c:?)
[   18.287761] lock_acquire (??:?)
[   18.287762] console_unlock (??:?)
[   18.287763] vprintk_emit (??:?)
[   18.287764] vprintk_default (??:?)
[   18.287764] vprintk_func (??:?)
[   18.287765] printk (??:?)
[   18.287766] get_random_u32 (??:?)
[   18.287767] shuffle_freelist (slub.c:?)
[   18.287767] allocate_slab (slub.c:?)
[   18.287768] new_slab (slub.c:?)
[   18.287769] ___slab_alloc+0x6d0/0xb20
[   18.287770] __slab_alloc+0xd6/0x2e0
[   18.287770] __kmalloc (??:?)
[   18.287771] tty_buffer_alloc (tty_buffer.c:?)
[   18.287772] __tty_buffer_request_room (tty_buffer.c:?)
[   18.287773] tty_insert_flip_string_fixed_flag (??:?)
[   18.287774] pty_write (pty.c:?)
[   18.287775] process_output_block (n_tty.c:?)
[   18.287776] n_tty_write (n_tty.c:?)
[   18.287777] tty_write (tty_io.c:?)
[   18.287778] __vfs_write (??:?)
[   18.287779] vfs_write (??:?)
[   18.287780] ksys_write (??:?)
[   18.287780] sys_write (??:?)
[   18.287781] do_fast_syscall_32 (??:?)
[   18.287782] entry_SYSENTER_32 (??:?)
[   18.287783]
[   18.287783] other info that might help us debug this:
[   18.287784]
[   18.287785] Chain exists of:
[   18.287785]   console_owner --> &port_lock_key --> &(&port->lock)->rlock
[   18.287789]
[   18.287790]  Possible unsafe locking scenario:
[   18.287790]
[   18.287791]        CPU0                    CPU1
[   18.287792]        ----                    ----
[   18.287792]   lock(&(&port->lock)->rlock);
[   18.287794]                                lock(&port_lock_key);
[   18.287814]                                lock(&(&port->lock)->rlock);
[   18.287815]   lock(console_owner);
[   18.287817]
[   18.287818]  *** DEADLOCK ***
[   18.287818]
[   18.287819] 6 locks held by stop/202:
[   18.287820] #0: (ptrval) (&tty->ldisc_sem){++++}, at: ldsem_down_read (??:?)
[   18.287823] #1: (ptrval) (&tty->atomic_write_lock){+.+.}, at: tty_write_lock (tty_io.c:?)
[   18.287826] #2: (ptrval) (&o_tty->termios_rwsem/1){++++}, at: n_tty_write (n_tty.c:?)
[   18.287830] #3: (ptrval) (&ldata->output_lock){+.+.}, at: process_output_block (n_tty.c:?)
[   18.287834] #4: (ptrval) (&(&port->lock)->rlock){-...}, at: pty_write (pty.c:?)
[   18.287838] #5: (ptrval) (console_lock){+.+.}, at: console_trylock_spinning (printk.c:?)
[   18.287841]
[   18.287842] stack backtrace:
[   18.287843] CPU: 0 PID: 202 Comm: stop Not tainted 4.19.248-00165-g3d1f971aa81f #1
[   18.287843] Call Trace:
[   18.287844] dump_stack (??:?)
[   18.287845] print_circular_bug.cold+0x78/0x8b
[   18.287846] check_prev_add+0x66a/0xd20
[   18.287847] check_prevs_add (lockdep.c:?)
[   18.287848] validate_chain+0x84a/0xe00
[   18.287848] __lock_acquire (lockdep.c:?)
[   18.287849] lock_acquire (??:?)
[   18.287850] ? console_unlock (??:?)
[   18.287851] console_unlock (??:?)
[   18.287851] ? console_unlock (??:?)
[   18.287852] ? native_save_fl (??:?)
[   18.287853] vprintk_emit (??:?)
[   18.287854] vprintk_default (??:?)
[   18.287855] vprintk_func (??:?)
[   18.287855] printk (??:?)
[   18.287856] get_random_u32 (??:?)
[   18.287857] ? shuffle_freelist (slub.c:?)
[   18.287858] shuffle_freelist (slub.c:?)
[   18.287858] ? page_address (??:?)
[   18.287859] allocate_slab (slub.c:?)
[   18.287860] new_slab (slub.c:?)
[   18.287861] ? pvclock_clocksource_read (??:?)
[   18.287862] ___slab_alloc+0x6d0/0xb20
[   18.287862] ? kvm_sched_clock_read (kvmclock.c:?)
[   18.287863] ? __slab_alloc+0xbc/0x2e0
[   18.287864] ? native_wbinvd (paravirt.c:?)
[   18.287865] __slab_alloc+0xd6/0x2e0
[   18.287865] __kmalloc (??:?)
[   18.287866] ? __lock_acquire (lockdep.c:?)
[   18.287867] ? tty_buffer_alloc (tty_buffer.c:?)
[   18.287868] tty_buffer_alloc (tty_buffer.c:?)
[   18.287869] __tty_buffer_request_room (tty_buffer.c:?)
[   18.287869] tty_insert_flip_string_fixed_flag (??:?)
[   18.287870] pty_write (pty.c:?)
[   18.287871] process_output_block (n_tty.c:?)
[   18.287872] n_tty_write (n_tty.c:?)
[   18.287873] ? print_dl_stats (??:?)
[   18.287874] ? n_tty_ioctl (n_tty.c:?)
[   18.287874] tty_write (tty_io.c:?)
[   18.287875] ? n_tty_ioctl (n_tty.c:?)
[   18.287876] ? tty_write_unlock (tty_io.c:?)
[   18.287877] __vfs_write (??:?)
[   18.287877] vfs_write (??:?)
[   18.287878] ? __fget_light (file.c:?)
[   18.287879] ksys_write (??:?)

Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Lech Perczak <l.perczak@camlintechnologies.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: John Ogness <john.ogness@linutronix.de>
Reported-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/lkml/Ytz+lo4zRQYG3JUR@xsang-OptiPlex-9020
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
UtsavBalar1231 pushed a commit that referenced this issue Aug 12, 2022
commit 2b1299322016731d56807aa49254a5ea3080b6b3 upstream.

tl;dr: The Enhanced IBRS mitigation for Spectre v2 does not work as
documented for RET instructions after VM exits. Mitigate it with a new
one-entry RSB stuffing mechanism and a new LFENCE.

== Background ==

Indirect Branch Restricted Speculation (IBRS) was designed to help
mitigate Branch Target Injection and Speculative Store Bypass, i.e.
Spectre, attacks. IBRS prevents software run in less privileged modes
from affecting branch prediction in more privileged modes. IBRS requires
the MSR to be written on every privilege level change.

To overcome some of the performance issues of IBRS, Enhanced IBRS was
introduced.  eIBRS is an "always on" IBRS, in other words, just turn
it on once instead of writing the MSR on every privilege level change.
When eIBRS is enabled, more privileged modes should be protected from
less privileged modes, including protecting VMMs from guests.

== Problem ==

Here's a simplification of how guests are run on Linux' KVM:

void run_kvm_guest(void)
{
	// Prepare to run guest
	VMRESUME();
	// Clean up after guest runs
}

The execution flow for that would look something like this to the
processor:

1. Host-side: call run_kvm_guest()
2. Host-side: VMRESUME
3. Guest runs, does "CALL guest_function"
4. VM exit, host runs again
5. Host might make some "cleanup" function calls
6. Host-side: RET from run_kvm_guest()

Now, when back on the host, there are a couple of possible scenarios of
post-guest activity the host needs to do before executing host code:

* on pre-eIBRS hardware (legacy IBRS, or nothing at all), the RSB is not
touched and Linux has to do a 32-entry stuffing.

* on eIBRS hardware, VM exit with IBRS enabled, or restoring the host
IBRS=1 shortly after VM exit, has a documented side effect of flushing
the RSB except in this PBRSB situation where the software needs to stuff
the last RSB entry "by hand".

IOW, with eIBRS supported, host RET instructions should no longer be
influenced by guest behavior after the host retires a single CALL
instruction.

However, if the RET instructions are "unbalanced" with CALLs after a VM
exit as is the RET in #6, it might speculatively use the address for the
instruction after the CALL in #3 as an RSB prediction. This is a problem
since the (untrusted) guest controls this address.

Balanced CALL/RET instruction pairs such as in step #5 are not affected.

== Solution ==

The PBRSB issue affects a wide variety of Intel processors which
support eIBRS. But not all of them need mitigation. Today,
X86_FEATURE_RETPOLINE triggers an RSB filling sequence that mitigates
PBRSB. Systems setting RETPOLINE need no further mitigation - i.e.,
eIBRS systems which enable retpoline explicitly.

However, such systems (X86_FEATURE_IBRS_ENHANCED) do not set RETPOLINE
and most of them need a new mitigation.

Therefore, introduce a new feature flag X86_FEATURE_RSB_VMEXIT_LITE
which triggers a lighter-weight PBRSB mitigation versus RSB Filling at
vmexit.

The lighter-weight mitigation performs a CALL instruction which is
immediately followed by a speculative execution barrier (INT3). This
steers speculative execution to the barrier -- just like a retpoline
-- which ensures that speculation can never reach an unbalanced RET.
Then, ensure this CALL is retired before continuing execution with an
LFENCE.

In other words, the window of exposure is opened at VM exit where RET
behavior is troublesome. While the window is open, force RSB predictions
sampling for RET targets to a dead end at the INT3. Close the window
with the LFENCE.

There is a subset of eIBRS systems which are not vulnerable to PBRSB.
Add these systems to the cpu_vuln_whitelist[] as NO_EIBRS_PBRSB.
Future systems that aren't vulnerable will set ARCH_CAP_PBRSB_NO.

  [ bp: Massage, incorporate review comments from Andy Cooper. ]
  [ Pawan: Update commit message to replace RSB_VMEXIT with RETPOLINE ]

Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Co-developed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
mesziman pushed a commit to mesziman/kernel_xiaomi_sm8250 that referenced this issue Sep 16, 2022
[ Upstream commit 84a53580c5d2138c7361c7c3eea5b31827e63b35 ]

The SRv6 layer allows defining HMAC data that can later be used to sign IPv6
Segment Routing Headers. This configuration is realised via netlink through
four attributes: SEG6_ATTR_HMACKEYID, SEG6_ATTR_SECRET, SEG6_ATTR_SECRETLEN and
SEG6_ATTR_ALGID. Because the SECRETLEN attribute is decoupled from the actual
length of the SECRET attribute, it is possible to provide invalid combinations
(e.g., secret = "", secretlen = 64). This case is not checked in the code and
with an appropriately crafted netlink message, an out-of-bounds read of up
to 64 bytes (max secret length) can occur past the skb end pointer and into
skb_shared_info:

Breakpoint 1, seg6_genl_sethmac (skb=<optimized out>, info=<optimized out>) at net/ipv6/seg6.c:208
208		memcpy(hinfo->secret, secret, slen);
(gdb) bt
 #0  seg6_genl_sethmac (skb=<optimized out>, info=<optimized out>) at net/ipv6/seg6.c:208
 UtsavBalar1231#1  0xffffffff81e012e9 in genl_family_rcv_msg_doit (skb=skb@entry=0xffff88800b1f9f00, nlh=nlh@entry=0xffff88800b1b7600,
    extack=extack@entry=0xffffc90000ba7af0, ops=ops@entry=0xffffc90000ba7a80, hdrlen=4, net=0xffffffff84237580 <init_net>, family=<optimized out>,
    family=<optimized out>) at net/netlink/genetlink.c:731
 UtsavBalar1231#2  0xffffffff81e01435 in genl_family_rcv_msg (extack=0xffffc90000ba7af0, nlh=0xffff88800b1b7600, skb=0xffff88800b1f9f00,
    family=0xffffffff82fef6c0 <seg6_genl_family>) at net/netlink/genetlink.c:775
 UtsavBalar1231#3  genl_rcv_msg (skb=0xffff88800b1f9f00, nlh=0xffff88800b1b7600, extack=0xffffc90000ba7af0) at net/netlink/genetlink.c:792
 UtsavBalar1231#4  0xffffffff81dfffc3 in netlink_rcv_skb (skb=skb@entry=0xffff88800b1f9f00, cb=cb@entry=0xffffffff81e01350 <genl_rcv_msg>)
    at net/netlink/af_netlink.c:2501
 UtsavBalar1231#5  0xffffffff81e00919 in genl_rcv (skb=0xffff88800b1f9f00) at net/netlink/genetlink.c:803
 UtsavBalar1231#6  0xffffffff81dff6ae in netlink_unicast_kernel (ssk=0xffff888010eec800, skb=0xffff88800b1f9f00, sk=0xffff888004aed000)
    at net/netlink/af_netlink.c:1319
 UtsavBalar1231#7  netlink_unicast (ssk=ssk@entry=0xffff888010eec800, skb=skb@entry=0xffff88800b1f9f00, portid=portid@entry=0, nonblock=<optimized out>)
    at net/netlink/af_netlink.c:1345
 UtsavBalar1231#8  0xffffffff81dff9a4 in netlink_sendmsg (sock=<optimized out>, msg=0xffffc90000ba7e48, len=<optimized out>) at net/netlink/af_netlink.c:1921
...
(gdb) p/x ((struct sk_buff *)0xffff88800b1f9f00)->head + ((struct sk_buff *)0xffff88800b1f9f00)->end
$1 = 0xffff88800b1b76c0
(gdb) p/x secret
$2 = 0xffff88800b1b76c0
(gdb) p slen
$3 = 64 '@'

The OOB data can then be read back from userspace by dumping HMAC state. This
commit fixes this by ensuring SECRETLEN cannot exceed the actual length of
SECRET.

Reported-by: Lucas Leong <wmliang.tw@gmail.com>
Tested: verified that EINVAL is correctly returned when secretlen > len(secret)
Fixes: 4f4853d ("ipv6: sr: implement API to control SR HMAC structure")
Signed-off-by: David Lebrun <dlebrun@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
UtsavBalar1231 pushed a commit that referenced this issue Oct 9, 2022
commit 1b513f613731e2afc05550e8070d79fac80c661e upstream.

Syzkaller reported BUG_ON as follows:

------------[ cut here ]------------
kernel BUG at fs/ntfs/dir.c:86!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
CPU: 3 PID: 758 Comm: a.out Not tainted 5.19.0-next-20220808 #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:ntfs_lookup_inode_by_name+0xd11/0x2d10
Code: ff e9 b9 01 00 00 e8 1e fe d6 fe 48 8b 7d 98 49 8d 5d 07 e8 91 85 29 ff 48 c7 45 98 00 00 00 00 e9 5a fb ff ff e8 ff fd d6 fe <0f> 0b e8 f8 fd d6 fe 0f 0b e8 f1 fd d6 fe 48 8b b5 50 ff ff ff 4c
RSP: 0018:ffff888079607978 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000008000 RCX: 0000000000000000
RDX: ffff88807cf10000 RSI: ffffffff82a4a081 RDI: 0000000000000003
RBP: ffff888079607a70 R08: 0000000000000001 R09: ffff88807a6d01d7
R10: ffffed100f4da03a R11: 0000000000000000 R12: ffff88800f0fb110
R13: ffff88800f0ee000 R14: ffff88800f0fb000 R15: 0000000000000001
FS:  00007f33b63c7540(0000) GS:ffff888108580000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f33b635c090 CR3: 000000000f39e005 CR4: 0000000000770ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 <TASK>
 load_system_files+0x1f7f/0x3620
 ntfs_fill_super+0xa01/0x1be0
 mount_bdev+0x36a/0x440
 ntfs_mount+0x3a/0x50
 legacy_get_tree+0xfb/0x210
 vfs_get_tree+0x8f/0x2f0
 do_new_mount+0x30a/0x760
 path_mount+0x4de/0x1880
 __x64_sys_mount+0x2b3/0x340
 do_syscall_64+0x38/0x90
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f33b62ff9ea
Code: 48 8b 0d a9 f4 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 76 f4 0b 00 f7 d8 64 89 01 48
RSP: 002b:00007ffd0c471aa8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f33b62ff9ea
RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffd0c471be0
RBP: 00007ffd0c471c60 R08: 00007ffd0c471ae0 R09: 00007ffd0c471c24
R10: 0000000000000000 R11: 0000000000000202 R12: 000055bac5afc160
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
 </TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---

Fix this by adding sanity check on extended system files' directory inode
to ensure that it is directory, just like ntfs_extend_init() when mounting
ntfs3.

Link: https://lkml.kernel.org/r/20220809064730.2316892-1-chenxiaosong2@huawei.com
Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com>
Cc: Anton Altaparmakov <anton@tuxera.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
mesziman pushed a commit to mesziman/kernel_xiaomi_sm8250 that referenced this issue Nov 14, 2022
commit 1b513f613731e2afc05550e8070d79fac80c661e upstream.

Syzkaller reported BUG_ON as follows:

------------[ cut here ]------------
kernel BUG at fs/ntfs/dir.c:86!
invalid opcode: 0000 [UtsavBalar1231#1] PREEMPT SMP KASAN PTI
CPU: 3 PID: 758 Comm: a.out Not tainted 5.19.0-next-20220808 UtsavBalar1231#5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:ntfs_lookup_inode_by_name+0xd11/0x2d10
Code: ff e9 b9 01 00 00 e8 1e fe d6 fe 48 8b 7d 98 49 8d 5d 07 e8 91 85 29 ff 48 c7 45 98 00 00 00 00 e9 5a fb ff ff e8 ff fd d6 fe <0f> 0b e8 f8 fd d6 fe 0f 0b e8 f1 fd d6 fe 48 8b b5 50 ff ff ff 4c
RSP: 0018:ffff888079607978 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000008000 RCX: 0000000000000000
RDX: ffff88807cf10000 RSI: ffffffff82a4a081 RDI: 0000000000000003
RBP: ffff888079607a70 R08: 0000000000000001 R09: ffff88807a6d01d7
R10: ffffed100f4da03a R11: 0000000000000000 R12: ffff88800f0fb110
R13: ffff88800f0ee000 R14: ffff88800f0fb000 R15: 0000000000000001
FS:  00007f33b63c7540(0000) GS:ffff888108580000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f33b635c090 CR3: 000000000f39e005 CR4: 0000000000770ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 <TASK>
 load_system_files+0x1f7f/0x3620
 ntfs_fill_super+0xa01/0x1be0
 mount_bdev+0x36a/0x440
 ntfs_mount+0x3a/0x50
 legacy_get_tree+0xfb/0x210
 vfs_get_tree+0x8f/0x2f0
 do_new_mount+0x30a/0x760
 path_mount+0x4de/0x1880
 __x64_sys_mount+0x2b3/0x340
 do_syscall_64+0x38/0x90
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f33b62ff9ea
Code: 48 8b 0d a9 f4 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 76 f4 0b 00 f7 d8 64 89 01 48
RSP: 002b:00007ffd0c471aa8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f33b62ff9ea
RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffd0c471be0
RBP: 00007ffd0c471c60 R08: 00007ffd0c471ae0 R09: 00007ffd0c471c24
R10: 0000000000000000 R11: 0000000000000202 R12: 000055bac5afc160
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
 </TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---

Fix this by adding sanity check on extended system files' directory inode
to ensure that it is directory, just like ntfs_extend_init() when mounting
ntfs3.

Link: https://lkml.kernel.org/r/20220809064730.2316892-1-chenxiaosong2@huawei.com
Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com>
Cc: Anton Altaparmakov <anton@tuxera.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
mesziman pushed a commit to mesziman/kernel_xiaomi_sm8250 that referenced this issue Nov 14, 2022
commit 4abc99652812a2ddf932f137515d5c5a04723538 upstream.

Syzkaller managed to trigger concurrent calls to
kernfs_remove_by_name_ns() for the same file resulting in
a KASAN detected use-after-free. The race occurs when the root
node is freed during kernfs_drain().

To prevent this acquire an additional reference for the root
of the tree that is removed before calling __kernfs_remove().

Found by syzkaller with the following reproducer (slab_nomerge is
required):

syz_mount_image$ext4(0x0, &(0x7f0000000100)='./file0\x00', 0x100000, 0x0, 0x0, 0x0, 0x0)
r0 = openat(0xffffffffffffff9c, &(0x7f0000000080)='/proc/self/exe\x00', 0x0, 0x0)
close(r0)
pipe2(&(0x7f0000000140)={0xffffffffffffffff, <r1=>0xffffffffffffffff}, 0x800)
mount$9p_fd(0x0, &(0x7f0000000040)='./file0\x00', &(0x7f00000000c0), 0x408, &(0x7f0000000280)={'trans=fd,', {'rfdno', 0x3d, r0}, 0x2c, {'wfdno', 0x3d, r1}, 0x2c, {[{@cache_loose}, {@MMAP}, {@Loose}, {@Loose}, {@MMAP}], [{@Mask={'mask', 0x3d, '^MAY_EXEC'}}, {@FSMagic={'fsmagic', 0x3d, 0x10001}}, {@dont_hash}]}})

Sample report:

==================================================================
BUG: KASAN: use-after-free in kernfs_type include/linux/kernfs.h:335 [inline]
BUG: KASAN: use-after-free in kernfs_leftmost_descendant fs/kernfs/dir.c:1261 [inline]
BUG: KASAN: use-after-free in __kernfs_remove.part.0+0x843/0x960 fs/kernfs/dir.c:1369
Read of size 2 at addr ffff8880088807f0 by task syz-executor.2/857

CPU: 0 PID: 857 Comm: syz-executor.2 Not tainted 6.0.0-rc3-00363-g7726d4c3e60b UtsavBalar1231#5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x6e/0x91 lib/dump_stack.c:106
 print_address_description mm/kasan/report.c:317 [inline]
 print_report.cold+0x5e/0x5e5 mm/kasan/report.c:433
 kasan_report+0xa3/0x130 mm/kasan/report.c:495
 kernfs_type include/linux/kernfs.h:335 [inline]
 kernfs_leftmost_descendant fs/kernfs/dir.c:1261 [inline]
 __kernfs_remove.part.0+0x843/0x960 fs/kernfs/dir.c:1369
 __kernfs_remove fs/kernfs/dir.c:1356 [inline]
 kernfs_remove_by_name_ns+0x108/0x190 fs/kernfs/dir.c:1589
 sysfs_slab_add+0x133/0x1e0 mm/slub.c:5943
 __kmem_cache_create+0x3e0/0x550 mm/slub.c:4899
 create_cache mm/slab_common.c:229 [inline]
 kmem_cache_create_usercopy+0x167/0x2a0 mm/slab_common.c:335
 p9_client_create+0xd4d/0x1190 net/9p/client.c:993
 v9fs_session_init+0x1e6/0x13c0 fs/9p/v9fs.c:408
 v9fs_mount+0xb9/0xbd0 fs/9p/vfs_super.c:126
 legacy_get_tree+0xf1/0x200 fs/fs_context.c:610
 vfs_get_tree+0x85/0x2e0 fs/super.c:1530
 do_new_mount fs/namespace.c:3040 [inline]
 path_mount+0x675/0x1d00 fs/namespace.c:3370
 do_mount fs/namespace.c:3383 [inline]
 __do_sys_mount fs/namespace.c:3591 [inline]
 __se_sys_mount fs/namespace.c:3568 [inline]
 __x64_sys_mount+0x282/0x300 fs/namespace.c:3568
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f725f983aed
Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f725f0f7028 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00007f725faa3f80 RCX: 00007f725f983aed
RDX: 00000000200000c0 RSI: 0000000020000040 RDI: 0000000000000000
RBP: 00007f725f9f419c R08: 0000000020000280 R09: 0000000000000000
R10: 0000000000000408 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000006 R14: 00007f725faa3f80 R15: 00007f725f0d7000
 </TASK>

Allocated by task 855:
 kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38
 kasan_set_track mm/kasan/common.c:45 [inline]
 set_alloc_info mm/kasan/common.c:437 [inline]
 __kasan_slab_alloc+0x66/0x80 mm/kasan/common.c:470
 kasan_slab_alloc include/linux/kasan.h:224 [inline]
 slab_post_alloc_hook mm/slab.h:727 [inline]
 slab_alloc_node mm/slub.c:3243 [inline]
 slab_alloc mm/slub.c:3251 [inline]
 __kmem_cache_alloc_lru mm/slub.c:3258 [inline]
 kmem_cache_alloc+0xbf/0x200 mm/slub.c:3268
 kmem_cache_zalloc include/linux/slab.h:723 [inline]
 __kernfs_new_node+0xd4/0x680 fs/kernfs/dir.c:593
 kernfs_new_node fs/kernfs/dir.c:655 [inline]
 kernfs_create_dir_ns+0x9c/0x220 fs/kernfs/dir.c:1010
 sysfs_create_dir_ns+0x127/0x290 fs/sysfs/dir.c:59
 create_dir lib/kobject.c:63 [inline]
 kobject_add_internal+0x24a/0x8d0 lib/kobject.c:223
 kobject_add_varg lib/kobject.c:358 [inline]
 kobject_init_and_add+0x101/0x160 lib/kobject.c:441
 sysfs_slab_add+0x156/0x1e0 mm/slub.c:5954
 __kmem_cache_create+0x3e0/0x550 mm/slub.c:4899
 create_cache mm/slab_common.c:229 [inline]
 kmem_cache_create_usercopy+0x167/0x2a0 mm/slab_common.c:335
 p9_client_create+0xd4d/0x1190 net/9p/client.c:993
 v9fs_session_init+0x1e6/0x13c0 fs/9p/v9fs.c:408
 v9fs_mount+0xb9/0xbd0 fs/9p/vfs_super.c:126
 legacy_get_tree+0xf1/0x200 fs/fs_context.c:610
 vfs_get_tree+0x85/0x2e0 fs/super.c:1530
 do_new_mount fs/namespace.c:3040 [inline]
 path_mount+0x675/0x1d00 fs/namespace.c:3370
 do_mount fs/namespace.c:3383 [inline]
 __do_sys_mount fs/namespace.c:3591 [inline]
 __se_sys_mount fs/namespace.c:3568 [inline]
 __x64_sys_mount+0x282/0x300 fs/namespace.c:3568
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Freed by task 857:
 kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38
 kasan_set_track+0x21/0x30 mm/kasan/common.c:45
 kasan_set_free_info+0x20/0x40 mm/kasan/generic.c:370
 ____kasan_slab_free mm/kasan/common.c:367 [inline]
 ____kasan_slab_free mm/kasan/common.c:329 [inline]
 __kasan_slab_free+0x108/0x190 mm/kasan/common.c:375
 kasan_slab_free include/linux/kasan.h:200 [inline]
 slab_free_hook mm/slub.c:1754 [inline]
 slab_free_freelist_hook mm/slub.c:1780 [inline]
 slab_free mm/slub.c:3534 [inline]
 kmem_cache_free+0x9c/0x340 mm/slub.c:3551
 kernfs_put.part.0+0x2b2/0x520 fs/kernfs/dir.c:547
 kernfs_put+0x42/0x50 fs/kernfs/dir.c:521
 __kernfs_remove.part.0+0x72d/0x960 fs/kernfs/dir.c:1407
 __kernfs_remove fs/kernfs/dir.c:1356 [inline]
 kernfs_remove_by_name_ns+0x108/0x190 fs/kernfs/dir.c:1589
 sysfs_slab_add+0x133/0x1e0 mm/slub.c:5943
 __kmem_cache_create+0x3e0/0x550 mm/slub.c:4899
 create_cache mm/slab_common.c:229 [inline]
 kmem_cache_create_usercopy+0x167/0x2a0 mm/slab_common.c:335
 p9_client_create+0xd4d/0x1190 net/9p/client.c:993
 v9fs_session_init+0x1e6/0x13c0 fs/9p/v9fs.c:408
 v9fs_mount+0xb9/0xbd0 fs/9p/vfs_super.c:126
 legacy_get_tree+0xf1/0x200 fs/fs_context.c:610
 vfs_get_tree+0x85/0x2e0 fs/super.c:1530
 do_new_mount fs/namespace.c:3040 [inline]
 path_mount+0x675/0x1d00 fs/namespace.c:3370
 do_mount fs/namespace.c:3383 [inline]
 __do_sys_mount fs/namespace.c:3591 [inline]
 __se_sys_mount fs/namespace.c:3568 [inline]
 __x64_sys_mount+0x282/0x300 fs/namespace.c:3568
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

The buggy address belongs to the object at ffff888008880780
 which belongs to the cache kernfs_node_cache of size 128
The buggy address is located 112 bytes inside of
 128-byte region [ffff888008880780, ffff888008880800)

The buggy address belongs to the physical page:
page:00000000732833f8 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8880
flags: 0x100000000000200(slab|node=0|zone=1)
raw: 0100000000000200 0000000000000000 dead000000000122 ffff888001147280
raw: 0000000000000000 0000000000150015 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff888008880680: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
 ffff888008880700: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>ffff888008880780: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                             ^
 ffff888008880800: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
 ffff888008880880: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
==================================================================

Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable <stable@kernel.org> # -rc3
Signed-off-by: Christian A. Ehrhardt <lk@c--e.de>
Link: https://lore.kernel.org/r/20220913121723.691454-1-lk@c--e.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
UtsavBalar1231 pushed a commit that referenced this issue Dec 13, 2022
commit 2b1299322016731d56807aa49254a5ea3080b6b3 upstream.

tl;dr: The Enhanced IBRS mitigation for Spectre v2 does not work as
documented for RET instructions after VM exits. Mitigate it with a new
one-entry RSB stuffing mechanism and a new LFENCE.

== Background ==

Indirect Branch Restricted Speculation (IBRS) was designed to help
mitigate Branch Target Injection and Speculative Store Bypass, i.e.
Spectre, attacks. IBRS prevents software run in less privileged modes
from affecting branch prediction in more privileged modes. IBRS requires
the MSR to be written on every privilege level change.

To overcome some of the performance issues of IBRS, Enhanced IBRS was
introduced.  eIBRS is an "always on" IBRS, in other words, just turn
it on once instead of writing the MSR on every privilege level change.
When eIBRS is enabled, more privileged modes should be protected from
less privileged modes, including protecting VMMs from guests.

== Problem ==

Here's a simplification of how guests are run on Linux' KVM:

void run_kvm_guest(void)
{
	// Prepare to run guest
	VMRESUME();
	// Clean up after guest runs
}

The execution flow for that would look something like this to the
processor:

1. Host-side: call run_kvm_guest()
2. Host-side: VMRESUME
3. Guest runs, does "CALL guest_function"
4. VM exit, host runs again
5. Host might make some "cleanup" function calls
6. Host-side: RET from run_kvm_guest()

Now, when back on the host, there are a couple of possible scenarios of
post-guest activity the host needs to do before executing host code:

* on pre-eIBRS hardware (legacy IBRS, or nothing at all), the RSB is not
touched and Linux has to do a 32-entry stuffing.

* on eIBRS hardware, VM exit with IBRS enabled, or restoring the host
IBRS=1 shortly after VM exit, has a documented side effect of flushing
the RSB except in this PBRSB situation where the software needs to stuff
the last RSB entry "by hand".

IOW, with eIBRS supported, host RET instructions should no longer be
influenced by guest behavior after the host retires a single CALL
instruction.

However, if the RET instructions are "unbalanced" with CALLs after a VM
exit as is the RET in #6, it might speculatively use the address for the
instruction after the CALL in #3 as an RSB prediction. This is a problem
since the (untrusted) guest controls this address.

Balanced CALL/RET instruction pairs such as in step #5 are not affected.

== Solution ==

The PBRSB issue affects a wide variety of Intel processors which
support eIBRS. But not all of them need mitigation. Today,
X86_FEATURE_RSB_VMEXIT triggers an RSB filling sequence that mitigates
PBRSB. Systems setting RSB_VMEXIT need no further mitigation - i.e.,
eIBRS systems which enable legacy IBRS explicitly.

However, such systems (X86_FEATURE_IBRS_ENHANCED) do not set RSB_VMEXIT
and most of them need a new mitigation.

Therefore, introduce a new feature flag X86_FEATURE_RSB_VMEXIT_LITE
which triggers a lighter-weight PBRSB mitigation versus RSB_VMEXIT.

The lighter-weight mitigation performs a CALL instruction which is
immediately followed by a speculative execution barrier (INT3). This
steers speculative execution to the barrier -- just like a retpoline
-- which ensures that speculation can never reach an unbalanced RET.
Then, ensure this CALL is retired before continuing execution with an
LFENCE.

In other words, the window of exposure is opened at VM exit where RET
behavior is troublesome. While the window is open, force RSB predictions
sampling for RET targets to a dead end at the INT3. Close the window
with the LFENCE.

There is a subset of eIBRS systems which are not vulnerable to PBRSB.
Add these systems to the cpu_vuln_whitelist[] as NO_EIBRS_PBRSB.
Future systems that aren't vulnerable will set ARCH_CAP_PBRSB_NO.

  [ bp: Massage, incorporate review comments from Andy Cooper. ]

Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Co-developed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
[ bp: Adjust patch to account for kvm entry being in c ]
Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com>
Signed-off-by: Suleiman Souhlal <suleiman@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
UtsavBalar1231 pushed a commit that referenced this issue Dec 13, 2022
commit 9b2f20344d450137d015b380ff0c2e2a6a170135 upstream.

The btrfs_alloc_dummy_root() uses ERR_PTR as the error return value
rather than NULL, if error happened, there will be a NULL pointer
dereference:

  BUG: KASAN: null-ptr-deref in btrfs_free_dummy_root+0x21/0x50 [btrfs]
  Read of size 8 at addr 000000000000002c by task insmod/258926

  CPU: 2 PID: 258926 Comm: insmod Tainted: G        W          6.1.0-rc2+ #5
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014
  Call Trace:
   <TASK>
   dump_stack_lvl+0x34/0x44
   kasan_report+0xb7/0x140
   kasan_check_range+0x145/0x1a0
   btrfs_free_dummy_root+0x21/0x50 [btrfs]
   btrfs_test_free_space_cache+0x1a8c/0x1add [btrfs]
   btrfs_run_sanity_tests+0x65/0x80 [btrfs]
   init_btrfs_fs+0xec/0x154 [btrfs]
   do_one_initcall+0x87/0x2a0
   do_init_module+0xdf/0x320
   load_module+0x3006/0x3390
   __do_sys_finit_module+0x113/0x1b0
   do_syscall_64+0x35/0x80
 entry_SYSCALL_64_after_hwframe+0x46/0xb0

Fixes: aaedb55 ("Btrfs: add tests for btrfs_get_extent")
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
mesziman pushed a commit to mesziman/kernel_xiaomi_sm8250 that referenced this issue Jan 19, 2023
…g the sock

[ Upstream commit 3cf7203ca620682165706f70a1b12b5194607dce ]

There is a race condition in vxlan that when deleting a vxlan device
during receiving packets, there is a possibility that the sock is
released after getting vxlan_sock vs from sk_user_data. Then in
later vxlan_ecn_decapsulate(), vxlan_get_sk_family() we will got
NULL pointer dereference. e.g.

   #0 [ffffa25ec6978a38] machine_kexec at ffffffff8c669757
   UtsavBalar1231#1 [ffffa25ec6978a90] __crash_kexec at ffffffff8c7c0a4d
   UtsavBalar1231#2 [ffffa25ec6978b58] crash_kexec at ffffffff8c7c1c48
   UtsavBalar1231#3 [ffffa25ec6978b60] oops_end at ffffffff8c627f2b
   UtsavBalar1231#4 [ffffa25ec6978b80] page_fault_oops at ffffffff8c678fcb
   UtsavBalar1231#5 [ffffa25ec6978bd8] exc_page_fault at ffffffff8d109542
   UtsavBalar1231#6 [ffffa25ec6978c00] asm_exc_page_fault at ffffffff8d200b62
      [exception RIP: vxlan_ecn_decapsulate+0x3b]
      RIP: ffffffffc1014e7b  RSP: ffffa25ec6978cb0  RFLAGS: 00010246
      RAX: 0000000000000008  RBX: ffff8aa000888000  RCX: 0000000000000000
      RDX: 000000000000000e  RSI: ffff8a9fc7ab803e  RDI: ffff8a9fd1168700
      RBP: ffff8a9fc7ab803e   R8: 0000000000700000   R9: 00000000000010ae
      R10: ffff8a9fcb748980  R11: 0000000000000000  R12: ffff8a9fd1168700
      R13: ffff8aa000888000  R14: 00000000002a0000  R15: 00000000000010ae
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
   UtsavBalar1231#7 [ffffa25ec6978ce8] vxlan_rcv at ffffffffc10189cd [vxlan]
   UtsavBalar1231#8 [ffffa25ec6978d90] udp_queue_rcv_one_skb at ffffffff8cfb6507
   UtsavBalar1231#9 [ffffa25ec6978dc0] udp_unicast_rcv_skb at ffffffff8cfb6e45
  UtsavBalar1231#10 [ffffa25ec6978dc8] __udp4_lib_rcv at ffffffff8cfb8807
  UtsavBalar1231#11 [ffffa25ec6978e20] ip_protocol_deliver_rcu at ffffffff8cf76951
  UtsavBalar1231#12 [ffffa25ec6978e48] ip_local_deliver at ffffffff8cf76bde
  UtsavBalar1231#13 [ffffa25ec6978ea0] __netif_receive_skb_one_core at ffffffff8cecde9b
  UtsavBalar1231#14 [ffffa25ec6978ec8] process_backlog at ffffffff8cece139
  UtsavBalar1231#15 [ffffa25ec6978f00] __napi_poll at ffffffff8ceced1a
  UtsavBalar1231#16 [ffffa25ec6978f28] net_rx_action at ffffffff8cecf1f3
  UtsavBalar1231#17 [ffffa25ec6978fa0] __softirqentry_text_start at ffffffff8d4000ca
  UtsavBalar1231#18 [ffffa25ec6978ff0] do_softirq at ffffffff8c6fbdc3

Reproducer: https://github.com/Mellanox/ovs-tests/blob/master/test-ovs-vxlan-remove-tunnel-during-traffic.sh

Fix this by waiting for all sk_user_data reader to finish before
releasing the sock.

Reported-by: Jianlin Shi <jishi@redhat.com>
Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Fixes: 6a93cc9 ("udp-tunnel: Add a few more UDP tunnel APIs")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
mesziman pushed a commit to mesziman/kernel_xiaomi_sm8250 that referenced this issue Jan 19, 2023
commit 341097ee53573e06ab9fc675d96a052385b851fa upstream.

There's a crash in mempool_free when running the lvm test
shell/lvchange-rebuild-raid.sh.

The reason for the crash is this:
* super_written calls atomic_dec_and_test(&mddev->pending_writes) and
  wake_up(&mddev->sb_wait). Then it calls rdev_dec_pending(rdev, mddev)
  and bio_put(bio).
* so, the process that waited on sb_wait and that is woken up is racing
  with bio_put(bio).
* if the process wins the race, it calls bioset_exit before bio_put(bio)
  is executed.
* bio_put(bio) attempts to free a bio into a destroyed bio set - causing
  a crash in mempool_free.

We fix this bug by moving bio_put before atomic_dec_and_test.

We also move rdev_dec_pending before atomic_dec_and_test as suggested by
Neil Brown.

The function md_end_flush has a similar bug - we must call bio_put before
we decrement the number of in-progress bios.

 BUG: kernel NULL pointer dereference, address: 0000000000000000
 #PF: supervisor write access in kernel mode
 #PF: error_code(0x0002) - not-present page
 PGD 11557f0067 P4D 11557f0067 PUD 0
 Oops: 0002 [UtsavBalar1231#1] PREEMPT SMP
 CPU: 0 PID: 73 Comm: kworker/0:1 Not tainted 6.1.0-rc3 UtsavBalar1231#5
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
 Workqueue: kdelayd flush_expired_bios [dm_delay]
 RIP: 0010:mempool_free+0x47/0x80
 Code: 48 89 ef 5b 5d ff e0 f3 c3 48 89 f7 e8 32 45 3f 00 48 63 53 08 48 89 c6 3b 53 04 7d 2d 48 8b 43 10 8d 4a 01 48 89 df 89 4b 08 <48> 89 2c d0 e8 b0 45 3f 00 48 8d 7b 30 5b 5d 31 c9 ba 01 00 00 00
 RSP: 0018:ffff88910036bda8 EFLAGS: 00010093
 RAX: 0000000000000000 RBX: ffff8891037b65d8 RCX: 0000000000000001
 RDX: 0000000000000000 RSI: 0000000000000202 RDI: ffff8891037b65d8
 RBP: ffff8891447ba240 R08: 0000000000012908 R09: 00000000003d0900
 R10: 0000000000000000 R11: 0000000000173544 R12: ffff889101a14000
 R13: ffff8891562ac300 R14: ffff889102b41440 R15: ffffe8ffffa00d05
 FS:  0000000000000000(0000) GS:ffff88942fa00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000000 CR3: 0000001102e99000 CR4: 00000000000006b0
 Call Trace:
  <TASK>
  clone_endio+0xf4/0x1c0 [dm_mod]
  clone_endio+0xf4/0x1c0 [dm_mod]
  __submit_bio+0x76/0x120
  submit_bio_noacct_nocheck+0xb6/0x2a0
  flush_expired_bios+0x28/0x2f [dm_delay]
  process_one_work+0x1b4/0x300
  worker_thread+0x45/0x3e0
  ? rescuer_thread+0x380/0x380
  kthread+0xc2/0x100
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x1f/0x30
  </TASK>
 Modules linked in: brd dm_delay dm_raid dm_mod af_packet uvesafb cfbfillrect cfbimgblt cn cfbcopyarea fb font fbdev tun autofs4 binfmt_misc configfs ipv6 virtio_rng virtio_balloon rng_core virtio_net pcspkr net_failover failover qemu_fw_cfg button mousedev raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 md_mod sd_mod t10_pi crc64_rocksoft crc64 virtio_scsi scsi_mod evdev psmouse bsg scsi_common [last unloaded: brd]
 CR2: 0000000000000000
 ---[ end trace 0000000000000000 ]---

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
mesziman pushed a commit to mesziman/kernel_xiaomi_sm8250 that referenced this issue Jan 19, 2023
[ Upstream commit b18cba09e374637a0a3759d856a6bca94c133952 ]

Commit 9130b8d ("SUNRPC: allow for upcalls for the same uid
but different gss service") introduced `auth` argument to
__gss_find_upcall(), but in gss_pipe_downcall() it was left as NULL
since it (and auth->service) was not (yet) determined.

When multiple upcalls with the same uid and different service are
ongoing, it could happen that __gss_find_upcall(), which returns the
first match found in the pipe->in_downcall list, could not find the
correct gss_msg corresponding to the downcall we are looking for.
Moreover, it might return a msg which is not sent to rpc.gssd yet.

We could see mount.nfs process hung in D state with multiple mount.nfs
are executed in parallel.  The call trace below is of CentOS 7.9
kernel-3.10.0-1160.24.1.el7.x86_64 but we observed the same hang w/
elrepo kernel-ml-6.0.7-1.el7.

PID: 71258  TASK: ffff91ebd4be0000  CPU: 36  COMMAND: "mount.nfs"
 #0 [ffff9203ca3234f8] __schedule at ffffffffa3b8899f
 UtsavBalar1231#1 [ffff9203ca323580] schedule at ffffffffa3b88eb9
 UtsavBalar1231#2 [ffff9203ca323590] gss_cred_init at ffffffffc0355818 [auth_rpcgss]
 UtsavBalar1231#3 [ffff9203ca323658] rpcauth_lookup_credcache at ffffffffc0421ebc
[sunrpc]
 UtsavBalar1231#4 [ffff9203ca3236d8] gss_lookup_cred at ffffffffc0353633 [auth_rpcgss]
 UtsavBalar1231#5 [ffff9203ca3236e8] rpcauth_lookupcred at ffffffffc0421581 [sunrpc]
 UtsavBalar1231#6 [ffff9203ca323740] rpcauth_refreshcred at ffffffffc04223d3 [sunrpc]
 UtsavBalar1231#7 [ffff9203ca3237a0] call_refresh at ffffffffc04103dc [sunrpc]
 UtsavBalar1231#8 [ffff9203ca3237b8] __rpc_execute at ffffffffc041e1c9 [sunrpc]
 UtsavBalar1231#9 [ffff9203ca323820] rpc_execute at ffffffffc0420a48 [sunrpc]

The scenario is like this. Let's say there are two upcalls for
services A and B, A -> B in pipe->in_downcall, B -> A in pipe->pipe.

When rpc.gssd reads pipe to get the upcall msg corresponding to
service B from pipe->pipe and then writes the response, in
gss_pipe_downcall the msg corresponding to service A will be picked
because only uid is used to find the msg and it is before the one for
B in pipe->in_downcall.  And the process waiting for the msg
corresponding to service A will be woken up.

Actual scheduing of that process might be after rpc.gssd processes the
next msg.  In rpc_pipe_generic_upcall it clears msg->errno (for A).
The process is scheduled to see gss_msg->ctx == NULL and
gss_msg->msg.errno == 0, therefore it cannot break the loop in
gss_create_upcall and is never woken up after that.

This patch adds a simple check to ensure that a msg which is not
sent to rpc.gssd yet is not chosen as the matching upcall upon
receiving a downcall.

Signed-off-by: minoura makoto <minoura@valinux.co.jp>
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@nec.com>
Tested-by: Hiroshi Shimamoto <h-shimamoto@nec.com>
Cc: Trond Myklebust <trondmy@hammerspace.com>
Fixes: 9130b8d ("SUNRPC: allow for upcalls for same uid but different gss service")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
markakash pushed a commit to pe-markakash/kernel_xiaomi_sm8250 that referenced this issue Feb 15, 2023
[ Upstream commit 6c4ca03bd890566d873e3593b32d034bf2f5a087 ]

During EEH error injection testing, a deadlock was encountered in the tg3
driver when tg3_io_error_detected() was attempting to cancel outstanding
reset tasks:

crash> foreach UN bt
...
PID: 159    TASK: c0000000067c6000  CPU: 8   COMMAND: "eehd"
...
 UtsavBalar1231#5 [c00000000681f990] __cancel_work_timer at c00000000019fd18
 UtsavBalar1231#6 [c00000000681fa30] tg3_io_error_detected at c00800000295f098 [tg3]
 UtsavBalar1231#7 [c00000000681faf0] eeh_report_error at c00000000004e25c
...

PID: 290    TASK: c000000036e5f800  CPU: 6   COMMAND: "kworker/6:1"
...
 UtsavBalar1231#4 [c00000003721fbc0] rtnl_lock at c000000000c940d8
 UtsavBalar1231#5 [c00000003721fbe0] tg3_reset_task at c008000002969358 [tg3]
 UtsavBalar1231#6 [c00000003721fc60] process_one_work at c00000000019e5c4
...

PID: 296    TASK: c000000037a65800  CPU: 21  COMMAND: "kworker/21:1"
...
 UtsavBalar1231#4 [c000000037247bc0] rtnl_lock at c000000000c940d8
 UtsavBalar1231#5 [c000000037247be0] tg3_reset_task at c008000002969358 [tg3]
 UtsavBalar1231#6 [c000000037247c60] process_one_work at c00000000019e5c4
...

PID: 655    TASK: c000000036f49000  CPU: 16  COMMAND: "kworker/16:2"
...:1

 UtsavBalar1231#4 [c0000000373ebbc0] rtnl_lock at c000000000c940d8
 UtsavBalar1231#5 [c0000000373ebbe0] tg3_reset_task at c008000002969358 [tg3]
 UtsavBalar1231#6 [c0000000373ebc60] process_one_work at c00000000019e5c4
...

Code inspection shows that both tg3_io_error_detected() and
tg3_reset_task() attempt to acquire the RTNL lock at the beginning of
their code blocks.  If tg3_reset_task() should happen to execute between
the times when tg3_io_error_deteced() acquires the RTNL lock and
tg3_reset_task_cancel() is called, a deadlock will occur.

Moving tg3_reset_task_cancel() call earlier within the code block, prior
to acquiring RTNL, prevents this from happening, but also exposes another
deadlock issue where tg3_reset_task() may execute AFTER
tg3_io_error_detected() has executed:

crash> foreach UN bt
PID: 159    TASK: c0000000067d2000  CPU: 9   COMMAND: "eehd"
...
 UtsavBalar1231#4 [c000000006867a60] rtnl_lock at c000000000c940d8
 UtsavBalar1231#5 [c000000006867a80] tg3_io_slot_reset at c0080000026c2ea8 [tg3]
 UtsavBalar1231#6 [c000000006867b00] eeh_report_reset at c00000000004de88
...
PID: 363    TASK: c000000037564000  CPU: 6   COMMAND: "kworker/6:1"
...
 UtsavBalar1231#3 [c000000036c1bb70] msleep at c000000000259e6c
 UtsavBalar1231#4 [c000000036c1bba0] napi_disable at c000000000c6b848
 UtsavBalar1231#5 [c000000036c1bbe0] tg3_reset_task at c0080000026d942c [tg3]
 UtsavBalar1231#6 [c000000036c1bc60] process_one_work at c00000000019e5c4
...

This issue can be avoided by aborting tg3_reset_task() if EEH error
recovery is already in progress.

Fixes: db84bf4 ("tg3: tg3_reset_task() needs to use rtnl_lock to synchronize")
Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Link: https://lore.kernel.org/r/20230124185339.225806-1-drc@linux.vnet.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
akash07k pushed a commit to crdroidandroid/android_kernel_xiaomi_sm8250-munch that referenced this issue Mar 31, 2023
commit 60eed1e3d45045623e46944ebc7c42c30a4350f0 upstream.

code path:

ocfs2_ioctl_move_extents
 ocfs2_move_extents
  ocfs2_defrag_extent
   __ocfs2_move_extent
    + ocfs2_journal_access_di
    + ocfs2_split_extent  //sub-paths call jbd2_journal_restart
    + ocfs2_journal_dirty //crash by jbs2 ASSERT

crash stacks:

PID: 11297  TASK: ffff974a676dcd00  CPU: 67  COMMAND: "defragfs.ocfs2"
 #0 [ffffb25d8dad3900] machine_kexec at ffffffff8386fe01
 UtsavBalar1231#1 [ffffb25d8dad3958] __crash_kexec at ffffffff8395959d
 UtsavBalar1231#2 [ffffb25d8dad3a20] crash_kexec at ffffffff8395a45d
 UtsavBalar1231#3 [ffffb25d8dad3a38] oops_end at ffffffff83836d3f
 UtsavBalar1231#4 [ffffb25d8dad3a58] do_trap at ffffffff83833205
 UtsavBalar1231#5 [ffffb25d8dad3aa0] do_invalid_op at ffffffff83833aa6
 UtsavBalar1231#6 [ffffb25d8dad3ac0] invalid_op at ffffffff84200d18
    [exception RIP: jbd2_journal_dirty_metadata+0x2ba]
    RIP: ffffffffc09ca54a  RSP: ffffb25d8dad3b70  RFLAGS: 00010207
    RAX: 0000000000000000  RBX: ffff9706eedc5248  RCX: 0000000000000000
    RDX: 0000000000000001  RSI: ffff97337029ea28  RDI: ffff9706eedc5250
    RBP: ffff9703c3520200   R8: 000000000f46b0b2   R9: 0000000000000000
    R10: 0000000000000001  R11: 00000001000000fe  R12: ffff97337029ea28
    R13: 0000000000000000  R14: ffff9703de59bf60  R15: ffff9706eedc5250
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 UtsavBalar1231#7 [ffffb25d8dad3ba8] ocfs2_journal_dirty at ffffffffc137fb95 [ocfs2]
 UtsavBalar1231#8 [ffffb25d8dad3be8] __ocfs2_move_extent at ffffffffc139a950 [ocfs2]
 UtsavBalar1231#9 [ffffb25d8dad3c80] ocfs2_defrag_extent at ffffffffc139b2d2 [ocfs2]

Analysis

This bug has the same root cause of 'commit 7f27ec9 ("ocfs2: call
ocfs2_journal_access_di() before ocfs2_journal_dirty() in
ocfs2_write_end_nolock()")'.  For this bug, jbd2_journal_restart() is
called by ocfs2_split_extent() during defragmenting.

How to fix

For ocfs2_split_extent() can handle journal operations totally by itself.
Caller doesn't need to call journal access/dirty pair, and caller only
needs to call journal start/stop pair.  The fix method is to remove
journal access/dirty from __ocfs2_move_extent().

The discussion for this patch:
https://oss.oracle.com/pipermail/ocfs2-devel/2023-February/000647.html

Link: https://lkml.kernel.org/r/20230217003717.32469-1-heming.zhao@suse.com
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
macka69 pushed a commit to macka69/kernel_xiaomi_sm8250-1 that referenced this issue Jun 4, 2023
[ Upstream commit 05bb0167c80b8f93c6a4e0451b7da9b96db990c2 ]

ACPICA commit 770653e3ba67c30a629ca7d12e352d83c2541b1e

Before this change we see the following UBSAN stack trace in Fuchsia:

  #0    0x000021e4213b3302 in acpi_ds_init_aml_walk(struct acpi_walk_state*, union acpi_parse_object*, struct acpi_namespace_node*, u8*, u32, struct acpi_evaluate_info*, u8) ../../third_party/acpica/source/components/dispatcher/dswstate.c:682 <platform-bus-x86.so>+0x233302
  #1.2  0x000020d0f660777f in ubsan_get_stack_trace() compiler-rt/lib/ubsan/ubsan_diag.cpp:41 <libclang_rt.asan.so>+0x3d77f
  #1.1  0x000020d0f660777f in maybe_print_stack_trace() compiler-rt/lib/ubsan/ubsan_diag.cpp:51 <libclang_rt.asan.so>+0x3d77f
  #1    0x000020d0f660777f in ~scoped_report() compiler-rt/lib/ubsan/ubsan_diag.cpp:387 <libclang_rt.asan.so>+0x3d77f
  UtsavBalar1231#2    0x000020d0f660b96d in handlepointer_overflow_impl() compiler-rt/lib/ubsan/ubsan_handlers.cpp:809 <libclang_rt.asan.so>+0x4196d
  UtsavBalar1231#3    0x000020d0f660b50d in compiler-rt/lib/ubsan/ubsan_handlers.cpp:815 <libclang_rt.asan.so>+0x4150d
  UtsavBalar1231#4    0x000021e4213b3302 in acpi_ds_init_aml_walk(struct acpi_walk_state*, union acpi_parse_object*, struct acpi_namespace_node*, u8*, u32, struct acpi_evaluate_info*, u8) ../../third_party/acpica/source/components/dispatcher/dswstate.c:682 <platform-bus-x86.so>+0x233302
  UtsavBalar1231#5    0x000021e4213e2369 in acpi_ds_call_control_method(struct acpi_thread_state*, struct acpi_walk_state*, union acpi_parse_object*) ../../third_party/acpica/source/components/dispatcher/dsmethod.c:605 <platform-bus-x86.so>+0x262369
  UtsavBalar1231#6    0x000021e421437fac in acpi_ps_parse_aml(struct acpi_walk_state*) ../../third_party/acpica/source/components/parser/psparse.c:550 <platform-bus-x86.so>+0x2b7fac
  UtsavBalar1231#7    0x000021e4214464d2 in acpi_ps_execute_method(struct acpi_evaluate_info*) ../../third_party/acpica/source/components/parser/psxface.c:244 <platform-bus-x86.so>+0x2c64d2
  UtsavBalar1231#8    0x000021e4213aa052 in acpi_ns_evaluate(struct acpi_evaluate_info*) ../../third_party/acpica/source/components/namespace/nseval.c:250 <platform-bus-x86.so>+0x22a052
  UtsavBalar1231#9    0x000021e421413dd8 in acpi_ns_init_one_device(acpi_handle, u32, void*, void**) ../../third_party/acpica/source/components/namespace/nsinit.c:735 <platform-bus-x86.so>+0x293dd8
  UtsavBalar1231#10   0x000021e421429e98 in acpi_ns_walk_namespace(acpi_object_type, acpi_handle, u32, u32, acpi_walk_callback, acpi_walk_callback, void*, void**) ../../third_party/acpica/source/components/namespace/nswalk.c:298 <platform-bus-x86.so>+0x2a9e98
  UtsavBalar1231#11   0x000021e4214131ac in acpi_ns_initialize_devices(u32) ../../third_party/acpica/source/components/namespace/nsinit.c:268 <platform-bus-x86.so>+0x2931ac
  UtsavBalar1231#12   0x000021e42147c40d in acpi_initialize_objects(u32) ../../third_party/acpica/source/components/utilities/utxfinit.c:304 <platform-bus-x86.so>+0x2fc40d
  UtsavBalar1231#13   0x000021e42126d603 in acpi::acpi_impl::initialize_acpi(acpi::acpi_impl*) ../../src/devices/board/lib/acpi/acpi-impl.cc:224 <platform-bus-x86.so>+0xed603

Add a simple check that avoids incrementing a pointer by zero, but
otherwise behaves as before. Note that our findings are against ACPICA
20221020, but the same code exists on master.

Link: acpica/acpica@770653e3
Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
mesziman pushed a commit to mesziman/kernel_xiaomi_sm8250 that referenced this issue Sep 23, 2023
[ Upstream commit ade32bd8a738d7497ffe9743c46728db26740f78 ]

unix_tot_inflight is changed under spin_lock(unix_gc_lock), but
unix_release_sock() reads it locklessly.

Let's use READ_ONCE() for unix_tot_inflight.

Note that the writer side was marked by commit 9d6d7f1cb67c ("af_unix:
annote lockless accesses to unix_tot_inflight & gc_in_progress")

BUG: KCSAN: data-race in unix_inflight / unix_release_sock

write (marked) to 0xffffffff871852b8 of 4 bytes by task 123 on cpu 1:
 unix_inflight+0x130/0x180 net/unix/scm.c:64
 unix_attach_fds+0x137/0x1b0 net/unix/scm.c:123
 unix_scm_to_skb net/unix/af_unix.c:1832 [inline]
 unix_dgram_sendmsg+0x46a/0x14f0 net/unix/af_unix.c:1955
 sock_sendmsg_nosec net/socket.c:724 [inline]
 sock_sendmsg+0x148/0x160 net/socket.c:747
 ____sys_sendmsg+0x4e4/0x610 net/socket.c:2493
 ___sys_sendmsg+0xc6/0x140 net/socket.c:2547
 __sys_sendmsg+0x94/0x140 net/socket.c:2576
 __do_sys_sendmsg net/socket.c:2585 [inline]
 __se_sys_sendmsg net/socket.c:2583 [inline]
 __x64_sys_sendmsg+0x45/0x50 net/socket.c:2583
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

read to 0xffffffff871852b8 of 4 bytes by task 4891 on cpu 0:
 unix_release_sock+0x608/0x910 net/unix/af_unix.c:671
 unix_release+0x59/0x80 net/unix/af_unix.c:1058
 __sock_release+0x7d/0x170 net/socket.c:653
 sock_close+0x19/0x30 net/socket.c:1385
 __fput+0x179/0x5e0 fs/file_table.c:321
 ____fput+0x15/0x20 fs/file_table.c:349
 task_work_run+0x116/0x1a0 kernel/task_work.c:179
 resume_user_mode_work include/linux/resume_user_mode.h:49 [inline]
 exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
 exit_to_user_mode_prepare+0x174/0x180 kernel/entry/common.c:204
 __syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline]
 syscall_exit_to_user_mode+0x1a/0x30 kernel/entry/common.c:297
 do_syscall_64+0x4b/0x90 arch/x86/entry/common.c:86
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

value changed: 0x00000000 -> 0x00000001

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 4891 Comm: systemd-coredum Not tainted 6.4.0-rc5-01219-gfa0e21fa4443 UtsavBalar1231#5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014

Fixes: 9305cfa ("[AF_UNIX]: Make unix_tot_inflight counter non-atomic")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
macka69 pushed a commit to macka69/kernel_xiaomi_sm8250-1 that referenced this issue Nov 4, 2023
[ Upstream commit a154f5f643c6ecddd44847217a7a3845b4350003 ]

The following call trace shows a deadlock issue due to recursive locking of
mutex "device_mutex". First lock acquire is in target_for_each_device() and
second in target_free_device().

 PID: 148266   TASK: ffff8be21ffb5d00  CPU: 10   COMMAND: "iscsi_ttx"
  #0 [ffffa2bfc9ec3b18] __schedule at ffffffffa8060e7f
  #1 [ffffa2bfc9ec3ba0] schedule at ffffffffa8061224
  UtsavBalar1231#2 [ffffa2bfc9ec3bb8] schedule_preempt_disabled at ffffffffa80615ee
  UtsavBalar1231#3 [ffffa2bfc9ec3bc8] __mutex_lock at ffffffffa8062fd7
  UtsavBalar1231#4 [ffffa2bfc9ec3c40] __mutex_lock_slowpath at ffffffffa80631d3
  UtsavBalar1231#5 [ffffa2bfc9ec3c50] mutex_lock at ffffffffa806320c
  UtsavBalar1231#6 [ffffa2bfc9ec3c68] target_free_device at ffffffffc0935998 [target_core_mod]
  UtsavBalar1231#7 [ffffa2bfc9ec3c90] target_core_dev_release at ffffffffc092f975 [target_core_mod]
  UtsavBalar1231#8 [ffffa2bfc9ec3ca0] config_item_put at ffffffffa79d250f
  UtsavBalar1231#9 [ffffa2bfc9ec3cd0] config_item_put at ffffffffa79d2583
 UtsavBalar1231#10 [ffffa2bfc9ec3ce0] target_devices_idr_iter at ffffffffc0933f3a [target_core_mod]
 UtsavBalar1231#11 [ffffa2bfc9ec3d00] idr_for_each at ffffffffa803f6fc
 UtsavBalar1231#12 [ffffa2bfc9ec3d60] target_for_each_device at ffffffffc0935670 [target_core_mod]
 UtsavBalar1231#13 [ffffa2bfc9ec3d98] transport_deregister_session at ffffffffc0946408 [target_core_mod]
 UtsavBalar1231#14 [ffffa2bfc9ec3dc8] iscsit_close_session at ffffffffc09a44a6 [iscsi_target_mod]
 UtsavBalar1231#15 [ffffa2bfc9ec3df0] iscsit_close_connection at ffffffffc09a4a88 [iscsi_target_mod]
 UtsavBalar1231#16 [ffffa2bfc9ec3df8] finish_task_switch at ffffffffa76e5d07
 UtsavBalar1231#17 [ffffa2bfc9ec3e78] iscsit_take_action_for_connection_exit at ffffffffc0991c23 [iscsi_target_mod]
 UtsavBalar1231#18 [ffffa2bfc9ec3ea0] iscsi_target_tx_thread at ffffffffc09a403b [iscsi_target_mod]
 UtsavBalar1231#19 [ffffa2bfc9ec3f08] kthread at ffffffffa76d8080
 UtsavBalar1231#20 [ffffa2bfc9ec3f50] ret_from_fork at ffffffffa8200364

Fixes: 36d4cb4 ("scsi: target: Avoid that EXTENDED COPY commands trigger lock inversion")
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Link: https://lore.kernel.org/r/20230918225848.66463-1-junxiao.bi@oracle.com
Reviewed-by: Mike Christie <michael.christie@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
mesziman pushed a commit to mesziman/kernel_xiaomi_sm8250 that referenced this issue Jan 17, 2024
[ Upstream commit 14694179e561b5f2f7e56a0f590e2cb49a9cc7ab ]

Trying to suspend to RAM on SAMA5D27 EVK leads to the following lockdep
warning:

 ============================================
 WARNING: possible recursive locking detected
 6.7.0-rc5-wt+ #532 Not tainted
 --------------------------------------------
 sh/92 is trying to acquire lock:
 c3cf306c (&irq_desc_lock_class){-.-.}-{2:2}, at: __irq_get_desc_lock+0xe8/0x100

 but task is already holding lock:
 c3d7c46c (&irq_desc_lock_class){-.-.}-{2:2}, at: __irq_get_desc_lock+0xe8/0x100

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&irq_desc_lock_class);
   lock(&irq_desc_lock_class);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 6 locks held by sh/92:
  #0: c3aa0258 (sb_writers#6){.+.+}-{0:0}, at: ksys_write+0xd8/0x178
  UtsavBalar1231#1: c4c2df44 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x138/0x284
  UtsavBalar1231#2: c32684a0 (kn->active){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x148/0x284
  UtsavBalar1231#3: c232b6d4 (system_transition_mutex){+.+.}-{3:3}, at: pm_suspend+0x13c/0x4e8
  UtsavBalar1231#4: c387b088 (&dev->mutex){....}-{3:3}, at: __device_suspend+0x1e8/0x91c
  UtsavBalar1231#5: c3d7c46c (&irq_desc_lock_class){-.-.}-{2:2}, at: __irq_get_desc_lock+0xe8/0x100

 stack backtrace:
 CPU: 0 PID: 92 Comm: sh Not tainted 6.7.0-rc5-wt+ #532
 Hardware name: Atmel SAMA5
  unwind_backtrace from show_stack+0x18/0x1c
  show_stack from dump_stack_lvl+0x34/0x48
  dump_stack_lvl from __lock_acquire+0x19ec/0x3a0c
  __lock_acquire from lock_acquire.part.0+0x124/0x2d0
  lock_acquire.part.0 from _raw_spin_lock_irqsave+0x5c/0x78
  _raw_spin_lock_irqsave from __irq_get_desc_lock+0xe8/0x100
  __irq_get_desc_lock from irq_set_irq_wake+0xa8/0x204
  irq_set_irq_wake from atmel_gpio_irq_set_wake+0x58/0xb4
  atmel_gpio_irq_set_wake from irq_set_irq_wake+0x100/0x204
  irq_set_irq_wake from gpio_keys_suspend+0xec/0x2b8
  gpio_keys_suspend from dpm_run_callback+0xe4/0x248
  dpm_run_callback from __device_suspend+0x234/0x91c
  __device_suspend from dpm_suspend+0x224/0x43c
  dpm_suspend from dpm_suspend_start+0x9c/0xa8
  dpm_suspend_start from suspend_devices_and_enter+0x1e0/0xa84
  suspend_devices_and_enter from pm_suspend+0x460/0x4e8
  pm_suspend from state_store+0x78/0xe4
  state_store from kernfs_fop_write_iter+0x1a0/0x284
  kernfs_fop_write_iter from vfs_write+0x38c/0x6f4
  vfs_write from ksys_write+0xd8/0x178
  ksys_write from ret_fast_syscall+0x0/0x1c
 Exception stack(0xc52b3fa8 to 0xc52b3ff0)
 3fa0:                   00000004 005a0ae8 00000001 005a0ae8 00000004 00000001
 3fc0: 00000004 005a0ae8 00000001 00000004 00000004 b6c616c0 00000020 0059d190
 3fe0: 00000004 b6c61678 aec5a041 aebf1a26

This warning is raised because pinctrl-at91-pio4 uses chained IRQ. Whenever
a wake up source configures an IRQ through irq_set_irq_wake, it will
lock the corresponding IRQ desc, and then call irq_set_irq_wake on "parent"
IRQ which will do the same on its own IRQ desc, but since those two locks
share the same class, lockdep reports this as an issue.

Fix lockdep false positive by setting a different class for parent and
children IRQ

Fixes: 7761808 ("pinctrl: introduce driver for Atmel PIO4 controller")
Signed-off-by: Alexis Lothoré <alexis.lothore@bootlin.com>
Link: https://lore.kernel.org/r/20231215-lockdep_warning-v1-1-8137b2510ed5@bootlin.com
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants