display turns off after sporadic 'flip_done timeout error' #128

nkichukov · 2018-02-11T22:50:55Z

Hi folks,

The reported issue is sporadic and not reproducible on demand. As per the log below, it has happened 3 times in 21 days. All 3 times, the display went blank and got powered off until the Xorg session was restarted.

Details follow below:
kernel: 4.15.0-rc7-v7+ + patch *1.) which fixed memory leak issue addressed at #123
compiler: gcc-6.4
raspberry pi running in 32bit mode
CMA=256MB
OS: Gentoo Linux

Error printed in the kernel log:
[1042617.315980] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] ERROR [CRTC:53:crtc-2] flip_done timed out
[1042842.595966] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] ERROR [PLANE:52:plane-20] flip_done timed out
[1042959.331986] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] ERROR [CRTC:53:crtc-2] flip_done timed out
[1043354.595985] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] ERROR [PLANE:52:plane-20] flip_done timed out
[1761553.894440] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] ERROR [CRTC:53:crtc-2] flip_done timed out
[1762164.198463] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] ERROR [PLANE:52:plane-20] flip_done timed out

Workaround:
terminate X11 session and restart it.

Thank you,
-Nikolay

*1.) https://lists.freedesktop.org/archives/dri-devel/2018-January/161625.html

lategoodbye · 2018-02-12T09:21:38Z

Please follow this thread:
https://lists.freedesktop.org/archives/dri-devel/2018-February/165428.html

[ Upstream commit 4adfa79 ] When we dump the ip6mr mfc entries via proc, we initialize an iterator with the table to dump but we don't clear the cache pointer which might be initialized from a prior read on the same descriptor that ended. This can result in lock imbalance (an unnecessary unlock) leading to other crashes and hangs. Clear the cache pointer like ipmr does to fix the issue. Thanks for the reliable reproducer. Here's syzbot's trace: WARNING: bad unlock balance detected! 4.15.0-rc3+ #128 Not tainted syzkaller971460/3195 is trying to release lock (mrt_lock) at: [<000000006898068d>] ipmr_mfc_seq_stop+0xe1/0x130 net/ipv6/ip6mr.c:553 but there are no more locks to release! other info that might help us debug this: 1 lock held by syzkaller971460/3195: #0: (&p->lock){+.+.}, at: [<00000000744a6565>] seq_read+0xd5/0x13d0 fs/seq_file.c:165 stack backtrace: CPU: 1 PID: 3195 Comm: syzkaller971460 Not tainted 4.15.0-rc3+ #128 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:53 print_unlock_imbalance_bug+0x12f/0x140 kernel/locking/lockdep.c:3561 __lock_release kernel/locking/lockdep.c:3775 [inline] lock_release+0x5f9/0xda0 kernel/locking/lockdep.c:4023 __raw_read_unlock include/linux/rwlock_api_smp.h:225 [inline] _raw_read_unlock+0x1a/0x30 kernel/locking/spinlock.c:255 ipmr_mfc_seq_stop+0xe1/0x130 net/ipv6/ip6mr.c:553 traverse+0x3bc/0xa00 fs/seq_file.c:135 seq_read+0x96a/0x13d0 fs/seq_file.c:189 proc_reg_read+0xef/0x170 fs/proc/inode.c:217 do_loop_readv_writev fs/read_write.c:673 [inline] do_iter_read+0x3db/0x5b0 fs/read_write.c:897 compat_readv+0x1bf/0x270 fs/read_write.c:1140 do_compat_preadv64+0xdc/0x100 fs/read_write.c:1189 C_SYSC_preadv fs/read_write.c:1209 [inline] compat_SyS_preadv+0x3b/0x50 fs/read_write.c:1203 do_syscall_32_irqs_on arch/x86/entry/common.c:327 [inline] do_fast_syscall_32+0x3ee/0xf9d arch/x86/entry/common.c:389 entry_SYSENTER_compat+0x51/0x60 arch/x86/entry/entry_64_compat.S:125 RIP: 0023:0xf7f73c79 RSP: 002b:00000000e574a15c EFLAGS: 00000292 ORIG_RAX: 000000000000014d RAX: ffffffffffffffda RBX: 000000000000000f RCX: 0000000020a3afb0 RDX: 0000000000000001 RSI: 0000000000000067 RDI: 0000000000000000 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 BUG: sleeping function called from invalid context at lib/usercopy.c:25 in_atomic(): 1, irqs_disabled(): 0, pid: 3195, name: syzkaller971460 INFO: lockdep is turned off. CPU: 1 PID: 3195 Comm: syzkaller971460 Not tainted 4.15.0-rc3+ #128 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:53 ___might_sleep+0x2b2/0x470 kernel/sched/core.c:6060 __might_sleep+0x95/0x190 kernel/sched/core.c:6013 __might_fault+0xab/0x1d0 mm/memory.c:4525 _copy_to_user+0x2c/0xc0 lib/usercopy.c:25 copy_to_user include/linux/uaccess.h:155 [inline] seq_read+0xcb4/0x13d0 fs/seq_file.c:279 proc_reg_read+0xef/0x170 fs/proc/inode.c:217 do_loop_readv_writev fs/read_write.c:673 [inline] do_iter_read+0x3db/0x5b0 fs/read_write.c:897 compat_readv+0x1bf/0x270 fs/read_write.c:1140 do_compat_preadv64+0xdc/0x100 fs/read_write.c:1189 C_SYSC_preadv fs/read_write.c:1209 [inline] compat_SyS_preadv+0x3b/0x50 fs/read_write.c:1203 do_syscall_32_irqs_on arch/x86/entry/common.c:327 [inline] do_fast_syscall_32+0x3ee/0xf9d arch/x86/entry/common.c:389 entry_SYSENTER_compat+0x51/0x60 arch/x86/entry/entry_64_compat.S:125 RIP: 0023:0xf7f73c79 RSP: 002b:00000000e574a15c EFLAGS: 00000292 ORIG_RAX: 000000000000014d RAX: ffffffffffffffda RBX: 000000000000000f RCX: 0000000020a3afb0 RDX: 0000000000000001 RSI: 0000000000000067 RDI: 0000000000000000 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 WARNING: CPU: 1 PID: 3195 at lib/usercopy.c:26 _copy_to_user+0xb5/0xc0 lib/usercopy.c:26 Reported-by: syzbot <bot+eceb3204562c41a438fa1f2335e0fe4f6886d669@syzkaller.appspotmail.com> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

nkichukov · 2018-04-03T13:25:39Z

Thanks for the pointers Stefan,
I have attempted rpi-4.16.0-rc7 kernel which does include the clk-patches from https://patchwork.kernel.org/patch/10207161/raw/ and drm/atomic fixes from https://patchwork.kernel.org/patch/10169179/raw/ however, I am still getting the same sporadic results:

[ 5433.053889] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:68:crtc-2] flip_done timed out [ 5551.325851] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [PLANE:67:plane-20] flip_done timed out
Any other pointers/patches?
-Nik

nkichukov · 2018-04-03T22:15:38Z

Reproduced with drm.debug=0x3f. A lot of repeated messages, not sure if this will help. Leaving the kern.log.gz file available for download(10MB), uncompressed 142MB.

http://www.oldum.net/~nikolay/kern.log.gz

First occurrence:
Apr 3 23:34:23 grpi kernel: [ 906.205803] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] ERROR [CRTC:68:crtc-2] flip_done timed out
Second:
Apr 3 23:38:18 grpi kernel: [ 1141.213846] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] ERROR [PLANE:67:plane-20] flip_done timed out

Cheers,
-N

nkichukov · 2018-04-14T20:07:10Z

Upgraded to rpi-4.16.1 which seems to have received multiple vc4 patches and I am unable to reproduce the flip_done timeout for the last 3 days.

nkichukov · 2018-04-18T07:47:44Z

day 6, issue reoccurred:

[516937.439594] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:68:crtc-2] flip_done timed out
[517547.743601] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [PLANE:67:plane-20] flip_done timed out

nkichukov · 2018-05-28T08:54:06Z

This continues to occur, so sporadic as it was...
Changes attempted:
Running 4.16.10 kernel now, noticed loads of vc4 commits made it there, to no avail for this bug unfortunately.
Running latest version of libdrm, from https://anongit.freedesktop.org/git/mesa/drm.git, no change.
Version of mesa: mesa-18.1.0, tried with experimental from git, same results.
Xorg server: xorg-server-1.19.5-r2
compiler upgraded to: gcc-7.3.0-r3 and all of the above recompiled and with -O2, same results.

[74697.182124] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:68:crtc-2] flip_done timed out [74814.686141] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [PLANE:67:plane-20] flip_done timed out [124903.134330] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:68:crtc-2] flip_done timed out [125513.438329] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [PLANE:67:plane-20] flip_done timed out

Any pointers? Where to go next?
Cheers,
-N

nkichukov · 2018-06-21T11:38:14Z

This is no longer happening, so I am closing the case. It seems like it was not due to the driver/kernel side, but userland, as the few changes I've made about 20 days ago were to upgrade xorg-server and xorg-drivers to version 1.20.0 and also disable some of the enabled but unused --configure switches for mesa when compiling it. This got the system back to a stable state and the flip_done timeout is no longer present, at least not in the last 20 days...

Cheers,
-N

lategoodbye · 2019-01-25T12:19:39Z

This "flip_done timed out" still occurs e.g. next-20190125.

Unfortunately i don't know the used display on the kernelci machine (Raspberry Pi 3 B, arm64/defconfig):
https://kernelci.org/boot/id/5c4ab5b359b514f4f39ad006/

07:07:28.667279  [   26.147366] [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:139:crtc-2] flip_done timed out
07:07:28.668117  [   26.147369] Bluetooth: hci0: Failed to set baudrate
07:07:28.702757  [   26.147527] Console: switching to colour frame buffer device 90x30
07:07:28.705750  [   26.259404] Bluetooth: hci0: BCM: chip id 94
07:07:28.706766  [   26.261556] Bluetooth: hci0: BCM: features 0x2e
07:07:28.707637  [   26.284543] Bluetooth: hci0: BCM43430A1
07:07:28.708475  [   26.284556] Bluetooth: hci0: BCM43430A1 (001.002.009) build 0000
07:07:28.709290  [   26.284658] bluetooth hci0: Direct firmware load for brcm/BCM43430A1.hcd failed with error -2
07:07:28.710129  [   26.284668] Bluetooth: hci0: BCM: Patch brcm/BCM43430A1.hcd not found
07:07:28.749202  [   36.387352] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:139:crtc-2] flip_done timed out
07:07:28.751899  [   45.569912] NET: Registered protocol family 10
07:07:28.752647  [   45.571503] Segment Routing with IPv6
07:07:28.753346  [   46.627353] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CONNECTOR:46:Composite-1] flip_done timed out
07:07:28.754030  [   56.867345] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [PLANE:135:plane-20] flip_done timed out
07:07:28.756490  [   57.030633] vc4-drm soc:gpu: fb0: DRM emulated frame buffer device

nkichukov closed this as completed Jun 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

display turns off after sporadic 'flip_done timeout error' #128

display turns off after sporadic 'flip_done timeout error' #128

nkichukov commented Feb 11, 2018

lategoodbye commented Feb 12, 2018

nkichukov commented Apr 3, 2018

nkichukov commented Apr 3, 2018

nkichukov commented Apr 14, 2018

nkichukov commented Apr 18, 2018

nkichukov commented May 28, 2018

nkichukov commented Jun 21, 2018

lategoodbye commented Jan 25, 2019 •

edited

Loading

display turns off after sporadic 'flip_done timeout error' #128

display turns off after sporadic 'flip_done timeout error' #128

Comments

nkichukov commented Feb 11, 2018

lategoodbye commented Feb 12, 2018

nkichukov commented Apr 3, 2018

nkichukov commented Apr 3, 2018

nkichukov commented Apr 14, 2018

nkichukov commented Apr 18, 2018

nkichukov commented May 28, 2018

nkichukov commented Jun 21, 2018

lategoodbye commented Jan 25, 2019 • edited Loading

lategoodbye commented Jan 25, 2019 •

edited

Loading