Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpeg risc op code error on Supermicro with WinTV-QuadHD-ATSC #51

Closed
MikeB2013 opened this issue Aug 30, 2018 · 39 comments
Closed

mpeg risc op code error on Supermicro with WinTV-QuadHD-ATSC #51

MikeB2013 opened this issue Aug 30, 2018 · 39 comments

Comments

@MikeB2013
Copy link

Hi Brad,

I am trying to help out a user with mythtv application, who is getting mpeg risc op code errors, which are fatal.

Operating System - Ubuntu 18.04.01 LTS (Server)
uname -a output: Linux mythtv-server 4.15.0-29201807270420-generic #0+mediatree+hauppauge-Ubuntu SMP Fri Jul 27 18:09:45 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
System - SUPERMICRO SYS-5018A-FTN4 1U Rackmount Server / 16 GB RAM / 2 TB HD WD Purple @ 5400 rpm
Hauppauge WinTV-QuadHD-ATSC [card=57,autodetected], Hauppauge model 165100, rev B4I6, serial# 4036040160

The dmesg output is http://paste.ubuntu.com/p/zGgVPGBfvk/

The mythtv forum thread is https://forum.mythtv.org/viewtopic.php?p=13661#p13661

Any thoughts on how to debug this ?

Mike

@b-rad-NDi
Copy link
Owner

Hey Mike. Ugh this issue :/ I will fire back up my ryzen system and resume this bug. I was having problems reproducing it after the last patches I did, but I don't think all mobo's are equal with this bug.

Has this end user updated their BIOS to the latest and greatest? That is the first thing I'd suggest.

I will install myth if that easily triggers this issue. Any tips on how to repro this via myth would be appreciated.

@MikeB2013
Copy link
Author

MikeB2013 commented Aug 30, 2018

Hi Brad, Bios is at version 2.1 and is the latest according to Supermicro web site. The CPU is Intel(R) Atom(TM) CPU C2758 @ 2.40GHz (family: 0x6, model: 0x4d, stepping: 0x8)

The mythtv setup is basic, it just requires either a recording or LiveTV to be active on the failing system, note that EIT scanning is enabled. I suspect that something like w_scan will produce the problem, do you have a suitable incantation to use (I don't know anything about ATSC) and I can get the mythtv user to run some tests.

I recently had one mpeg risc error (non fatal) on my ASUS STRIX B250F GAMING motherboard, BIOS 1205 05/11/2018, with CPU Intel(R) Pentium(R) CPU G4400 @ 3.30GHz (family: 0x6, model: 0x5e, stepping: 0x3)

I note that on the failing system the mpeg risc error line was not preceded by "cx23885 0000:05:00.0: dma in progress detected 0x00000001 0x00000001, clearing" which is what I see e.g.

56499.363529] cx23885 0000:05:00.0: dma in progress detected 0x00000001 0x00000001, clearing
[56499.363599] cx23885: cx23885[0]: mpeg risc op code error
[56499.363603] cx23885: cx23885[0]: TS1 B - dma channel status dump
[56499.363606] cx23885: cx23885[0]:   cmds: init risc lo   : 0xbf4c5000
[56499.363608] cx23885: cx23885[0]:   cmds: init risc hi   : 0x00000000
[56499.363611] cx23885: cx23885[0]:   cmds: cdt base       : 0x00010580
[56499.363613] cx23885: cx23885[0]:   cmds: cdt size       : 0x0000000a
[56499.363616] cx23885: cx23885[0]:   cmds: iq base        : 0x00010400
[56499.363618] cx23885: cx23885[0]:   cmds: iq size        : 0x00000010
[56499.363620] cx23885: cx23885[0]:   cmds: risc pc lo     : 0xbf4c500c
[56499.363623] cx23885: cx23885[0]:   cmds: risc pc hi     : 0x00000000
[56499.363625] cx23885: cx23885[0]:   cmds: iq wr ptr      : 0x00004101
[56499.363628] cx23885: cx23885[0]:   cmds: iq rd ptr      : 0x00004100
[56499.363630] cx23885: cx23885[0]:   cmds: cdt current    : 0x00010588
[56499.363633] cx23885: cx23885[0]:   cmds: pci target lo  : 0x12266340
[56499.363635] cx23885: cx23885[0]:   cmds: pci target hi  : 0x00000000
[56499.363637] cx23885: cx23885[0]:   cmds: line / byte    : 0x000c0000
[56499.363640] cx23885: cx23885[0]:   risc0: 
[56499.363641] 0x1c0002f0 [ write sol eol count=752 ]
[56499.363644] cx23885: cx23885[0]:   risc1: 
[56499.363645] 0x12266050 [ write irq2 21 18 cnt1 14 13 count=80 ]
[56499.363649] cx23885: cx23885[0]:   risc2: 
[56499.363649] 0x00000000 [ INVALID count=0 ]
[56499.363652] cx23885: cx23885[0]:   risc3: 
[56499.363652] 0x1c0002f0 [ write sol eol count=752 ]
[56499.363656] cx23885: cx23885[0]:   (0x00010400) iq 0: 
[56499.363656] 0x70000000 [ jump count=0 ]
[56499.363659] cx23885: cx23885[0]:   iq 1: 0x00000000 [ arg #1 ]
[56499.363662] cx23885: cx23885[0]:   iq 2: 0x1c0002f0 [ arg #2 ]
[56499.363664] cx23885: cx23885[0]:   (0x0001040c) iq 3: 
[56499.363664] 0x12266050 [ write irq2 21 18 cnt1 14 13 count=80 ]
[56499.363669] cx23885: cx23885[0]:   iq 4: 0x00000000 [ arg #1 ]
[56499.363671] cx23885: cx23885[0]:   iq 5: 0x1c0002f0 [ arg #2 ]
[56499.363676] cx23885: cx23885[0]:   (0x00010418) iq 6: 
[56499.363676] 0x12266340 [ write irq2 21 18 cnt1 14 13 count=832 ]
[56499.363681] cx23885: cx23885[0]:   iq 7: 0x00000000 [ arg #1 ]
[56499.363683] cx23885: cx23885[0]:   iq 8: 0x00000000 [ arg #2 ]
[56499.363685] cx23885: cx23885[0]:   (0x00010424) iq 9: 
[56499.363686] 0x1c0002f0 [ write sol eol count=752 ]
[56499.363689] cx23885: cx23885[0]:   iq a: 0x12265780 [ arg #1 ]
[56499.363692] cx23885: cx23885[0]:   iq b: 0x00000000 [ arg #2 ]
[56499.363694] cx23885: cx23885[0]:   (0x00010430) iq c: 
[56499.363694] 0x1c0002f0 [ write sol eol count=752 ]
[56499.363698] cx23885: cx23885[0]:   iq d: 0x12265a70 [ arg #1 ]
[56499.363700] cx23885: cx23885[0]:   iq e: 0x00000000 [ arg #2 ]
[56499.363702] cx23885: cx23885[0]:   (0x0001043c) iq f: 
[56499.363703] 0x1c0002f0 [ write sol eol count=752 ]
[56499.363706] cx23885: cx23885[0]:   iq 10: 0x06d90bc0 [ arg #1 ]
[56499.363709] cx23885: cx23885[0]:   iq 11: 0x00000000 [ arg #2 ]
[56499.363709] cx23885: cx23885[0]: fifo: 0x00005000 -> 0x6000
[56499.363710] cx23885: cx23885[0]: ctrl: 0x00010400 -> 0x10460
[56499.363712] cx23885: cx23885[0]:   ptr1_reg: 0x00005390
[56499.363715] cx23885: cx23885[0]:   ptr2_reg: 0x00010598
[56499.363717] cx23885: cx23885[0]:   cnt1_reg: 0x0000000a
[56499.363719] cx23885: cx23885[0]:   cnt2_reg: 0x00000007

Mike

@b-rad-NDi
Copy link
Owner

Wait what, this isn't ryzen :-o Shyzer.

It would be great if you could get me command lines that repro this. In my tests w_scan is usually not enough, because it doesn't generate enough interrupts during the scanning process to cause this.

@b-rad-NDi
Copy link
Owner

I do have a patch where this driver was converted to vb2 buffering system. The risc programs were changed at that point. I'm noodling over this.

@MikeB2013
Copy link
Author

MikeB2013 commented Aug 30, 2018

HI Brad, given the user is seeing a lot of these mpeg risc errors, I think w_scan is worth a try - just need the appropriate incantation I think it is w_scan -fa -A1 -c US -a n where n is adapter number

Unfortunately mythtv is not a command line application.

Mike

@crumka
Copy link

crumka commented Aug 31, 2018

@b-rad-NDi, I'm the user experiencing the issue. I ran w_scan -fa -A1 -c US -a 0 three times last night and couldn't replicate the issue.

Logs are here: http://paste.ubuntu.com/p/xWw7CtSF5F/

MikeB2013 asked me for my /proc/interrupts output. See below:

crumka@mythtv-server:~$ cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       
  0:         11          0          0          0          0          0          0          0   IO-APIC   2-edge      timer
  8:          1          0          0          0          0          0          0          0   IO-APIC   8-edge      rtc0
  9:          0          0          0          0          0          0          0          0   IO-APIC   9-fasteoi   acpi
 20:          0          0          0          0       1376          0          0          0   IO-APIC  20-fasteoi   cx23885[1]
 23:          0          0          0          0          0          0          0        141   IO-APIC  23-fasteoi   ehci_hcd:usb1, cx23885[0]
 24:          0          0          0          0          0          0          0          0   PCI-MSI 16384-edge      aerdrv, PCIe PME
 25:          0          0          0          0          0          0          0          0   PCI-MSI 32768-edge      aerdrv, PCIe PME
 26:          0          0          0          0          0          0          0          0   PCI-MSI 49152-edge      aerdrv, PCIe PME
 29:         25          0          0          0          0          0          0          0   PCI-MSI 1572864-edge      xhci_hcd
 30:          0          0          0          0          0          0          0          0   PCI-MSI 1572865-edge      xhci_hcd
 31:          0          0          0          0          0          0          0          0   PCI-MSI 1572866-edge      xhci_hcd
 32:          0          0          0          0          0          0          0          0   PCI-MSI 1572867-edge      xhci_hcd
 33:          0          0          0          0          0          0          0          0   PCI-MSI 1572868-edge      xhci_hcd
 34:          0          0          0          0          0          0          0          0   PCI-MSI 1572869-edge      xhci_hcd
 35:          0          0          0          0          0          0          0          0   PCI-MSI 1572870-edge      xhci_hcd
 36:          0          0          0          0          0          0          0          0   PCI-MSI 1572871-edge      xhci_hcd
 37:          0          0          0          0          0          0          0          0   PCI-MSI 311296-edge      ismt-msi
 38:          0          0          0          0          0          0          0          0   PCI-MSI 376832-edge      ahci[0000:00:17.0]
 39:          0          0       2628       6656          0          0          0          0   PCI-MSI 393216-edge      ahci[0000:00:18.0]
 40:          1          0          0          0          0          0          0          0   PCI-MSI 327680-edge      enp0s20f0
 41:          0         30          0          0          0          0          0        103   PCI-MSI 327681-edge      enp0s20f0-TxRx-0
 42:          0          0         20          0          0          0        110          0   PCI-MSI 327682-edge      enp0s20f0-TxRx-1
 43:          0          0          0         13          0        101          0          0   PCI-MSI 327683-edge      enp0s20f0-TxRx-2
 44:          0          0          0          0        107          0          0          0   PCI-MSI 327684-edge      enp0s20f0-TxRx-3
 45:          0          0          0        102          0         25          0          0   PCI-MSI 327685-edge      enp0s20f0-TxRx-4
 46:          0          0        103          0          0          0         15          0   PCI-MSI 327686-edge      enp0s20f0-TxRx-5
 47:          0          0          0         96          0          0          0         13   PCI-MSI 327687-edge      enp0s20f0-TxRx-6
 48:        126          0          0          0          0          0          0          0   PCI-MSI 327688-edge      enp0s20f0-TxRx-7
 49:          0          0          0          0          0          0          0          0   PCI-MSI 329728-edge      enp0s20f1
 50:         11          0          0         90          0          0          0          0   PCI-MSI 329729-edge      enp0s20f1-TxRx-0
 51:          0         11         90          0          0          0          0          0   PCI-MSI 329730-edge      enp0s20f1-TxRx-1
 52:          0         90         11          0          0          0          0          0   PCI-MSI 329731-edge      enp0s20f1-TxRx-2
 53:         90          0          0         11          0          0          0          0   PCI-MSI 329732-edge      enp0s20f1-TxRx-3
 54:          0          0          0          0         11          0         90          0   PCI-MSI 329733-edge      enp0s20f1-TxRx-4
 55:          0          0          0          0          0         11          0         90   PCI-MSI 329734-edge      enp0s20f1-TxRx-5
 56:          0          0          0          0         90          0         11          0   PCI-MSI 329735-edge      enp0s20f1-TxRx-6
 57:          0          0          0          0          0          0          0        101   PCI-MSI 329736-edge      enp0s20f1-TxRx-7
 58:          0          0          0          0          0          0          1          0   PCI-MSI 331776-edge      enp0s20f2
 59:          0          0          0          0          0         51        182         59   PCI-MSI 331777-edge      enp0s20f2-TxRx-0
 60:         31          0          0          0        134          0          0          0   PCI-MSI 331778-edge      enp0s20f2-TxRx-1
 61:          0         23          0        279          0          0          0          0   PCI-MSI 331779-edge      enp0s20f2-TxRx-2
 62:          0          0        139          0          0          0          0          0   PCI-MSI 331780-edge      enp0s20f2-TxRx-3
 63:          0        107          0         40          0          0          0          0   PCI-MSI 331781-edge      enp0s20f2-TxRx-4
 64:        105          0          0          0         24          0          0          0   PCI-MSI 331782-edge      enp0s20f2-TxRx-5
 65:          0        142          0          0          0         39          0          0   PCI-MSI 331783-edge      enp0s20f2-TxRx-6
 66:          0          0          0          0          0          0         47        110   PCI-MSI 331784-edge      enp0s20f2-TxRx-7
 67:          0          0          0          0          0          0          0          0   PCI-MSI 333824-edge      enp0s20f3
 68:          0          0          0          0          0         90         11          0   PCI-MSI 333825-edge      enp0s20f3-TxRx-0
 69:          0          0          0          0         90          0          0         11   PCI-MSI 333826-edge      enp0s20f3-TxRx-1
 70:         11          0         90          0          0          0          0          0   PCI-MSI 333827-edge      enp0s20f3-TxRx-2
 71:          0         11          0         90          0          0          0          0   PCI-MSI 333828-edge      enp0s20f3-TxRx-3
 72:         90          0         11          0          0          0          0          0   PCI-MSI 333829-edge      enp0s20f3-TxRx-4
 73:          0         90          0         11          0          0          0          0   PCI-MSI 333830-edge      enp0s20f3-TxRx-5
 74:          0          0          0          0         11          0          0         90   PCI-MSI 333831-edge      enp0s20f3-TxRx-6
 75:          0          0          0          0          0        101          0          0   PCI-MSI 333832-edge      enp0s20f3-TxRx-7
NMI:          0          1          0          0          0          0          0          0   Non-maskable interrupts
LOC:       7490       9625       6928       6053       9394      10777       5101       5567   Local timer interrupts
SPU:          0          0          0          0          0          0          0          0   Spurious interrupts
PMI:          0          1          0          0          0          0          0          0   Performance monitoring interrupts
IWI:       2347       2486       2085       1808       3154       5039       1370       1551   IRQ work interrupts
RTR:          0          0          0          0          0          0          0          0   APIC ICR read retries
RES:        607        232        350        320        280        273        272        248   Rescheduling interrupts
CAL:       2439       2366       1762       1701       3488       4328       2504       4050   Function call interrupts
TLB:         11         24         14         24         21         27          8         12   TLB shootdowns
TRM:          0          0          0          0          0          0          0          0   Thermal event interrupts
THR:          0          0          0          0          0          0          0          0   Threshold APIC interrupts
DFR:          0          0          0          0          0          0          0          0   Deferred Error APIC interrupts
MCE:          0          0          0          0          0          0          0          0   Machine check exceptions
MCP:          1          1          1          1          1          1          1          1   Machine check polls
HYP:          0          0          0          0          0          0          0          0   Hypervisor callback interrupts
ERR:          0
MIS:          0
PIN:          0          0          0          0          0          0          0          0   Posted-interrupt notification event
NPI:          0          0          0          0          0          0          0          0   Nested posted-interrupt event
PIW:          0          0          0          0          0          0          0          0   Posted-interrupt wakeup event

Please let me know if/what else you need.

@MikeB2013
Copy link
Author

Hi Brad,

If you want to try setting up mythtv application here is my quick guide to setup
mythtv-quick-configure.txt

There is a whole pile of formal documentation here https://www.mythtv.org/wiki/Configuring_MythTV

Mike

@b-rad-NDi
Copy link
Owner

I'm gonna do this on my idle ryzen box. It's whole point in life was fixing this issue, so I guess it's back on duty.

@crumka
Copy link

crumka commented Sep 12, 2018

Brad - Would you want me to try to generate/post any more logs? I've been continuing to play with the system. Getting intermittent success with longer recordings (up to 3.3 GB) - but all fail with the same error(s).

@b-rad-NDi
Copy link
Owner

OK, so I installed myth using the instructions above on my ryzen system. I've left channels playing overnight as well as setting up four simultaneous recordings.

I get risc op code error very early at boot, but they are related to the analog audio. I've not so far triggered one on a TS port. I'm going to continue to let it sit and queue up recordings.

@crumka
Copy link

crumka commented Sep 19, 2018

Brad - Thank you for the update. This is a single purpose machine (for now), so if you want me to test anything I'm happy to break/rebuild. If need be, we can discuss getting you ssh access, but that would be a new one for me so it might take some time, Just let me know.

@b-rad-NDi
Copy link
Owner

120GB of recordings and no error. Mythtv isn't auto cleaning though, so I need to figure some stuff out before I can let it go long term.

@crumka
Copy link

crumka commented Sep 20, 2018

Best I've ever achieved is 3.3 GB, but most fail within a few tens or hundreds MB.

@crumka
Copy link

crumka commented Sep 28, 2018

@b-rad-NDi -- Been playing around with the server and got some interesting results. When I run a minecraft server at the same time as mythtv -- the error goes away and I can record. When I turn off the minecraft server instance the error returns.

Could we have some resource going to sleep or a race condition? (I don't know much about these but have been reading. Please excuse me if they arn't productive comments)

@b-rad-NDi
Copy link
Owner

Good finding, I've often suspected this could be related to system performance level. Check out cpu load. Perhaps this issue is related to cpu power scaling.

To verify performance level:

cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

Note you might check all cpuN.

You can spin up a few md5sum to do a different sort of verify. See how much cpu load the minecraft server adds, then tack on a couple md5sum threads.

cat /dev/null | md5sum &

and see if you encounter the risc op code error.

then eventually...

killall md5sum

@crumka
Copy link

crumka commented Oct 10, 2018

@b-rad-NDi - I've been out of town. Intending to get you this data tonight.

@b-rad-NDi
Copy link
Owner

Crumka? Did you ever try out those tests? The discussion on the mailing list right now is to only apply the original ryzen dma patches, on ryzen systems, or others that exhibit the issue. This patch seems like it might be causing issue on other systems.

@b-rad-NDi
Copy link
Owner

A patch has been submitted to disable the "Ryzen dma engine stall" patch on non-Ryzen systems. It is very possible you have one of the platforms that is adversely affected by that "fix". Some platforms are fine, others have issue.
Hopefully with that disable the issue is no longer encountered on Intel systems.

@crumka
Copy link

crumka commented Dec 18, 2018

I got distracted. Apologies.

---Without Minecraft---
Results from:
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
through (it is an 8 core system)
cat /sys/devices/system/cpu/cpu7/cpufreq/scaling_governor
were all:
ondemand

---With Minecraft---
Results from:
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
through (it is an 8 core system)
cat /sys/devices/system/cpu/cpu7/cpufreq/scaling_governor
were all:
ondemand

---md5sum---
These all seemed to end before I could run the 'cat' commend.

@crumka
Copy link

crumka commented Dec 22, 2018

@b-rad-NDi - So should I be sitting tight until 4.21 and re-test? Just want to confirm.

@b-rad-NDi
Copy link
Owner

This should be fixed in the ppa as well as in mainline kernel now. Please re-open if the issue still exists.

@rg4github
Copy link

Hi Brad,

I bumped into this thread after my lovely WinTV-QuadHD-ATSC stopped tuning this morning following a kernel update. Digging through kernel messages I found the exact error this fix was designed to solve. Since you mentioned changes related to this error made it into the mainline kernel, I'm wondering if those changes are what's killing my card.

A few important notes:

  1. I'm not running Ubuntu but Fedora 29.
  2. I'm not using a Ryzen CPU, but I am using an AMD one. Model name: AMD Athlon(tm) X4 740 Quad Core Processor. CPU family 21, model 16. Nothing super current, but it gets the job done.
  3. No problems with kernel 4.19.13-300.fc29.x86_64.
  4. Persistent failures with kernel 4.20.3-200.fc29.x86_64.

Could you help me figure out which change needs to be backed out, or made conditional? I'm happy to provide additional info, run tests or make changes, including compiling a fresh kernel if that helps.

Happy to open a fresh ticket for this as well, but I was hesitant to do so because I'm not running Ubuntu.

Thanks in advance for your help!

  • Richard.

@rg4github
Copy link

Hello again :-)

I located the changes to the kernel module and noticed the addition of the cx23885 kernel module parameter dma_reset_workaround to force enable or disable the workaround. Nice!

Setting this parameter to forced on (options cx23885 dma_reset_workaround=2) solves the problem(?!) with kernel 4.20.3 and its kmods, so even though my AMD CPU is anything but current (late 2012) it needs the workaround as well. Interesting.

Please let me know if you would like me to provide system info to possibly change the default detection rules.

Kind regards,

  • Richard.

@b-rad-NDi
Copy link
Owner

Hi @rg4github, thanks for filing this information. It is very good to know. You'll have to determine your cpu/motherboard pcie id, then I can send in a patch so you can drop the module option.

https://openbenchmarking.org/system/1703021-RI-AMDZEN08075/Ryzen%207%201800X/lspci,

On Ryzen it is 0x1451, for your older system it must be different.

@rg4github
Copy link

Hi @b-rad-NDi,

I think you only want the IOMMU info, which seems to be 1419, but I'm uploading all of the AMD elements from lspci -v -nn -k in case there's a clue hidden somewhere else ;-)

00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 10h-1fh) Processor Root Complex [1022:1410]
	Subsystem: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 10h-1fh) Processor Root Complex [1022:1410]
	Flags: bus master, 66MHz, medium devsel, latency 0

00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 10h-1fh) I/O Memory Management Unit [1022:1419]
	Subsystem: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 10h-1fh) I/O Memory Management Unit [1022:1419]
	Flags: bus master, fast devsel, latency 0, IRQ 24
	Capabilities: [40] Secure device <?>
	Capabilities: [54] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [64] HyperTransport: MSI Mapping Enable+ Fixed+

00:02.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 10h-1fh) Processor Root Port [1022:1412] (prog-if 00 [Normal decode])
	Flags: bus master, fast devsel, latency 0, IRQ 25
	Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
	I/O behind bridge: 0000e000-0000efff [size=4K]
	Memory behind bridge: fd000000-fe0fffff [size=17M]
	Prefetchable memory behind bridge: 00000000c0000000-00000000d1ffffff [size=288M]
	Capabilities: [50] Power Management version 3
	Capabilities: [58] Express Root Port (Slot+), MSI 00
	Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [b0] Subsystem: Advanced Micro Devices, Inc. [AMD] Trinity A-series APU [1022:1234]
	Capabilities: [b8] HyperTransport: MSI Mapping Enable+ Fixed+
	Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Kernel driver in use: pcieport

00:04.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 10h-1fh) Processor Root Port [1022:1414] (prog-if 00 [Normal decode])
	Flags: bus master, fast devsel, latency 0, IRQ 26
	Bus: primary=00, secondary=02, subordinate=02, sec-latency=0
	I/O behind bridge: 0000d000-0000dfff [size=4K]
	Memory behind bridge: fe600000-fe6fffff [size=1M]
	Prefetchable memory behind bridge: 00000000d8000000-00000000d80fffff [size=1M]
	Capabilities: [50] Power Management version 3
	Capabilities: [58] Express Root Port (Slot+), MSI 00
	Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [b0] Subsystem: Advanced Micro Devices, Inc. [AMD] Trinity A-series APU [1022:1234]
	Capabilities: [b8] HyperTransport: MSI Mapping Enable+ Fixed+
	Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Kernel driver in use: pcieport

00:07.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 10h-1fh) Processor Root Port [1022:1417] (prog-if 00 [Normal decode])
	Flags: bus master, fast devsel, latency 0, IRQ 27
	Bus: primary=00, secondary=03, subordinate=06, sec-latency=0
	I/O behind bridge: None
	Memory behind bridge: fe200000-fe5fffff [size=4M]
	Prefetchable memory behind bridge: None
	Capabilities: [50] Power Management version 3
	Capabilities: [58] Express Root Port (Slot+), MSI 00
	Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [b0] Subsystem: Advanced Micro Devices, Inc. [AMD] Device [1022:1234]
	Capabilities: [b8] HyperTransport: MSI Mapping Enable+ Fixed+
	Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Kernel driver in use: pcieport

00:10.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] FCH USB XHCI Controller [1022:7814] (rev 09) (prog-if 30 [XHCI])
	Subsystem: Gigabyte Technology Co., Ltd Device [1458:5004]
	Flags: bus master, fast devsel, latency 0, IRQ 18
	Memory at fe106000 (64-bit, non-prefetchable) [size=8K]
	Capabilities: [50] Power Management version 3
	Capabilities: [70] MSI: Enable- Count=1/8 Maskable- 64bit+
	Capabilities: [90] MSI-X: Enable+ Count=8 Masked-
	Capabilities: [a0] Express Root Complex Integrated Endpoint, MSI 00
	Capabilities: [100] Latency Tolerance Reporting
	Kernel driver in use: xhci_hcd

00:10.1 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] FCH USB XHCI Controller [1022:7814] (rev 09) (prog-if 30 [XHCI])
	Subsystem: Gigabyte Technology Co., Ltd Device [1458:5004]
	Flags: bus master, fast devsel, latency 0, IRQ 17
	Memory at fe104000 (64-bit, non-prefetchable) [size=8K]
	Capabilities: [50] Power Management version 3
	Capabilities: [70] MSI: Enable- Count=1/8 Maskable- 64bit+
	Capabilities: [90] MSI-X: Enable+ Count=8 Masked-
	Capabilities: [a0] Express Root Complex Integrated Endpoint, MSI 00
	Kernel driver in use: xhci_hcd

00:11.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7801] (rev 40) (prog-if 01 [AHCI 1.0])
	Subsystem: Gigabyte Technology Co., Ltd Device [1458:b002]
	Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 30
	I/O ports at f040 [size=8]
	I/O ports at f030 [size=4]
	I/O ports at f020 [size=8]
	I/O ports at f010 [size=4]
	I/O ports at f000 [size=16]
	Memory at fe10d000 (32-bit, non-prefetchable) [size=2K]
	Capabilities: [50] MSI: Enable+ Count=1/8 Maskable- 64bit+
	Capabilities: [70] SATA HBA v1.0
	Kernel driver in use: ahci

00:12.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] FCH USB OHCI Controller [1022:7807] (rev 11) (prog-if 10 [OHCI])
	Subsystem: Gigabyte Technology Co., Ltd Device [1458:5004]
	Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 18
	Memory at fe10c000 (32-bit, non-prefetchable) [size=4K]
	Kernel driver in use: ohci-pci

00:12.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] FCH USB EHCI Controller [1022:7808] (rev 11) (prog-if 20 [EHCI])
	Subsystem: Gigabyte Technology Co., Ltd Device [1458:5004]
	Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 17
	Memory at fe10b000 (32-bit, non-prefetchable) [size=256]
	Capabilities: [c0] Power Management version 2
	Capabilities: [e4] Debug port: BAR=1 offset=00e0
	Kernel driver in use: ehci-pci

00:13.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] FCH USB OHCI Controller [1022:7807] (rev 11) (prog-if 10 [OHCI])
	Subsystem: Gigabyte Technology Co., Ltd Device [1458:5004]
	Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 18
	Memory at fe10a000 (32-bit, non-prefetchable) [size=4K]
	Kernel driver in use: ohci-pci

00:13.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] FCH USB EHCI Controller [1022:7808] (rev 11) (prog-if 20 [EHCI])
	Subsystem: Gigabyte Technology Co., Ltd Device [1458:5004]
	Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 17
	Memory at fe109000 (32-bit, non-prefetchable) [size=256]
	Capabilities: [c0] Power Management version 2
	Capabilities: [e4] Debug port: BAR=1 offset=00e0
	Kernel driver in use: ehci-pci

00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:780b] (rev 16)
	Subsystem: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:780b]
	Flags: 66MHz, medium devsel
	Kernel driver in use: piix4_smbus
	Kernel modules: i2c_piix4, sp5100_tco

00:14.2 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] FCH Azalia Controller [1022:780d] (rev 01)
	Subsystem: Gigabyte Technology Co., Ltd Device [1458:a002]
	Flags: bus master, slow devsel, latency 32, IRQ 16
	Memory at fe100000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: [50] Power Management version 2
	Kernel driver in use: snd_hda_intel
	Kernel modules: snd_hda_intel

00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:780e] (rev 11)
	Subsystem: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:780e]
	Flags: bus master, 66MHz, medium devsel, latency 0

00:14.4 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] FCH PCI Bridge [1022:780f] (rev 40) (prog-if 01 [Subtractive decode])
	Flags: bus master, VGA palette snoop, 66MHz, medium devsel, latency 64
	Bus: primary=00, secondary=07, subordinate=07, sec-latency=64
	I/O behind bridge: None
	Memory behind bridge: None
	Prefetchable memory behind bridge: d4000000-d7ffffff [size=64M]

00:14.5 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] FCH USB OHCI Controller [1022:7809] (rev 11) (prog-if 10 [OHCI])
	Subsystem: Gigabyte Technology Co., Ltd Device [1458:5004]
	Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 18
	Memory at fe108000 (32-bit, non-prefetchable) [size=4K]
	Kernel driver in use: ohci-pci

00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 10h-1fh) Processor Function 0 [1022:1400]
	Flags: fast devsel

00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 10h-1fh) Processor Function 1 [1022:1401]
	Flags: fast devsel

00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 10h-1fh) Processor Function 2 [1022:1402]
	Flags: fast devsel

00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 10h-1fh) Processor Function 3 [1022:1403]
	Flags: fast devsel
	Capabilities: [f0] Secure device <?>
	Kernel driver in use: k10temp
	Kernel modules: k10temp

00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 10h-1fh) Processor Function 4 [1022:1404]
	Flags: fast devsel

00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h (Models 10h-1fh) Processor Function 5 [1022:1405]
	Flags: fast devsel

Happy to provide additional output!

Kind regards,

  • Richard.

@crumka
Copy link

crumka commented Mar 30, 2019

@b-rad-NDi -- Hi. Finally got the chance to upgrade my kernel. Now on 5.0.5

uname -a

Linux mythtv-server 5.0.5-050005-generic #201903271212 SMP Wed Mar 27 16:14:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Issue is still active.

Edit: Spelling

@crumka
Copy link

crumka commented May 9, 2019

I'm still super interested in a solution to this issue.

@rg4github
Copy link

Hi @crumka,

Are you looking for a solution that does not require the dma_reset_workaround, or is that workaround not working for you?

@madscientist159
Copy link

madscientist159 commented Jun 4, 2019

@b-rad-NDi I have some likely related information ... running the HD-PVR 1800 in our POWER9 boxes shows the card attempts a DMA access from address 0 right before going down. POWER systems will fence off a card that attempts bad DMA, so instead of continuing with corrupted / random data (like I suspect the Ryzen boxes are) the card drops off the bus. This is good from a data integrity perspective, but bad from a continuity perspective as a reboot (or VM restart with PCIe passthrough) is needed.

Bottom line: It's not just Ryzen. It's probably more a factor of older Intel systems ignoring the bad DMA or just allowing it through with resultant undefined behavior that happened to keep things working in most cases. If we can figure out what's attempting a DMA to address 0 that would help.

I can provide full remote access to a test box if desired...

@b-rad-NDi
Copy link
Owner

@madscientist159 : If you can give me the exact way to reproduce this then I will attempt to find a true fix. Thanks for this info, I have not seen anyone say this. This does sound very incorrect.

@madscientist159
Copy link

@b-rad-NDi Yes, it's quite reproducible in a weak signal environment. The easiest way I have found is to install MythTV (set it up with channel scan etc.) and try to tune a weak channel -- even just setting a record rule and letting it start trying to record does the trick. Let it sit for a few hours and it'll eventually generate a bad DMA and the system will fence off the card.

I can set up a test box for you tomorrow with an antenna connected to provide the weak signals. Would that work?

@b-rad-NDi
Copy link
Owner

So on weak signal mythtv closes the stream and tries again? Do you only see this on stream close? I believe others have encountered this issue midstream on a good signal, but maybe that detail was just never made super clear.
I was hoping you had some clues on where it was dieing / where the null address is encountered etc. Do you have any instrumentation done? If I just had logs and info on where the null is encountered I can continue reviewing the driver. It must be a race condition in the vb2 buffer management.

@madscientist159
Copy link

madscientist159 commented Jun 4, 2019

@b-rad-NDi It seems to happen either at stream open or stream close -- unfortunately a weak signal AFAIK could cause a close / reopen attempt?

Here's a log of the failure I saved earlier -- this one doesn't show the risc fault, but does show the invalid DMA. There's a bit of a race condition, either the tuner driver or the risc driver will be interrupted when the invalid DMA registers and the card is dropped, but I don't reliably see one or the other.

[ 9786.108891] dvb_frontend: dvb_frontend_get_frequency_limits: frequency interval: tuner: 48000000...860000000, frontend: 54000000...858000000
[ 9787.152742] EEH: Frozen PHB#30-PE#1fd detected
[ 9787.179679] EEH: PE location: N/A, PHB location: N/A
[ 9787.206609] CPU: 24 PID: 6161 Comm: kdvb-ad-3-fe-0 Not tainted 5.0.0-rc3+ #1
[ 9787.234136] Call Trace:
[ 9787.261290] [c0002003853075f0] [c00000000094067c] dump_stack+0xb0/0xf4 (unreliable)
[ 9787.289702] [c000200385307630] [c00000000003dba8] eeh_dev_check_failure+0x498/0x590
[ 9787.318347] [c0002003853076d0] [c00000000003dd2c] eeh_check_failure+0x8c/0xd0
[ 9787.346812] [c000200385307710] [c00800000b58ed04] i2c_wait_done+0xac/0x110 [cx23885]
[ 9787.375429] [c000200385307740] [c00800000b58f2c8] i2c_sendbytes+0x110/0x4b0 [cx23885]
[ 9787.404230] [c0002003853077f0] [c00800000b58f7b0] i2c_xfer+0x148/0x210 [cx23885]
[ 9787.432838] [c000200385307890] [c000000000729b44] __i2c_transfer+0x154/0x5d0
[ 9787.461504] [c0002003853078f0] [c00000000072a058] i2c_transfer+0x98/0x160
[ 9787.490140] [c000200385307970] [c00800000d390458] s5h1409_readreg.isra.9+0x70/0xd0 [s5h1409]
[ 9787.518928] [c000200385307a00] [c00800000d390d20] s5h1409_read_status+0x68/0x1f0 [s5h1409]
[ 9787.547828] [c000200385307a90] [c008000008cfd90c] dvb_frontend_swzigzag+0x144/0x380 [dvb_core]
[ 9787.577125] [c000200385307ca0] [c008000008cfe1e0] dvb_frontend_thread+0x698/0x8c0 [dvb_core]
[ 9787.606462] [c000200385307db0] [c000000000136080] kthread+0x160/0x1a0
[ 9787.635679] [c000200385307e20] [c00000000000bdd4] ret_from_kernel_thread+0x5c/0x68
[ 9787.665371] EEH: Detected PCI bus error on PHB#30-PE#1fd
[ 9787.667472] s5h1409_readreg: readreg error (ret == -5)
[ 9787.693882] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures.
[ 9787.725487] s5h1409_writereg: error (reg == 0xf5, val == 0x0000, ret == -5)
[ 9787.753644] EEH: Notify device drivers to shutdown
[ 9787.787212] s5h1409_writereg: error (reg == 0xf5, val == 0x0001, ret == -5)
[ 9787.815314] EEH: Beginning: 'error_detected(IO frozen)'
[ 9787.815331] EEH: PE#1fd (PCI 0030:01:00.0): driver not EEH aware
[ 9787.849334] s5h1409_writereg: error (reg == 0xf4, val == 0x0000, ret == -5)
[ 9787.878244] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'none'
[ 9787.912396] s5h1409_writereg: error (reg == 0xf5, val == 0x0000, ret == -5)
[ 9787.942096] EEH: Collect temporary log
[ 9787.977575] s5h1409_writereg: error (reg == 0xf5, val == 0x0001, ret == -5)
[ 9788.008277] EEH: of node=0030:01:00.0
[ 9788.043877] s5h1409_writereg: error (reg == 0xf3, val == 0x0001, ret == -5)
[ 9788.074997] EEH: PCI device/vendor: 888014f1
[ 9788.111101] mt2131 I2C write failed (len=7)
[ 9788.142889] EEH: PCI cmd/status register: 10100146
[ 9788.142892] EEH: PCI-E capabilities and status follow:
[ 9788.179632] s5h1409_writereg: error (reg == 0xf3, val == 0x0000, ret == -5)
[ 9788.211238] EEH: PCI-E 00: 00018010 00000000 00002810 00015c11
[ 9788.248071] s5h1409_writereg: error (reg == 0xf5, val == 0x0000, ret == -5)
[ 9788.279767] EEH: PCI-E 10: 00110000 00000000 00000000 00000000
[ 9788.317109] s5h1409_writereg: error (reg == 0xf5, val == 0x0001, ret == -5)
[ 9788.349685] EEH: PCI-E 20: 00000000
[ 9788.349687] EEH: PCI-E AER capability register set follows:
[ 9788.349706] EEH: PCI-E AER 00: 20010001 00000000 00000000 00062010
[ 9788.563364] EEH: PCI-E AER 10: 00000000 00000000 00000140 00000000
[ 9788.598816] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[ 9788.633811] EEH: PCI-E AER 30: 00000000 00000000
[ 9788.668336] PHB4 PHB#48 Diag-data (Version: 1)
[ 9788.702861] brdgCtl:    00000002
[ 9788.736872] RootSts:    00060000 00402000 20110008 00100107 00000800
[ 9788.771189] PhbSts:     0000001c00000000 0000001c00000000
[ 9788.804858] Lem:        0000000100000080 0000000000000000 0000000000000080
[ 9788.838621] PhbErr:     0000028000000000 0000020000000000 2148000098000240 a008400000000000
[ 9788.872644] RxeTceErr:  2000000000000000 2000000000000000 c0000000000001fd 0000000000000000
[ 9788.906461] PblErr:     0000000000020000 0000000000020000 0000000000000000 0000000000000000
[ 9788.940101] RegbErr:    0040004000000000 0000004000000000 8800000400000000 0000000000000000
[ 9788.973461] PE[1fd] A/B: 8300b03800000000 8000000000000000
[ 9789.006612] EEH: Reset with hotplug activity
[ 9789.240961] cx23885 0030:01:00.0:  In cx23885_stop_dma()
[ 9789.451156] s5h1409_readreg: readreg error (ret == -5)
[ 9789.489073] s5h1409_writereg: error (reg == 0xf5, val == 0x0000, ret == -5)
[ 9789.526011] s5h1409_writereg: error (reg == 0xf5, val == 0x0001, ret == -5)
[ 9789.562421] s5h1409_writereg: error (reg == 0xf4, val == 0x0000, ret == -5)
[ 9789.598192] s5h1409_writereg: error (reg == 0xf5, val == 0x0000, ret == -5)
[ 9789.633428] s5h1409_writereg: error (reg == 0xf5, val == 0x0001, ret == -5)
[ 9789.667844] s5h1409_writereg: error (reg == 0xf3, val == 0x0001, ret == -5)
[ 9789.701599] mt2131 I2C write failed (len=7)
[ 9789.734416] s5h1409_writereg: error (reg == 0xf3, val == 0x0000, ret == -5)
[ 9789.767011] s5h1409_writereg: error (reg == 0xf5, val == 0x0000, ret == -5)
[ 9789.798659] s5h1409_writereg: error (reg == 0xf5, val == 0x0001, ret == -5)
[ 9790.298528] cx23885 0030:01:00.0:  delay=1000 reg1=0xffffffff reg2=0xffffffff
[ 9790.330425] cx23885 0030:01:00.0:  done!
[ 9790.396647] tda18271 16-0060: destroying instance
[ 9790.425640] iommu: Removing device 0030:01:00.0 from group 6
[ 9795.411657] EEH: Sleep 5s ahead of complete hotplug
[ 9800.714414] pci 0030:01:00.0: [14f1:8880] type 00 class 0x040000
[ 9800.714470] pci 0030:01:00.0: reg 0x10: [mem 0x620c000000000-0x620c0001fffff 64bit]
[ 9800.714657] pci 0030:01:00.0: supports D1 D2
[ 9800.714660] pci 0030:01:00.0: PME# supported from D0 D1 D2 D3hot
[ 9800.714889] pci 0030:01:00.0: disabling ASPM on pre-1.1 PCIe device.  You can enable it with 'pcie_aspm=force'
[ 9800.745726] pci 0030:01:00.0: BAR 0: assigned [mem 0x620c000000000-0x620c0001fffff 64bit]
[ 9800.775443] pci 0030:01     : [PE# 1fd] Secondary bus 1 associated with PE#1fd
[ 9800.805264] pci 0030:01     : [PE# 1fd] Setting up 32-bit TCE table at 0..80000000
[ 9800.841719] pci 0030:01     : [PE# 1fd] Setting up window#0 0..7fffffff pg=1000
[ 9800.870888] pci 0030:01     : [PE# 1fd] Enabling 64-bit DMA bypass
[ 9800.899889] pci 0030:00:00.0: PCI bridge to [bus 01]
[ 9800.928748] pci 0030:00:00.0:   bridge window [mem 0x620c000000000-0x620c07fefffff]
[ 9800.958503] cx23885 0030:01:00.0: enabling device (0140 -> 0142)
[ 9800.988458] cx23885: CORE cx23885[5]: subsystem: 0070:7801, board: Hauppauge WinTV-HVR1800 [card=2,autodetected]
[ 9801.423876] tveeprom: Hauppauge model 78521, rev C1E9, serial# 4031409315
[ 9801.457245] tveeprom: MAC address is 00:0d:fe:4a:6c:a3
[ 9801.490269] tveeprom: tuner model is Philips 18271_8295 (idx 149, type 54)
[ 9801.523149] tveeprom: TV standards NTSC(M) ATSC/DVB Digital (eeprom 0x88)
[ 9801.555905] tveeprom: audio processor is CX23887 (idx 42)
[ 9801.588891] tveeprom: decoder processor is CX23887 (idx 37)
[ 9801.621760] tveeprom: has radio
[ 9801.654894] cx23885: cx23885[5]: hauppauge eeprom: model=78521
[ 9801.697662] cx25840 17-0044: cx23887 A/V decoder found @ 0x88 (cx23885[5])
[ 9802.348899] cx25840 17-0044: loaded v4l-cx23885-avcore-01.fw firmware (16382 bytes)
[ 9802.405665] tuner: 16-0042: Tuner -1 found with type(s) Radio TV.
[ 9802.471459] tda829x 16-0042: could not clearly identify tuner address, defaulting to 60
[ 9802.527846] tda18271 16-0060: creating new instance
[ 9802.594926] tda18271: TDA18271HD/C1 detected @ 16-0060
[ 9802.958966] tda829x 16-0042: type set to tda8295+18271
[ 9803.951097] cx23885: cx23885[5]: registered device video6 [v4l2]
[ 9803.985472] cx23885: cx23885[5]: registered device vbi2
[ 9804.019208] cx23885: cx23885[5]: alsa: registered ALSA audio device
[ 9804.019298] cx23885: cx23885[5]: registered device video7 [mpeg]
[ 9804.053001] cx23885: Firmware and/or mailbox pointer not initialized or corrupted, signature = 0x14, cmd = PING_FW
[ 9806.146023] cx23885: cx23885_dvb_register() allocating 1 frontend(s)
[ 9806.183616] cx23885: cx23885[5]: cx23885 based dvb card
[ 9806.249129] MT2131: successfully identified at address 0x61
[ 9806.288213] dvbdev: DVB: registering new adapter (cx23885[5])
[ 9806.325652] cx23885 0030:01:00.0: DVB: registering adapter 3 frontend 0 (Samsung S5H1409 QAM/8VSB Frontend)...
[ 9806.365552] cx23885: cx23885_dev_checkrevision() Hardware revision = 0xb1
[ 9806.402735] cx23885: cx23885[5]/0: found at 0030:01:00.0, rev: 15, irq: 41, latency: 0, mmio: 0x620c000000000
[ 9806.441314] cx23885 0030:01:00.0: Using 32-bit DMA via iommu
[ 9806.480435] EEH: Notify device driver to resume
[ 9806.519903] EEH: Beginning: 'resume'
[ 9806.558830] EEH: PE#1fd (PCI 0030:01:00.0): driver not EEH aware
[ 9806.598040] EEH: Finished:'resume'
[ 9806.598041] EEH: Recovery successful.
[ 9817.583911] cx23885 0001:01:00.0:  In cx23885_stop_dma()
[ 9817.642985] cx23885 0001:01:00.0:  delay=0 reg1=0x00000000 reg2=0x00000000
[ 9817.682219] cx23885 0001:01:00.0:  done!
[ 9817.747968] dvb_frontend: dvb_frontend_get_frequency_limits: frequency interval: tuner: 48000000...860000000, frontend: 54000000...85800000

Decoded fault (PE[1fd] A/B: 8300b03800000000 8000000000000000) indicates attempted DMA read response to 0x0:

Transaction type: DMA Read Response
Invalid MMIO Address
TCE Page Fault
TCE Access Fault
LEM Bit Number 56
Requestor 00:0.0
MSI Data 0x0000
Fault Address = 0x0000000000000000

@madscientist159
Copy link

Here's another trace with debug enabled. Note that since the card failed with EEH all reads return 0xff (i.e. the PCIe standard response for MMIO reads with no device attached).

When I tried to debug this, the problem I ran into is that the DMA is asynchronous. So the invalid DMA read from 0x0 is initiated from the card, but where this actually ends up landing timing-wise in kernel code is somewhat random.

[  425.830238] EEH: Frozen PHB#0-PE#1 detected
[  425.830330] EEH: PE location: N/A, PHB location: N/A
[  425.830375] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.2.0-rc2+ #1
[  425.830428] Call Trace:
[  425.830455] [c00000000122b2c0] [c000000000a1d78c] dump_stack+0xb0/0xf4 (unreliable)
[  425.830520] [c00000000122b300] [c000000000045054] eeh_dev_check_failure+0x4b4/0x5e0
[  425.830585] [c00000000122b3a0] [c000000000045218] eeh_check_failure+0x98/0xe0
[  425.830677] [c00000000122b3e0] [c008000000cca244] cx23885_irq+0x96c/0xd30 [cx23885]
[  425.830742] [c00000000122b4f0] [c0000000001aa8dc] __handle_irq_event_percpu+0x9c/0x2f0
[  425.830807] [c00000000122b5b0] [c0000000001aab74] handle_irq_event_percpu+0x44/0xc0
[  425.830871] [c00000000122b5f0] [c0000000001aac64] handle_irq_event+0x74/0xc0
[  425.830935] [c00000000122b620] [c0000000001af86c] try_one_irq+0x11c/0x1a0
[  425.830988] [c00000000122b660] [c0000000001afa08] poll_spurious_irqs+0x118/0x170
[  425.831052] [c00000000122b6b0] [c0000000001ccd10] call_timer_fn+0x50/0x1f0
[  425.831106] [c00000000122b740] [c0000000001cd020] expire_timers+0x170/0x210
[  425.831159] [c00000000122b7b0] [c0000000001cd198] run_timer_softirq+0xd8/0x270
[  425.831226] [c00000000122b850] [c000000000a44204] __do_softirq+0x174/0x424
[  425.831281] [c00000000122b940] [c000000000124f78] irq_exit+0xd8/0x100
[  425.831335] [c00000000122b960] [c00000000002d558] timer_interrupt+0x128/0x2e0
[  425.831399] [c00000000122b9c0] [c000000000009498] decrementer_common+0x178/0x180
[  425.831466] --- interrupt: 901 at plpar_hcall_norets+0x1c/0x28
[  425.831466]     LR = check_and_cede_processor+0x48/0x60
[  425.831550] [c00000000122bcc0] [c00000007fe73400] 0xc00000007fe73400 (unreliable)
[  425.831614] [c00000000122bd20] [c0000000007f04d0] shared_cede_loop+0x50/0x120
[  425.831678] [c00000000122bd50] [c0000000007ecc14] cpuidle_enter_state+0xa4/0x660
[  425.831741] [c00000000122bdd0] [c0000000007ed270] cpuidle_enter+0x50/0x70
[  425.831795] [c00000000122be10] [c0000000001686a0] call_cpuidle+0x50/0x90
[  425.831849] [c00000000122be30] [c000000000168cfc] do_idle+0x35c/0x3d0
[  425.831903] [c00000000122bea0] [c000000000168fc8] cpu_startup_entry+0x38/0x40
[  425.831966] [c00000000122bed0] [c0000000000110b0] rest_init+0xe0/0xf8
[  425.832020] [c00000000122bf00] [c000000000cc41c0] start_kernel+0x674/0x6b4
[  425.832074] [c00000000122bf90] [c00000000000b774] start_here_common+0x1c/0x528
[  425.832158] cx23885: cx23885[0]: V4L mpeg risc op code error, status = 0xffffffff
[  425.832230] cx23885: cx23885[0]: TS1 B - dma channel status dump
[  425.832281] cx23885: cx23885[0]:   cmds: init risc lo   : 0xffffffff
[  425.832332] cx23885: cx23885[0]:   cmds: init risc hi   : 0xffffffff
[  425.832383] cx23885: cx23885[0]:   cmds: cdt base       : 0xffffffff
[  425.832434] cx23885: cx23885[0]:   cmds: cdt size       : 0xffffffff
[  425.832484] cx23885: cx23885[0]:   cmds: iq base        : 0xffffffff
[  425.832535] cx23885: cx23885[0]:   cmds: iq size        : 0xffffffff
[  425.832585] cx23885: cx23885[0]:   cmds: risc pc lo     : 0xffffffff
[  425.832635] cx23885: cx23885[0]:   cmds: risc pc hi     : 0xffffffff
[  425.832686] cx23885: cx23885[0]:   cmds: iq wr ptr      : 0xffffffff
[  425.832736] cx23885: cx23885[0]:   cmds: iq rd ptr      : 0xffffffff
[  425.832787] cx23885: cx23885[0]:   cmds: cdt current    : 0xffffffff
[  425.832837] cx23885: cx23885[0]:   cmds: pci target lo  : 0xffffffff
[  425.832890] cx23885: cx23885[0]:   cmds: pci target hi  : 0xffffffff
[  425.832968] cx23885: cx23885[0]:   cmds: line / byte    : 0xffffffff
[  425.833021] cx23885: cx23885[0]:   risc0:
[  425.833060] cx23885: cx23885[0]:   risc1:
[  425.833099] cx23885: cx23885[0]:   risc2:
[  425.833137] cx23885: cx23885[0]:   risc3:
[  425.833174] cx23885: cx23885[0]:   (0x00010630) iq 0:
[  425.833227] cx23885: cx23885[0]:   (0x00010634) iq 1:
[  425.833273] cx23885: cx23885[0]:   (0x00010638) iq 2:
[  425.833319] cx23885: cx23885[0]:   (0x0001063c) iq 3:
[  425.833365] cx23885: cx23885[0]:   (0x00010640) iq 4:
[  425.833412] cx23885: cx23885[0]:   (0x00010644) iq 5:
[  425.833458] cx23885: cx23885[0]:   (0x00010648) iq 6:
[  425.833504] cx23885: cx23885[0]:   (0x0001064c) iq 7:
[  425.833550] cx23885: cx23885[0]:   (0x00010650) iq 8:
[  425.833596] cx23885: cx23885[0]:   (0x00010654) iq 9:
[  425.833642] cx23885: cx23885[0]:   (0x00010658) iq a:
[  425.833688] cx23885: cx23885[0]:   (0x0001065c) iq b:
[  425.833734] cx23885: cx23885[0]:   (0x00010660) iq c:
[  425.833780] cx23885: cx23885[0]:   (0x00010664) iq d:
[  425.833826] cx23885: cx23885[0]:   (0x00010668) iq e:
[  425.833872] cx23885: cx23885[0]:   (0x0001066c) iq f:
[  425.833918] cx23885: cx23885[0]: fifo: 0x00005000 -> 0x6000
[  425.833959] cx23885: cx23885[0]: ctrl: 0x00010630 -> 0x10690
[  425.834010] cx23885: cx23885[0]:   ptr1_reg: 0xffffffff
[  425.834050] cx23885: cx23885[0]:   ptr2_reg: 0xffffffff
[  425.834091] cx23885: cx23885[0]:   cnt1_reg: 0xffffffff
[  425.834132] cx23885: cx23885[0]:   cnt2_reg: 0xffffffff
[  425.834176] cx23885: Firmware and/or mailbox pointer not initialized or corrupted, signature = 0xffffffff, cmd = GET_SEQ_END
[  425.834275] cx23885: cx23885[0]: mpeg risc op code error
[  425.834316] cx23885: cx23885[0]: TS2 C - dma channel status dump
[  425.834366] cx23885: cx23885[0]:   cmds: init risc lo   : 0xffffffff
[  425.834417] cx23885: cx23885[0]:   cmds: init risc hi   : 0xffffffff
[  425.834467] cx23885: cx23885[0]:   cmds: cdt base       : 0xffffffff
[  425.834517] cx23885: cx23885[0]:   cmds: cdt size       : 0xffffffff
[  425.834568] cx23885: cx23885[0]:   cmds: iq base        : 0xffffffff
[  425.834618] cx23885: cx23885[0]:   cmds: iq size        : 0xffffffff
[  425.834669] cx23885: cx23885[0]:   cmds: risc pc lo     : 0xffffffff
[  425.834719] cx23885: cx23885[0]:   cmds: risc pc hi     : 0xffffffff
[  425.834769] cx23885: cx23885[0]:   cmds: iq wr ptr      : 0xffffffff
[  425.834820] cx23885: cx23885[0]:   cmds: iq rd ptr      : 0xffffffff
[  425.834870] cx23885: cx23885[0]:   cmds: cdt current    : 0xffffffff
[  425.834921] cx23885: cx23885[0]:   cmds: pci target lo  : 0xffffffff
[  425.834972] cx23885: cx23885[0]:   cmds: pci target hi  : 0xffffffff
[  425.835022] cx23885: cx23885[0]:   cmds: line / byte    : 0xffffffff
[  425.835072] cx23885: cx23885[0]:   risc0:
[  425.835108] cx23885: cx23885[0]:   risc1:
[  425.835145] cx23885: cx23885[0]:   risc2:
[  425.835182] cx23885: cx23885[0]:   risc3:
[  425.835226] cx23885: cx23885[0]:   (0x00010670) iq 0:
[  425.835273] cx23885: cx23885[0]:   (0x00010674) iq 1:
[  425.835319] cx23885: cx23885[0]:   (0x00010678) iq 2:
[  425.835365] cx23885: cx23885[0]:   (0x0001067c) iq 3:
[  425.835411] cx23885: cx23885[0]:   (0x00010680) iq 4:
[  425.835457] cx23885: cx23885[0]:   (0x00010684) iq 5:
[  425.835503] cx23885: cx23885[0]:   (0x00010688) iq 6:
[  425.835550] cx23885: cx23885[0]:   (0x0001068c) iq 7:
[  425.835595] cx23885: cx23885[0]:   (0x00010690) iq 8:
[  425.835642] cx23885: cx23885[0]:   (0x00010694) iq 9:
[  425.835688] cx23885: cx23885[0]:   (0x00010698) iq a:
[  425.835734] cx23885: cx23885[0]:   (0x0001069c) iq b:
[  425.835780] cx23885: cx23885[0]:   (0x000106a0) iq c:
[  425.835826] cx23885: cx23885[0]:   (0x000106a4) iq d:
[  425.835872] cx23885: cx23885[0]:   (0x000106a8) iq e:
[  425.835919] cx23885: cx23885[0]:   (0x000106ac) iq f:
[  425.835965] cx23885: cx23885[0]: fifo: 0x00006000 -> 0x7000
[  425.836005] cx23885: cx23885[0]: ctrl: 0x00010670 -> 0x106d0
[  425.836056] cx23885: cx23885[0]:   ptr1_reg: 0xffffffff
[  425.836097] cx23885: cx23885[0]:   ptr2_reg: 0xffffffff
[  425.836152] cx23885: cx23885[0]:   cnt1_reg: 0xffffffff
[  425.836194] cx23885: cx23885[0]:   cnt2_reg: 0xffffffff
[  425.836241] cx23885: cx23885[0]: video risc op code error
[  425.836282] cx23885: cx23885[0]: VID A - dma channel status dump
[  425.836332] cx23885: cx23885[0]:   cmds: init risc lo   : 0xffffffff
[  425.836383] cx23885: cx23885[0]:   cmds: init risc hi   : 0xffffffff
[  425.836433] cx23885: cx23885[0]:   cmds: cdt base       : 0xffffffff
[  425.836484] cx23885: cx23885[0]:   cmds: cdt size       : 0xffffffff
[  425.836534] cx23885: cx23885[0]:   cmds: iq base        : 0xffffffff
[  425.836584] cx23885: cx23885[0]:   cmds: iq size        : 0xffffffff
[  425.836635] cx23885: cx23885[0]:   cmds: risc pc lo     : 0xffffffff
[  425.836685] cx23885: cx23885[0]:   cmds: risc pc hi     : 0xffffffff
[  425.836736] cx23885: cx23885[0]:   cmds: iq wr ptr      : 0xffffffff
[  425.836786] cx23885: cx23885[0]:   cmds: iq rd ptr      : 0xffffffff
[  425.836836] cx23885: cx23885[0]:   cmds: cdt current    : 0xffffffff
[  425.836887] cx23885: cx23885[0]:   cmds: pci target lo  : 0xffffffff
[  425.836937] cx23885: cx23885[0]:   cmds: pci target hi  : 0xffffffff
[  425.836988] cx23885: cx23885[0]:   cmds: line / byte    : 0xffffffff
[  425.837038] cx23885: cx23885[0]:   risc0:
[  425.837075] cx23885: cx23885[0]:   risc1:
[  425.837111] cx23885: cx23885[0]:   risc2:
[  425.837148] cx23885: cx23885[0]:   risc3:
[  425.837185] cx23885: cx23885[0]:   (0x000105b0) iq 0:
[  425.837240] cx23885: cx23885[0]:   (0x000105b4) iq 1:
[  425.837286] cx23885: cx23885[0]:   (0x000105b8) iq 2:
[  425.837332] cx23885: cx23885[0]:   (0x000105bc) iq 3:
[  425.837378] cx23885: cx23885[0]:   (0x000105c0) iq 4:
[  425.837424] cx23885: cx23885[0]:   (0x000105c4) iq 5:
[  425.837470] cx23885: cx23885[0]:   (0x000105c8) iq 6:
[  425.837516] cx23885: cx23885[0]:   (0x000105cc) iq 7:
[  425.837562] cx23885: cx23885[0]:   (0x000105d0) iq 8:
[  425.837608] cx23885: cx23885[0]:   (0x000105d4) iq 9:
[  425.837654] cx23885: cx23885[0]:   (0x000105d8) iq a:
[  425.837701] cx23885: cx23885[0]:   (0x000105dc) iq b:
[  425.837746] cx23885: cx23885[0]:   (0x000105e0) iq c:
[  425.837793] cx23885: cx23885[0]:   (0x000105e4) iq d:
[  425.837839] cx23885: cx23885[0]:   (0x000105e8) iq e:
[  425.837885] cx23885: cx23885[0]:   (0x000105ec) iq f:
[  425.837931] cx23885: cx23885[0]: fifo: 0x00000040 -> 0x2840
[  425.837971] cx23885: cx23885[0]: ctrl: 0x000105b0 -> 0x10610
[  425.838022] cx23885: cx23885[0]:   ptr1_reg: 0xffffffff
[  425.838062] cx23885: cx23885[0]:   ptr2_reg: 0xffffffff
[  425.838103] cx23885: cx23885[0]:   cnt1_reg: 0xffffffff
[  425.838144] cx23885: cx23885[0]:   cnt2_reg: 0xffffffff
[  425.838186] cx23885: cx23885[0]/1: Audio risc op code error
[  425.838231] cx23885: cx23885[0]: TV Audio - dma channel status dump
[  425.838282] cx23885: cx23885[0]:   cmds: init risc lo   : 0xffffffff
[  425.838333] cx23885: cx23885[0]:   cmds: init risc hi   : 0xffffffff
[  425.838383] cx23885: cx23885[0]:   cmds: cdt base       : 0xffffffff
[  425.838433] cx23885: cx23885[0]:   cmds: cdt size       : 0xffffffff
[  425.838484] cx23885: cx23885[0]:   cmds: iq base        : 0xffffffff
[  425.838534] cx23885: cx23885[0]:   cmds: iq size        : 0xffffffff
[  425.838585] cx23885: cx23885[0]:   cmds: risc pc lo     : 0xffffffff
[  425.838635] cx23885: cx23885[0]:   cmds: risc pc hi     : 0xffffffff
[  425.838686] cx23885: cx23885[0]:   cmds: iq wr ptr      : 0xffffffff
[  425.838736] cx23885: cx23885[0]:   cmds: iq rd ptr      : 0xffffffff
[  425.838786] cx23885: cx23885[0]:   cmds: cdt current    : 0xffffffff
[  425.838837] cx23885: cx23885[0]:   cmds: pci target lo  : 0xffffffff
[  425.838887] cx23885: cx23885[0]:   cmds: pci target hi  : 0xffffffff
[  425.838938] cx23885: cx23885[0]:   cmds: line / byte    : 0xffffffff
[  425.838988] cx23885: cx23885[0]:   risc0:
[  425.839025] cx23885: cx23885[0]:   risc1:
[  425.839061] cx23885: cx23885[0]:   risc2:
[  425.839097] cx23885: cx23885[0]:   risc3:
[  425.839134] cx23885: cx23885[0]:   (0x000106b0) iq 0:
[  425.839180] cx23885: cx23885[0]:   (0x000106b4) iq 1:
[  425.839236] cx23885: cx23885[0]:   (0x000106b8) iq 2:
[  425.839282] cx23885: cx23885[0]:   (0x000106bc) iq 3:
[  425.839328] cx23885: cx23885[0]:   (0x000106c0) iq 4:
[  425.839374] cx23885: cx23885[0]:   (0x000106c4) iq 5:
[  425.839420] cx23885: cx23885[0]:   (0x000106c8) iq 6:
[  425.839466] cx23885: cx23885[0]:   (0x000106cc) iq 7:
[  425.839525] cx23885: cx23885[0]:   (0x000106d0) iq 8:
[  425.839572] cx23885: cx23885[0]:   (0x000106d4) iq 9:
[  425.839618] cx23885: cx23885[0]:   (0x000106d8) iq a:
[  425.839664] cx23885: cx23885[0]:   (0x000106dc) iq b:
[  425.839710] cx23885: cx23885[0]:   (0x000106e0) iq c:
[  425.839756] cx23885: cx23885[0]:   (0x000106e4) iq d:
[  425.839802] cx23885: cx23885[0]:   (0x000106e8) iq e:
[  425.839849] cx23885: cx23885[0]:   (0x000106ec) iq f:
[  425.839895] cx23885: cx23885[0]: fifo: 0x00007000 -> 0x8000
[  425.839935] cx23885: cx23885[0]: ctrl: 0x000106b0 -> 0x10710
[  425.839985] cx23885: cx23885[0]:   ptr1_reg: 0xffffffff
[  425.840026] cx23885: cx23885[0]:   ptr2_reg: 0xffffffff
[  425.840067] cx23885: cx23885[0]:   cnt1_reg: 0xffffffff
[  425.840108] cx23885: cx23885[0]:   cnt2_reg: 0xffffffff
[  425.840273] EEH: Detected PCI bus error on PHB#0-PE#1
[  425.840321] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures.
[  425.840410] EEH: Notify device drivers to shutdown
[  425.840457] EEH: Beginning: 'error_detected(IO frozen)'
[  425.840504] EEH: PE#1 (PCI 0000:00:02.0): driver not EEH aware
[  425.840556] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'none'
[  425.840628] EEH: Collect temporary log
[  425.840863] EEH: of node=0000:00:02.0
[  425.840904] EEH: PCI device/vendor: 888014f1
[  425.840956] EEH: PCI cmd/status register: 10100146
[  425.841002] EEH: PCI-E capabilities and status follow:
[  425.841104] EEH: PCI-E 00: 00018010 00000000 00002810 00015c11
[  425.841203] EEH: PCI-E 10: 00110000 00000000 00000000 00000000
[  425.841266] EEH: PCI-E 20: 00000000
[  425.841296] EEH: PCI-E AER capability register set follows:
[  425.841396] EEH: PCI-E AER 00: 20010001 00000000 00000000 00062010
[  425.841495] EEH: PCI-E AER 10: 00000000 00000000 00000154 04008001
[  425.841593] EEH: PCI-E AER 20: 0000050f 01000400 00000000 00000000
[  425.841656] EEH: PCI-E AER 30: 00000000 00000000
[  425.841705] EEH: Reset with hotplug activity
[  436.103443] tda18271 1-0060: destroying instance
[  436.103637] pci 0000:00:02.0: Removing from iommu group 0
[  436.107460] pci 0000:00:01.0: Removing from iommu group 0
[  436.114096] pci 0000:00:00.0: Removing from iommu group 0
[  443.182642] EEH: Sleep 5s ahead of complete hotplug
[  448.304631] pci 0000:00:00.0: No hypervisor support for SR-IOV on this device, IOV BARs disabled.
[  448.306728] pci 0000:00:01.0: No hypervisor support for SR-IOV on this device, IOV BARs disabled.
[  448.308235] pci 0000:00:02.0: No hypervisor support for SR-IOV on this device, IOV BARs disabled.
[  448.309547] pci 0000:00:02.0: supports D1 D2
[  448.309634] pci 0000:00:02.0: PME# supported from D0 D1 D2 D3hot
[  448.310390] pci 0000:00:00.0: Adding to iommu group 0
[  448.310513] pci 0000:00:01.0: Adding to iommu group 0
[  448.312873] pci 0000:00:02.0: Adding to iommu group 0
[  448.318524] cx23885: CORE cx23885[1]: subsystem: 0070:7801, board: Hauppauge WinTV-HVR1800 [card=2,autodetected]
[  448.724039] tveeprom: Hauppauge model 78521, rev C1E9, serial# 4029305914
[  448.724111] tveeprom: MAC address is 00:0d:fe:2a:54:3a
[  448.724152] tveeprom: tuner model is Philips 18271_8295 (idx 149, type 54)
[  448.724204] tveeprom: TV standards NTSC(M) ATSC/DVB Digital (eeprom 0x88)
[  448.724254] tveeprom: audio processor is CX23887 (idx 42)
[  448.724295] tveeprom: decoder processor is CX23887 (idx 37)
[  448.724336] tveeprom: has radio
[  448.724367] cx23885: cx23885[1]: hauppauge eeprom: model=78521

@crumka
Copy link

crumka commented Jun 5, 2019

@rg4github -- I don't know which work around you're referring to.

@rg4github
Copy link

rg4github commented Jun 5, 2019

@crumka -- Maybe it doesn't apply to the issue you are experiencing, but the cx23885 module currently lets you force enable or force disable the DMA reset workaround via parameter:

dma_reset_workaround:periodic RiSC dma engine reset; 0-force disable, 1-driver detect (default), 2-force enable (int)

On my system I had to force-enable it, which I did by creating the file /etc/modprobe.d/Hauppauge-WinTV-QuadHD.conf with contents:

options cx23885 dma_reset_workaround=2

If I understand correctly this workaround was originally always in place, so perhaps you need the opposite using dma_reset_workaround=0. It may not help, but it's an easy thing to test.

The latest findings by @madscientist159 may be leading us towards a real fix though,

@madscientist159
Copy link

@b-rad-NDi Were the traces helpful at all, or would you like any additional information? It reliably goes down after tuning a weak signal a few times, I could see if there's a way to cause the crash with just azap if that's easier for you to debug with?

@b-rad-NDi
Copy link
Owner

These were helpful. I just have a lot of other priorities. If you have any additional logs with more info you can feel free to supply them, the more information the better. I'm going to try and prioritize this issue again.

@madscientist159
Copy link

Thank you for the update -- just wanted to make sure you had something you could work with! I'm not familiar enough with this particular hardware to even know where to start looking for a bad DMA...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants