-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mpeg risc op code error on Supermicro with WinTV-QuadHD-ATSC #51
Comments
|
Hey Mike. Ugh this issue :/ I will fire back up my ryzen system and resume this bug. I was having problems reproducing it after the last patches I did, but I don't think all mobo's are equal with this bug. Has this end user updated their BIOS to the latest and greatest? That is the first thing I'd suggest. I will install myth if that easily triggers this issue. Any tips on how to repro this via myth would be appreciated. |
|
Hi Brad, Bios is at version 2.1 and is the latest according to Supermicro web site. The CPU is Intel(R) Atom(TM) CPU C2758 @ 2.40GHz (family: 0x6, model: 0x4d, stepping: 0x8) The mythtv setup is basic, it just requires either a recording or LiveTV to be active on the failing system, note that EIT scanning is enabled. I suspect that something like w_scan will produce the problem, do you have a suitable incantation to use (I don't know anything about ATSC) and I can get the mythtv user to run some tests. I recently had one mpeg risc error (non fatal) on my ASUS STRIX B250F GAMING motherboard, BIOS 1205 05/11/2018, with CPU Intel(R) Pentium(R) CPU G4400 @ 3.30GHz (family: 0x6, model: 0x5e, stepping: 0x3) I note that on the failing system the mpeg risc error line was not preceded by "cx23885 0000:05:00.0: dma in progress detected 0x00000001 0x00000001, clearing" which is what I see e.g. Mike |
|
Wait what, this isn't ryzen :-o Shyzer. It would be great if you could get me command lines that repro this. In my tests w_scan is usually not enough, because it doesn't generate enough interrupts during the scanning process to cause this. |
|
I do have a patch where this driver was converted to vb2 buffering system. The risc programs were changed at that point. I'm noodling over this. |
|
HI Brad, given the user is seeing a lot of these mpeg risc errors, I think w_scan is worth a try - just need the appropriate incantation I think it is w_scan -fa -A1 -c US -a n where n is adapter number Unfortunately mythtv is not a command line application. Mike |
|
@b-rad-NDi, I'm the user experiencing the issue. I ran Logs are here: http://paste.ubuntu.com/p/xWw7CtSF5F/ MikeB2013 asked me for my Please let me know if/what else you need. |
|
Hi Brad, If you want to try setting up mythtv application here is my quick guide to setup There is a whole pile of formal documentation here https://www.mythtv.org/wiki/Configuring_MythTV Mike |
|
I'm gonna do this on my idle ryzen box. It's whole point in life was fixing this issue, so I guess it's back on duty. |
|
Brad - Would you want me to try to generate/post any more logs? I've been continuing to play with the system. Getting intermittent success with longer recordings (up to 3.3 GB) - but all fail with the same error(s). |
|
OK, so I installed myth using the instructions above on my ryzen system. I've left channels playing overnight as well as setting up four simultaneous recordings. I get risc op code error very early at boot, but they are related to the analog audio. I've not so far triggered one on a TS port. I'm going to continue to let it sit and queue up recordings. |
|
Brad - Thank you for the update. This is a single purpose machine (for now), so if you want me to test anything I'm happy to break/rebuild. If need be, we can discuss getting you ssh access, but that would be a new one for me so it might take some time, Just let me know. |
|
120GB of recordings and no error. Mythtv isn't auto cleaning though, so I need to figure some stuff out before I can let it go long term. |
|
Best I've ever achieved is 3.3 GB, but most fail within a few tens or hundreds MB. |
|
@b-rad-NDi -- Been playing around with the server and got some interesting results. When I run a minecraft server at the same time as mythtv -- the error goes away and I can record. When I turn off the minecraft server instance the error returns. Could we have some resource going to sleep or a race condition? (I don't know much about these but have been reading. Please excuse me if they arn't productive comments) |
|
Good finding, I've often suspected this could be related to system performance level. Check out cpu load. Perhaps this issue is related to cpu power scaling. To verify performance level: Note you might check all cpuN. You can spin up a few md5sum to do a different sort of verify. See how much cpu load the minecraft server adds, then tack on a couple md5sum threads. and see if you encounter the risc op code error. then eventually... |
|
@b-rad-NDi - I've been out of town. Intending to get you this data tonight. |
|
Crumka? Did you ever try out those tests? The discussion on the mailing list right now is to only apply the original ryzen dma patches, on ryzen systems, or others that exhibit the issue. This patch seems like it might be causing issue on other systems. |
|
A patch has been submitted to disable the "Ryzen dma engine stall" patch on non-Ryzen systems. It is very possible you have one of the platforms that is adversely affected by that "fix". Some platforms are fine, others have issue. |
|
I got distracted. Apologies. ---Without Minecraft--- ---With Minecraft--- ---md5sum--- |
|
@b-rad-NDi - So should I be sitting tight until 4.21 and re-test? Just want to confirm. |
|
This should be fixed in the ppa as well as in mainline kernel now. Please re-open if the issue still exists. |
|
Hi Brad, I bumped into this thread after my lovely WinTV-QuadHD-ATSC stopped tuning this morning following a kernel update. Digging through kernel messages I found the exact error this fix was designed to solve. Since you mentioned changes related to this error made it into the mainline kernel, I'm wondering if those changes are what's killing my card. A few important notes:
Could you help me figure out which change needs to be backed out, or made conditional? I'm happy to provide additional info, run tests or make changes, including compiling a fresh kernel if that helps. Happy to open a fresh ticket for this as well, but I was hesitant to do so because I'm not running Ubuntu. Thanks in advance for your help!
|
|
Hello again :-) I located the changes to the kernel module and noticed the addition of the cx23885 kernel module parameter dma_reset_workaround to force enable or disable the workaround. Nice! Setting this parameter to forced on (options cx23885 dma_reset_workaround=2) solves the problem(?!) with kernel 4.20.3 and its kmods, so even though my AMD CPU is anything but current (late 2012) it needs the workaround as well. Interesting. Please let me know if you would like me to provide system info to possibly change the default detection rules. Kind regards,
|
|
Hi @rg4github, thanks for filing this information. It is very good to know. You'll have to determine your cpu/motherboard pcie id, then I can send in a patch so you can drop the module option. https://openbenchmarking.org/system/1703021-RI-AMDZEN08075/Ryzen%207%201800X/lspci, On Ryzen it is 0x1451, for your older system it must be different. |
|
Hi @b-rad-NDi, I think you only want the IOMMU info, which seems to be 1419, but I'm uploading all of the AMD elements from lspci -v -nn -k in case there's a clue hidden somewhere else ;-) Happy to provide additional output! Kind regards,
|
|
@b-rad-NDi -- Hi. Finally got the chance to upgrade my kernel. Now on 5.0.5
Issue is still active. Edit: Spelling |
|
I'm still super interested in a solution to this issue. |
|
Hi @crumka, Are you looking for a solution that does not require the dma_reset_workaround, or is that workaround not working for you? |
|
@b-rad-NDi I have some likely related information ... running the HD-PVR 1800 in our POWER9 boxes shows the card attempts a DMA access from address 0 right before going down. POWER systems will fence off a card that attempts bad DMA, so instead of continuing with corrupted / random data (like I suspect the Ryzen boxes are) the card drops off the bus. This is good from a data integrity perspective, but bad from a continuity perspective as a reboot (or VM restart with PCIe passthrough) is needed. Bottom line: It's not just Ryzen. It's probably more a factor of older Intel systems ignoring the bad DMA or just allowing it through with resultant undefined behavior that happened to keep things working in most cases. If we can figure out what's attempting a DMA to address 0 that would help. I can provide full remote access to a test box if desired... |
|
@madscientist159 : If you can give me the exact way to reproduce this then I will attempt to find a true fix. Thanks for this info, I have not seen anyone say this. This does sound very incorrect. |
|
@b-rad-NDi Yes, it's quite reproducible in a weak signal environment. The easiest way I have found is to install MythTV (set it up with channel scan etc.) and try to tune a weak channel -- even just setting a record rule and letting it start trying to record does the trick. Let it sit for a few hours and it'll eventually generate a bad DMA and the system will fence off the card. I can set up a test box for you tomorrow with an antenna connected to provide the weak signals. Would that work? |
|
So on weak signal mythtv closes the stream and tries again? Do you only see this on stream close? I believe others have encountered this issue midstream on a good signal, but maybe that detail was just never made super clear. |
|
@b-rad-NDi It seems to happen either at stream open or stream close -- unfortunately a weak signal AFAIK could cause a close / reopen attempt? Here's a log of the failure I saved earlier -- this one doesn't show the risc fault, but does show the invalid DMA. There's a bit of a race condition, either the tuner driver or the risc driver will be interrupted when the invalid DMA registers and the card is dropped, but I don't reliably see one or the other. Decoded fault (PE[1fd] A/B: 8300b03800000000 8000000000000000) indicates attempted DMA read response to 0x0: |
|
Here's another trace with debug enabled. Note that since the card failed with EEH all reads return 0xff (i.e. the PCIe standard response for MMIO reads with no device attached). When I tried to debug this, the problem I ran into is that the DMA is asynchronous. So the invalid DMA read from 0x0 is initiated from the card, but where this actually ends up landing timing-wise in kernel code is somewhat random. |
|
@rg4github -- I don't know which work around you're referring to. |
|
@crumka -- Maybe it doesn't apply to the issue you are experiencing, but the cx23885 module currently lets you force enable or force disable the DMA reset workaround via parameter:
On my system I had to force-enable it, which I did by creating the file /etc/modprobe.d/Hauppauge-WinTV-QuadHD.conf with contents:
If I understand correctly this workaround was originally always in place, so perhaps you need the opposite using dma_reset_workaround=0. It may not help, but it's an easy thing to test. The latest findings by @madscientist159 may be leading us towards a real fix though, |
|
@b-rad-NDi Were the traces helpful at all, or would you like any additional information? It reliably goes down after tuning a weak signal a few times, I could see if there's a way to cause the crash with just azap if that's easier for you to debug with? |
|
These were helpful. I just have a lot of other priorities. If you have any additional logs with more info you can feel free to supply them, the more information the better. I'm going to try and prioritize this issue again. |
|
Thank you for the update -- just wanted to make sure you had something you could work with! I'm not familiar enough with this particular hardware to even know where to start looking for a bad DMA... |
Hi Brad,
I am trying to help out a user with mythtv application, who is getting mpeg risc op code errors, which are fatal.
Operating System - Ubuntu 18.04.01 LTS (Server)
uname -a output: Linux mythtv-server 4.15.0-29201807270420-generic #0+mediatree+hauppauge-Ubuntu SMP Fri Jul 27 18:09:45 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
System - SUPERMICRO SYS-5018A-FTN4 1U Rackmount Server / 16 GB RAM / 2 TB HD WD Purple @ 5400 rpm
Hauppauge WinTV-QuadHD-ATSC [card=57,autodetected], Hauppauge model 165100, rev B4I6, serial# 4036040160
The dmesg output is http://paste.ubuntu.com/p/zGgVPGBfvk/
The mythtv forum thread is https://forum.mythtv.org/viewtopic.php?p=13661#p13661
Any thoughts on how to debug this ?
Mike
The text was updated successfully, but these errors were encountered: