New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate AER issues #18
Comments
Another data point: Disabling AER kills the system when doing the benchmark |
With the new PCIe stack in master right now (currently fa813ab) stress-testing reports 48 Mbit/s with and without AER. So the AER hang seems to be gone at least. lspci still reports:
|
After flashing the same build artifact and power cycling the AER issues are back. I've tried programming, and coldbooting, and various combinations without success. The errors seems to be missing completions, since CmpltTO is logged. The stats on the FPGA is:
|
Trying to investigate spurious RxErr, BadTLP, and BadDLLP. Running my test Tyan S7106 board with commit 571a297 while poking the Right now running a test in Gen 2 x8 to see if that makes the RxErr etc. disappear. Card is up with two ports looped back to back (interfaces set to UP). Cooling is provided by fans blowing over the card, but the built-in fan is not active. Reported temperature appears steady at 46 degrees. |
stress.sh: #!/bin/bash
i=0
while true
do
v=$(cat '/sys/devices/pci0000:b2/0000:b2:00.0/0000:b3:00.0/phy_freq')
if [[ "$v" -lt 100000000 ]] || [[ "$v" -gt 110000000 ]]; then
echo "[$(date)] Got weird value: $v"
fi
i=$((i+1))
if [[ "$((i % 100))" == "0" ]]; then
echo -ne "Round $i\r"
fi
done
#!/bin/bash
diff -u a <(sudo lspci -s b3:00.0 -vvnn) |
Test started at Tue Aug 18 21:15:41 2020 UTC |
Test ended successfully at Wed Aug 19 08:05:59 2020 UTC. Temperature 46 degrees. No errors reported. lspci remains stable. dmesg:
Driver at f950e02 using the following diff:
lspci was:
|
Powering off the test motherboard and powering it up again returned with this diff on lspci: --- a 2020-08-18 21:16:14.688358710 +0000
+++ /dev/fd/63 2020-08-19 08:10:48.856247296 +0000
@@ -1,13 +1,13 @@
b3:00.0 Fibre Channel [0c04]: Device [f1c0:0de5] (rev 01)
Subsystem: Device [f1c0:0de5]
- Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
+ Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
- Interrupt: pin A routed to IRQ 62
+ Interrupt: pin A routed to IRQ 11
NUMA node: 0
Region 0: Memory at fbe00000 (32-bit, non-prefetchable) [size=64K]
- Capabilities: [50] MSI: Enable+ Count=16/32 Maskable- 64bit+
- Address: 00000000fee00218 Data: 0000
+ Capabilities: [50] MSI: Enable- Count=1/32 Maskable- 64bit+
+ Address: 0000000000000000 Data: 0000
Capabilities: [78] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
@@ -22,7 +22,7 @@
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
- LnkSta: Speed 5GT/s (downgraded), Width x8 (ok)
+ LnkSta: Speed 8GT/s (ok), Width x8 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range A, TimeoutDis+, NROPrPrP-, LTR-
10BitTagComp-, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
@@ -53,10 +53,9 @@
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP+ FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
+ CESta: RxErr+ BadTLP- BadDLLP+ Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
- Kernel driver in use: fejkon It is clear that the speed is now Gen 3 but already it seems that the correctable error status (CESta) reports it has seen RxErr and BadDLLP. |
Loading the module updates the diff to: --- a 2020-08-18 21:16:14.688358710 +0000
+++ /dev/fd/63 2020-08-19 08:13:42.850777753 +0000
@@ -22,7 +22,7 @@
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
- LnkSta: Speed 5GT/s (downgraded), Width x8 (ok)
+ LnkSta: Speed 8GT/s (ok), Width x8 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range A, TimeoutDis+, NROPrPrP-, LTR-
10BitTagComp-, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
@@ -53,7 +53,7 @@
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP+ FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
+ CESta: RxErr+ BadTLP+ BadDLLP+ Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- I.e. it seems that the speed test in the beginning triggered a BadTLP error. dmesg at this point:
Starting stress test at Wed 19 Aug 2020 08:14:50 AM UTC. Temperature still 46 degrees. Gen 3 x8 active. |
Test failed at Wed Aug 19 08:20:46 2020 UTC (after 6 min). New test started at Wed Aug 19 08:36:09 2020 UTC. |
AER Timeout + endpoint AER RxErr reported at 08:49 UTC. |
Crash at Wed Aug 19 08:57:05 2020 UTC. Host stuck in BIOS with red angry led shining. |
Consider this done. AER works and stress testing using Gen 2 x8 on commit 571a297 works. |
There is a setpci command exist that can clear the RxErr. By any chance does anyone know what it is? |
On a cold-boot, a modprobe fejkon produces this:
I.e we have triggered BadTLP+ I guess?
The text was updated successfully, but these errors were encountered: