Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate AER issues #18

Open
bluecmd opened this issue Dec 15, 2019 · 3 comments
Open

Investigate AER issues #18

bluecmd opened this issue Dec 15, 2019 · 3 comments

Comments

@bluecmd
Copy link
Owner

@bluecmd bluecmd commented Dec 15, 2019

On a cold-boot, a modprobe fejkon produces this:

 		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
 		UEMsk:	DLP- SDES- TLP+ FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
 		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
-		CESta:	RxErr+ BadTLP- BadDLLP+ Rollover- Timeout- NonFatalErr-
+		CESta:	RxErr+ BadTLP+ BadDLLP+ Rollover- Timeout- NonFatalErr+
 		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+

I.e we have triggered BadTLP+ I guess?

@bluecmd

This comment has been minimized.

Copy link
Owner Author

@bluecmd bluecmd commented Dec 15, 2019

Another data point: Disabling AER kills the system when doing the benchmark

@bluecmd

This comment has been minimized.

Copy link
Owner Author

@bluecmd bluecmd commented Jan 7, 2020

With the new PCIe stack in master right now (currently fa813ab) stress-testing reports 48 Mbit/s with and without AER. So the AER hang seems to be gone at least.

lspci still reports:

	Capabilities: [800 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP+ FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr+ BadTLP+ BadDLLP+ Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
@bluecmd

This comment has been minimized.

Copy link
Owner Author

@bluecmd bluecmd commented Jan 7, 2020

After flashing the same build artifact and power cycling the AER issues are back. I've tried programming, and coldbooting, and various combinations without success.

The errors seems to be missing completions, since CmpltTO is logged.

The stats on the FPGA is:

% pcie
 My ID                : 0xb300
 TLP RX               : 0x00000001
 TLP Unsupported RX   : 0x00000000
 TLP TX Data          : 0x00000000
 TLP TX Instant       : 0x00000000
 TLP TX Response      : 0x00000001
 Last TLP RX          : 0x00000001 0xb200000f 0xfbe00004 0x00000000 0xffffffff 0xffffffff 0xffffffff 0xffffffff
 Last TLP TX Data     : 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
 Last TLP TX Instant  : 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
 Last TLP TX Response : 0x4a000001 0xb3000004 0xb2000004 0xdeadbeef 0xffffffff 0xffffffff 0xffffffff 0xffffffff
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.