Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hardware testing found issue regarding Error Passive State #73

Closed
Deniz-Eren opened this issue Jul 10, 2024 · 1 comment · Fixed by #74
Closed

Hardware testing found issue regarding Error Passive State #73

Deniz-Eren opened this issue Jul 10, 2024 · 1 comment · Fixed by #74
Assignees
Labels
bug Something isn't working

Comments

@Deniz-Eren
Copy link
Owner

Deniz-Eren commented Jul 10, 2024

Describe the bug

Hardware testing with heavy message flooding results in IRQs stopping when in Error Passive State

To Reproduce

Steps to reproduce the behaviour, start the driver:

dev-can-linux -Ex -vvvvv

Then from another console run the following script to flood heavy amount of CAN-bus messages to the driver:

#!/bin/bash
# Basic while loop
counter=1
while [ $counter -le 100000 ]
do
        echo -n test > /dev/can1/tx0
        echo -n test > /dev/can0/tx0
        ((counter++))
done
echo Test sequence complete

The driver soon gets overwhelmed with the messages and gets stuck in Error Passive State:

error warning interrupt
Controller changed from Error Active State (0) into Error Warning State (1).
netif_rx: adv_pci-can0: controller problems: 8
netif_rx: adv_pci-can0: TX error counter; tx:60, rx:0
error warning interrupt
Controller changed from Error Active State (0) into Error Warning State (1).
netif_rx: adv_pci-can1: controller problems: 8
netif_rx: adv_pci-can1: TX error counter; tx:70, rx:0
error passive interrupt
Controller changed from Error Warning State (1) into Error Passive State (2).
netif_rx: adv_pci-can0: controller problems: 20
netif_rx: adv_pci-can0: TX error counter; tx:80, rx:0
error passive interrupt
Controller changed from Error Warning State (1) into Error Passive State (2).
netif_rx: adv_pci-can1: controller problems: 20
netif_rx: adv_pci-can1: TX error counter; tx:80, rx:0

Expected behaviour

According to the documentation of SJA1000 chip, during Error Passive State the device should still receive messages and provide IRQs until the amount of errors reach greater than 256, in which case the device chip should enter Bus-Off state. The chip never enters Bus-Off state, the last netif_rx message we receive is that the chip is in Error Passive State and then no further IRQs arrive.

Screenshots

dev-can-linux v1.3.4
Harmonized with Linux Kernel version 69
dev-can-linux comes with ABSOLUTELY NO WARRANTY; for details use option `-w'.
This is free software, and you are welcome to redistribute it
under certain conditions; option `-c' for details.
warning: release versions allow at max -vv option.
driver start (version: 1.3.4)
Auto detected device (13fe:00d7) successfully: (driver "adv_pci")
initializing device 13fe:00d7
read ssvid: 13fe
read ssid: 00d7
read cs: 0, slot: 0, func: 0, devfn: 0
read capability[2]: 0x10
capability 0x10 (PCIe) already enabled
PCIe version: 1
read capability[1]: 0x05
nirq: 8
capability 0x05 (MSI) Per Vector Masking (PVM) not supported
capability 0x05 (MSI) enabled
read ba[0] MEM { addr: df302000, size: 800 }
read ba[1] MEM { addr: df301000, size: 80 }
read ba[2] MEM { addr: df300000, size: 80 }
read irq[0]: 266
read irq[1]: 267
read irq[2]: 268
read irq[3]: 269
read irq[4]: 270
read irq[5]: 271
read irq[6]: 272
read irq[7]: 273
ioremap [df302000] mapping to [53da9db000] successful
reg_base=53da9db000 irq=266
setting BTR0=0x01 BTR1=0x1c
ioremap [df302400] mapping to [53da9dc400] successful
reg_base=53da9dc400 irq=266
setting BTR0=0x01 BTR1=0x1c
error warning interrupt
Controller changed from Error Active State (0) into Error Warning State (1).
netif_rx: adv_pci-can0: controller problems: 8
netif_rx: adv_pci-can0: TX error counter; tx:60, rx:0
error warning interrupt
Controller changed from Error Active State (0) into Error Warning State (1).
netif_rx: adv_pci-can1: controller problems: 8
netif_rx: adv_pci-can1: TX error counter; tx:70, rx:0
error passive interrupt
Controller changed from Error Warning State (1) into Error Passive State (2).
netif_rx: adv_pci-can0: controller problems: 20
netif_rx: adv_pci-can0: TX error counter; tx:80, rx:0
error passive interrupt
Controller changed from Error Warning State (1) into Error Passive State (2).
netif_rx: adv_pci-can1: controller problems: 20
netif_rx: adv_pci-can1: TX error counter; tx:80, rx:0

Platform

  • Target QNX architecture, x86_64
  • CAN-bus hardware device, Advantech (13fe:00d7)
  • Development environment, workspace
  • Version, QNX 7.1

Driver

  • Driver loaded, adv_pci
  • Branch: main
  • Version, 1.3.4

Additional context

If the chip had entered Bus-Off state, the current implementation would have performed a chip restart, which is the expected behaviour. Of course the chip cannot handle the amount of data provided, but we would have expected it to progress to Bus-Off and then get rebooted inline with the current restart_ms value of 50ms the driver was started with (default option).

We tested with special implementation via canctl to poke the driver to check the chip registers to see if it is in Buss-Off state, however it was not. The chip also reported the IRQ system was still ON. We were worried the IRQ handler was missing an IRQ, however this test ruled this possibility out.

Other cases online also suggest in Error Passive State others with different hardware and software have experienced the same with no recorded resolution.

@Deniz-Eren Deniz-Eren added the bug Something isn't working label Jul 10, 2024
@Deniz-Eren Deniz-Eren self-assigned this Jul 10, 2024
Deniz-Eren added a commit that referenced this issue Jul 10, 2024
- Implemented command-line option '-R' to allow for Bus error state recovery; particularly for Error Passive State issues.
- Removed legacy debug asserts in netif_wake_queue() and netif_stop_queue(); these were only active for Debug builds.
- Fixed a bug in command-line help option '-r'; it was still '-b' from prior changes.
@Deniz-Eren
Copy link
Owner Author

Deniz-Eren commented Jul 10, 2024

Implemented command-line option -R to allow for Bus error state recovery; particularly for Error Passive State issues.

If this recovery via chip restart for Error Passive State is desired, then the recommended value for option -R is 128.

That is, start driver with:

dev-can-linux -R128 ...

Deniz-Eren added a commit that referenced this issue Jul 10, 2024
- Implemented command-line option '-R' to allow for Bus error state recovery; particularly for Error Passive State issues.
- Removed legacy debug asserts in netif_wake_queue() and netif_stop_queue(); these were only active for Debug builds.
- Fixed a bug in command-line help option '-r'; it was still '-b' from prior changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant