-
Notifications
You must be signed in to change notification settings - Fork 7.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ESP32 CAN controller delivers corrupted frames on RX FIFO overrun (IDFGH-2114) #4276
Comments
Not sure where you got that information from. But the CAN controller in the ESP32 has a similar register interface to the SJA1000 but NOT fully compatible. There are a few missing features and behavioral differences.
Both are true. Basically when bytes are received, they are written to the FIFO directly, and an overflow is not detected until the 64th byte is written. The bytes of the overflowing message will remain in the FIFO. The RMC should count all messages received (up to 64 messages) regardless of whether they were overflowing or not. Basically, what should happen is that whenever you release the receiver buffer, and the buffer window shifts to an overflowed message, the Data overrun interrupt will be set. If that is the case, the message contents should be ignored, the clear data overrun command set, and the receiver buffer released again. Continue this process until RMC reaches zero, or until the buffer window rotates to a valid message. But you're right, it appears that the CAN driver doesn't handle buffer overflow case yet (it's marked as a todo). I'll push a commit to handle this case. |
Thanks for the clarification & feedback. The SJA1000 info comes from Thomas Barth (see link). Can you provide a spec sheet / register documentation for the ESP32 CAN controller, or some documentation on the missing features & differences? |
Three further questions:
|
See comment. Info source: espressif/esp-idf#4276 (comment) Further experimentation showed looping on the RMC or RBS is not sufficient, the handler needs to check the interrupt flags during the RX loop.
Hello, I had a problem a few days ago related to the CAN controller. And yesterday I found out that my problem is data overrun. But my problem went beyond duplicate messages. This is a critical problem. Failure to process invalid messages causes the system to fail completely. |
There is a DMA for CAN? |
Hi, I tried this solution without success. In fact, when clear overrun and then release the buffer, the next message still invalid (duplicated!) but there's no overrun interrupt neither overrun status. Supposedly the window was for a valid message, but not valid! Inside
where
|
@neorevx Please check my commit referenced above. I found we cannot rely on RMC, RBS or DOS, we need to check the interrupt flags after each release to avoid false duplicates. As that also clears the IR, you need to collect and return new interrupts from the RX loop back to the main ISR so they don't get lost. |
@dexterbg I tried your solution but I still getting duplicated. Check:
There is some curious thing happening when clear data overrun. |
Maybe that was too early for optimism: I just got a user report that can only be explained by a frame with content shifted by one byte to the back. Looks like a FIFO address window error. This is a new type of error we haven't had before. Some more documentation or a working overflow handling code example would be nice. |
I haven't observed that. Did you try giving the clear & release commands in sequence? |
If clear & release in sequence: BEFORE RXBUF=1 INTR1=9 INTR2=1 RMC=59 If clear & release issued together: BEFORE RXBUF=1 INTR1=9 INTR2=1 RMC=8 Note: interruped has been read before status. If you read status before interrupt, the behavior change: interrupt is restored after read status. Maybe there is some processing time. Need to check put some nops or check command register. |
Guessing here: if releasing the buffer can result in just data_overrun to be set (without a new rx interrupt), the loop terminates at that point, leaving an unresolved overrun. I've just changed that in my code. Your loop condition would need to be:
|
I have tried it. But aparently rx is set when overrun is set. I not found a case in overrun set and rx is not set. |
So far, best approch is clear buffer. It's working, but I'm losing 5% of messages.
Think about buffer. It has 64 bytes? I guess yes. Every message has 13 bytes. There's in max 4 messages in buffer. Truly, 3 messages, but I can read 4. If RMC > 4 the overrun happens, I checked it. I'll try discarcd messages until RMC <= 4. Then I read message. I don't know if it work. I don't know if there some overflow over overflow. I'll try. Edit: It don't work. But surprisely, we have overrun interrupt after loop and RMC = 4 or 5. In next loop, it's read a valid message, but in sequence it stuck in same message. |
I can read 5 valid 8 byte standard frames and more than 5 valid frames from the FIFO if there are shorter frames mixed in. If the controller follow the SJA1000 specs here, a standard frame needs 3 bytes for metadata, an extended frame 5. If Darian's info…
…is incorrect and there won't be any valid frames after an overrun, we really may just need to discard until RMC is zero to fully clear the overrun. |
I don't think so. When I clear until RMC = 4, I got a valid message. Then, there's some valid messages in buffer. The problem should be clear data overrun. Aparently don't work. Or the problem is anoted in the code: |
I also don't think so now. I've tried variations of that scheme and they all lead to new problems resulting from FIFO window address errors, i.e. frames being constructed from wrong offsets into other frame data stored in the FIFO. |
I've set up a CAN test sender and modified our framework to support single stepping through the RX process and inspect the registers. My findings so far:
I've been testing these changes to our driver since the weekend, had no more corrupted frames from FIFO overflows. @neorevx can you verify this? |
I've tried many codes. But I think the corrupted frames come from hw errors.
Yes. But I think it's not a delay. You need to read/write any registers related do CAN to "restore" interrupt flags. I just read RMC after CDO+RRB to get interrutps alive. However, I use DOS for check overrun. I don't trust in interrupt.
Yes and no.
How you handle this?
I did not know that! I'm glad you said that. I will treat. What better way to restart CAN? Calling command reset = 1 and after reset = 0? There are several error registers, need to be reset? A few more things:
For reference, I'll put my code here:
|
Sorry, I wasn't clear on this: you don't do RRB until RMC=0, you read RMC at the DOS event, then do that many RRBs. The RX buffer will have the next valid frame right after CDO, but reading from there will result in dupes. See my attached log for an extended test of this scheme with interleaved new receives.
See my full code here: https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/blob/350f5f10d37a7cc58aae6770a6e6db7c2953d29c/vehicle/OVMS.V3/components/esp32can/src/esp32can.cpp#L94 RX loop: while (MODULE_ESP32CAN->SR.B.RBS | MODULE_ESP32CAN->SR.B.DOS)
{
if (MODULE_ESP32CAN->RMC.B.RMC == 64)
{
// RMC overflow => reset controller:
MODULE_ESP32CAN->MOD.B.RM = 1;
me->InitController();
MODULE_ESP32CAN->MOD.B.RM = 0;
error_irqs = __CAN_IRQ_DATA_OVERRUN;
me->m_status.error_resets++;
}
else if (MODULE_ESP32CAN->SR.B.DOS)
{
// FIFO overflow => clear overflow & discard <RMC> messages to resync:
error_irqs = __CAN_IRQ_DATA_OVERRUN;
MODULE_ESP32CAN->CMR.B.CDO = 1;
int8_t discard = MODULE_ESP32CAN->RMC.B.RMC;
while (discard--)
{
MODULE_ESP32CAN->CMR.B.RRB = 1;
me->m_status.rxbuf_overflow++;
}
}
else
{
// Valid frame in receive buffer: record the origin
[…get frame…]
// Request next frame:
MODULE_ESP32CAN->CMR.B.RRB = 1;
// Send frame to CAN framework:
xQueueSendFromISR(MyCan.m_rxqueue, &msg, task_woken);
}
} // while (MODULE_ESP32CAN->SR.B.RBS | MODULE_ESP32CAN->SR.B.DOS) On The frame drop rate is now around 3 per mille for our application. Handling the actual frame processing in the ISR is not an option for us. Also, feeding a queue from the ISR is the standard way to handle I/O in FreeRTOS, that is supposed to absolutely work for a buffering CAN controller. I'm now trying to figure out why the ISR is sometimes blocked / delayed so long a FIFO overrun can actually occur. The issue started to be more present when we switched the TCP/IP task affinity from "*" to core 0, where our CAN ISR is handled, so that's my current suspect. |
@dexterbg @neorevx sorry for not responding earlier.
Long critical sections or other same/higher priority interrupts are the usual culprit. Try reducing the length of your critical sections, or moving the CAN ISR to a less crowded core (basically call |
…ssif#4276) The web server currently lacks the ability to send a buffer. Only strings are supported. This PR adds an overload to sendContent.
…fixes For reference, here is Michael's original issue: espressif/esp-idf#4276 (comment) The commit of interest is here: espressif/esp-idf@2f58060 Here is a reformatted description of this fix: TWAI_ERRATA_FIX_BUS_OFF_REC Add SW workaround for REC change during bus-off When the bus-off condition is reached, the REC should be reset to 0 and frozen (via LOM) by the driver's ISR. However on the ESP32, there is an edge case where the REC will increase before the driver's ISR can respond in time (e.g., due to the rapid occurrence of bus errors), thus causing the REC to be non-zero after bus-off. A non-zero REC can prevent bus-off recovery as the bus-off recovery condition is that both TEC and REC become 0. Enabling this option will add a workaround in the driver to forcibly reset REC to zero on reaching bus-off. The actual change is simple: // esp/components/hal/twai_hal_iram.c //Handle low latency events if (events & TWAI_HAL_EVENT_BUS_OFF) { twai_ll_set_mode(hal_ctx->dev, TWAI_MODE_LISTEN_ONLY); //Freeze TEC/REC by entering LOM //Errata workaround: Force REC to 0 by re-triggering bus-off (by setting TEC to 0 then 255) twai_ll_set_tec(hal_ctx->dev, 0); twai_ll_set_tec(hal_ctx->dev, 255); (void) twai_ll_get_and_clear_intrs(hal_ctx->dev); //Clear the re-triggered bus-off inter rupt } TWAI_HAL_EVENT_BUS_OFF is set in the routine above, twai_hal_decode_interrupt(): //Error Warning Interrupt set whenever Error or Bus Status bit changes if (interrupts & TWAI_LL_INTR_EI) { if (status & TWAI_LL_STATUS_BS) { //Currently in BUS OFF state if (status & TWAI_LL_STATUS_ES) { //EWL is exceeded, thus must have entered BUS OFF TWAI_HAL_SET_BITS(events, TWAI_HAL_EVENT_BUS_OFF); TWAI_HAL_SET_BITS(state_flags, TWAI_HAL_STATE_FLAG_BUS_OFF); My change looks for the __CAN_IRQ_ERR_WARNING interrupt, for the __CAN_STS_BUS_OFF bit to be on (indicating bus off), and the __CAN_STS_ERR_WARNING status bit to be on (indicating error status).
…interrupt lost For reference, here is Michael's original issue: espressif/esp-idf#4276 (comment) The commit of interest is here: espressif/esp-idf@2f58060 Here is a reformatted description of this fix: Errata workaround: TWAI_ERRATA_FIX_TX_INTR_LOST Add SW workaround for TX interrupt lost On the ESP32, when a transmit interrupt occurs, and interrupt register is read on the same APB clock cycle, the transmit interrupt could be lost. Enabling this option will add a workaround that checks the transmit buffer status bit to recover any lost transmit interrupt. The fix involves keeping track of when the tx buffer is in use; look for TWAI_HAL_STATE_FLAG_TX_BUFF_OCCUPIED in: esp/components/hal/twai_hal_iram.c My change adds a place to keep track of the tx buf state (m_state) and looks for the possible lost tx interrupt in ESP32CAN_isr().
Environment
Problem Description
On RX FIFO overrun, the ESP32 CAN controller delivers corrupted frames and
false frame repetitions.
Expected Behavior
The ESP32 CAN controller is supposed to be SJA1000 compatible. We're operating
it with driver code derived from the original CAN driver by Thomas Barth
(https://www.barth-dev.de/can-driver-esp32/), using the SJA1000 PeliCAN mode
and fetching RX frames sequentially through the receive buffer.
Quoting from the SJA1000 spec sheet:
So according to the specs:
acceptance filter), that frame should be discarded completely.
be generated, so the driver knows some frame has been lost.
Actual Behavior
RMC
register…RI
andRBS
as a valid frame when retrieving the FIFO contents.some trashed bytes up to the nominal frame length.
number of false frames containing repetitions of the first frame in the FIFO.
Example:
A BMS delivering cell voltage & temperature readings sends blocks of 8 byte
standard frames. On FIFO overflow, the CAN controller trashes bytes 7 & 8 on
the sixth frame. A standard frame needs a 3 byte header + the data bytes in the
FIFO, so the 6th frame exceeds the FIFO by two bytes. The first trash byte
normally is "08", the second "84" or "2a" or sometimes "ab", possibly some
internal SJA1000 data.
This behaviour (both the frame corruption and the false repetitions) applies to
all methods reading the standard receive buffer, i.e. using the
RMC
(as isdone by the current esp-idf can.c driver), checking the
RBS
indicator andchecking the
RI
interrupt flag.The workaround I've done for our driver is adding up the message lengths read
during an RX fetch run and discarding all frames exceeding the 64 byte border.
See function
ESP32CAN_rxframe()
inesp32can.cpp
:https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/blob/master/vehicle/OVMS.V3/components/esp32can/src/esp32can.cpp#L92
I suggest applying this workaround to the esp-idf driver as well and fixing
the hardware in the next ESP32 revision.
Steps to repropduce
It should be reproducable by connecting two units running the CAN example,
with one of the units temporarily disabling interrupts to force the FIFO
overrun.
Note: the bug may need specific circumstances to occur in addition to the
overflow, maybe the overflow happening on a specific byte position in the FIFO
-- I haven't tried to determine that.
Code to reproduce this issue
Use esp-idf CAN example.
Debug Logs
none
Other items if possible
none
Project origin
https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/
The text was updated successfully, but these errors were encountered: