Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESP32 CAN controller delivers corrupted frames on RX FIFO overrun (IDFGH-2114) #4276

Closed
dexterbg opened this issue Nov 1, 2019 · 21 comments
Closed
Assignees
Labels
Status: Done Issue is done internally

Comments

@dexterbg
Copy link

dexterbg commented Nov 1, 2019

Environment

  • Development Kit: none / OVMS3
  • Kit version (for WroverKit/PicoKit/DevKitC): none / OVMS3
  • Module or chip used: ESP32-WROVER 16MB
  • IDF version: all / doesn't apply
  • Build System: Make
  • Compiler version: (crosstool-NG crosstool-ng-1.22.0-98-g4638c4f) 5.2.0
  • Operating System: Linux, macOS
  • Power Supply: USB, external 5V

Problem Description

On RX FIFO overrun, the ESP32 CAN controller delivers corrupted frames and
false frame repetitions.

Expected Behavior

The ESP32 CAN controller is supposed to be SJA1000 compatible. We're operating
it with driver code derived from the original CAN driver by Thomas Barth
(https://www.barth-dev.de/can-driver-esp32/), using the SJA1000 PeliCAN mode
and fetching RX frames sequentially through the receive buffer.

Quoting from the SJA1000 spec sheet:

After reading the contents of the receive buffer, the CPU can release this
memory space in the RXFIFO by setting the release receive buffer bit to logic 1.
This may result in another message becoming immediately available within the
receive buffer.

… the RXFIFO has space for 64 message bytes in total. It depends on the data
length how many messages can fit in it at one time. If there is not enough space
for a new message within the RXFIFO, the CAN controller generates a data overrun
condition the moment this message becomes valid and the acceptance test was
positive. A message which is partly written into the RXFIFO, when the data
overrun situation occurs, is deleted.

The RMC register (CAN address 29) reflects the number of messages available
within the RXFIFO. The value is incremented with each receive event and
decremented by the release receive buffer command.

So according to the specs:

  • If no space is left in the FIFO for a new frame coming in (and passing the
    acceptance filter), that frame should be discarded completely.
  • It should not be counted.
  • It should not be passed through the receive buffer.
  • Just the overflow indicator should be set and the according interrupt
    be generated, so the driver knows some frame has been lost.

Actual Behavior

  • The frame causing the overflow is added to the FIFO partially (up to the FIFO border).
  • It's also counted both in the RMC register…
  • …and indicated by RI and RBS as a valid frame when retrieving the FIFO contents.
  • On fetching the FIFO contents, the controller delivers the partial frame +
    some trashed bytes up to the nominal frame length.
  • After delivering the corrupted frame, the controller may continues delivering a
    number of false frames containing repetitions of the first frame in the FIFO.

Example:

A BMS delivering cell voltage & temperature readings sends blocks of 8 byte
standard frames. On FIFO overflow, the CAN controller trashes bytes 7 & 8 on
the sixth frame. A standard frame needs a 3 byte header + the data bytes in the
FIFO, so the 6th frame exceeds the FIFO by two bytes. The first trash byte
normally is "08", the second "84" or "2a" or sometimes "ab", possibly some
internal SJA1000 data.

inv_msg: framecnt=13, invindex=6
inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
inv_msg: 24 04 00 00 11 40 10 22 37 55 00 37 | 4..?........$....@."7U.7
inv_msg: 25 04 00 00 0a 1b 44 ff fe 4e 01 26 | 4..?........%.....D..N.&
inv_msg: 54 05 00 00 37 37 37 37 37 37 37 00 | 4..?........T...7777777.
inv_msg: 56 05 00 00 31 63 14 31 53 14 31 4a | 4..?........V...1c.1S.1J
inv_msg: 57 05 00 00 31 43 14 31 53 15 08 2a | 4..?........W...1C.1S..*
                                       ^^^^^ trashed bytes here
… following 7 repetitions of the first frame:
inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n
inv_msg: 55 01 00 00 07 98 50 54 54 20 00 6e | 4..?........U.....PTT .n

This behaviour (both the frame corruption and the false repetitions) applies to
all methods reading the standard receive buffer, i.e. using the RMC (as is
done by the current esp-idf can.c driver), checking the RBS indicator and
checking the RI interrupt flag.

The workaround I've done for our driver is adding up the message lengths read
during an RX fetch run and discarding all frames exceeding the 64 byte border.
See function ESP32CAN_rxframe() in esp32can.cpp:
https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/blob/master/vehicle/OVMS.V3/components/esp32can/src/esp32can.cpp#L92

I suggest applying this workaround to the esp-idf driver as well and fixing
the hardware in the next ESP32 revision.

Steps to repropduce

It should be reproducable by connecting two units running the CAN example,
with one of the units temporarily disabling interrupts to force the FIFO
overrun.

Note: the bug may need specific circumstances to occur in addition to the
overflow, maybe the overflow happening on a specific byte position in the FIFO
-- I haven't tried to determine that.

Code to reproduce this issue

Use esp-idf CAN example.

Debug Logs

none

Other items if possible

none

Project origin

https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/

@github-actions github-actions bot changed the title ESP32 CAN controller delivers corrupted frames on RX FIFO overrun ESP32 CAN controller delivers corrupted frames on RX FIFO overrun (IDFGH-2114) Nov 1, 2019
@Dazza0
Copy link
Contributor

Dazza0 commented Nov 1, 2019

The ESP32 CAN controller is supposed to be SJA1000 compatible.

Not sure where you got that information from. But the CAN controller in the ESP32 has a similar register interface to the SJA1000 but NOT fully compatible. There are a few missing features and behavioral differences.

The frame causing the overflow is added to the FIFO partially (up to the FIFO border).
It's also counted both in the RMC register

Both are true. Basically when bytes are received, they are written to the FIFO directly, and an overflow is not detected until the 64th byte is written. The bytes of the overflowing message will remain in the FIFO. The RMC should count all messages received (up to 64 messages) regardless of whether they were overflowing or not.

Basically, what should happen is that whenever you release the receiver buffer, and the buffer window shifts to an overflowed message, the Data overrun interrupt will be set. If that is the case, the message contents should be ignored, the clear data overrun command set, and the receiver buffer released again. Continue this process until RMC reaches zero, or until the buffer window rotates to a valid message.

But you're right, it appears that the CAN driver doesn't handle buffer overflow case yet (it's marked as a todo). I'll push a commit to handle this case.

@dexterbg
Copy link
Author

dexterbg commented Nov 1, 2019

Thanks for the clarification & feedback. The SJA1000 info comes from Thomas Barth (see link).

Can you provide a spec sheet / register documentation for the ESP32 CAN controller, or some documentation on the missing features & differences?

@dexterbg
Copy link
Author

dexterbg commented Nov 2, 2019

Three further questions:

  • Does RRB set DOS along with DOI in case of an overflown frame?
  • Does CDO clear both DOS and DOI or is reading the IR register necessary to clear DOI?
  • Can CDO and RRB be issued together?

dexterbg added a commit to openvehicles/Open-Vehicle-Monitoring-System-3 that referenced this issue Nov 3, 2019
See comment. Info source:
espressif/esp-idf#4276 (comment)

Further experimentation showed looping on the RMC or RBS is not sufficient,
the handler needs to check the interrupt flags during the RX loop.
@neorevx
Copy link

neorevx commented Nov 5, 2019

Hello, I had a problem a few days ago related to the CAN controller. And yesterday I found out that my problem is data overrun. But my problem went beyond duplicate messages. This is a critical problem. Failure to process invalid messages causes the system to fail completely.
When data overrun happens, interrupt rx becomes active. If no message overwrites the invalid message, for example, if there is a bus silence time, the interrupt is called constantly in sequence and does not clear data overrun (there is no message to process). All other tasks (for cpu) starve and the watchdog is triggered.
More one thing: RMC maybe count overrun messages, but in some point (maybe CMD_RELEASE_RX_BUFF decrease RMC), RMC is zero and rx interrupt still active.

@neorevx
Copy link

neorevx commented Nov 5, 2019

There is a DMA for CAN?

@neorevx
Copy link

neorevx commented Nov 6, 2019

The ESP32 CAN controller is supposed to be SJA1000 compatible.

Not sure where you got that information from. But the CAN controller in the ESP32 has a similar register interface to the SJA1000 but NOT fully compatible. There are a few missing features and behavioral differences.

The frame causing the overflow is added to the FIFO partially (up to the FIFO border).
It's also counted both in the RMC register

Both are true. Basically when bytes are received, they are written to the FIFO directly, and an overflow is not detected until the 64th byte is written. The bytes of the overflowing message will remain in the FIFO. The RMC should count all messages received (up to 64 messages) regardless of whether they were overflowing or not.

Basically, what should happen is that whenever you release the receiver buffer, and the buffer window shifts to an overflowed message, the Data overrun interrupt will be set. If that is the case, the message contents should be ignored, the clear data overrun command set, and the receiver buffer released again. Continue this process until RMC reaches zero, or until the buffer window rotates to a valid message.

But you're right, it appears that the CAN driver doesn't handle buffer overflow case yet (it's marked as a todo). I'll push a commit to handle this case.

Hi, I tried this solution without success. In fact, when clear overrun and then release the buffer, the next message still invalid (duplicated!) but there's no overrun interrupt neither overrun status. Supposedly the window was for a valid message, but not valid!
I have tried release buf + clear overrun + release buf, but don't work.
So I was more radical. When there is overrun I discard all messages regardless of the overrun interrupt until RMC is zero.
I believe that I will be losing valid messages. But it's the only solution I've found to keep the system running.
Another thing, in some cases interruption overrun is not active and the status is active. In these cases overrun has been occurred.

Inside for in can_intr_handler_rx I put:

        status.val = can_get_status();
        if (status.data_overrun) {
            while (can_get_rx_message_counter() > 0) {
                p_can_obj->rx_total_count++;
                p_can_obj->rx_data_overrun_count++;
                status.val = can_get_status();
                if (status.data_overrun) {
                    can_set_command(CMD_CLR_DATA_OVRN);
                }
                can_set_command(CMD_RELEASE_RX_BUFF);
            }
            return;
        }

where

can_status_reg_t status;

@dexterbg
Copy link
Author

dexterbg commented Nov 6, 2019

@neorevx Please check my commit referenced above. I found we cannot rely on RMC, RBS or DOS, we need to check the interrupt flags after each release to avoid false duplicates. As that also clears the IR, you need to collect and return new interrupts from the RX loop back to the main ISR so they don't get lost.

@neorevx
Copy link

neorevx commented Nov 6, 2019

@dexterbg I tried your solution but I still getting duplicated.
RX interrupt neither overrun status are not cleared when read. But overrun interrupt is cleared.
Overrun status is cleared after clear overrun command.

Check:

can_intr_reg_t ir = *intr_reason; // First intr

    while (ir.rx) {
        p_can_obj->rx_total_count++;

        if (ir.data_overrun) {
            p_can_obj->rx_data_overrun_count++;
            can_set_command(CMD_CLR_DATA_OVRN | CMD_RELEASE_RX_BUFF);  // Apparently this works together
        } else {
            can_frame_t frame;
            can_get_rx_buffer_and_clear(&frame);
...
        }
        ir.val = can_get_interrupt_reason();
        intr_reason->val |= ir.val;
    }

There is some curious thing happening when clear data overrun.
After clear, in next interrupt read rx is not active! Even RMC > 0 and rx buffer status is set. But if you read interrupts again in sequence, rx is active! I'll try check if it happens after clear data overrun or after release buffer.

@dexterbg
Copy link
Author

dexterbg commented Nov 6, 2019

Maybe that was too early for optimism: I just got a user report that can only be explained by a frame with content shifted by one byte to the back. Looks like a FIFO address window error. This is a new type of error we haven't had before.

Some more documentation or a working overflow handling code example would be nice.

@dexterbg
Copy link
Author

dexterbg commented Nov 6, 2019

I haven't observed that. Did you try giving the clear & release commands in sequence?

@neorevx
Copy link

neorevx commented Nov 6, 2019

If clear & release in sequence:

BEFORE RXBUF=1 INTR1=9 INTR2=1 RMC=59
AFTER CLR OVRN RXBUF=1 INTR=1 RMC=59
AFTER RELEASE RXBUF=1 INTR=0 RMC=58
AFTER AFTER RELEASE RXBUF=1 INTR=1 RMC=58

If clear & release issued together:

BEFORE RXBUF=1 INTR1=9 INTR2=1 RMC=8
AFTER CLR&RELEASE RXBUF=1 INTR=0 RMC=7
AFTER AFTER CLR&RELEASE RXBUF=1 INTR=1 RMC=7

Note: interruped has been read before status. If you read status before interrupt, the behavior change: interrupt is restored after read status. Maybe there is some processing time. Need to check put some nops or check command register.

@dexterbg
Copy link
Author

dexterbg commented Nov 6, 2019

Guessing here: if releasing the buffer can result in just data_overrun to be set (without a new rx interrupt), the loop terminates at that point, leaving an unresolved overrun. I've just changed that in my code. Your loop condition would need to be:

    while (ir.rx | ir.data_overrun) {

@neorevx
Copy link

neorevx commented Nov 6, 2019

I have tried it. But aparently rx is set when overrun is set. I not found a case in overrun set and rx is not set.
About rx is not set after clear data overrun: aparently you have to read anything related to CAN to fill interrupt. I read cmd register and status, both cases interrupt back do rx set.

@neorevx
Copy link

neorevx commented Nov 6, 2019

So far, best approch is clear buffer. It's working, but I'm losing 5% of messages.

if (CAN.status_reg.data_overrun) {
            while(can_get_rx_message_counter()) {
                p_can_obj->rx_total_count++;
                p_can_obj->rx_data_overrun_count++;
                can_set_command(CMD_CLR_DATA_OVRN | CMD_RELEASE_RX_BUFF);
            }
        }

Think about buffer. It has 64 bytes? I guess yes. Every message has 13 bytes. There's in max 4 messages in buffer. Truly, 3 messages, but I can read 4. If RMC > 4 the overrun happens, I checked it. I'll try discarcd messages until RMC <= 4. Then I read message. I don't know if it work. I don't know if there some overflow over overflow. I'll try.

Edit:

It don't work. But surprisely, we have overrun interrupt after loop and RMC = 4 or 5. In next loop, it's read a valid message, but in sequence it stuck in same message.

@dexterbg
Copy link
Author

dexterbg commented Nov 6, 2019

I can read 5 valid 8 byte standard frames and more than 5 valid frames from the FIFO if there are shorter frames mixed in. If the controller follow the SJA1000 specs here, a standard frame needs 3 bytes for metadata, an extended frame 5.

If Darian's info…

Continue this process until RMC reaches zero, or until the buffer window rotates to a valid message.

…is incorrect and there won't be any valid frames after an overrun, we really may just need to discard until RMC is zero to fully clear the overrun.

@neorevx
Copy link

neorevx commented Nov 7, 2019

I don't think so. When I clear until RMC = 4, I got a valid message. Then, there's some valid messages in buffer. The problem should be clear data overrun. Aparently don't work. Or the problem is anoted in the code:
// Todo: Check data overrun bug where interrupt does not trigger even when // enabled
If overrun interrupt don't raise after move buffer window to another invalid message, we can't identify it.
Another thing, when overrun occur when reading duplicated messages, I new message are found. The buffer window move when overrun occur.

@dexterbg
Copy link
Author

dexterbg commented Nov 7, 2019

I also don't think so now. I've tried variations of that scheme and they all lead to new problems resulting from FIFO window address errors, i.e. frames being constructed from wrong offsets into other frame data stored in the FIFO.

@dexterbg
Copy link
Author

I've set up a CAN test sender and modified our framework to support single stepping through the RX process and inspect the registers.

My findings so far:

  • The interrupt flags are set with a delay after RRB / CDO

    • → the RX loop must not use them (but can rely on the status flags)
  • At DOS, RMC tells us how many frames need to be discarded

    • → on DOS, issuing CDO, then RMC times RRB resyncs the RX to the next valid frame
  • DOS can become set on the last RRB (i.e. without any more message in the FIFO, RMC=0 and no RI/RBS)

    • → the RX loop must check & handle DOS independent of the other indicators
  • After an overflow of the receive message counter RMC (at 64), the controller cannot recover
    and continues to deliver wrong & corrupted frames, even if clearing the FIFO completely until RMC=0

    • → if RMC reaches 64, the controller must be reset

I've been testing these changes to our driver since the weekend, had no more corrupted frames from FIFO overflows. @neorevx can you verify this?

@neorevx
Copy link

neorevx commented Nov 13, 2019

I've tried many codes. But I think the corrupted frames come from hw errors.

The interrupt flags are set with a delay after RRB / CDO

→ the RX loop must not use them (but can rely on the status flags)

Yes. But I think it's not a delay. You need to read/write any registers related do CAN to "restore" interrupt flags. I just read RMC after CDO+RRB to get interrutps alive. However, I use DOS for check overrun. I don't trust in interrupt.

At DOS, RMC tells us how many frames need to be discarded

→ on DOS, issuing CDO, then RMC times RRB resyncs the RX to the next valid frame

Yes and no.
In fact, if you call CDO and RRB and later RX Interuption is active and DOS is not active, you have a valid frame! Using RRB until RMC = 0 will discard this valid frame!
However, when this happens, the rest of the frames will be doubled to RMC = 0 or DOI = 1! If you enter another DOI, the system stops sending duplicate frames and sends another valid frame (after CDO and RRB), but later continues to duplicate the last frame. For this reason I believe the problem is HW, not that there are no valid frames in memory.
In order not to lose this valid frame and not need too complex logic, when overrun occurs I call CDO and RRB and then discard the other frames until RMC = 1. So, as the frame is duplicated, I can use this "last" frame. But I'm thinking of discarding until RMC = 1 or DOS / DOI = 1.

DOS can become set on the last RRB (i.e. without any more message in the FIFO, RMC=0 and no RI/RBS)

→ the RX loop must check & handle DOS independent of the other indicators

How you handle this?
Currently in my code, my read loop considers only the initial RMC, validating the RI or DOI or DOS. When RMC = 0, my code will return and I will not handle this case. It will only be treated when RMC> 0.
When overrun happens, I restore the RMC count (for loop) after clearing the buffer. In fact, it goes back in the loop with RMC = 1.

After an overflow of the receive message counter RMC (at 64), the controller cannot recover
and continues to deliver wrong & corrupted frames, even if clearing the FIFO completely until RMC=0

→ if RMC reaches 64, the controller must be reset

I did not know that! I'm glad you said that. I will treat. What better way to restart CAN? Calling command reset = 1 and after reset = 0? There are several error registers, need to be reset?


A few more things:

  • Once, I got corrupted frames with DLC = 7! The last byte was not really valid, but the original message was DLC = 8. In this case, as all messages I use are DLC = 8, I filter.
  • I took another approach to message processing: I ultra optimized the code to process messages as quickly as possible. In fact, the original implementation puts the messages in a queue and then processed. This is a correct approach, but due to overhead, even processing on another cpu core, I started losing messages by full queue or even overrun. I changed the way to copy CAN buffer data to memory (I already inverted copy to MSB-> LSB), reduced the number of copies, put interrupts and processing in IRAM. Now processing is done within the interrupt. I know this is not the right practice, but I basically pass the message over a switch ... case and write the data in another variable.
    Basically, I don't have overrun anymore.

For reference, I'll put my code here:

static inline void can_intr_handler_rx(can_intr_reg_t *intr_reason, BaseType_t *task_woken, int *alert_req) {
    can_intr_reg_t ir = *intr_reason;

    uint32_t rx = can_get_rx_message_counter();

    while (rx-- && (ir.rx | ir.data_overrun | CAN.status_reg.data_overrun)) {
        p_can_obj->rx_total_count++;

        if (ir.data_overrun | CAN.status_reg.data_overrun) {
            p_can_obj->rx_data_overrun_count++;
            can_set_command(CMD_RELEASE_RX_BUFF | CMD_CLR_DATA_OVRN);
            while (can_get_rx_message_counter() > 1) {
                p_can_obj->rx_total_count++;
                p_can_obj->rx_data_overrun_count++;
                can_set_command(CMD_RELEASE_RX_BUFF | CMD_CLR_DATA_OVRN);
            }
            rx = can_get_rx_message_counter();  // Além de definir o rx, essa linha restaura a interrupção do CAN (must be called to restore interrupt!)
        } else {
            can_read_frame(); // Internally store message in static variable
            can_set_command(CMD_RELEASE_RX_BUFF);
            p_can_obj->rx_msg_count++;
[...]
        }

        ir.val = can_get_interrupt_reason();
        intr_reason->val |= ir.val;
    }
    intr_reason->rx = 0;
    intr_reason->data_overrun = 0;
}

@dexterbg
Copy link
Author

In fact, if you call CDO and RRB and later RX Interuption is active and DOS is not active, you have a valid frame! Using RRB until RMC = 0 will discard this valid frame!

Sorry, I wasn't clear on this: you don't do RRB until RMC=0, you read RMC at the DOS event, then do that many RRBs. The RX buffer will have the next valid frame right after CDO, but reading from there will result in dupes. See my attached log for an extended test of this scheme with interleaved new receives.
esp32can-singlestep-cdotest2.log

→ the RX loop must check & handle DOS independent of the other indicators
How you handle this?

What better way to restart CAN? Calling command reset = 1 and after reset = 0? There are several error registers, need to be reset?

See my full code here: https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/blob/350f5f10d37a7cc58aae6770a6e6db7c2953d29c/vehicle/OVMS.V3/components/esp32can/src/esp32can.cpp#L94

RX loop:

  while (MODULE_ESP32CAN->SR.B.RBS | MODULE_ESP32CAN->SR.B.DOS)
    {
    if (MODULE_ESP32CAN->RMC.B.RMC == 64)
      {
      // RMC overflow => reset controller:
      MODULE_ESP32CAN->MOD.B.RM = 1;
      me->InitController();
      MODULE_ESP32CAN->MOD.B.RM = 0;
      error_irqs = __CAN_IRQ_DATA_OVERRUN;
      me->m_status.error_resets++;
      }
    else if (MODULE_ESP32CAN->SR.B.DOS)
      {
      // FIFO overflow => clear overflow & discard <RMC> messages to resync:
      error_irqs = __CAN_IRQ_DATA_OVERRUN;
      MODULE_ESP32CAN->CMR.B.CDO = 1;
      int8_t discard = MODULE_ESP32CAN->RMC.B.RMC;
      while (discard--)
        {
        MODULE_ESP32CAN->CMR.B.RRB = 1;
        me->m_status.rxbuf_overflow++;
        }
      }
    else
      {
      // Valid frame in receive buffer: record the origin
      […get frame…]
      // Request next frame:
      MODULE_ESP32CAN->CMR.B.RRB = 1;
      // Send frame to CAN framework:
      xQueueSendFromISR(MyCan.m_rxqueue, &msg, task_woken);
      }
    } // while (MODULE_ESP32CAN->SR.B.RBS | MODULE_ESP32CAN->SR.B.DOS)

On DOS, the loop does the discarding inline, then continues to check for RBS and DOS. This scheme seems to work reliably in terms of not producing any corrupted or duplicate false frames.

The frame drop rate is now around 3 per mille for our application. Handling the actual frame processing in the ISR is not an option for us. Also, feeding a queue from the ISR is the standard way to handle I/O in FreeRTOS, that is supposed to absolutely work for a buffering CAN controller. I'm now trying to figure out why the ISR is sometimes blocked / delayed so long a FIFO overrun can actually occur. The issue started to be more present when we switched the TCP/IP task affinity from "*" to core 0, where our CAN ISR is handled, so that's my current suspect.

@Dazza0
Copy link
Contributor

Dazza0 commented Nov 19, 2019

@dexterbg @neorevx sorry for not responding earlier.
I've tested the overflow behavior myself, and here are my findings:

  • When the RX FIFO is empty and begins receiving messages

    • Bytes are filled into the RX FIFO, RMC is incremented for every message received
  • When a message arrives with more bytes than can fit in the RX FIFO's remaining space

    • RMC is still incremented for the message
    • Whatever bytes of the message that can fit in the remaining space of the RX FIFO will be filled. The remaining bytes will be discarded.
  • When the RX FIFO is full but messages are still being received.

    • RMC is still incremented for each overrun message (up to 64).
    • None of the bytes of these overrun messages are written to RX FIFO because it is already full.
  • When RMC reaches 64, the RX FIFO becomes unrecoverable (due to an RTL bug).

    • The RX FIFO's internal read pointer becomes out of sync, and subsequent calls to release the buffer may shift the buffer window by an incorrect amount, leading to corrupt messages.
    • Entering then exiting reset mode will reset the RX FIFO
    • If the RMC reaches 63, the RX FIFO is still recoverable.
  • The DOI interrupt and DOS status bits are both set when release buffer is called and the window rotates from a valid message to an overrun one.

    • DOI is cleared by reading the interrupt register, DOS is cleared by the CDO command
    • If the next message is also overrun, DOI and DOS will not be set again if release buffer is called. The two bits are only set on a transition of the buffer window from valid to overrun message.
  • Assuming that you are clearing the RX FIFO in a single sitting (i.e. in one continuous operation).

    • If RMC is 64. The buffer is unrecoverable. Enter and exit reset mode to reset the FIFO. Whatever valid messages in the RX FIFO are lost.
    • If RMC is <64, the buffer is recoverable. Keeping reading the valid messages and releasing the buffer until DOS or DOI is set (preferably DOS, because it isn't auto cleared on a register read). The remaining messages are overrun, thus release the buffer N times until RMC is 0.

@dexterbg

I'm now trying to figure out why the ISR is sometimes blocked / delayed so long a FIFO overrun can actually occur.

Long critical sections or other same/higher priority interrupts are the usual culprit. Try reducing the length of your critical sections, or moving the CAN ISR to a less crowded core (basically call esp_intr_alloc() on which ever core to register on)

@espressif-bot espressif-bot removed the Status: In Progress Work is in progress label Mar 31, 2021
0xFEEDC0DE64 pushed a commit to 0xFEEDC0DE64/esp-idf that referenced this issue May 5, 2021
…ssif#4276)

The web server currently lacks the ability to send a buffer. Only strings are supported.

This PR adds an overload to sendContent.
espressif-bot pushed a commit that referenced this issue May 8, 2021
This commit adds handling for FIFO overruns and
adds workarounds for HW errats on the ESP32.

Closes #2519
Closes #4276
leres added a commit to leres/Open-Vehicle-Monitoring-System-3 that referenced this issue Jun 12, 2021
…fixes

For reference, here is Michael's original issue:

    espressif/esp-idf#4276 (comment)

The commit of interest is here:

    espressif/esp-idf@2f58060

Here is a reformatted description of this fix:

   TWAI_ERRATA_FIX_BUS_OFF_REC
   Add SW workaround for REC change during bus-off

   When the bus-off condition is reached, the REC should be reset
   to 0 and frozen (via LOM) by the driver's ISR. However on the
   ESP32, there is an edge case where the REC will increase before
   the driver's ISR can respond in time (e.g., due to the rapid
   occurrence of bus errors), thus causing the REC to be non-zero
   after bus-off. A non-zero REC can prevent bus-off recovery as
   the bus-off recovery condition is that both TEC and REC become
   0. Enabling this option will add a workaround in the driver to
   forcibly reset REC to zero on reaching bus-off.

The actual change is simple:

    // esp/components/hal/twai_hal_iram.c

    //Handle low latency events
    if (events & TWAI_HAL_EVENT_BUS_OFF) {
        twai_ll_set_mode(hal_ctx->dev, TWAI_MODE_LISTEN_ONLY);  //Freeze TEC/REC by entering LOM
        //Errata workaround: Force REC to 0 by re-triggering bus-off (by setting TEC to 0 then 255)
        twai_ll_set_tec(hal_ctx->dev, 0);
        twai_ll_set_tec(hal_ctx->dev, 255);
        (void) twai_ll_get_and_clear_intrs(hal_ctx->dev);    //Clear the re-triggered bus-off inter
rupt
    }

TWAI_HAL_EVENT_BUS_OFF is set in the routine above, twai_hal_decode_interrupt():

    //Error Warning Interrupt set whenever Error or Bus Status bit changes
    if (interrupts & TWAI_LL_INTR_EI) {
        if (status & TWAI_LL_STATUS_BS) {       //Currently in BUS OFF state
            if (status & TWAI_LL_STATUS_ES) {    //EWL is exceeded, thus must have entered BUS OFF
                TWAI_HAL_SET_BITS(events, TWAI_HAL_EVENT_BUS_OFF);
                TWAI_HAL_SET_BITS(state_flags, TWAI_HAL_STATE_FLAG_BUS_OFF);

My change looks for the __CAN_IRQ_ERR_WARNING interrupt, for the
__CAN_STS_BUS_OFF bit to be on (indicating bus off), and the
__CAN_STS_ERR_WARNING status bit to be on (indicating error status).
leres added a commit to leres/Open-Vehicle-Monitoring-System-3 that referenced this issue Jul 3, 2021
…interrupt lost

For reference, here is Michael's original issue:

    espressif/esp-idf#4276 (comment)

The commit of interest is here:

    espressif/esp-idf@2f58060

Here is a reformatted description of this fix:

    Errata workaround: TWAI_ERRATA_FIX_TX_INTR_LOST

    Add SW workaround for TX interrupt lost

    On the ESP32, when a transmit interrupt occurs, and interrupt
    register is read on the same APB clock cycle, the transmit
    interrupt could be lost. Enabling this option will add a
    workaround that checks the transmit buffer status bit to
    recover any lost transmit interrupt.

The fix involves keeping track of when the tx buffer is in use;
look for TWAI_HAL_STATE_FLAG_TX_BUFF_OCCUPIED in:

    esp/components/hal/twai_hal_iram.c

My change adds a place to keep track of the tx buf state (m_state)
and looks for the possible lost tx interrupt in ESP32CAN_isr().
espressif-bot pushed a commit that referenced this issue May 22, 2022
This commit adds handling for FIFO overruns and
adds workarounds for HW erratas on the ESP32.

Closes #2519
Closes #4276
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Done Issue is done internally
Projects
None yet
Development

No branches or pull requests

4 participants