Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

W5500 fails after 20 minutes of operation (IDFGH-10018) #11295

Closed
3 tasks done
mickeyl opened this issue Apr 28, 2023 · 27 comments
Closed
3 tasks done

W5500 fails after 20 minutes of operation (IDFGH-10018) #11295

mickeyl opened this issue Apr 28, 2023 · 27 comments
Assignees
Labels
Resolution: NA Issue resolution is unavailable Status: Done Issue is done internally Type: Bug bugs in IDF

Comments

@mickeyl
Copy link
Contributor

mickeyl commented Apr 28, 2023

Answers checklist.

  • I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
  • I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
  • I have searched the issue tracker for a similar issue and not found a similar issue.

IDF version.

v5.2-dev-321-ga8b6a70620

Operating System used.

Linux

How did you build your project?

VS Code IDE

If you are using Windows, please specify command line type.

None

Development Kit.

ESP32-C6-DevKit M1

Power Supply used.

USB

What is the expected behavior?

W5500 should be stable over a long time.

What is the actual behavior?

After 20 minutes, it stops operating. See debug log.

Steps to reproduce.

Use an ethernet example.

Debug Logs.

E (5516841) w5500.mac: emac_w5500_read_phy_reg(335): read PHY register failed
E (5516841) w5500.phy: w5500_update_link_duplex_speed(69): read PHYCFG failed
E (5516841) w5500.phy: w5500_get_link(112): update link duplex speed failed
E (5516901) w5500.mac: w5500_get_rx_received_size(152): read RX RSR failed
E (5518051) w5500.mac: w5500_get_rx_received_size(152): read RX RSR failed


### More Information.

I'm using the C6-Devkit w/ a W5500 connected via SPI.
@mickeyl mickeyl added the Type: Bug bugs in IDF label Apr 28, 2023
@github-actions github-actions bot changed the title W5500 fails after 20 minutes of operation W5500 fails after 20 minutes of operation (IDFGH-10018) Apr 28, 2023
@espressif-bot espressif-bot added the Status: Opened Issue is new label Apr 28, 2023
@kostaond
Copy link
Collaborator

kostaond commented May 2, 2023

@mickeyl thanks for reporting potential issue. Could you please share more information about your setup?

  • how is the W5500 connected to the ESP Devkit? Via jump wires?
  • what is your SPI frequency setting?

@mickeyl
Copy link
Contributor Author

mickeyl commented May 2, 2023

@kostaond Thanks for answering. Yes, it's all on a breadboard w/ jumper wires. I set an SPI frequency of 30Mhz.

@chikichaka
Copy link

@kostaond @mickeyl Same thing happened to me.
I did PCB board with ESP32s3. I set an SPI frequency of 100000 Hz.
In my case, this happened about every 3 hours.

@kostaond
Copy link
Collaborator

kostaond commented May 4, 2023

I tried to run ethernet\basic example on our setup with the W5500 @ SPI frequency 26.6 MHz for ~20 hours without any issue. Basic ping command was running and I had zero packet loss for the whole period. Only note that our setup consists of ESP32 Devkit and W5500 is connected via mezzanine board to not use wires. I'll try to find C6 and run it with it. Could you please describe your scenario in more details? For example, what is the traffic?

I still tend to think your issue is more related to the SPI connection problem. Is your Ethernet connection lost when you experience the issue or it continues work properly?

@kostaond
Copy link
Collaborator

kostaond commented May 9, 2023

@mickeyl I obtained ESP32C6 and ran it for more than 24 hours without any issue. Details of my test setup:

  • W5500 connected to ESP32-C6-DevKitM-1 via short wires (~5 cm) interconnected on breadboard.
  • ethernet/basic example
  • SPI CLK = 20 MHz
  • SCLK_GPIO 6
    MOSI_GPIO 7
    MISO_GPIO 2
    CS0_GPIO 15
    INT0_GPIO 4
  • traffic, just simple Linux ping in default configuration

@ndedobbeleer
Copy link

I have the same thing,

After about 20 min :

E (2037620) w5500.mac: emac_w5500_read_phy_reg(335): read PHY register failed
E (2037620) w5500.phy: w5500_update_link_duplex_speed(69): read PHYCFG failed
E (2037620) w5500.phy: w5500_get_link(112): update link duplex speed failed

W5500 + ESP32S3 on a custom PCB with ESP IDF V5.0.1

  • Clock Speed = 36 MHz
  • SCLK_GPIO = 40
  • MOSI_GPIO = 38
  • MISO_GPIO = 39
  • ETH_CS_GPIO = 41
  • ETH_INT_GPIO = 18

I also have a ADC ( MCP3462) on the same SPI bus :

  • Clock Speed = 10 MHz
  • ADC_CS_GPIO = 21

@kostaond
Copy link
Collaborator

When configuring SPI CLK to 36 MHz in menuconfig, it is re-calculated by the driver to the nearest hardware-compatible number. Therefore, the configured frequency is 40 MHz in fact (see https://docs.espressif.com/projects/esp-idf/en/latest/esp32s3/api-reference/peripherals/spi_master.html#spi-clock-frequency for more information).

This could be considered out of spec. of the W5500 since its datasheet says: “The minimum guaranteed speed of the SCLK is 33.3 MHz which was tested and measured with the stable waveform.” On the other hand it also states: “Even though theoretical design speed is 80MHz, the signal in the high speed may be distorted because of the circuit crosstalk and the length of the signal line”. Therefore it is very important to take care of the signal path. I am not hardware engineer so I cannot provide you more details. However, consider checking the following links:
https://resources.pcb.cadence.com/blog/2019-tips-for-optimal-high-speed-spi-layout-routing
https://resources.altium.com/p/there-spi-trace-impedance-requirement

I tried to connect the W5500 to ESP32S3 in the same configuration as provided by @ndedobbeleer. I used short wires with matched length interconnected on breadboard and I was able to run it without any issue for more than 2 hours so far. Frankly speaking, I was quite surprised it works in this configuration at 40 MHz since we have specifically designed PCB (ESP32 based) and it does not work unless I tune SPI configuration:

    spi_device_interface_config_t spi_devcfg = {
...
        .input_delay_ns = <value>,
        .cs_ena_posttrans = <value>
    };

You could also try to adjust input_delay_ns to match your PCB specifics.

@ndedobbeleer
Copy link

@kostaond Thanks for your response and help.

After some testing this afternoon, if I don't send a request to the device, the problem does not occur.

The previous test was done with a modbus request sent every second (did not try with a spam ping).

I will try with a lower spi clock speed and let you know

@ndedobbeleer
Copy link

@kostaond I have I lowered the speed of the spi to 20Mhz but the issue still the same.

The problem does not appear at the same frequency depending on the type of request sent on the Ethernet :

  • With a ping request (Windows command) i cannot reproduce the issue (tested for 12 hours and no problem)
  • With an http request that serves a simple web page, the problem occurs after a few hours (sending frequency : every second)
  • With modbus request sent every second, the problem occurs after about 20 min.

I also tried with lower speeds. The lower the speed, the faster the problem appears.

@kostaond
Copy link
Collaborator

@ndedobbeleer could you please provide code example with modbus so I could quickly try to reproduce?

@ndedobbeleer
Copy link

@kostaond

I have push a project with minimal code to reproduce this error :
#https://github.com/Stay-Info/EspW5500FailDemo

While testing this project, I realized that the problem did not appear if no other SPI device was used on the same bus.

For the example I used an MCP3462 ADC but I think spamming requests to any other SPI device will return the same result.

@mickeyl
Copy link
Contributor Author

mickeyl commented May 18, 2023

My tests were run without any other SPI devices. I have to investigate more to provide solid debug data, but for me it seems the likeliness to break increased with devices hopping on/off the Ethernet switch I used.

@ndedobbeleer
Copy link

@mickeyl

The second SPI device may just increase the chances of the problem occurring.

I can certify that in my case. But I cannot certify that the problem does not appear at all without the second device. I will do a test over a very long period and I will communicate the result.

Interestingly, SPI communication with the second device is also blocked when the W5500 indicates a fault. This reminds me of a problem with a mutex.

@ndedobbeleer
Copy link

@kostaond

I was able to do some additional tests. And I couldn't reproduce the error without another SPI device present on the bus.

I can also say that my SPI transactions hang indefinitely at this step:

esp_err_t SPI_MASTER_ATTR spi_device_get_trans_result(spi_device_handle_t handle, spi_transaction_t **trans_desc, TickType_t ticks_to_wait)
{
    BaseType_t r;
    spi_trans_priv_t trans_buf;
    SPI_CHECK(handle!=NULL, "invalid dev handle", ESP_ERR_INVALID_ARG);
    //use the interrupt, block until return
    r=xQueueReceive(handle->ret_queue, (void*)&trans_buf, ticks_to_wait); // <=== BLOCKED HERE
    if (!r) {
        // The memory occupied by rx and tx DMA buffer destroyed only when receiving from the queue (transaction finished).
        // If timeout, wait and retry.
        // Every in-flight transaction request occupies internal memory as DMA buffer if needed.
        return ESP_ERR_TIMEOUT;
    }
    //release temporary buffers
    uninstall_priv_desc(&trans_buf);
    (*trans_desc) = trans_buf.trans;
    return ESP_OK;
}

I tried to dig around a bit in the source code of the SPI library but couldn't find where the spi answer was added in the queue.

@KaeLL
Copy link
Contributor

KaeLL commented May 30, 2023

Reminds me of #6624 and #8179

@kostaond
Copy link
Collaborator

@ndedobbeleer, what functions do you use to communicate with MCP3462? Could you please try to use spi_device_polling_transmit?

@ndedobbeleer
Copy link

@kostaond

I was using spi_device_transmit .

I replaced all my functions with spi_device_polling_transmit and ran a test again. With this configuration, I did not encounter any problem in 16 hours.

I don't need to specifically use spi_device_transmit. So that solves the problem in my case.

@kostaond
Copy link
Collaborator

kostaond commented Jun 2, 2023

@ndedobbeleer good to hear. I will pass your observation to team responsible for SPI.

@mickeyl
Copy link
Contributor Author

mickeyl commented Mar 10, 2024

I must come back to this thread here.

By now we've moved into production and I'm currently working on the production testing. Our device is based on ESP32-S3-WROOM-1-N16R8 with the W5500 via SPI and an MCP2518fd on another SPI plus a bit of glue logic. I see the W5500 failing after 2 hours when the only thing is using esp_ping. Here's the relevant output before things go wrong:

I (7101678) TestClient: EMV Test Active...
I (7102065) TEST KLINE: ECHO: PING
I (7102224) TEST IO: ZGW ON
I (7102557) TEST CAN: Sending CAN message
64 bytes from 192.168.42.1 icmp_seq=7098 ttl=64 time=1 ms
I (7102678) TestClient: EMV Test Active...
I (7103070) TEST KLINE: ECHO: PONG
I (7103224) TEST IO: ZGW OFF
I (7103558) TEST CAN: Sending CAN message
64 bytes from 192.168.42.1 icmp_seq=7099 ttl=64 time=1 ms
I (7103678) TestClient: EMV Test Active...
I (7104075) TEST KLINE: ECHO: PING
I (7104224) TEST IO: ZGW ON
I (7104559) TEST CAN: Sending CAN message
64 bytes from 192.168.42.1 icmp_seq=7100 ttl=64 time=1 ms
I (7104678) TestClient: EMV Test Active...
I (7105076) TEST KLINE: ECHO: 
I (7105224) TEST IO: ZGW OFF
I (7105560) TEST CAN: Sending CAN message
64 bytes from 192.168.42.1 icmp_seq=7101 ttl=64 time=1 ms
I (7105678) TestClient: EMV Test Active...
I (7106081) TEST KLINE: ECHO: PONG
I (7106224) TEST IO: ZGW ON
I (7106561) TEST CAN: Sending CAN message
E (7106654) w5500.mac: w5500_spi_write(136): spi transmit failed
E (7106654) w5500.mac: w5500_write_buffer(254): write TX buffer failed
E (7106657) w5500.mac: emac_w5500_transmit(570): write frame failed
E (7106664) ping_sock: send error=0
I (7106678) TestClient: EMV Test Active...
I (7107086) TEST KLINE: ECHO: PING
I (7107224) TEST IO: ZGW OFF
I (7107562) TEST CAN: Sending CAN message
From 192.168.42.1 icmp_seq=7102 timeout
E (7107668) w5500.mac: w5500_spi_write(136): spi transmit failed
E (7107668) w5500.mac: w5500_write_buffer(254): write TX buffer failed
E (7107674) w5500.mac: emac_w5500_transmit(570): write frame failed
I (7107681) TestClient: EMV Test Active...
E (7107681) ping_sock: send error=0

This repeats before finally the device crashes:

E (7314151) ping_sock: send error=0
I (7314212) TestClient: EMV Test Active...
I (7314234) TEST IO: ZGW ON
E (7314481) w5500.mac: w5500_spi_write(136): spi transmit failed
E (7314481) w5500.mac: w5500_write_buffer(254): write TX buffer failed
E (7314484) w5500.mac: emac_w5500_transmit(570): write frame failed
I (7314780) TEST CAN: Sending CAN message
I (7315040) TEST KLINE: ECHO: PONG
From 192.168.42.1 icmp_seq=7308 timeout
E (7315151) ping_sock: send error=0
I (7315212) TestClient: EMV Test Active...
I (7315234) TEST IO: ZGW OFF
I (7315781) TEST CAN: Sending CAN message
I (7316045) TEST KLINE: ECHO: PING
From 192.168.42.1 icmp_seq=7309 timeout
I (7316212) TestClient: EMV Test Active...
I (7316234) TEST IO: ZGW ON
I (7316782) TEST CAN: Sending CAN message

abort() was called at PC 0x42062227 on core 1


Backtrace: 0x40375d66:0x3fcb6cf0 0x40381109:0x3fcb6d10 0x40388762:0x3fcb6d30 0x42062227:0x3fcb6da0 0x4206225c:0x3fcb6dc0 0x42061f0e:0x3fcb6de0 0x420611d1:0x3fcb6e00 0x42016d3e:0x3fcb6e20 0x42017406:0x3fcb6e80 0x4200a7f6:0x3fcb6ea0




ELF file SHA256: 0812bdd9e

CPU halted.

I can eliminate the other device, this was still up and running. What could be responsible for SPI suddenty stopping to work?

@ndedobbeleer
Copy link

If that can help, we use our products in production since 6 months now and I can confirm that spi_device_polling_transmit solved definitively the problem for us. Thanks to kostaond for his help.

@kostaond
Copy link
Collaborator

kostaond commented Mar 11, 2024

@mickeyl would it be possible to provide some minimal reproducible code example to be able to reproduce at our side?

@mickeyl
Copy link
Contributor Author

mickeyl commented Mar 11, 2024

@kostaond This is going to be tough, but I'll try eventually. At first I'm going to check whether it really doesn't happen without a second SPI transfer in place. Then I'll hammer both SPI busses with traffic and check whether I can get it to break early. If so, I'll swap the SPI device with a temperature sensor and try again (since I can't disclose the MCP2518fd driver for now).

@mickeyl
Copy link
Contributor Author

mickeyl commented Mar 16, 2024

@kostaond

In an attempt to finally dive deeper into this issue, I took the weekend and carried out a lot of
stress-tests with the following setup:

Our product is a custom ESP32S3-based board with a W5500 (SPI3_HOST) and an MCP2518fd CAN controller (SPI2_HOST). The design follows the product recommendations from Wiznet and Microchip very closely and it has been done by an experienced EE, so I'm pretty sure there are no hardware issues. The test software configures the SPI peripherals bus speeds for W5500 at 20MHz and the MCP2518fd at 16MHz (slightly lower than the supported maximum of 17.5MHz). The software is compiled with -Os and a log level of I.

The application on the device under test is based on examples/network/bridge, where WiFi and the W5500
Ethernet is combined to one virtual interface running a DHCP server. The MCP2518fd driver simply echoes all the frames it receives back on the CAN bus.

To test the performance, there are three auxilary systems attached:

  1. A Raspberry Pi 4 via a cross-over Ethernet cable to the W5500. On this machine, I'm running iperf -s:
iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size:  128 KByte (default)
------------------------------------------------------------
  1. A Raspberry Pi 4 with a CAN (gs_usb) attached to the MCP2518fd. The bus is properly terminated.
    This machine sends random can-bus frames very quickly using cangen can0 -g 0.45. This leads to a 70%
    bus load (measured with canbusload -cbr can0@500000), which is almost saturating the ISR.

  2. A Linux PC (Thinkpad X1 Carbon 2021), which connects to the ESP32S3 via WiFi. It runs the client part
    of iperf endlessly.

I let this run for several hours, at first with a CPU speed of 160MHz and DIO.
Without CAN traffic I'm getting the following speeds:

[ ID] Interval       Transfer     Bandwidth
[  5] 0.0000-10.5857 sec  10.9 MBytes  8.62 Mbits/sec
[  4] local 192.168.4.2 port 5001 connected with 192.168.4.3 port 52050
[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-10.4410 sec  10.9 MBytes  8.74 Mbits/sec
[  5] local 192.168.4.2 port 5001 connected with 192.168.4.3 port 48290
[ ID] Interval       Transfer     Bandwidth
[  5] 0.0000-10.5355 sec  10.9 MBytes  8.66 Mbits/sec
[  4] local 192.168.4.2 port 5001 connected with 192.168.4.3 port 38148
[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-10.6031 sec  11.0 MBytes  8.70 Mbits/sec
[  5] local 192.168.4.2 port 5001 connected with 192.168.4.3 port 57326
[ ID] Interval       Transfer     Bandwidth
[  5] 0.0000-10.5154 sec  10.9 MBytes  8.68 Mbits/sec
[  4] local 192.168.4.2 port 5001 connected with 192.168.4.3 port 56776
[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-10.4982 sec  10.8 MBytes  8.59 Mbits/sec
[  5] local 192.168.4.2 port 5001 connected with 192.168.4.3 port 41376
[ ID] Interval       Transfer     Bandwidth
[  5] 0.0000-10.3880 sec  10.8 MBytes  8.68 Mbits/sec
[  4] local 192.168.4.2 port 5001 connected with 192.168.4.3 port 59300
[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-10.4865 sec  10.8 MBytes  8.60 Mbits/sec
[  5] local 192.168.4.2 port 5001 connected with 192.168.4.3 port 42276
[ ID] Interval       Transfer     Bandwidth
[  5] 0.0000-10.4605 sec  10.8 MBytes  8.62 Mbits/sec
[  4] local 192.168.4.2 port 5001 connected with 192.168.4.3 port 51660
[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-10.6646 sec  11.0 MBytes  8.65 Mbits/sec
[  5] local 192.168.4.2 port 5001 connected with 192.168.4.3 port 35296

With CAN traffic I'm getting the following speeds:

[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-10.8194 sec  10.0 MBytes  7.75 Mbits/sec
[  5] local 192.168.4.2 port 5001 connected with 192.168.4.4 port 59222
[ ID] Interval       Transfer     Bandwidth
[  5] 0.0000-10.8526 sec  10.1 MBytes  7.83 Mbits/sec
[  4] local 192.168.4.2 port 5001 connected with 192.168.4.4 port 41688
[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-10.4855 sec  9.63 MBytes  7.70 Mbits/sec
[  5] local 192.168.4.2 port 5001 connected with 192.168.4.4 port 39026
[ ID] Interval       Transfer     Bandwidth
[  5] 0.0000-10.4220 sec  9.63 MBytes  7.75 Mbits/sec
[  4] local 192.168.4.2 port 5001 connected with 192.168.4.4 port 47770
[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-10.9810 sec  10.1 MBytes  7.73 Mbits/sec
[  5] local 192.168.4.2 port 5001 connected with 192.168.4.4 port 59350
[ ID] Interval       Transfer     Bandwidth
[  5] 0.0000-10.9278 sec  10.1 MBytes  7.77 Mbits/sec
[  4] local 192.168.4.2 port 5001 connected with 192.168.4.4 port 48874
[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-10.7493 sec  9.88 MBytes  7.71 Mbits/sec
[  5] local 192.168.4.2 port 5001 connected with 192.168.4.4 port 42004
[ ID] Interval       Transfer     Bandwidth
[  5] 0.0000-10.4291 sec  9.63 MBytes  7.74 Mbits/sec
[  4] local 192.168.4.2 port 5001 connected with 192.168.4.4 port 51532

The slight drop in performance is probably a factor of the higher systemload, so
I'm fine with it. For a while this went ok, including the occasional watchdog complaint...

E (1932971) task_wdt: Task watchdog got triggered. The following tasks/users did not reset the watchdog in time:
E (1932971) task_wdt:  - IDLE1 (CPU 1)
E (1932971) task_wdt: Tasks currently running:
E (1932971) task_wdt: CPU 0: wifi
E (1932971) task_wdt: CPU 1: ReceiveTask
E (1932971) task_wdt: Print CPU 1 backtrace

Backtrace: 0x4037875F:0x3FC9BCB0 0x40377295:0x3FC9BCD0 0x400559DD:0x3FCB7B10 0x4037E082:0x3FCB7B20 0x4037D841:0x3FCB7B40 0x4200E146:0x3FCB7B80 0x4200EA1B:0x3FCB7BA0 0x4200B055:0x3FCB7BD0 0x4037DDE9:0x3FCB7C00

...and the occasional dropped CAN frame:

I (1933011) ESP-MCP251XFD: RX FIFO Overflow. Sorry, but we lost at least one CAN frame

But then... after some more time, I got the following error and the system restarted:

RROR*** A stack overflow in task w5500_tsk has been detected.


Backtrace: 0x40375b5a:0x3fca6d60 0x4037d3ed:0x3fca6d80 0x4037e0fe:0x3fca6da0 0x4037f44f:0x3fca6e20 0x4037e230:0x3fca6e50 0x4037e226:0xa082f590 |<-CORRUPTED

I guess this did happen, because the w5500_tsk wanted to emit a warning or error message.
I enlarged its stack size to 8192 and rerun the tested with 240MHz and QIO.
Bandwidth slightly enlarged:

[ ID] Interval       Transfer     Bandwidth
[  5] 0.0000-10.5194 sec  11.6 MBytes  9.27 Mbits/sec
[  4] local 192.168.4.2 port 5001 connected with 192.168.4.4 port 42884
[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-10.4907 sec  12.0 MBytes  9.60 Mbits/sec
[  5] local 192.168.4.2 port 5001 connected with 192.168.4.4 port 44268
[ ID] Interval       Transfer     Bandwidth
[  5] 0.0000-10.2703 sec  11.8 MBytes  9.60 Mbits/sec
[  4] local 192.168.4.2 port 5001 connected with 192.168.4.4 port 49958
[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-10.3149 sec  11.8 MBytes  9.56 Mbits/sec
[  5] local 192.168.4.2 port 5001 connected with 192.168.4.4 port 37384

I let this run for a while and launched a bunch of ping processes in addition,
changed the iperf paralleliziation and even flood pinged the device, but couldn't
make it crash. Even if I completely overloaded it...

[ 25] 0.0000-20.3036 sec   289 KBytes   116 Kbits/sec
connect failed: Connection refused
recv failed: Resource temporarily unavailable
recv failed: Resource temporarily unavailable
recv failed: Resource temporarily unavailable
recv failed: Resource temporarily unavailable
recv failed: Resource temporarily unavailable
[ 52] 0.0000-21.7129 sec   102 KBytes  38.4 Kbits/sec
connect failed: Connection refused
recv failed: Resource temporarily unavailable
[ 16] 0.0000-23.2300 sec   525 KBytes   185 Kbits/sec
connect failed: Connection refused
recv failed: Resource temporarily unavailable
recv failed: Resource temporarily unavailable
[ 23] 0.0000-23.8224 sec   106 KBytes  36.5 Kbits/sec
connect failed: Connection refused

...it always recovered.

So besides that one crash beforehand, everything went really great.
Since I still can produce the W5500 hangups with my full application though,
it must be something else.

I'm afraid these tests didn't help much for the bug report in question, although they increased my confidence that in general the hardware combination and the included drivers are really solid.

I guess it makes most sense to call it a day with regards to this bug report for now.
When I opened it one year ago, it was referring to my work with jumper wires and a C6-devboard.
Since I have different hardware now, I think it's best to close this here and
continue to inspect what my application is doing that might destabilize the W5500 and/or LWIP stack.

I will open or contribute to another issue report with further findings. Thanks for your attention so far!

@mickeyl mickeyl closed this as completed Mar 16, 2024
@JimmyPedersen
Copy link

I'm no RTOS expert but doesn't this...
E (1932971) task_wdt: Task watchdog got triggered. The following tasks/users did not reset the watchdog in time: E (1932971) task_wdt: - IDLE1 (CPU 1) E (1932971) task_wdt: Tasks currently running: E (1932971) task_wdt: CPU 0: wifi E (1932971) task_wdt: CPU 1: ReceiveTask E (1932971) task_wdt: Print CPU 1 backtrace

... suggest that your ReceiveTask hogs the cpu for too long without yielding?
Maybe need a short vTaskDelay, or other yielding function, to allow idle task to run once in a while?

@mickeyl
Copy link
Contributor Author

mickeyl commented Mar 17, 2024

@JimmyPedersen Yes, usually this is correct, but in this case it shouldn't be necessary, because the receiver task is already waiting in a FreeRTOS-aware function for the next message from the queue. This becomes only a problem if the IRQs are coming in so fast that the system is starving in general -- which isn't a real problem in the field, since we rarely reach 70% bus load.

@JimmyPedersen
Copy link

JimmyPedersen commented Mar 17, 2024

As long as there are no sudden spikes of data that can swamp the thread that might be fine. Personally I'm a belts and suspenders kind of guy so I would likely have added a very short vTaskDelay just in case. 😁
Good luck on your project

@mickeyl
Copy link
Contributor Author

mickeyl commented Mar 17, 2024

Thanks a lot. Unfortunately, vTaskDelay's minimum delay is one millisecond (w/ CONFIG_FREERTOS_HZ=1000), and this is quite long, considering that at a (moderate) speed of 500KBit/s of CAN I might receive at least 4 frames during that time. At 1MBit/s, it could be 8 frames, se we can't really sleep.

That said, I have recently published esp-microsleep, which works around that issue.

On another note… applications like mine might be an interesting playground for FreeRTOS AMP -- where time critical stuff (like CAN-FD in our case) happens on a non-FreeRTOS core.

@espressif-bot espressif-bot added Status: Done Issue is done internally Resolution: NA Issue resolution is unavailable and removed Status: Opened Issue is new labels Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Resolution: NA Issue resolution is unavailable Status: Done Issue is done internally Type: Bug bugs in IDF
Projects
None yet
Development

No branches or pull requests

7 participants