New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
W5500 fails after 20 minutes of operation (IDFGH-10018) #11295
Comments
@mickeyl thanks for reporting potential issue. Could you please share more information about your setup?
|
@kostaond Thanks for answering. Yes, it's all on a breadboard w/ jumper wires. I set an SPI frequency of 30Mhz. |
I tried to run I still tend to think your issue is more related to the SPI connection problem. Is your Ethernet connection lost when you experience the issue or it continues work properly? |
@mickeyl I obtained ESP32C6 and ran it for more than 24 hours without any issue. Details of my test setup:
|
I have the same thing, After about 20 min :
W5500 + ESP32S3 on a custom PCB with ESP IDF V5.0.1
I also have a ADC ( MCP3462) on the same SPI bus :
|
When configuring SPI CLK to 36 MHz in menuconfig, it is re-calculated by the driver to the nearest hardware-compatible number. Therefore, the configured frequency is 40 MHz in fact (see https://docs.espressif.com/projects/esp-idf/en/latest/esp32s3/api-reference/peripherals/spi_master.html#spi-clock-frequency for more information). This could be considered out of spec. of the W5500 since its datasheet says: “The minimum guaranteed speed of the SCLK is 33.3 MHz which was tested and measured with the stable waveform.” On the other hand it also states: “Even though theoretical design speed is 80MHz, the signal in the high speed may be distorted because of the circuit crosstalk and the length of the signal line”. Therefore it is very important to take care of the signal path. I am not hardware engineer so I cannot provide you more details. However, consider checking the following links: I tried to connect the W5500 to ESP32S3 in the same configuration as provided by @ndedobbeleer. I used short wires with matched length interconnected on breadboard and I was able to run it without any issue for more than 2 hours so far. Frankly speaking, I was quite surprised it works in this configuration at 40 MHz since we have specifically designed PCB (ESP32 based) and it does not work unless I tune SPI configuration: spi_device_interface_config_t spi_devcfg = {
...
.input_delay_ns = <value>,
.cs_ena_posttrans = <value>
}; You could also try to adjust input_delay_ns to match your PCB specifics. |
@kostaond Thanks for your response and help. After some testing this afternoon, if I don't send a request to the device, the problem does not occur. The previous test was done with a modbus request sent every second (did not try with a spam ping). I will try with a lower spi clock speed and let you know |
@kostaond I have I lowered the speed of the spi to 20Mhz but the issue still the same. The problem does not appear at the same frequency depending on the type of request sent on the Ethernet :
I also tried with lower speeds. The lower the speed, the faster the problem appears. |
@ndedobbeleer could you please provide code example with modbus so I could quickly try to reproduce? |
I have push a project with minimal code to reproduce this error : While testing this project, I realized that the problem did not appear if no other SPI device was used on the same bus. For the example I used an MCP3462 ADC but I think spamming requests to any other SPI device will return the same result. |
My tests were run without any other SPI devices. I have to investigate more to provide solid debug data, but for me it seems the likeliness to break increased with devices hopping on/off the Ethernet switch I used. |
The second SPI device may just increase the chances of the problem occurring. I can certify that in my case. But I cannot certify that the problem does not appear at all without the second device. I will do a test over a very long period and I will communicate the result. Interestingly, SPI communication with the second device is also blocked when the W5500 indicates a fault. This reminds me of a problem with a mutex. |
I was able to do some additional tests. And I couldn't reproduce the error without another SPI device present on the bus. I can also say that my SPI transactions hang indefinitely at this step: esp_err_t SPI_MASTER_ATTR spi_device_get_trans_result(spi_device_handle_t handle, spi_transaction_t **trans_desc, TickType_t ticks_to_wait)
{
BaseType_t r;
spi_trans_priv_t trans_buf;
SPI_CHECK(handle!=NULL, "invalid dev handle", ESP_ERR_INVALID_ARG);
//use the interrupt, block until return
r=xQueueReceive(handle->ret_queue, (void*)&trans_buf, ticks_to_wait); // <=== BLOCKED HERE
if (!r) {
// The memory occupied by rx and tx DMA buffer destroyed only when receiving from the queue (transaction finished).
// If timeout, wait and retry.
// Every in-flight transaction request occupies internal memory as DMA buffer if needed.
return ESP_ERR_TIMEOUT;
}
//release temporary buffers
uninstall_priv_desc(&trans_buf);
(*trans_desc) = trans_buf.trans;
return ESP_OK;
} I tried to dig around a bit in the source code of the SPI library but couldn't find where the spi answer was added in the queue. |
@ndedobbeleer, what functions do you use to communicate with MCP3462? Could you please try to use |
I was using I replaced all my functions with I don't need to specifically use |
@ndedobbeleer good to hear. I will pass your observation to team responsible for SPI. |
I must come back to this thread here. By now we've moved into production and I'm currently working on the production testing. Our device is based on ESP32-S3-WROOM-1-N16R8 with the W5500 via SPI and an MCP2518fd on another SPI plus a bit of glue logic. I see the W5500 failing after 2 hours when the only thing is using esp_ping. Here's the relevant output before things go wrong:
This repeats before finally the device crashes:
I can eliminate the other device, this was still up and running. What could be responsible for SPI suddenty stopping to work? |
If that can help, we use our products in production since 6 months now and I can confirm that |
@mickeyl would it be possible to provide some minimal reproducible code example to be able to reproduce at our side? |
@kostaond This is going to be tough, but I'll try eventually. At first I'm going to check whether it really doesn't happen without a second SPI transfer in place. Then I'll hammer both SPI busses with traffic and check whether I can get it to break early. If so, I'll swap the SPI device with a temperature sensor and try again (since I can't disclose the MCP2518fd driver for now). |
In an attempt to finally dive deeper into this issue, I took the weekend and carried out a lot of Our product is a custom ESP32S3-based board with a W5500 (SPI3_HOST) and an MCP2518fd CAN controller (SPI2_HOST). The design follows the product recommendations from Wiznet and Microchip very closely and it has been done by an experienced EE, so I'm pretty sure there are no hardware issues. The test software configures the SPI peripherals bus speeds for W5500 at 20MHz and the MCP2518fd at 16MHz (slightly lower than the supported maximum of 17.5MHz). The software is compiled with The application on the device under test is based on To test the performance, there are three auxilary systems attached:
I let this run for several hours, at first with a CPU speed of 160MHz and DIO.
With CAN traffic I'm getting the following speeds:
The slight drop in performance is probably a factor of the higher systemload, so
...and the occasional dropped CAN frame:
But then... after some more time, I got the following error and the system restarted:
I guess this did happen, because the
I let this run for a while and launched a bunch of ping processes in addition,
...it always recovered. So besides that one crash beforehand, everything went really great. I'm afraid these tests didn't help much for the bug report in question, although they increased my confidence that in general the hardware combination and the included drivers are really solid. I guess it makes most sense to call it a day with regards to this bug report for now. I will open or contribute to another issue report with further findings. Thanks for your attention so far! |
I'm no RTOS expert but doesn't this... ... suggest that your ReceiveTask hogs the cpu for too long without yielding? |
@JimmyPedersen Yes, usually this is correct, but in this case it shouldn't be necessary, because the receiver task is already waiting in a FreeRTOS-aware function for the next message from the queue. This becomes only a problem if the IRQs are coming in so fast that the system is starving in general -- which isn't a real problem in the field, since we rarely reach 70% bus load. |
As long as there are no sudden spikes of data that can swamp the thread that might be fine. Personally I'm a belts and suspenders kind of guy so I would likely have added a very short vTaskDelay just in case. 😁 |
Thanks a lot. Unfortunately, That said, I have recently published esp-microsleep, which works around that issue. On another note… applications like mine might be an interesting playground for FreeRTOS AMP -- where time critical stuff (like CAN-FD in our case) happens on a non-FreeRTOS core. |
Answers checklist.
IDF version.
v5.2-dev-321-ga8b6a70620
Operating System used.
Linux
How did you build your project?
VS Code IDE
If you are using Windows, please specify command line type.
None
Development Kit.
ESP32-C6-DevKit M1
Power Supply used.
USB
What is the expected behavior?
W5500 should be stable over a long time.
What is the actual behavior?
After 20 minutes, it stops operating. See debug log.
Steps to reproduce.
Use an ethernet example.
Debug Logs.
The text was updated successfully, but these errors were encountered: