-
Notifications
You must be signed in to change notification settings - Fork 7.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random infrequent TG1WDT_SYS_RESET when using single core & PSRAM (IDFGH-2240) #4388
Comments
Let me rephrase that: what is the officially supported ESP32 environment using PSRAM? Meaning, guaranteed to be stable? |
What does that function have to do with psram? |
@negativekelvin CONFIG_FREERTOS_UNICORE is required when working with psram. The message indicates this configuration is not supported, or equally, ESP32 + PSRAM is not supported. |
All it says is you can't call |
this got triggered due to a crash restart in unicore mode
It was a spurious WDT reset. Can't share any more details I'm afraid, and it's not reproducible. |
Point is, I didn't call wdt_reset_info_dump explicitly. The ESP32 crashed due to an unknown reason and the only info I got was "CPU not support!" |
Ok it's a minor bug that it is called in unicore mode but it shouldn't have any affect other than printing an error esp-idf/components/bootloader_support/src/bootloader_init.c Lines 533 to 535 in 93a8603
|
The error was due to a legitimate interrupt WDT reset, for reasons unknown. |
I don't have any custom interrupts. This is 100% non reproducible, only happens once a day or so, under heavy load. Any tips? |
The only info there is PC=0x40092997 so look it up using addr2line |
I've tried that. It's mostly random. Sometimes it gets triggered in the task switching code. No idea how to track this down, and reproduction takes hours. |
Hi @szmodz , As @negativekelvin says, the "WDT reset info: &s CPU not support!" log line can be ignored here. We'll fix this in the bootloader. Sorry for the misleading information. A reset caused by TG1WDT_SYS_RESET is usually caused by the interrupt WDT triggering (meaning an interrupt handler ran for too long, or interrupts were disabled by a critical section for too long). Some questions which will help any additional debugging:
The fix may be as simple as increasing the value of the timeout to allow for delays when accessing SPIRAM. Otherwise, it probably indicates some kind of timing issue with interrupts or critical sections. These are unfortunately hard to debug. Decoding the PC reset addressses with addr2line and looking for patterns may give some clues. |
@projectgus I'm currently running latest master, but it's not "clean" (based on 93a8603). I've used v3.2 before, switched to master to see if that changes anything. The compiler is currently 2019r2. I've also tried the other experimental one. I've already tried increasing the WDT timeout value to 800ms. Same problem. The project in question uses only the ethernet peripheral. I've found another bug in the master branch, related to the ethernet driver (not present in the stable SDKs), will report that elsewhere. I was actually thinking the problem may be caused by memory corruption, which is causing some other part of the system to malfunction. The WDT timeouts pretty often happen in the task switching code. I'm also investigating a different problem, which may or may not be related. Will post details as soon as I have them. |
Here comes the latest one:
Stack smashing protection, stack overflow checking using breakpoints, etc. is enabled.
|
Hi @szmodz , Register window spill exceptions are a normal occurrence, but obviously it shouldn't be hanging or triggering an interrupt WDT during the window spill. Without a backtrace (which unfortunately we can't get after a reset) it's not really possible to know the context in which this happened or how it got stuck for long enough to trigger the INT WDT timeout. I saw in the linked issue that you're using ethernet and have all allocations defaulting to PSRAM which is causing some crashes. This may be a symptom of the same problem, if we're lucky, depending on the sequence of other interrupts and CPU exceptions. Assuming you use the ESP32 internal Ethernet MAC, if you replace this line in esp_eth_mac_esp32.c :
with
(ie replace ESP_INTR_FLAG_IRAM with 0) Does anything change? If not this, it could really be anything in your app's firmware so it's impossible to tell without more information. You don't have the CONFIG_SPIRAM_ALLOW_STACK_EXTERNAL_MEMORY flag enabled, do you? |
SPIRAM_ALLOW_STACK_EXTERNAL_MEMORY is disabled, and so is SPIRAM_ALLOW_BSS_SEG_EXTERNAL_MEMORY. I've patched the ethernet driver to use heap_caps_calloc instead of plain calloc, and the Ethernet related crashes ("Cache disabled...") are gone. Shouldn't that be enough? Is removing ESP_INTR_FLAG_IRAM still necessary? Are there any other PSRAM usage situations that can cause corruption (or WDT timeouts), but no "Cache disabled but cached memory region accessed" exception? The firmware is unfortunately pretty big (multiple simultaneous HTTPS connections, xml and json parsers, and the Lua interpreter on top of that). But, I think I've managed to narrow the source down to a manageable portion of the code. Problem is, that part works fine if it's used alone. It only starts causing problems when there's other activities in parallel. Are there any known issues related to nonblocking sockets (see #4407)? I've hacked around the C library problem, and it seems to work, but perhaps there are other issues with nonblocking sockets and that's why fcntl was disabled in the first place? The code which seems to be the cause of the problems relies on nonblocking sockets. I can't see anything else that's suspicious. The socket in question is not used by multiple threads at once, but there is heavy network traffic in other threads (which all seem to work fine when this particular thread is disabled). |
Shouldn't be necessary, but it may be worth trying this change anyhow in case the WDT reset has a similar root cause but not 100% the same.
Unsure, but I wouldn't expect LWIP to ever cause a fully silent INT WDT reset as it doesn't do anything in the interrupt context. You should at least get a panic output. (I'm assuming your project uses the default panic handler settings, and hasn't set |
I can also reproduce this when using WiFi instead of Ethernet (using DevKitC, not custom hardware). Most of the time, there's no panic. Recently I saw an "Unhandled kernel exception", see below. If i disable INT WDT entirely there's no reboot, no crash. It just sits there spinning.
|
line 440 is a retw.
That would indicate some sort of TCB or stack corruption I think. |
I also noticed that the memory allocator is sometimes called inside critical sections. That too can lead to int wdt timeouts I think. |
Agree this looks like stack corruption. I'm afraid unless you have a minimal case we can use to debug this then it's not really something we can look into from our side, memory corruption could come from almost anywhere in the firmware. |
actually, the code which seems to cause problems is basically this: Simply caling aws_iot_mqtt_yield seems to trigger the problem (and then there's mbedtls and lwip). Yes, I know that
|
also relates to: |
I rewrote the suspicious code using a different approach, and now it seems stable, but I still have no idea why the original code caused problems. |
@szmodz Good to hear. Was this some code in the AWS IoT SDK component, or in your own project? |
I replaced the AWS SDK with a different library. The project is using AWS, but doesn't use any AWS-specific features, so almost any MQTT implementation supporting TLS will work in this case. |
might be related: |
Closing as it looks like this was resolved, and unclear if there's any remaining issue in ESP-IDF. The fix for the misleading error log line following a WDT reset in unicore mode is still pending, but this issue will be updated when it merges. |
related: #5227 (comment) But I'm working on part II. |
@projectgus this is the original cause |
https://github.com/espressif/esp-idf/blob/master/components/bootloader_support/src/bootloader_init.c#L461
#if !CONFIG_FREERTOS_UNICORE
…
#else
ESP_LOGE(TAG, "WDT reset info: &s CPU not support!\n", cpu_name);
return;
#endif
What? Why? ;-)
So basically, the configuration which works is not supported?
#2892
The text was updated successfully, but these errors were encountered: