Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random infrequent TG1WDT_SYS_RESET when using single core & PSRAM (IDFGH-2240) #4388

Closed
szmodz opened this issue Nov 22, 2019 · 30 comments
Closed

Comments

@szmodz
Copy link

szmodz commented Nov 22, 2019

https://github.com/espressif/esp-idf/blob/master/components/bootloader_support/src/bootloader_init.c#L461

#if !CONFIG_FREERTOS_UNICORE

#else
ESP_LOGE(TAG, "WDT reset info: &s CPU not support!\n", cpu_name);
return;
#endif

What? Why? ;-)

So basically, the configuration which works is not supported?

#2892

@github-actions github-actions bot changed the title bootloader problem bootloader problem (IDFGH-2240) Nov 22, 2019
@szmodz
Copy link
Author

szmodz commented Nov 22, 2019

Let me rephrase that: what is the officially supported ESP32 environment using PSRAM? Meaning, guaranteed to be stable?

@negativekelvin
Copy link
Contributor

What does that function have to do with psram?

@szmodz
Copy link
Author

szmodz commented Nov 22, 2019

@negativekelvin CONFIG_FREERTOS_UNICORE is required when working with psram.
see #2892

The message indicates this configuration is not supported, or equally, ESP32 + PSRAM is not supported.

@negativekelvin
Copy link
Contributor

All it says is you can't call wdt_reset_info_dump(1) in unicore mode

@szmodz
Copy link
Author

szmodz commented Nov 22, 2019

this got triggered due to a crash restart in unicore mode

rst:0x8 (TG1WDT_SYS_RESET),boot:0x1a (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3fff0000,len:5416
load:0x3fff1528,len:4
load:0x40078000,len:15308
load:0x40080400,len:4200
entry 0x40080638
W (172) boot: PRO CPU has been reset by WDT.
W (172) boot: WDT reset info: PRO CPU PC=0x40092997

E (173) boot: WDT reset info: &s CPU not support!

It was a spurious WDT reset. Can't share any more details I'm afraid, and it's not reproducible.

@szmodz
Copy link
Author

szmodz commented Nov 22, 2019

Point is, I didn't call wdt_reset_info_dump explicitly. The ESP32 crashed due to an unknown reason and the only info I got was "CPU not support!"

@negativekelvin
Copy link
Contributor

negativekelvin commented Nov 22, 2019

Ok it's a minor bug that it is called in unicore mode but it shouldn't have any affect other than printing an error

#if CONFIG_IDF_TARGET_ESP32
wdt_reset_info_dump(1);
#endif

@szmodz
Copy link
Author

szmodz commented Nov 22, 2019

The error was due to a legitimate interrupt WDT reset, for reasons unknown.

@szmodz
Copy link
Author

szmodz commented Nov 22, 2019

I don't have any custom interrupts. This is 100% non reproducible, only happens once a day or so, under heavy load. Any tips?

@negativekelvin
Copy link
Contributor

The only info there is PC=0x40092997 so look it up using addr2line

@szmodz
Copy link
Author

szmodz commented Nov 22, 2019

I've tried that. It's mostly random. Sometimes it gets triggered in the task switching code.

No idea how to track this down, and reproduction takes hours.

@projectgus
Copy link
Contributor

Hi @szmodz ,

As @negativekelvin says, the "WDT reset info: &s CPU not support!" log line can be ignored here. We'll fix this in the bootloader. Sorry for the misleading information.

A reset caused by TG1WDT_SYS_RESET is usually caused by the interrupt WDT triggering (meaning an interrupt handler ran for too long, or interrupts were disabled by a critical section for too long).

Some questions which will help any additional debugging:

  • What ESP-IDF version do you have? (can run git describe --tags --dirty in the IDF_PATH dir to find this out)
  • Is there any relevant log output from the ESP32 before it resets?
  • What value for CONFIG_ESP_INT_WDT_TIMEOUT_MS do you have in your project sdkconfig file?

The fix may be as simple as increasing the value of the timeout to allow for delays when accessing SPIRAM.

Otherwise, it probably indicates some kind of timing issue with interrupts or critical sections. These are unfortunately hard to debug. Decoding the PC reset addressses with addr2line and looking for patterns may give some clues.

@projectgus projectgus changed the title bootloader problem (IDFGH-2240) Random infrequent TG1WDT_SYS_RESET when using PSRAM (IDFGH-2240) Nov 25, 2019
@projectgus projectgus changed the title Random infrequent TG1WDT_SYS_RESET when using PSRAM (IDFGH-2240) Random infrequent TG1WDT_SYS_RESET when using sincle core & PSRAM (IDFGH-2240) Nov 25, 2019
@projectgus projectgus changed the title Random infrequent TG1WDT_SYS_RESET when using sincle core & PSRAM (IDFGH-2240) Random infrequent TG1WDT_SYS_RESET when using single core & PSRAM (IDFGH-2240) Nov 25, 2019
@szmodz
Copy link
Author

szmodz commented Nov 26, 2019

@projectgus I'm currently running latest master, but it's not "clean" (based on 93a8603). I've used v3.2 before, switched to master to see if that changes anything. The compiler is currently 2019r2. I've also tried the other experimental one.

I've already tried increasing the WDT timeout value to 800ms. Same problem. The project in question uses only the ethernet peripheral.

I've found another bug in the master branch, related to the ethernet driver (not present in the stable SDKs), will report that elsewhere.
#4406

I was actually thinking the problem may be caused by memory corruption, which is causing some other part of the system to malfunction. The WDT timeouts pretty often happen in the task switching code.

I'm also investigating a different problem, which may or may not be related. Will post details as soon as I have them.

@szmodz
Copy link
Author

szmodz commented Nov 27, 2019

Here comes the latest one:

W (172) boot: PRO CPU has been reset by WDT.
W (172) boot: WDT reset info: PRO CPU PC=0x4008f990
0x4008f990: xthal_window_spill_nw at /Users/igrokhotkov/e/esp32/hal/hal/windowspill_asm.S:175

E (173) boot: WDT reset info: &s CPU not support!

Stack smashing protection, stack overflow checking using breakpoints, etc. is enabled.

Hardware triggers window overflow and underflow exceptions as necessary when registers outside the current window need to be spilled to preallocated space in the stack frame, or restored. Complete details are in the Xtensa manuals.

@projectgus
Copy link
Contributor

projectgus commented Nov 28, 2019

Hi @szmodz ,

Register window spill exceptions are a normal occurrence, but obviously it shouldn't be hanging or triggering an interrupt WDT during the window spill. Without a backtrace (which unfortunately we can't get after a reset) it's not really possible to know the context in which this happened or how it got stuck for long enough to trigger the INT WDT timeout.

I saw in the linked issue that you're using ethernet and have all allocations defaulting to PSRAM which is causing some crashes. This may be a symptom of the same problem, if we're lucky, depending on the sequence of other interrupts and CPU exceptions.

Assuming you use the ESP32 internal Ethernet MAC, if you replace this line in esp_eth_mac_esp32.c :

 MAC_CHECK(esp_intr_alloc(ETS_ETH_MAC_INTR_SOURCE, ESP_INTR_FLAG_IRAM, emac_esp32_isr_handler,

with

 MAC_CHECK(esp_intr_alloc(ETS_ETH_MAC_INTR_SOURCE, 0, emac_esp32_isr_handler,

(ie replace ESP_INTR_FLAG_IRAM with 0)

Does anything change?

If not this, it could really be anything in your app's firmware so it's impossible to tell without more information.

You don't have the CONFIG_SPIRAM_ALLOW_STACK_EXTERNAL_MEMORY flag enabled, do you?

@szmodz
Copy link
Author

szmodz commented Nov 28, 2019

SPIRAM_ALLOW_STACK_EXTERNAL_MEMORY is disabled, and so is SPIRAM_ALLOW_BSS_SEG_EXTERNAL_MEMORY.

I've patched the ethernet driver to use heap_caps_calloc instead of plain calloc, and the Ethernet related crashes ("Cache disabled...") are gone. Shouldn't that be enough? Is removing ESP_INTR_FLAG_IRAM still necessary?

Are there any other PSRAM usage situations that can cause corruption (or WDT timeouts), but no "Cache disabled but cached memory region accessed" exception?

The firmware is unfortunately pretty big (multiple simultaneous HTTPS connections, xml and json parsers, and the Lua interpreter on top of that). But, I think I've managed to narrow the source down to a manageable portion of the code. Problem is, that part works fine if it's used alone. It only starts causing problems when there's other activities in parallel.

Are there any known issues related to nonblocking sockets (see #4407)? I've hacked around the C library problem, and it seems to work, but perhaps there are other issues with nonblocking sockets and that's why fcntl was disabled in the first place?

The code which seems to be the cause of the problems relies on nonblocking sockets. I can't see anything else that's suspicious. The socket in question is not used by multiple threads at once, but there is heavy network traffic in other threads (which all seem to work fine when this particular thread is disabled).

@projectgus
Copy link
Contributor

I've patched the ethernet driver to use heap_caps_calloc instead of plain calloc, and the Ethernet related crashes ("Cache disabled...") are gone. Shouldn't that be enough? Is removing ESP_INTR_FLAG_IRAM still necessary?

Shouldn't be necessary, but it may be worth trying this change anyhow in case the WDT reset has a similar root cause but not 100% the same.

The code which seems to be the cause of the problems relies on nonblocking sockets. I can't see anything else that's suspicious.

Unsure, but I wouldn't expect LWIP to ever cause a fully silent INT WDT reset as it doesn't do anything in the interrupt context. You should at least get a panic output.

(I'm assuming your project uses the default panic handler settings, and hasn't set ESP32_PANIC_SILENT_REBOOT or anything.)

@szmodz
Copy link
Author

szmodz commented Dec 2, 2019

I can also reproduce this when using WiFi instead of Ethernet (using DevKitC, not custom hardware).

Most of the time, there's no panic. Recently I saw an "Unhandled kernel exception", see below.

If i disable INT WDT entirely there's no reboot, no crash. It just sits there spinning.

Guru Meditation Error: Core 0 panic'ed (Unhandled kernel exception)
Core 0 register dump:
PC : 0x4009291f PS : 0x0000001a A0 : 0x3ffb233c A1 : 0x3fff4dd0
0x4009291f: _frxt_dispatch at /Users/szmodz/xxx/esp-idf/components/freertos/portasm.S:440

A2 : 0x00000000 A3 : 0x0000000a A4 : 0x400928ff A5 : 0x3ffbf430
0x400928ff: _frxt_dispatch at /Users/szmodz/xxx/esp-idf/components/freertos/portasm.S:407

A6 : 0x3ffb2380 A7 : 0x3fff5334 A8 : 0x00000000 A9 : 0x00000000
A10 : 0x3ffb244c A11 : 0x00000008 A12 : 0x80090715 A13 : 0x3ffbf420
A14 : 0x00000003 A15 : 0x00060023 SAR : 0x0000001b EXCCAUSE: 0x00000003
EXCVADDR: 0x00000000 LBEG : 0x4008cd18 LEND : 0x4008cd26 LCOUNT : 0x00000001
0x4008cd18: esp_set_watchpoint at /Users/szmodz/xxx/esp-idf/components/esp32/panic.c:682

0x4008cd26: esp_set_watchpoint at /Users/szmodz/xxx/esp-idf/components/esp32/panic.c:674

Backtrace: 0x4009291c:0x3fff4dd0 0x3ffb2339:0x3fff4df0 |<-CORRUPTED
0x4009291c: _frxt_dispatch at /Users/szmodz/xxx/esp-idf/components/freertos/portasm.S:435

@szmodz
Copy link
Author

szmodz commented Dec 2, 2019

DISPATCH THE HIGH READY TASK
void _frxt_dispatch(void)

line 440 is a retw.

/* Solicited stack frame. Restore minimal context and return from vPortYield(). */

That would indicate some sort of TCB or stack corruption I think.

@szmodz
Copy link
Author

szmodz commented Dec 2, 2019

I also noticed that the memory allocator is sometimes called inside critical sections. That too can lead to int wdt timeouts I think.

@projectgus
Copy link
Contributor

That would indicate some sort of TCB or stack corruption I think.

Agree this looks like stack corruption. I'm afraid unless you have a minimal case we can use to debug this then it's not really something we can look into from our side, memory corruption could come from almost anywhere in the firmware.

@szmodz
Copy link
Author

szmodz commented Dec 5, 2019

actually, the code which seems to cause problems is basically this:
https://github.com/espressif/esp-aws-iot

Simply caling aws_iot_mqtt_yield seems to trigger the problem (and then there's mbedtls and lwip).

Yes, I know that

/* AWS IoT SDK isn't very RTOS friendly because it polls for "done
   timers" a lot without ever sleeping on them. So we hack in some
   amount of sleeping here: if it seems like the caller is polling
   an unexpired timer in a tight loop then we delay a tick to let
   things progress elsewhere.
*/

@szmodz
Copy link
Author

szmodz commented Dec 11, 2019

also relates to:
#1070

@szmodz
Copy link
Author

szmodz commented Dec 19, 2019

I rewrote the suspicious code using a different approach, and now it seems stable, but I still have no idea why the original code caused problems.

@projectgus
Copy link
Contributor

projectgus commented Dec 19, 2019

@szmodz Good to hear. Was this some code in the AWS IoT SDK component, or in your own project?

@szmodz
Copy link
Author

szmodz commented Dec 20, 2019

I replaced the AWS SDK with a different library. The project is using AWS, but doesn't use any AWS-specific features, so almost any MQTT implementation supporting TLS will work in this case.

@szmodz
Copy link
Author

szmodz commented Dec 23, 2019

might be related:
#4541

@projectgus
Copy link
Contributor

Closing as it looks like this was resolved, and unclear if there's any remaining issue in ESP-IDF.

The fix for the misleading error log line following a WDT reset in unicore mode is still pending, but this issue will be updated when it merges.

@szmodz
Copy link
Author

szmodz commented May 1, 2020

related: #5227 (comment)

But I'm working on part II.

@szmodz
Copy link
Author

szmodz commented May 1, 2020

@projectgus this is the original cause
#5229

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants