Random infrequent TG1WDT_SYS_RESET when using single core & PSRAM (IDFGH-2240) #4388

szmodz · 2019-11-22T00:18:38Z

https://github.com/espressif/esp-idf/blob/master/components/bootloader_support/src/bootloader_init.c#L461

#if !CONFIG_FREERTOS_UNICORE
…
#else
ESP_LOGE(TAG, "WDT reset info: &s CPU not support!\n", cpu_name);
return;
#endif

What? Why? ;-)

So basically, the configuration which works is not supported?

#2892

szmodz · 2019-11-22T01:10:57Z

Let me rephrase that: what is the officially supported ESP32 environment using PSRAM? Meaning, guaranteed to be stable?

negativekelvin · 2019-11-22T01:25:23Z

What does that function have to do with psram?

szmodz · 2019-11-22T01:33:09Z

@negativekelvin CONFIG_FREERTOS_UNICORE is required when working with psram.
see #2892

The message indicates this configuration is not supported, or equally, ESP32 + PSRAM is not supported.

negativekelvin · 2019-11-22T03:24:07Z

All it says is you can't call wdt_reset_info_dump(1) in unicore mode

szmodz · 2019-11-22T03:39:46Z

this got triggered due to a crash restart in unicore mode

rst:0x8 (TG1WDT_SYS_RESET),boot:0x1a (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3fff0000,len:5416
load:0x3fff1528,len:4
load:0x40078000,len:15308
load:0x40080400,len:4200
entry 0x40080638
W (172) boot: PRO CPU has been reset by WDT.
W (172) boot: WDT reset info: PRO CPU PC=0x40092997

E (173) boot: WDT reset info: &s CPU not support!

It was a spurious WDT reset. Can't share any more details I'm afraid, and it's not reproducible.

szmodz · 2019-11-22T03:43:42Z

Point is, I didn't call wdt_reset_info_dump explicitly. The ESP32 crashed due to an unknown reason and the only info I got was "CPU not support!"

negativekelvin · 2019-11-22T03:46:08Z

Ok it's a minor bug that it is called in unicore mode but it shouldn't have any affect other than printing an error

esp-idf/components/bootloader_support/src/bootloader_init.c

Lines 533 to 535 in 93a8603

    
           #if CONFIG_IDF_TARGET_ESP32 
        
                   wdt_reset_info_dump(1); 
        
           #endif

szmodz · 2019-11-22T03:47:57Z

The error was due to a legitimate interrupt WDT reset, for reasons unknown.

szmodz · 2019-11-22T04:04:03Z

I don't have any custom interrupts. This is 100% non reproducible, only happens once a day or so, under heavy load. Any tips?

negativekelvin · 2019-11-22T05:41:48Z

The only info there is PC=0x40092997 so look it up using addr2line

szmodz · 2019-11-22T11:43:59Z

I've tried that. It's mostly random. Sometimes it gets triggered in the task switching code.

No idea how to track this down, and reproduction takes hours.

projectgus · 2019-11-25T01:10:04Z

Hi @szmodz ,

As @negativekelvin says, the "WDT reset info: &s CPU not support!" log line can be ignored here. We'll fix this in the bootloader. Sorry for the misleading information.

A reset caused by TG1WDT_SYS_RESET is usually caused by the interrupt WDT triggering (meaning an interrupt handler ran for too long, or interrupts were disabled by a critical section for too long).

Some questions which will help any additional debugging:

What ESP-IDF version do you have? (can run git describe --tags --dirty in the IDF_PATH dir to find this out)
Is there any relevant log output from the ESP32 before it resets?
What value for CONFIG_ESP_INT_WDT_TIMEOUT_MS do you have in your project sdkconfig file?

The fix may be as simple as increasing the value of the timeout to allow for delays when accessing SPIRAM.

Otherwise, it probably indicates some kind of timing issue with interrupts or critical sections. These are unfortunately hard to debug. Decoding the PC reset addressses with addr2line and looking for patterns may give some clues.

szmodz · 2019-11-26T20:53:51Z

@projectgus I'm currently running latest master, but it's not "clean" (based on 93a8603). I've used v3.2 before, switched to master to see if that changes anything. The compiler is currently 2019r2. I've also tried the other experimental one.

I've already tried increasing the WDT timeout value to 800ms. Same problem. The project in question uses only the ethernet peripheral.

I've found another bug in the master branch, related to the ethernet driver (not present in the stable SDKs), will report that elsewhere.
#4406

I was actually thinking the problem may be caused by memory corruption, which is causing some other part of the system to malfunction. The WDT timeouts pretty often happen in the task switching code.

I'm also investigating a different problem, which may or may not be related. Will post details as soon as I have them.

szmodz · 2019-11-27T00:16:22Z

Here comes the latest one:

W (172) boot: PRO CPU has been reset by WDT.
W (172) boot: WDT reset info: PRO CPU PC=0x4008f990
0x4008f990: xthal_window_spill_nw at /Users/igrokhotkov/e/esp32/hal/hal/windowspill_asm.S:175

E (173) boot: WDT reset info: &s CPU not support!

Stack smashing protection, stack overflow checking using breakpoints, etc. is enabled.

Hardware triggers window overflow and underflow exceptions as necessary when registers outside the current window need to be spilled to preallocated space in the stack frame, or restored. Complete details are in the Xtensa manuals.

projectgus · 2019-11-28T03:05:54Z

Hi @szmodz ,

Register window spill exceptions are a normal occurrence, but obviously it shouldn't be hanging or triggering an interrupt WDT during the window spill. Without a backtrace (which unfortunately we can't get after a reset) it's not really possible to know the context in which this happened or how it got stuck for long enough to trigger the INT WDT timeout.

I saw in the linked issue that you're using ethernet and have all allocations defaulting to PSRAM which is causing some crashes. This may be a symptom of the same problem, if we're lucky, depending on the sequence of other interrupts and CPU exceptions.

Assuming you use the ESP32 internal Ethernet MAC, if you replace this line in esp_eth_mac_esp32.c :

 MAC_CHECK(esp_intr_alloc(ETS_ETH_MAC_INTR_SOURCE, ESP_INTR_FLAG_IRAM, emac_esp32_isr_handler,

with

 MAC_CHECK(esp_intr_alloc(ETS_ETH_MAC_INTR_SOURCE, 0, emac_esp32_isr_handler,

(ie replace ESP_INTR_FLAG_IRAM with 0)

Does anything change?

If not this, it could really be anything in your app's firmware so it's impossible to tell without more information.

You don't have the CONFIG_SPIRAM_ALLOW_STACK_EXTERNAL_MEMORY flag enabled, do you?

szmodz · 2019-11-28T10:40:42Z

SPIRAM_ALLOW_STACK_EXTERNAL_MEMORY is disabled, and so is SPIRAM_ALLOW_BSS_SEG_EXTERNAL_MEMORY.

I've patched the ethernet driver to use heap_caps_calloc instead of plain calloc, and the Ethernet related crashes ("Cache disabled...") are gone. Shouldn't that be enough? Is removing ESP_INTR_FLAG_IRAM still necessary?

Are there any other PSRAM usage situations that can cause corruption (or WDT timeouts), but no "Cache disabled but cached memory region accessed" exception?

The firmware is unfortunately pretty big (multiple simultaneous HTTPS connections, xml and json parsers, and the Lua interpreter on top of that). But, I think I've managed to narrow the source down to a manageable portion of the code. Problem is, that part works fine if it's used alone. It only starts causing problems when there's other activities in parallel.

Are there any known issues related to nonblocking sockets (see #4407)? I've hacked around the C library problem, and it seems to work, but perhaps there are other issues with nonblocking sockets and that's why fcntl was disabled in the first place?

The code which seems to be the cause of the problems relies on nonblocking sockets. I can't see anything else that's suspicious. The socket in question is not used by multiple threads at once, but there is heavy network traffic in other threads (which all seem to work fine when this particular thread is disabled).

projectgus · 2019-12-02T03:26:37Z

I've patched the ethernet driver to use heap_caps_calloc instead of plain calloc, and the Ethernet related crashes ("Cache disabled...") are gone. Shouldn't that be enough? Is removing ESP_INTR_FLAG_IRAM still necessary?

Shouldn't be necessary, but it may be worth trying this change anyhow in case the WDT reset has a similar root cause but not 100% the same.

The code which seems to be the cause of the problems relies on nonblocking sockets. I can't see anything else that's suspicious.

Unsure, but I wouldn't expect LWIP to ever cause a fully silent INT WDT reset as it doesn't do anything in the interrupt context. You should at least get a panic output.

(I'm assuming your project uses the default panic handler settings, and hasn't set ESP32_PANIC_SILENT_REBOOT or anything.)

szmodz · 2019-12-02T11:39:42Z

I can also reproduce this when using WiFi instead of Ethernet (using DevKitC, not custom hardware).

Most of the time, there's no panic. Recently I saw an "Unhandled kernel exception", see below.

If i disable INT WDT entirely there's no reboot, no crash. It just sits there spinning.

Guru Meditation Error: Core 0 panic'ed (Unhandled kernel exception)
Core 0 register dump:
PC : 0x4009291f PS : 0x0000001a A0 : 0x3ffb233c A1 : 0x3fff4dd0
0x4009291f: _frxt_dispatch at /Users/szmodz/xxx/esp-idf/components/freertos/portasm.S:440

A2 : 0x00000000 A3 : 0x0000000a A4 : 0x400928ff A5 : 0x3ffbf430
0x400928ff: _frxt_dispatch at /Users/szmodz/xxx/esp-idf/components/freertos/portasm.S:407

A6 : 0x3ffb2380 A7 : 0x3fff5334 A8 : 0x00000000 A9 : 0x00000000
A10 : 0x3ffb244c A11 : 0x00000008 A12 : 0x80090715 A13 : 0x3ffbf420
A14 : 0x00000003 A15 : 0x00060023 SAR : 0x0000001b EXCCAUSE: 0x00000003
EXCVADDR: 0x00000000 LBEG : 0x4008cd18 LEND : 0x4008cd26 LCOUNT : 0x00000001
0x4008cd18: esp_set_watchpoint at /Users/szmodz/xxx/esp-idf/components/esp32/panic.c:682

0x4008cd26: esp_set_watchpoint at /Users/szmodz/xxx/esp-idf/components/esp32/panic.c:674

Backtrace: 0x4009291c:0x3fff4dd0 0x3ffb2339:0x3fff4df0 |<-CORRUPTED
0x4009291c: _frxt_dispatch at /Users/szmodz/xxx/esp-idf/components/freertos/portasm.S:435

szmodz · 2019-12-02T11:53:49Z

DISPATCH THE HIGH READY TASK
void _frxt_dispatch(void)

line 440 is a retw.

/* Solicited stack frame. Restore minimal context and return from vPortYield(). */

That would indicate some sort of TCB or stack corruption I think.

szmodz · 2019-12-02T17:07:55Z

I also noticed that the memory allocator is sometimes called inside critical sections. That too can lead to int wdt timeouts I think.

projectgus · 2019-12-02T22:24:13Z

That would indicate some sort of TCB or stack corruption I think.

Agree this looks like stack corruption. I'm afraid unless you have a minimal case we can use to debug this then it's not really something we can look into from our side, memory corruption could come from almost anywhere in the firmware.

szmodz · 2019-12-05T01:27:52Z

actually, the code which seems to cause problems is basically this:
https://github.com/espressif/esp-aws-iot

Simply caling aws_iot_mqtt_yield seems to trigger the problem (and then there's mbedtls and lwip).

Yes, I know that

/* AWS IoT SDK isn't very RTOS friendly because it polls for "done
   timers" a lot without ever sleeping on them. So we hack in some
   amount of sleeping here: if it seems like the caller is polling
   an unexpired timer in a tight loop then we delay a tick to let
   things progress elsewhere.
*/

szmodz · 2019-12-11T18:02:20Z

also relates to:
#1070

szmodz · 2019-12-19T02:12:28Z

I rewrote the suspicious code using a different approach, and now it seems stable, but I still have no idea why the original code caused problems.

projectgus · 2019-12-19T03:25:44Z

@szmodz Good to hear. Was this some code in the AWS IoT SDK component, or in your own project?

szmodz · 2019-12-20T13:03:04Z

I replaced the AWS SDK with a different library. The project is using AWS, but doesn't use any AWS-specific features, so almost any MQTT implementation supporting TLS will work in this case.

szmodz · 2019-12-23T19:18:43Z

might be related:
#4541

projectgus · 2020-04-30T07:31:27Z

Closing as it looks like this was resolved, and unclear if there's any remaining issue in ESP-IDF.

The fix for the misleading error log line following a WDT reset in unicore mode is still pending, but this issue will be updated when it merges.

szmodz · 2020-05-01T10:07:56Z

related: #5227 (comment)

But I'm working on part II.

szmodz · 2020-05-01T13:43:01Z

@projectgus this is the original cause
#5229

github-actions bot changed the title ~~bootloader problem~~ bootloader problem (IDFGH-2240) Nov 22, 2019

projectgus changed the title ~~bootloader problem (IDFGH-2240)~~ Random infrequent TG1WDT_SYS_RESET when using PSRAM (IDFGH-2240) Nov 25, 2019

projectgus changed the title ~~Random infrequent TG1WDT_SYS_RESET when using PSRAM (IDFGH-2240)~~ Random infrequent TG1WDT_SYS_RESET when using sincle core & PSRAM (IDFGH-2240) Nov 25, 2019

projectgus changed the title ~~Random infrequent TG1WDT_SYS_RESET when using sincle core & PSRAM (IDFGH-2240)~~ Random infrequent TG1WDT_SYS_RESET when using single core & PSRAM (IDFGH-2240) Nov 25, 2019

projectgus closed this as completed Apr 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random infrequent TG1WDT_SYS_RESET when using single core & PSRAM (IDFGH-2240) #4388

Random infrequent TG1WDT_SYS_RESET when using single core & PSRAM (IDFGH-2240) #4388

szmodz commented Nov 22, 2019 •

edited

Loading

szmodz commented Nov 22, 2019 •

edited

Loading

negativekelvin commented Nov 22, 2019

szmodz commented Nov 22, 2019 •

edited

Loading

negativekelvin commented Nov 22, 2019

szmodz commented Nov 22, 2019 •

edited

Loading

szmodz commented Nov 22, 2019 •

edited

Loading

negativekelvin commented Nov 22, 2019 •

edited

Loading

szmodz commented Nov 22, 2019 •

edited

Loading

szmodz commented Nov 22, 2019

negativekelvin commented Nov 22, 2019

szmodz commented Nov 22, 2019

projectgus commented Nov 25, 2019

szmodz commented Nov 26, 2019 •

edited

Loading

szmodz commented Nov 27, 2019 •

edited

Loading

projectgus commented Nov 28, 2019 •

edited

Loading

szmodz commented Nov 28, 2019 •

edited

Loading

projectgus commented Dec 2, 2019

szmodz commented Dec 2, 2019

szmodz commented Dec 2, 2019

szmodz commented Dec 2, 2019

projectgus commented Dec 2, 2019

szmodz commented Dec 5, 2019 •

edited

Loading

szmodz commented Dec 11, 2019

szmodz commented Dec 19, 2019

projectgus commented Dec 19, 2019 •

edited

Loading

szmodz commented Dec 20, 2019

szmodz commented Dec 23, 2019

projectgus commented Apr 30, 2020

szmodz commented May 1, 2020

szmodz commented May 1, 2020

Random infrequent TG1WDT_SYS_RESET when using single core & PSRAM (IDFGH-2240) #4388

Random infrequent TG1WDT_SYS_RESET when using single core & PSRAM (IDFGH-2240) #4388

Comments

szmodz commented Nov 22, 2019 • edited Loading

szmodz commented Nov 22, 2019 • edited Loading

negativekelvin commented Nov 22, 2019

szmodz commented Nov 22, 2019 • edited Loading

negativekelvin commented Nov 22, 2019

szmodz commented Nov 22, 2019 • edited Loading

szmodz commented Nov 22, 2019 • edited Loading

negativekelvin commented Nov 22, 2019 • edited Loading

szmodz commented Nov 22, 2019 • edited Loading

szmodz commented Nov 22, 2019

negativekelvin commented Nov 22, 2019

szmodz commented Nov 22, 2019

projectgus commented Nov 25, 2019

szmodz commented Nov 26, 2019 • edited Loading

szmodz commented Nov 27, 2019 • edited Loading

projectgus commented Nov 28, 2019 • edited Loading

szmodz commented Nov 28, 2019 • edited Loading

projectgus commented Dec 2, 2019

szmodz commented Dec 2, 2019

szmodz commented Dec 2, 2019

szmodz commented Dec 2, 2019

projectgus commented Dec 2, 2019

szmodz commented Dec 5, 2019 • edited Loading

szmodz commented Dec 11, 2019

szmodz commented Dec 19, 2019

projectgus commented Dec 19, 2019 • edited Loading

szmodz commented Dec 20, 2019

szmodz commented Dec 23, 2019

projectgus commented Apr 30, 2020

szmodz commented May 1, 2020

szmodz commented May 1, 2020

szmodz commented Nov 22, 2019 •

edited

Loading

szmodz commented Nov 22, 2019 •

edited

Loading

szmodz commented Nov 22, 2019 •

edited

Loading

szmodz commented Nov 22, 2019 •

edited

Loading

szmodz commented Nov 22, 2019 •

edited

Loading

negativekelvin commented Nov 22, 2019 •

edited

Loading

szmodz commented Nov 22, 2019 •

edited

Loading

szmodz commented Nov 26, 2019 •

edited

Loading

szmodz commented Nov 27, 2019 •

edited

Loading

projectgus commented Nov 28, 2019 •

edited

Loading

szmodz commented Nov 28, 2019 •

edited

Loading

szmodz commented Dec 5, 2019 •

edited

Loading

projectgus commented Dec 19, 2019 •

edited

Loading