Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash and hang on coredump save (IDFGH-4710) #6519

Open
vtunr opened this issue Feb 8, 2021 · 6 comments
Open

Crash and hang on coredump save (IDFGH-4710) #6519

vtunr opened this issue Feb 8, 2021 · 6 comments

Comments

@vtunr
Copy link

vtunr commented Feb 8, 2021

Environment

  • Custom PCB
  • Module or chip used: ESP32-WROVER
  • IDF version : v4.2
  • Build System: CMake
  • Compiler version : xtensa-esp32-elf-gcc (crosstool-NG esp-2020r3) 8.4.0
  • Operating System: Windows
  • (Windows only) environment type: Plain Command Prompt
  • Using an IDE: Yes, eclipse
  • Power Supply: External 3.3V

Problem Description

ESP can crash and hang forever.

Expected Behavior

ESP crash but recover and reboot by itself

Actual Behavior

ESP crash and doesn't recover, just hang.

Steps to reproduce

I couldn't reproduce on a simple project, but when I call restart, with a too small stack for LWIP thread, it crashes before restarting, but half the time, it hang and never recover until power cycle.

Here's the log :

I (22487) wifi:state: run -> init (0)
I (22489) wifi:pm stop, total sleep time: 13865608 us / 18122915 us

I (22491) wifi:new:<6,0>, old:<6,0>, ap:<255,255>, sta:<6,0>, prof:1
W (22
***ERROR*** A stack overflow in task tiT has been detected.

Backtrace:0x40090b72:0x3ffe4430 0x40091125:0x3ffe4450 0x40091312:0x3ffe4470 0x4009205d:0x3ffe44f0 0x40091408:0x3ffe4530 0x400913be:0xa5a5a5a5 |<-CORRUPTED


ELF file SHA256: ca747698a1c60b13

I (21263) esp_core_dump_flash: Save core dump to flash...
I (21269) esp_core_dump_elf: Found tasks: 27
I (21275) esp_core_dump_flash: Erase flash 49152 bytes @ 0x210000

Here's the SDK config : sdkconfig_debug.txt

If I debug, I hit the first stack overflow and can't see the actual problem that hangs after.
I modified a bit the SDK so it's not creating a breakpoint when a crash happens.

It seems that I hit a double exception :

Thread #1 (Suspended : Signal : SIGTRAP:Trace/breakpoint trap)	
	_DoubleExceptionVector() at xtensa_vectors.S:455 0x400803c0	

I'd like to know what to do so it doesn't hang forever in case of a crash.
Let me know if you need more information.

@github-actions github-actions bot changed the title Crash and hang on coredump save Crash and hang on coredump save (IDFGH-4710) Feb 8, 2021
@gerekon
Copy link
Collaborator

gerekon commented Feb 12, 2021

Hi @vtunr
Can you enable coredump verbose logging by inserting

#define LOG_LOCAL_LEVEL ESP_LOG_VERBOSE

before this line?

@vtunr
Copy link
Author

vtunr commented Feb 15, 2021

Hi @gerekon,

Thanks for your answer.
Please find attached the logs coredump_crash.log

Here's my partition.csv :

# Name,   Type, SubType, Offset,   Size
# Note: if you change the phy_init or app partition offset, make sure to change the offset in Kconfig.projbuild
nvs,      data, nvs,     ,         16K
otadata,  data, ota,     ,         8K
phy_init, data, phy,     ,         4K
factory,  0,    0,       ,         2M
coredump, data, coredump,,         512K
ota_0,    0,    ota_0,   ,         2M
ota_1,    0,    ota_1,   ,         2M
nvs_factory, data, nvs,	 ,         16K
sensordata, data, nvs,	 ,         1456K

@gerekon
Copy link
Collaborator

gerekon commented Feb 16, 2021

@vtunr

It seems that I hit a double exception :

W/o debugger in case of exception panic handler should be re-entered and you would see special message. BTW can you retrive backtrace from the point you hit DoubleException?

Please find attached the logs coredump_crash.log

Hmm, looks strange... In any case if core dump was stuck at some point the board should be reset by RTC watchdog.

I couldn't reproduce on a simple project, but when I call restart, with a too small stack for LWIP thread, i

Coredump code works on the task's stack and needs some extra stack space. For saving data in ELF format it requires more stack (~800 bytes) than for binary one. So possible option is to switch to binary coredump format.
What size of LWIP stack do you use when problem happens? Can you add code (somewhere in panic handler) to print high water mark (uxTaskGetStackHighWaterMark) for the task before dumping the data to flash?

@KaeLL
Copy link
Contributor

KaeLL commented Feb 18, 2021

@gerekon

In any case if core dump was stuck at some point the board should be reset by RTC watchdog.

Without wanting to hijack the thread but doing so anyway, I've had issues with DoubleException and the board not resetting itself at all, so much so that I had to develop a way to kind of reboot the board externally.

@vtunr
Copy link
Author

vtunr commented Feb 18, 2021

@gerekon @KaeLL
Actually that is my biggest issue. The crash, I can prevent it, I just need to extend the stack, and even if it happens, I know it should recover. But it doesn't. Now i'm worried it'll crash when deployed, and somehow get stuck, so that's what I want to understand.

I'll check to have more info about the double exception, i'll let you know.

@KaeLL
Copy link
Contributor

KaeLL commented Feb 22, 2021

@gerekon Good luck. I gave up on trying to find out what was happening and went for the radical solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants