Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

esp_mqtt_task hit panic_abort (IDFGH-3858) #5762

Closed
AxelLin opened this issue Aug 20, 2020 · 17 comments
Closed

esp_mqtt_task hit panic_abort (IDFGH-3858) #5762

AxelLin opened this issue Aug 20, 2020 · 17 comments

Comments

@AxelLin
Copy link
Contributor

AxelLin commented Aug 20, 2020

Environment

  • Module or chip used: ESP32-WROOM-32D
  • IDF version: v4.3-dev-907-g6c17e3a64c02
  • Build System: idf.py
  • Compiler version: xtensa-esp32-elf-gcc (crosstool-NG esp-2020r2) 8.2.0
  • Operating System: Linux
  • Power Supply: USB

Problem Description

My application hit below issue with recent esp-idf update.

abort() was called at PC 0x40082c3c on core 1
0x40082c3c: lock_acquire_generic at /home/axel/esp/esp-idf/components/newlib/locks.c:142

Backtrace:0x40087d3e:0x3ffdcd40 0x400883a9:0x3ffdcd60 0x4008f04a:0x3ffdcd80 0x40082c3c:0x3ffdcdf0 0x40082d51:0x3ffdce30 0x401806b1:0x3ffdce50 0x4017892d:0x3ffdd110 0x40178875:0x3ffdd160 0x400d9827:0x3ffdd190 0x400d92d7:0x3ffdd1b0 0x40082e4b:0x3ffdd1d0 0x40082ed3:0x3ffdd1f0 0x401800a1:0x3ffdd210 0x400e5ade:0x3ffdd230 0x400e3ce4:0x3ffdd260 0x400e44e4:0x3ffdd290
0x40087d3e: panic_abort at /home/axel/esp/esp-idf/components/esp_system/panic.c:360

0x400883a9: esp_system_abort at /home/axel/esp/esp-idf/components/esp_system/system_api.c:104

0x4008f04a: abort at /home/axel/esp/esp-idf/components/newlib/abort.c:46

0x40082c3c: lock_acquire_generic at /home/axel/esp/esp-idf/components/newlib/locks.c:142

0x40082d51: _lock_acquire_recursive at /home/axel/esp/esp-idf/components/newlib/locks.c:170

0x401806b1: _vfiprintf_r at /builds/idf/crosstool-NG/.build/xtensa-esp32-elf/src/newlib/newlib/libc/stdio/vfprintf.c:853 (discriminator 2)

0x4017892d: fiprintf at /builds/idf/crosstool-NG/.build/xtensa-esp32-elf/src/newlib/newlib/libc/stdio/fiprintf.c:48

0x40178875: __assert_func at /builds/idf/crosstool-NG/.build/xtensa-esp32-elf/src/newlib/newlib/libc/stdlib/assert.c:58 (discriminator 8)

0x400d9827: spinlock_acquire at /home/axel/esp/esp-idf/components/soc/include/soc/spinlock.h:117
(inlined by) esp_time_impl_get_boot_time at /home/axel/esp/esp-idf/components/newlib/port/esp_time_impl.c:125

0x400d92d7: adjust_boot_time at /home/axel/esp/esp-idf/components/newlib/time.c:63

0x40082e4b: get_adjusted_boot_time at /home/axel/esp/esp-idf/components/newlib/time.c:104
(inlined by) _gettimeofday_r at /home/axel/esp/esp-idf/components/newlib/time.c:178

0x40082ed3: _gettimeofday_r at /home/axel/esp/esp-idf/components/newlib/time.c:177

0x401800a1: gettimeofday at /builds/idf/crosstool-NG/.build/xtensa-esp32-elf/src/newlib/newlib/libc/syscalls/sysgettod.c:11

0x400e5ade: platform_tick_get_ms at /home/axel/esp/esp-idf/components/mqtt/esp-mqtt/lib/platform_esp32_idf.c:30

0x400e3ce4: mqtt_write_data at /home/axel/esp/esp-idf/components/mqtt/esp-mqtt/mqtt_client.c:796

0x400e44e4: mqtt_process_receive at /home/axel/esp/esp-idf/components/mqtt/esp-mqtt/mqtt_client.c:1159
(inlined by) esp_mqtt_task at /home/axel/esp/esp-idf/components/mqtt/esp-mqtt/mqtt_client.c:1273

Expected Behavior

Should not hit panic

Actual Behavior

It should work, it was fine when build with v4.3-dev-771-gc77c4ccf6c43.

Steps to reproduce

I cannot provide my application code, but from the backtrace it looks like the problem is not in my application.
My application use mqtt_client and publish data to broker every minute.

@github-actions github-actions bot changed the title esp_mqtt_task hit panic_abort esp_mqtt_task hit panic_abort (IDFGH-3858) Aug 20, 2020
@Alvin1Zhang
Copy link
Collaborator

Thanks for reporting, we will look into.

@AxelLin
Copy link
Contributor Author

AxelLin commented Aug 21, 2020

I bisect and found the first bad commit is: a395a00
Please check if you can provide fix ASAP since this seems has system-wide level impact.

[Edit: Just found it failed to revert a395a00 in master v4.3-dev-907-g6c17e3a64c02, my previous test was in bisected branch.]

@ESP-Marius
Copy link
Collaborator

@AxelLin Thanks for reporting this.

My initial suspicion would be memory corruption somewhere in the app.

I haven't been able to reproduce this error though, any chance it would be possible for you to share a minimal example that reproduces this?

How frequent/consistent is this error for you? Is it something that happens every time when you start? Or randomely after running for a while?

@AxelLin
Copy link
Contributor Author

AxelLin commented Aug 21, 2020

@AxelLin Thanks for reporting this.

My initial suspicion would be memory corruption somewhere in the app.

Hi @ESP-Marius

Everytime I hit the issue, the backtrace is the same.
It's related to
0x400e3ce4: mqtt_write_data at /home/axel/esp/esp-idf/components/mqtt/esp-mqtt/mqtt_client.c:796
0x400e44e4: mqtt_process_receive at /home/axel/esp/esp-idf/components/mqtt/esp-mqtt/mqtt_client.c:1159
(inlined by) esp_mqtt_task at /home/axel/esp/esp-idf/components/mqtt/esp-mqtt/mqtt_client.c:1273

i.e
Calling platform_tick_get_ms(); in mqtt_write_data().

Just curious which part of the backtrace make you think it could be memory corruption somewhere in the app.
PS. My application was working fine before recently esp-idf update.

@AxelLin
Copy link
Contributor Author

AxelLin commented Aug 22, 2020

What could be the reason to trigger this assertion?
https://github.com/espressif/esp-idf/blob/master/components/soc/include/soc/spinlock.h#L117

0x400d9827: spinlock_acquire at /home/axel/esp/esp-idf/components/soc/include/soc/spinlock.h:117
(inlined by) esp_time_impl_get_boot_time at /home/axel/esp/esp-idf/components/newlib/port/esp_time_impl.c:125

I made below change and found the assertion happens when result == core_id (and the result is 0xCDCD, i.e. CORE_ID_REGVAL_PRO).

if (result == core_id)
assert((result == SPINLOCK_FREE) == (lock->count == 0)); /* we're first to lock iff count is zero /
else if (result == SPINLOCK_FREE)
assert((result == SPINLOCK_FREE) == (lock->count == 0)); /
we're first to lock iff count is zero */

@AxelLin
Copy link
Contributor Author

AxelLin commented Aug 22, 2020

@ESP-Marius

This issue does not happen when CONFIG_FREERTOS_UNICORE=y.
Any idea?

@tagunil
Copy link

tagunil commented Aug 23, 2020

@ESP-Marius
I have similar issues with MQTT task and time spinlocks introduced in a395a00. It looks almost like the task has migrated to another core while holding spinlock. Is that possible? Could these spinlocks be reliably used without disabling context switches?

@henrik-the
Copy link

I have a similar crash, if i compile with the newest version of the IDF.
I am trying to run this project. https://github.com/Pro/open62541-esp32

If i compile with IDF Version 4.1 everything is ok.

@ESP-Marius
Copy link
Collaborator

Thanks everyone for reporting this and providing more information!

You guys are correctly, it definitely seems like a395a00 introduced a change that caused this to break.

These spinlocks doesn't seem to be designed to be used directly in the code, but as a part of our criticalsection implemention. We'll make this issue a high priority, and i'll keep it updated.

@ESP-Marius
Copy link
Collaborator

We have in-progress MR internally to fix this. In the meantime you can try to apply the attached diff.
I haven't been able to reproduce the issue locally with this patch applied.

spinlock_assert.txt

@AxelLin
Copy link
Contributor Author

AxelLin commented Aug 24, 2020

Hi @ESP-Marius,

Without your patch, my device usually panic within 10 minutes.
With your patch applied, my device is already running for 12 hours.

So I think the patch fixed the issue.
Thanks for the quick fix.

@ESP-Marius
Copy link
Collaborator

That's good to hear!

I'll update the thread when a fix gets merged into master.

@ESP-Marius
Copy link
Collaborator

A fix has been merged to master internally with commit ID: 6fb996b

It will be available on github when the repo is synced the next time.

@skanico
Copy link

skanico commented Sep 5, 2020

@AxelLin @ESP-Marius I'm running in the same issue. As the repo seems not updated yet, could you explain me how to apply the patch? Thanks

EDIT: solved : git apply spinlock_assert.txt

@davidthings
Copy link

Piling on, while running the latest IDF master, I just hit this bug and spent a day and a half wondering if I was doing something wrong, trying to eliminate local causes. New code so unlike the OP, I couldn't rule out my code. A bummer to think that there was a known fix somewhere in the system not yet bubbled up into master.

@ESP-Marius
Copy link
Collaborator

Yes unfortunately syncing to Github has been a bit slow recently. Please bear with us, there is indeed a fix on the way 😄

@davidthings
Copy link

A fix to the bug? Yep. I got that. A fix to the slow github sync? That would be excellent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants