Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

esp_netif_list_lock/esp_netif_list_unlock freeing mutex creates problems (IDFGH-11088) #12261

Closed
3 tasks done
arnoutdekimo opened this issue Sep 18, 2023 · 5 comments
Closed
3 tasks done
Assignees
Labels
Resolution: NA Issue resolution is unavailable Status: Done Issue is done internally Type: Bug bugs in IDF

Comments

@arnoutdekimo
Copy link

arnoutdekimo commented Sep 18, 2023

Answers checklist.

  • I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
  • I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
  • I have searched the issue tracker for a similar issue and not found a similar issue.

IDF version.

v4.4.4 / latest master

Operating System used.

Windows

How did you build your project?

Command line with CMake

If you are using Windows, please specify command line type.

None

Development Kit.

ESP32-wrover-kit

Power Supply used.

USB

What is the expected behavior?

No race conditions leading to crashes

What is the actual behavior?

Race conditions leading to crashes:

The issue is that the mutex protecting the esp_netif_list isn't thread-safe.

Two reasons:

  1. In theory two threads could try and create an interface at the same time, each reaching esp_netif_list_lock at the same time. The first thread could context-switch just after having detected that the current mutex is null, and planning to allocate a new one. The other could then allocate one as well, and lock it, before context switching back. When the first one is brought back up, it will overwrite the first one, causing corruption.
    The chances of this happening are small though, but it is still bad design.
    The correct way of doing this is initing the relevant mutexes once in some init function.

  2. The fact that the mutex is again optionally destroyed on esp_netif_list_unlock is really bad. (See here )
    For instance, if we call esp_netif_get_handle_from_ifkey (e.g. by mdns_init) at the same time and/or just before an interface is created, it can lead to the mutex being allocated, freed, but also simultaenously used.

This has led to constant crashes of spinlocks asserts that can be found in the debug logs.

Note that this guy https://www.esp32.com/viewtopic.php?t=20233 has clearly seen this issue before in 2021 already.

A correct design would have an init function, deinit function, and not alloc/dealloc mutexes at runtime..

Steps to reproduce.

Problem 1) Create two threads, that simultaneously try to create interfaces. (Very hard to reproduce though)

Problem 2) Also a tricky race condition, but was able to reproduce this in a constant crashloop:

  • Call mdns_init in one thread
  • In other thread, simultaneously create the FIRST interface

By slightly varying the delays between the thread execution sequences, the spinlock crash can be triggered. (But as this is a race condition, this is tricky. Anyway, just from reading the code the issue can be clearly spotted)

Debug Logs.

`
assert failed: spinlock_acquire spinlock.h:123 ((result == SPINLOCK_FREE) == (lock->count == 0))


Backtrace: 0x4008231d:0x3ffd73d0 0x40095185:0x3ffd73f0 0x4009c0ed:0x3ffd7410 0x40099441:0x3ffd7530 0x400963f5:0x3ffd7570 0x40142f55:0x3ffd75b0 0x4014310d:0x3ffd75d0 0x40143c61:0x3ffd75f0 0x401460db:0x3ffd7620 0x400f84a2:0x3ffd7650 0x40099211:0x3ffd7820
0x4008231d: panic_abort at C:/ESP32/esp-idf-eclipse-444/esp-idf/esp-idf-v4.4.4/components/esp_system/panic.c:408

0x40095185: esp_system_abort at C:/ESP32/esp-idf-eclipse-444/esp-idf/esp-idf-v4.4.4/components/esp_system/esp_system.c:137

0x4009c0ed: __assert_func at C:/ESP32/esp-idf-eclipse-444/esp-idf/esp-idf-v4.4.4/components/newlib/assert.c:85

0x40099441: spinlock_acquire at C:/ESP32/esp-idf-eclipse-444/esp-idf/esp-idf-v4.4.4/components/esp_hw_support/include/soc/spinlock.h:123
 (inlined by) xPortEnterCriticalTimeout at C:/ESP32/esp-idf-eclipse-444/esp-idf/esp-idf-v4.4.4/components/freertos/port/xtensa/port.c:301

0x400963f5: vPortEnterCritical at C:/ESP32/esp-idf-eclipse-444/esp-idf/esp-idf-v4.4.4/components/freertos/port/xtensa/include/freertos/portmacro.h:578
 (inlined by) xQueueSemaphoreTake at C:/ESP32/esp-idf-eclipse-444/esp-idf/esp-idf-v4.4.4/components/freertos/queue.c:1563

0x40142f55: esp_netif_list_lock at C:/ESP32/esp-idf-eclipse-444/esp-idf/esp-idf-v4.4.4/components/esp_netif/esp_netif_objects.c:51

0x4014310d: esp_netif_get_handle_from_ifkey at C:/ESP32/esp-idf-eclipse-444/esp-idf/esp-idf-v4.4.4/components/esp_netif/esp_netif_objects.c:175

0x40143c61: esp_netif_new at C:/ESP32/esp-idf-eclipse-444/esp-idf/esp-idf-v4.4.4/components/esp_netif/lwip/esp_netif_lwip.c:462
`

More Information.

No response

@arnoutdekimo arnoutdekimo added the Type: Bug bugs in IDF label Sep 18, 2023
@espressif-bot espressif-bot added the Status: Opened Issue is new label Sep 18, 2023
@github-actions github-actions bot changed the title esp_netif_list_lock/esp_netif_list_unlock freeing mutex creates problems esp_netif_list_lock/esp_netif_list_unlock freeing mutex creates problems (IDFGH-11088) Sep 18, 2023
@espressif-bot espressif-bot added Status: In Progress Work is in progress and removed Status: Opened Issue is new labels Sep 22, 2023
@espressif-bot espressif-bot added Status: Done Issue is done internally Resolution: NA Issue resolution is unavailable and removed Status: In Progress Work is in progress labels Oct 17, 2023
@AxelLin
Copy link
Contributor

AxelLin commented Oct 26, 2023

Hi @david-cermak

Is it possible to change the esp_netif_next() implementation to use esp_netif_tcpip_exec()
so it won't impact with existing users?
Otherwise you are actually asking exist users to repeat similar implementation in various places.

@david-cermak
Copy link
Collaborator

Hi @AxelLin

I'm afraid that's not possible. The problem is in the API design itself (similar to the macro NETIF_FOREACH() in lwip), which could not assure thread safety between iterations.
That's why I marked the esp_netif_next() deprecated, so the existing users would get notified and should think about their usecases to do one of the following:

  1. just rename esp_netif_next() -> esp_netif_next_unsafe() if all interfaces are strictly under their control and are created once.
  2. rework the iteration, for example using esp_netif_find_if() for searching interfaces in a safe way, if the application interfaces could be created/destroyed runtime from multiple tasks

@AxelLin
Copy link
Contributor

AxelLin commented Nov 6, 2023

Then I need to wait until the fix is available in stable branches, and update application code accordingly.

@AxelLin
Copy link
Contributor

AxelLin commented Nov 30, 2023

@david-cermak

This issue is already marked as closed, but currently this fix is only available in v5.2+ , I don't find the fix in any release branch.
Would you fix this issue for release branches?

@david-cermak
Copy link
Collaborator

Yes, will fix on all release branches (down to v4.3)

movsb pushed a commit to movsb/esp-idf that referenced this issue Dec 1, 2023
This commit removes the lock from the list manipulation code in esp_netif_objects.c,
 because we already have another lock/task context for lwip.
So the list manipulation is unsafe and safety must be assured by the stack layer
(in esp_netif_lwip).
Problems with current locking:
* implementation of locking was wrong -- lazy init style of creating the mutex is not
  thread safe (and destroying it if we have no interface makes the problem exhibit very frequently)
* locking only the list won't solve issues when assessing interfaces atomically
* maintaining multiple locks is problematic, as we often switch between
lwip context and user context in internal implementation of esp_netif_lwip

Closes espressif#12261
espressif-bot pushed a commit that referenced this issue Dec 8, 2023
This commit removes the lock from the list manipulation code in esp_netif_objects.c,
 because we already have another lock/task context for lwip.
So the list manipulation is unsafe and safety must be assured by the stack layer
(in esp_netif_lwip).
Problems with current locking:
* implementation of locking was wrong -- lazy init style of creating the mutex is not
  thread safe (and destroying it if we have no interface makes the problem exhibit very frequently)
* locking only the list won't solve issues when assessing interfaces atomically
* maintaining multiple locks is problematic, as we often switch between
lwip context and user context in internal implementation of esp_netif_lwip

Closes #12261
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Resolution: NA Issue resolution is unavailable Status: Done Issue is done internally Type: Bug bugs in IDF
Projects
None yet
Development

No branches or pull requests

4 participants