Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESP-MQTT publish / disconnect race condition (IDFGH-573) #2975

Closed
ghost opened this issue Jan 21, 2019 · 3 comments
Closed

ESP-MQTT publish / disconnect race condition (IDFGH-573) #2975

ghost opened this issue Jan 21, 2019 · 3 comments

Comments

@ghost
Copy link

ghost commented Jan 21, 2019

Environment

  • Development Kit: [none]
  • Module or chip used: [ESP32]
  • IDF version (run git describe --tags to find it): v3.3-beta1-212-g1ad8b96
  • Build System: [Make]
  • Compiler version (run xtensa-esp32-elf-gcc --version to find it): 1.22.0-80-g6c4433a
  • Operating System: [Linux]
  • Power Supply: [external 5V]

Problem Description

I'm experiencing crashes on wifi disconnect in an application that uses ESP-MQTT. Digging into the implementation of ESP-MQTT and mqtt_client.c I trace these crashes down to a race condition between ESP-MQTT internal task mqtt_task and my application task.

In a simple flow the application runs:

esp_mqtt_client_init
esp_mqtt_client_start

Looking at the sources of mqtt_client.c when esp_mqtt_client_start is called an internal task called mqtt_task is started. The mqtt_task is handling initialization, mqtt connect, reconnect, states, ping etc..
When the mqtt connection is established the mqtt_task is mostly waiting in a mqtt_process_receive call.

I'm struggling to understand how the API is meant to be used without running into race conditions between the application task and the internal mqtt_task.

Simplified I have an application that looks like this:

app_main:

    esp_mqtt_client_init
    esp_mqtt_client_start   

    while (1) {
        if (mqtt_connected) {
            esp_mqtt_client_publish
        }
        wait on some condition
    }

In my scenario the wifi disconnect flow is problematic where my application task and the internal mqtt_task race for:

mqtt_task:
mqtt_process_receive returns an error due to wifi disconnect and calls esp_mqtt_abort_connection which in turn updates the internal state, closes down the transport etc.

my app task:
Checks that mqtt is connected (based on mqtt callback handler status) and calls esp_mqtt_client_publish

Having my application deployed on several devices with coredump function enabled I find lots of crashes with a callstack like:

mbedtls_internal_aes_encrypt at /home/travis/esp-idf/components/mbedtls/mbedtls/library/aes.c:1308
mbedtls_aes_crypt_ecb at /home/travis/esp-idf/components/mbedtls/mbedtls/library/aes.c:1308
aes_crypt_ecb_wrap at /home/travis/esp-idf/components/mbedtls/mbedtls/library/cipher_wrap.c:131
mbedtls_cipher_update at /home/travis/esp-idf/components/mbedtls/mbedtls/library/cipher.c:958
mbedtls_gcm_starts at /home/travis/esp-idf/components/mbedtls/mbedtls/library/gcm.c:324
mbedtls_gcm_crypt_and_tag at /home/travis/esp-idf/components/mbedtls/mbedtls/library/gcm.c:456
mbedtls_cipher_auth_encrypt at /home/travis/esp-idf/components/mbedtls/mbedtls/library/cipher.c:1004
ssl_encrypt_buf at /home/travis/esp-idf/components/mbedtls/mbedtls/library/ssl_tls.c:7765
 (inlined by) mbedtls_ssl_write_record at /home/travis/esp-idf/components/mbedtls/mbedtls/library/ssl_tls.c:3384
ssl_write_real at /home/travis/esp-idf/components/mbedtls/mbedtls/library/ssl_tls.c:7765
 (inlined by) mbedtls_ssl_write at /home/travis/esp-idf/components/mbedtls/mbedtls/library/ssl_tls.c:8644
tls_write at /home/travis/esp-idf/components/esp-tls/esp_tls.c:513
esp_tls_conn_write at /home/travis/esp-idf/components/esp-tls/esp_tls.h:214
 (inlined by) ssl_write at /home/travis/esp-idf/components/tcp_transport/transport_ssl.c:112
esp_transport_write at /home/travis/esp-idf/components/tcp_transport/transport.c:176
mqtt_write_data at /home/travis/esp-idf/components/mqtt/esp-mqtt/mqtt_client.c:968 (discriminator 6)
esp_mqtt_client_publish at /home/travis/esp-idf/components/mqtt/esp-mqtt/mqtt_client.c:1004
app_task at main/app_main.c:263

From what I can tell there's no thread protection at all in mqtt_client.c. Before digging deeply into esp_transport, mbedtls etc to try and find the cause for the crashes I'd like to know how publishing using ESP-MQTT API is meant to be safely done?

All ESP-IDF example code perform the esp_mqtt_client_publish inside mqtt_event_handler meaning in the context of the internal mqtt_task so here there's no race. But that's not really feasible for any application publishing more than one message per connection.

Expected Behavior

Publishing from my application task (the one calling esp_mqtt_client_start) is safe.

Actual Behavior

Publishing from my application task (the one calling esp_mqtt_client_start) is racing with internal mqtt_task and sometimes causing crashes on wifi disconnect.

Steps to reproduce

I've not been able to create a simple example application to trigger the crash but looking at the source code I either misunderstood the API completely or there's a design flaw making it unsafe/unusable.

@Alvin1Zhang Alvin1Zhang changed the title ESP-MQTT publish / disconnect race condition [TW#28480] ESP-MQTT publish / disconnect race condition Jan 22, 2019
@david-cermak
Copy link
Collaborator

Hi @mikaelkanstrup

Current version of mqtt library does not yet support publishing from a different thread. This should be fixed in next few weeks.

@ghost
Copy link
Author

ghost commented Jan 22, 2019

Current version of mqtt library does not yet support publishing from a different thread. This should be fixed in next few weeks.

@david-cermak Oh ok. Thanks a lot for clarifying! Do you want me to close this issue as this "works as designed"?

I did a naive patch protecting just the abort_connection + publish case with a semaphore. I think that'll do for me until official support is there.

@david-cermak
Copy link
Collaborator

Please keep this open to help tracking the issue. It would be closed once a fix is merged.

Yes, adding a mutex to protect the api should be enough to workaround the issue for now.

david-cermak added a commit to espressif/esp-mqtt that referenced this issue Feb 15, 2019
@projectgus projectgus changed the title [TW#28480] ESP-MQTT publish / disconnect race condition ESP-MQTT publish / disconnect race condition (IDFGH-573) Mar 12, 2019
@igrr igrr closed this as completed in 1465f53 Mar 27, 2019
david-cermak added a commit to espressif/esp-mqtt that referenced this issue Dec 16, 2022
…ng/receiving different data and references esp-mqtt commits to pass these tests

testing conditions:
transports (tcp, ssl, ws..)
qos (0, 1, 2)
short repeated messages (packed packets)
oversized messages (fragmented packets)
publish from a different thread

Closes espressif/esp-idf#2870 by means of including commit 815623d from esp-mqtt
Closes espressif/esp-idf#2975 by means of including commit 752953d from esp-mqtt
Closes espressif/esp-idf#2850 by means of including commits df455d2 17fd713 from esp-mqtt
egnor pushed a commit to egnor/esp-mqtt that referenced this issue Dec 23, 2022
…ng/receiving different data and references esp-mqtt commits to pass these tests

testing conditions:
transports (tcp, ssl, ws..)
qos (0, 1, 2)
short repeated messages (packed packets)
oversized messages (fragmented packets)
publish from a different thread

Closes espressif/esp-idf#2870 by means of including commit 815623d from esp-mqtt
Closes espressif/esp-idf#2975 by means of including commit 752953d from esp-mqtt
Closes espressif/esp-idf#2850 by means of including commits df455d2 17fd713 from esp-mqtt
egnor pushed a commit to egnor/esp-mqtt that referenced this issue Dec 23, 2022
…ng/receiving different data and references esp-mqtt commits to pass these tests

testing conditions:
transports (tcp, ssl, ws..)
qos (0, 1, 2)
short repeated messages (packed packets)
oversized messages (fragmented packets)
publish from a different thread

Closes espressif/esp-idf#2870 by means of including commit 815623d from esp-mqtt
Closes espressif/esp-idf#2975 by means of including commit 752953d from esp-mqtt
Closes espressif/esp-idf#2850 by means of including commits df455d2 17fd713 from esp-mqtt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant