-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconnection doesn't occur after: Error write data or timeout (IDFGH-1651) #126
Comments
Hi @no1seman From the logs it looks like an issue of internal logic of mqtt-task (race condition or a dead-lock) as the call to
|
@david-cermak,
Seems that disconnection occur between is_mqtt_client_connected() and esp_mqtt_client_publish()
Messages like: W (29641) MQTT: PUBLISHING - right before esp_mqtt_client_publish() Question: Does mqtt_client sends MQTT_EVENT_CONNECTED after reconnection? Another one log with that error: Finally got it work with MQTT_RECONNECT_TIMEOUT_MS = 0 (after mosquitto service restart). Error is the same, log is here: |
@no1seman Thanks for the detailed info. Seems the mqtt_task got stuck in
Yes it does, example of the log from IDF example updated to disconnect & reconnect while publishing
|
I added vTaskDelay(MQTT_POLL_READ_TIMEOUT_MS / portTICK_PERIOD_MS); to esp_mqtt_task() but it doesn't solve the problem, here if full log:
In parallel issue thread (espressif/esp-idf#3851) i'm trying to solve another connection problem with http and turning on SOCKETS_DEBUG also solved that problem too! Thus it can be argued that connection problems is highly likely in tcp_transport or lwip component. When I added sockets debug I also added some delay and locks or logic error(s) in tcp_transport or lwip gone away.
Here is full log of esp32: Most of messages sends much more than once. I can solve the problem of dublicated messages by setting QOS2, but it solves the problem only for reciever who sibscribed that messages, because mosquitto will dedublicate them, but doesn't solve the problem of overloading thin communication channel between esp32 and network. How else can I help you to find the key problem because turning on lwip debug logging is not a solution? Thanks in advance |
@no1seman Not only adding a delay but also reducing the poll timeout (sorry wasn't very clear from my previous post) esp_transport_poll_read(client->transport, 0)
Adding delays (enabling logging) might lower the changes of the issue to appear but will not solve it. These are very good pointers though, and I believe that reducing the poll timeout should also help (There's is a bug in |
adding ONLY vTaskDelay(MQTT_POLL_READ_TIMEOUT_MS / portTICK_PERIOD_MS); doesn't help but adding vTaskDelay(MQTT_POLL_READ_TIMEOUT_MS / portTICK_PERIOD_MS); and turning on socket debug - does. OK, will wait for fix in tcp_transpot - then I'll make full tests again. What will you say about multiple mqtt messages resend? This problem is also relates to tcp_transport bug or not? |
Could you please test one more scenario? Please revert all these changes and apply the following patch to
Resending logic is very basic, i.e. tries to resend everything (which was transmitted correctly) after 1 second, so when adding additional 1 second delay it is expected to see a lot of resends. Please note that this is not a common use case, just a adding temporary test modification to find the root cause of the issue. |
Changes made:
Added instead: got the following negative result:
Full log: |
Thanks very much @no1seman for this test! This however confirms that my initial suspicion was wrong (and most probably the tcp_transport fix won't help either). May I ask you for another go with this patch (just adding more logs) against the same versions as before? |
@david-cermak, It's not easy to conect debugger because GPIO12-GPIO15 is used for connecting sd card. According to the same connection problems with http - seems that the problem is in tcp_transport or LwIP or smth else but mqtt ( |
Thanks for the logs! To be honest I'm a bit at a loss now, since this log is pointing again to the poll timeout the same way I thought before. Don't you have some high prio thread running in the background (for example waiting for the connection flag) blocking the mqtt task? No idea beside that I am afraid... |
@david-cermak, mqtt_send uses only 2 of them: BEACONS and BLINK and http & mqtt has the same priority on a single core Now set mqtt_task priority == 11 - nothing has changed, error still present |
@david-cermak, |
@no1seman From previous post, it looks like a heathly application, but does the picture show task usage under My main concern was how the task distribution looks like when the issue appears |
@david-cermak, |
@david-cermak, During one of the tests I 've got the following result:
Full log: Prerequisites: How to use:
After some time you have to get an error. Script http_test.sh needed to provoke an error by increasing the load so that you do not have to wait for days when an event occurs that leads to an error. It would be great if you will find a problem or tell me where I'm wrong. |
Thanks very much for this test code @no1seman! I was indeed able to recreate the issue. The root cause seems to be resolved in cbae634. Could you please manually update mqtt lib to the latest master (to include this commit cbae634) |
thank you for investigating the issue. I already did the update earlier and it was no effect: #126 (comment) item 1. I made update of mqtt lib again to 92aa01d (merge for cbae634): and I've got the same error:
May be that is the effect of SPI SPRAM error, but for now problem still exists. It's very strange that in case of MQTT_CLIENT: No PING_RESP, disconnected - every thing OK:
but in case: MQTT_CLIENT: Error write data or timeout, written len = 0, errno=0 I've got:
|
alright, in that case there must be really something environmental related which I cannot reproduce. may you still try to apply this patch to 92aa01d and share the results (trying to set watchpoints to run state in order to find out who's making the mqtt_loop stop)? |
@david-cermak, So, I applied 92aa01d and got:
Full log is here: PS Tommorow, I'll try to move to release/4.0 with this patch: espressif/esp-idf@b9a5f76 , may be the problem is there!? PSPS Already done: |
@no1seman Thanks very much for all these tests! Yet another attempt to get more info (this time it should break with backtrace information) Please if you could apply this patch again to the 92aa01d and share results. |
@david-cermak, So, I made some runs on v4.0-dev-1443-g39f090a4f esp-idf, updated mqtt lib to: 92aa01d and provided patch with added watchpoins, got the following: I inserted backtrace addreses to decoded to lines at the end of each file. Wish that it helps PS Just for fun fed source codes of ESP-IDF to PVS Studio. Here is results, and seems that mqtt lib has some questions: |
Every time I look into the posted logs I'm getting even more confused then before (no exception this time). I can see that it behaves a bit differently then before (this time it tries to retrigger connection, after the default 10s though. before it just exits mqtt_loop), which really makes no sense to me and cannot think of anything but some ugly PSRAM issue. Is this reproducible on multiple boards or testing with just one? |
seems that it's my fault, I looked inattentively on a disconnection reason! As a have already mentioned that mqtt client successfully reconnects after MQTT_CLIENT: No PING_RESP, disconnected and also it successfully reconects in other reasons except: MQTT_CLIENT: Error write data or timeout, written len = 0, errno=0. I'll try to catch that case. The problem is that after apllying you patch I can't see mentioned above error lines in log... The problem will occur if write_len == 0 in:
if write_len < 0 everything OK. Also I shuld say that there are 2 places where mqtt_write_data() return state doesn't check in mqtt_process_receive() |
Yes, that's the point. The error scenario started the same way, but continued a bit differently from previous logs. Moreover it seems to assert when a memory next to the watchpoint was modified. Therefore I asked about more boards, modules where this could be reproduced.
Correct, but that's (very simple) logic of the mqtt library. Timeouts are treated as errors and disconnection is triggered (could perhaps make this timeout configurable). Anyway it should and must always reconnect.
True, similar logic needs to be applied to these places as well! Thanks for noting, this is to be fixed! |
@david-cermak, When I will catch the bug - I'll post the results |
I've got some results. Tested on a brand new board TTGO T8 v1.7 with espressif 64Mbit PSRAM. In first run I blocked for 30 seconds wifi connection of ESP on my WiFi-router and watchpoint triggered and in second run I've got MQTT_CLIENT: Error write data or timeout, written len = 0, errno=0 and watchpoint didn't trigger. So, the problem is that when E (7212) MQTT_CLIENT: Error write data or timeout, written len = -1, errno=113 occur mqtt lib set client->run = false and when MQTT_CLIENT: Error write data or timeout, written len = 0, errno=0 doesn't. I set esp_backtrace_print(20); rigth after: ESP_LOGE(TAG, "Error write data or timeout, written len = %d, errno=%d", write_len, errno); and got:
in esp_mqtt_abort_connection():
I've got return code: -1 when MQTT_CLIENT: Error write data or timeout, written len = 0, errno=0 occur, and in any other cases (for exmple if I stop mosquitto service on server) - I've got return code: 0 Aslo I can't understand how it would be: Backtrace:0x400E60D3:0x3FFD3C00 0x400E7D0C:0x3FFD3C30 0x400D99C2:0x3FFD3C80 0x40092D85:0x3FFD3CA0 0x400e7d0c: esp_mqtt_client_publish at /home/user/esp/esp-mdf/esp-idf/components/mqtt/esp-mqtt/mqtt_client.c:1354 0x400d99c2: mqtt_task2 at /home/user/esp/mqtt_bug/build/../main/main.c:419 (discriminator 9) 0x40092d85: vPortTaskWrapper at /home/user/esp/esp-mdf/esp-idf/components/freertos/port.c:143 V (32263) MQTT_CLIENT: esp_mqtt_abort_connection: 341 So, if you have any ideas - I will be ready to help |
Thanks for posting this log. (again no exception wrtt the level of my confusedness) this looks very very strange... Assuming your using the latest patch (+ the added bt_print) with watchpoints enabled (can see in the log). The logs from lines 1143-1160 print values of
that doen;t mean The only trouble is to find out why the client->run changes and why the watchpoint is not triggered while it's state is changed. Cannot think of any reason... |
that "The only trouble is to find out why the client->run changes and why the watchpoint is not triggered while it's state is changed. Cannot think of any reason..." means, that on different boards I've got the same problem with memory, probably External RAM. Here is my sdkconfig: https://github.com/no1seman/mqtt_bug/blob/master/sdkconfig? may be the problem is there? For example due wrong settings some errors in caching? |
@no1seman exactly. Thanks for posting the sdkconfig, can see the important option here |
thank you very much for investigation of my problem! I have to remind you about another issue, related to esp_http: espressif/esp-idf#3851 . Without PSRAM http problems also going away. HTTP case little bit easyly to reproduce and test. PS After my previous post I run test on the same board but without SPIRAM. Everything works fine, so the problem 100% in SPIRAM and GCC8.2.0, because I recall that problems began after upgrading toolchain from: xtensa-esp32-elf-linux64-1.22.0-80-g6c4433a-5.2.0.tar.gz to xtensa-esp32-elf-gcc (crosstool-NG esp32-2019r1) 8.2.0 (installed by $IDF_PATH/install.sh) PS PS No, I'm not right, problems still present even 1.22 cNG tools used for build, but PSRAM - 100% the source of the problem |
this issue IMHO should be linked to this one: espressif/esp-idf#3624 (comment) and this one too: espressif/esp-idf#2892 |
@no1seman This particular issue of mqtt library not reconnecting might be even worked around with changing bools to integers in client config:
That would however consequently fail somewhere else such as LWIP... I would suggest to subscribe tohttps://github.com/espressif/esp-idf/issues/2892 to get updates on the underlying issue related to PSRAM. |
@david-cermak, Are you plannig to commit that changes to IDF SDK? Seems that it would be not great cost to lose 6 bytes but to make lib more stable. PS As for #2892, I'm do not hope that this issue will be finally solved ever with rev 1 ESP32D. More than 2 years passed (section #3.9 added into eco bugs at june 2017!) and final solution is still not out. Now I'm refactoring my apps to make it runnable without PSRAM and after it I'll continue to make stress tests. Also will wait for rev 2 and 3, may be there PSRAM will be fully functional. |
@no1seman No, I don't think I will update the mqtt library to use Thank you too, for the active approach to this issue and mainly for creating the test code! |
Ok, seems that this issue must be closed. I'm busy now for refactoring my app. If I will face such a problem later, without SPI RAM - I'll reopen it. |
Making long run test of my app with latest mosquitto 1.6.4 got 3 different cases:
...
E (1177186) MQTT_CLIENT: No PING_RESP, disconnected
I (1177186) MQTT: MQTT client reconnecting...
I (1177186) MQTT_CLIENT: Client force reconnect requested
I (1177226) MQTT_CLIENT: Sending MQTT CONNECT message, type: 1, id: 0000
I (1177236) MQTT: MQTT client connected
E (3103846) MQTT_CLIENT: No PING_RESP, disconnected
I (3103846) MQTT: MQTT client reconnecting...
I (3103846) MQTT_CLIENT: Client force reconnect requested
I (3103896) MQTT_CLIENT: Sending MQTT CONNECT message, type: 1, id: 0000
I (3103976) MQTT: MQTT client connected
...
...
E (40626) MQTT_CLIENT: mqtt_message_receive: transport_read() error: errno=128
E (40626) MQTT_CLIENT: mqtt_process_receive: mqtt_message_receive() returned -1
I (40636) MQTT: MQTT client reconnecting...
I (40636) MQTT_CLIENT: Client force reconnect requested
I (40716) MQTT_CLIENT: Sending MQTT CONNECT message, type: 1, id: 0000
I (40726) MQTT: MQTT client connected
...
...
E (329776) MQTT_CLIENT: Error write data or timeout, written len = 0, errno=0
I (329776) MQTT: MQTT client reconnecting...
I (329776) MQTT_CLIENT: Client force reconnect requested
...
No any connection to MQTT server after. Whats wrong?
Here is verbose log of 3rd situation:
log.zip
Logic of my application:
...
...
...
MQTT client connect settings (no SSL):
CONFIG_LWIP_USE_ONLY_LWIP_SELECT = n
PS Tried to set CONFIG_LWIP_USE_ONLY_LWIP_SELECT = y and increase buffer size - no effect, issue still present
PSPS Should add that after I (329776) MQTT_CLIENT: Client force reconnect requested in third case mqtt client begins to eat heap and funally app hangs due to out of memory. Also I hadn't mention that I'm using 3 different tasks to publish data to mqtt server.
The text was updated successfully, but these errors were encountered: