Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MQTT crashes ems-esp while getting disconnected to server and reconnecting afterwards #1067

Closed
tp1de opened this issue Feb 20, 2023 · 58 comments
Labels
enhancement New feature or request
Milestone

Comments

@tp1de
Copy link
Contributor

tp1de commented Feb 20, 2023

@MichaelDvP @proddy as discussed on Discord:

When MQTT (HA) gets disconnected and reconnects after a while, all entities are removed and created new.
Michael mentioned on Discord:
"this generates a lot of messages (EMSdevice::ha_config_clear() {), because it's for all devices at once, and fills the queue".

When queue becomes too long then ems-esp crashes and reboots. Then the same procedure starts again, but due to discovery of devices / entities while reading telegrams needs some time, the start-up works.

I tested that some seconds of "unavailability" of the mqtt-server brings my ems-esp system to crash.
As a result any short-time wifi-disconnects as well as longer offline backup-sessions will result in crashes.

I recognized that ems-esp wifi disconnects and reconnects during the day and changing the wifi access point on my mesh-network.
This takes some seconds. I might believe that this is the reason for my 2-3 crashes a day.

The automatic removal of the entities should be avoided wherever possible. It takes 1-2 minutes to recreate them by mqtt-discovery on HA. During this time dashboards and automations do not work or even crash. A MQTT reconnect does not need this removal function.

@tp1de tp1de changed the title MQTT crashes ems-esp while restarting MQTT crashes ems-esp while getting disconnected to server and reconnecting afterwards Feb 20, 2023
proddy added a commit that referenced this issue Feb 20, 2023
@proddy proddy added the enhancement New feature or request label Feb 20, 2023
@proddy proddy added this to the v3.6.0 milestone Feb 20, 2023
MichaelDvP added a commit to MichaelDvP/EMS-ESP32 that referenced this issue Feb 20, 2023
@MichaelDvP
Copy link
Contributor

@tp1de Could you compile/test the tech-branch? Or shoud i add a bin, or merge to my 3.5.1-dev? With the RC310 you have more entities, i get not restarts with my system, even if i enable HA discovery.

proddy added a commit that referenced this issue Feb 20, 2023
@tp1de
Copy link
Contributor Author

tp1de commented Feb 20, 2023

Hi guys I am lost for the moment how to get the tech upgrade.

When I clone with command: git clone https://github.com/emsesp/EMS-ESP32.git
I do not see any branches anymore to select the tech branch.

So I used Michaels repository and selected the tech-upgrade branch and compiled to: EMS-ESP-3_5_0-tec_4-ESP32.bin
This bin I can't upload with the webui - error upload failed (507).

Please advice how to go forward .....

btw: What have you changed?

@proddy
Copy link
Contributor

proddy commented Feb 20, 2023

Here's a bin file : EMS-ESP-3_5_0-tec_4-ESP32_4M.zip

The tec branch has various optimizations under the hood which I'm working on. There are no breaking changes or new features.

@tp1de
Copy link
Contributor Author

tp1de commented Feb 20, 2023

thanks @proddy for bin. I get the same upload error 507. What to do?

I finally managed to compile form original repro with tech upgrade.
It was irritating for me that a file was missing in \lib\framework: WWWData.h --- I took the file from @MichaelDvP

But upload does not work for this file neither. (error 507)

@proddy
Copy link
Contributor

proddy commented Feb 20, 2023

the WWWData.h file is built when you use platformio to build the firmware. Do you have a custom pio_local.ini where you override settings? Did you use yarn to build the webui? I don't get the 507 error. This is what the filesize should look like:

RAM:   [==        ]  17.5% (used 57260 bytes from 327680 bytes)
Flash: [==========]  99.1% (used 2014113 bytes from 2031616 bytes)

@tp1de
Copy link
Contributor Author

tp1de commented Feb 20, 2023

I started with an empty directory on scratch. No own ini-files. Main I can compile and upload without errors.
While using tech-upgrade I get errors that yarn is not found.

But your bin does not work either for upload.
I will stop for today ...

@MichaelDvP
Copy link
Contributor

I get errors that yarn is not found.

See: https://yarnpkg.com/getting-started/install
In admin shell type corepack enable
I've also added this change to my v3.5.1-dev

@tp1de
Copy link
Contributor Author

tp1de commented Feb 21, 2023

@MichaelDvP thanks for reply. I was able to compile and upload from your tech-upgrade branch.

As far as I can observe after first testing the crash can be avoided by limiting the number of entries in the queue.
I think the behavior is similar to a restart. I will do some more tests.

Nevertheless the logic stays the same. The entities are removed first and then they are new created.
Can't this be avoided while reconnecting mqtt?

@MichaelDvP
Copy link
Contributor

Nevertheless the logic stays the same. The entities are removed first and then they are new created.

No. Do you have the latest commit? Only inactive (missing) entities are removed, active are rebuild. (This is the idea, maybe i've done a mistake in code).

@tp1de
Copy link
Contributor Author

tp1de commented Feb 21, 2023

No. Do you have the latest commit?

Yes. I shall use the tech-upgrade branch and not the dev - isn't it?

... and after doing some more testing I still have sometimes new reboots / crashes - maybe in 30% of all cases.

@MichaelDvP
Copy link
Contributor

Yes. I shall use the tech-upgrade branch and not the dev - isn't it?

The tech branch is updated frequently, do you have commit 050ecd9

You can also use my dev v3.5.1-dev0 https://github.com/MichaelDvP/EMS-ESP32/releases/tag/latest

@tp1de
Copy link
Contributor Author

tp1de commented Feb 21, 2023

btw: I believe that I might have found the root cause for my Wifi stability problem then causing some of the reboots. #1072

@tp1de
Copy link
Contributor Author

tp1de commented Feb 21, 2023

You can also use my dev v3.5.1-dev0 https://github.com/MichaelDvP/EMS-ESP32/releases/tag/latest

I used the bin from there. Is this ok? (I do not know how to check if I have all commits ...)

As far as I can see after first tests, the thermostat and mixer entities are not removed anymore. But the boiler entities are still deleted and made new - this takes a while. Could this be @MichaelDvP ?

@proddy
Copy link
Contributor

proddy commented Feb 21, 2023

See: https://yarnpkg.com/getting-started/install In admin shell type corepack enable I've also added this change to my v3.5.1-dev

wow, never knew that. Thanks Michael. I've updated the docs too https://emsesp.github.io/docs/Building/

MichaelDvP added a commit to MichaelDvP/EMS-ESP32 that referenced this issue Feb 21, 2023
MichaelDvP added a commit to MichaelDvP/EMS-ESP32 that referenced this issue Feb 21, 2023
@MichaelDvP
Copy link
Contributor

I'va added another PR, with this change i get for coldstart and reconnect this: (the extra logging is not included in official software)

2023-02-21 13:43:10.000 INFO 1: [emsesp] Last system reset reason Core0: Software reset CPU, Core1: Software reset CPU
2023-02-21 13:43:10.000 INFO 2: [emsesp] Loading board profile E32
2023-02-21 13:43:10.000 INFO 3: [emsesp] Starting EMS-ESP version 3.5.0-tec.3 (hostname: ems-esp-eth)
2023-02-21 13:43:10.000 INFO 4: [dallassensor] Starting Dallas sensor service
2023-02-21 13:43:10.000 INFO 5: [analogsensor] Starting Analog sensor service
2023-02-21 13:43:10.000 INFO 5: [analogsensor] Starting Analog sensor service
2023-02-21 13:43:10.000 INFO 6: [emsesp] Loaded EMS device library (113 records)
2023-02-21 13:43:17.299 INFO 7: [emsesp] Starting NTP
2023-02-21 13:43:17.299 INFO 7: [emsesp] Starting NTP
2023-02-21 13:43:17.299 INFO 8: [emsesp] WiFi connected with IP=192.168.0.34, hostname=ems-esp-eth
2023-02-21 13:43:17.332 INFO 9: [mqtt] MQTT connected
2023-02-21 13:43:18.204 INFO 10: [system] NTP connected
2023-02-21 13:43:23.741 INFO 11: [emsesp] Recognized new boiler with deviceID 0x08
2023-02-21 13:43:24.307 INFO 12: [emsesp] Recognized new controller with deviceID 0x09
2023-02-21 13:43:28.141 ERROR 13: [telegram] Last Tx Read operation failed after 3 retries. Ignoring request: 0D 8B 02 00 1B
2023-02-21 13:43:28.141 INFO 14: [emsesp] Recognized new connect with deviceID 0x0B
2023-02-21 13:43:28.460 INFO 15: [emsesp] Recognized new thermostat with deviceID 0x10
2023-02-21 13:43:29.439 INFO 16: [emsesp] Recognized new mixer with deviceID 0x21
2023-02-21 13:43:30.127 INFO 17: [emsesp] Recognized new solar with deviceID 0x30
2023-02-21 13:43:40.001 INFO 18: [emsesp] Removed 91 HA-boiler-configs
2023-02-21 13:43:40.001 INFO 19: [emsesp] Created 52 HA-boiler-configs
2023-02-21 13:43:44.649 ERROR 20: [telegram] Last Tx Read operation failed after 3 retries. Ignoring request: 0D 8B 02 00 1B
2023-02-21 13:43:48.381 INFO 21: [emsesp] Removed 23 HA-thermostat-configs
2023-02-21 13:43:48.381 INFO 22: [emsesp] Created 47 HA-thermostat-configs
2023-02-21 13:43:52.465 INFO 23: [emsesp] Created 1 HA-thermostat-configs
2023-02-21 13:43:52.583 INFO 24: [emsesp] Removed 6 HA-mixer-configs
2023-02-21 13:43:52.583 INFO 25: [emsesp] Created 6 HA-mixer-configs
2023-02-21 13:44:00.129 INFO 26: [emsesp] Created 5 HA-thermostat-configs
2023-02-21 13:44:00.443 INFO 27: [emsesp] Removed 16 HA-solar-configs
2023-02-21 13:44:00.443 INFO 28: [emsesp] Created 16 HA-solar-configs
2023-02-21 13:44:20.114 INFO 29: [emsesp] Created 2 HA-thermostat-configs
2023-02-21 13:45:17.390 WARNING 30: [mqtt] MQTT disconnected: TCP
2023-02-21 13:45:27.628 INFO 31: [mqtt] MQTT connected
2023-02-21 13:45:29.008 INFO 32: [emsesp] Created 52 HA-boiler-configs
2023-02-21 13:45:31.994 INFO 33: [emsesp] Created 55 HA-thermostat-configs
2023-02-21 13:45:34.951 INFO 34: [emsesp] Created 16 HA-solar-configs
2023-02-21 13:45:34.951 INFO 34: [emsesp] Created 16 HA-solar-configs
2023-02-21 13:45:35.869 INFO 35: [emsesp] Created 6 HA-mixer-configs

@tp1de
Copy link
Contributor Author

tp1de commented Feb 21, 2023

@MichaelDvP how to use it? It's not within your latest bin. And when I compile neither .... what it's wrong on my side?

proddy added a commit that referenced this issue Feb 21, 2023
@MichaelDvP
Copy link
Contributor

As mentioned, i've added the counting for my personal analysis, it's not in the official builds.

@proddy
Copy link
Contributor

proddy commented Feb 21, 2023

I'm getting crashes with the MQTT HA discovery enabled. I haven't had time to locate what is causing it. To reproduce compile with -DEMSESP_TEST, upload to an ESP32 with MQTT and Discovery enabled, and run test from the console. It will load a boiler with 176 entities and crashes after 5-10 seconds.

@tp1de
Copy link
Contributor Author

tp1de commented Feb 22, 2023

Let me recommend for reconnecting mqtt (HA):

  • differ between reconnecting and new startup (ems-restart or mqtt config changes)
  • do not remove mqtt entries when reconnecting
  • do not resend mqtt config when reconnecting
  • just start sending value data - with a delay of > 30 secs after reconnecting
  • any queue is not needed before starting resending values

@proddy
Copy link
Contributor

proddy commented Feb 22, 2023

I'm getting crashes with the MQTT HA discovery enabled. I haven't had time to locate what is causing it. To reproduce compile with -DEMSESP_TEST, upload to an ESP32 with MQTT and Discovery enabled, and run test from the console. It will load a boiler with 176 entities and crashes after 5-10 seconds.

it was a bug I introduced, sorry. fixing.

@tp1de
Copy link
Contributor Author

tp1de commented Feb 22, 2023

When EMS-ESP crashes you'll get a push notification.

Installed the automation - works I get 2 notifications with one minute in between. Is ems-esp restarting then twice?

@proddy
Copy link
Contributor

proddy commented Feb 22, 2023

When EMS-ESP crashes you'll get a push notification.

Installed the automation - works I get 2 notifications with one minute in between. Is ems-esp restarting then twice?

i think it's because you set the retained flag? best check with MQTTExplorer.

In any case don't take any of the tec builds for now because they ain't working!

@MichaelDvP
Copy link
Contributor

But if retained anyhow there is no need to resend config when reconnecting

retain means the server keeps the messages if client is down, but if server is down it keeps only if it has a permanent storage, that's not always, so we have to resend.

@tp1de
Copy link
Contributor Author

tp1de commented Feb 22, 2023

OK guys thanks for clarifying. I will wait until the reboot topic / crash is solved.

Since I installed the MQTT server independent from the HA / ioBroker installation now, the offline backups have no influence on ems-esp anymore and while changing from WPA2/WP3 to WPA2(fixed) Wifi seems to be stable as well (I will observe more).

@proddy
Copy link
Contributor

proddy commented Feb 22, 2023

@MichaelDvP I'm really struggling to find why it's crashing and think it's memory related. Are we adding more to the MQTT now? I solved in temporarily in Mqtt::queue_message() by checking the heap. When you have time could you take a look? My brain is fuzzy and I'm probably not thinking straight (at home in bed with covid at the moment 👎🏻 )

edit: running a performance test, with the new DeviceValueState::DV_HA_CONFIG_RECREATE logic I can see a big increase in heap usage. Before these last changes it was using 26MB of Heap and now with the new changes it's 3 times more, 68MB, even when MQTT queue is capped to 200 (instead of 300), which is why it's bombing. So we need to re-think the logic a little.

@tp1de
Copy link
Contributor Author

tp1de commented Feb 22, 2023

@proddy take it easy. I wish all the best and a good course of your recovery.

@MichaelDvP
Copy link
Contributor

This test build also crashes wth mqtt enabled and HA disabled. Heap is 103/65 and should not be the problem, Maybe a serialization buffer inside mqtt-lib?
Or the test itself? I'm not familiar with the test addition, not sure where to search.

@proddy
Copy link
Contributor

proddy commented Feb 22, 2023

I have no idea. I'm still playing around and see what I find. I wish I documented the logic!

@proddy
Copy link
Contributor

proddy commented Feb 22, 2023

I merged the HA config changes from @pswid so we don't need to remove the HA topics on a restart or reconnect anymore.

But still something is eating the heap and I can't find out why.

In any case, one good change would be to put process_queue() in it's on thread (using xTaskCreatePinnedToCore) so its always looking at the queue. This will keep the buffer queue smaller.

@MichaelDvP
Copy link
Contributor

This change from @pswid is great, so we only have to rebuild. But ony skipping the config messages is not good:

  • creating new configs eats heap for the json, that is not freed fast enough
  • some configs are skipped and will be missing.

I tried to stop generating configs on low heap, insert in EMSdevice::mqtt_ha_entity_config_create() L1726

        if (ESP.getFreeHeap() < (65 * 1024)) {
            break;
        }

The creation stops on low heap and continues on next mqtt schedule.

I'll merge the pswid PR to my 3.5.1-dev, so we have some more testers how needs a binary.

@proddy
Copy link
Contributor

proddy commented Feb 23, 2023

I was thinking maybe we're making this too complicated. An MQTT broker is essentially a queue and we're trying to code our own smart queue on top of it. The original design was based around

  • we wanted to queue messages temporarily until the Network and MQTT connects during a cold-start, so nothing is missed. With a smaller buffer that should be ok.
  • we were flooding the LWiP stack on the ESP8266 in v2 with all the messages and needed to throttle it down

I believe the rule should be to publish the Mqtt messages, i.e. clear our own queue, as fast as we can, and constantly. So a new design could be

  • don't create discovery HA config topics until we have an active MQTT connection
  • do a real MQTT publish immediately on the config topics, skipping the queue completely

@tp1de
Copy link
Contributor Author

tp1de commented Feb 23, 2023

I'll merge the pswid PR to my 3.5.1-dev, so we have some more testers how needs a binary.

Just tell when this done .... I am ready to test.
(no problem to compile myself - but I am still confused which repository and branch I should take )

@proddy
Copy link
Contributor

proddy commented Feb 23, 2023

it's already in the tech-upgrade branch.

@tp1de
Copy link
Contributor Author

tp1de commented Feb 23, 2023

it's already in the tech-upgrade branch.

If I compile und install from your tech-upgrade branch ems-esp is permanently rebooting. The version before and the actual version both do not work stable. (rebooting after approx. 15-20 seconds)

@tp1de
Copy link
Contributor Author

tp1de commented Feb 23, 2023

I compiled the dev branch from Michael including the last 7 new commits. This version seems to work. I tested 3 times until yet to stop mqtt server and restarting it. Until now no crashes / reboots anymore and the removal of the configs seems not to happen anymore. I will continue testing.

What is different from @MichaelDvP repository and yours @proddy ?

@tp1de
Copy link
Contributor Author

tp1de commented Feb 23, 2023

I tested a bit more. Everything is fine for me for reconnecting to mqtt server. 👍

I was just surprised that on restarting ems-esp there are no config removals anymore.
Is this desired? When reading #1077 I understood yes.
This is very nice and a major improvement. I am happy.

@proddy
Copy link
Contributor

proddy commented Feb 23, 2023

What is different from @MichaelDvP repository and yours @proddy ?

about 200 commits.

There's still a bug which I'm hunting down. Doesn't crash on my system but something doesn't look right.

@tp1de
Copy link
Contributor Author

tp1de commented Feb 23, 2023

I tested a bit more. When deleting configs with mqtt explorer then a restart will create them new again. (restart needed)
So from my point of view everything works fine and we could close this issue.
Or do you want to do some more testing?

btw: I recognized that 70% of the ems-esp system HA entities are renamed? Was this intended? I lost therefor some statistics (e.g. WiFi strength). Not really a problem but not expected.

@proddy
Copy link
Contributor

proddy commented Feb 23, 2023

What is different from @MichaelDvP repository and yours @proddy ?

about 200 commits.

There's still a bug which I'm hunting down. Doesn't crash on my system but something doesn't look right.

think i fixed it, changed so many things I still don't know what caused it. A nullptr somewhere I expect.

@proddy
Copy link
Contributor

proddy commented Feb 23, 2023

btw: I recognized that 70% of the ems-esp system HA entities are renamed? Was this intended? I lost therefor some statistics (e.g. WiFi strength). Not really a problem but not expected.

can you give an example or the HA entity name before and after?

@tp1de
Copy link
Contributor Author

tp1de commented Feb 23, 2023

can you give an example or the HA entity name before and after?

Not really anymore before. E.g. Uptime stays as before wifi strength or rssi not.

@MichaelDvP
Copy link
Contributor

btw: I recognized that 70% of the ems-esp system HA entities are renamed?

Have you checked/changed mqtt-settings: Entity-ID-format?

@tp1de
Copy link
Contributor Author

tp1de commented Feb 24, 2023

Have you checked/changed mqtt-settings: Entity-ID-format?

No changes in format, all other entities are the same, just those related to ems-esp (system) are changed (not all). No problem for me, I was just surprised. I recognized while watching my WiFi stability. With the changed entity I lost the long-term history....

@proddy
Copy link
Contributor

proddy commented Feb 24, 2023

I haven't noticed it, but I'm using a different branch and build. You said 70% of the entities had changed so I was worried for a minute.

@tp1de
Copy link
Contributor Author

tp1de commented Feb 24, 2023

old name ---- new name

sensor.system_wifi_rssi - sensor.system_rssi
sensor.system_wifi_strength - sensor.system_wifistrength
sensor.system_mqtt_fails - sensor.system_mqttfails
sensor.system_rx_received - sensor.system_rxreceived
sensor.system_rx_fails - sensor.system_rxfails
sensor.system_tx_reads - sensor.system_txreads
sensor.system_tx_writes - sensor.system_txwrites
sensor.system_tx_fails - sensor.system_txfails

@proddy
Copy link
Contributor

proddy commented Feb 24, 2023

well spotted! I'll do a fix.

EDIT: couldn't reproduce it. I tested all MQTT options on the official 3.5.0 release against the tech-upgrade branch and the sensor names in HA are identical. Are you using another build from somewhere?

@tp1de
Copy link
Contributor Author

tp1de commented Feb 24, 2023

Are you using another build from somewhere?

Not as far as I can remember .... it's not important for me anyhow.
Keep it as it is. I deleted the old ones from my statistics db now.

@proddy proddy closed this as completed Mar 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants