Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mqtt client does not reconnect if mqtt server is temporary down #42

Closed
tp1de opened this issue Apr 7, 2021 · 37 comments
Closed

Mqtt client does not reconnect if mqtt server is temporary down #42

tp1de opened this issue Apr 7, 2021 · 37 comments
Labels
bug Something isn't working enhancement New feature or request

Comments

@tp1de
Copy link
Contributor

tp1de commented Apr 7, 2021

I did some software updates on my home automization system. Therefore the mqtt server was temporary down (15 minutes).
After restarting my other mqtt clients reconnected automatically but ems-esp32 does not. I had to reconnect manually by webui.
I tested it twice, but automatic reconnection does not work if mqtt server was down.

I did some more testing. If I stop the mqtt broker for a couple of minutes then reconnection works.
When I disconnect the LAN-cable from the server where the mqtt broker is running on, then ems-esp does not recognize a "disconnected" status - It stays "connected" (I tested for 3-4 minutes - IP-Address not reachable). When I just stop the service but IP stays reachable then the ems-esp recognize the disconnect status immediatly.

@tp1de tp1de added the bug Something isn't working label Apr 7, 2021
@proddy
Copy link
Contributor

proddy commented May 3, 2021

I've reproduced this but can't fix it. If the MQTT is down, EMS-ESP will attempt to reconnect. Which works, except after a long wait (>10mins on my system) it fails. The bug is in the underlying asyncmqttclient code where it trips-up after multiple calls and doesn't close the pipe. So the onDisconnect callback is never called and EMS-ESP assumes it's connected. I spent some time trying to fix it but it's hard. If this is a major problem then one solution is to re-write the EMS-ESP logic to use what we had back in v1.9 and not use the mqtt library calls.

@tp1de
Copy link
Contributor Author

tp1de commented May 3, 2021

@proddy
I went from mqtt to API. So it's not a problem for me anymore.
But anyone who uses MQTT on a server (like me on Docker) might face SW-Updates on the server where the 10 mins timelimit is passed. But I understand that by new API 3.0 there will be a command to restart MQTT?

@tp1de
Copy link
Contributor Author

tp1de commented May 3, 2021

On the other side MQTT was designed to support unstable network connections with delays. My other clients reconnect without a problem to the broker. Shortterm I recommend to focus on the new API development but midterm I would recommend to find a solution.

@proddy
Copy link
Contributor

proddy commented May 3, 2021

I'll post a bug and some reproducible code to the owners of the library

@proddy
Copy link
Contributor

proddy commented May 3, 2021

and think I'll re-write the logic and copy what esphome do : https://github.com/esphome/esphome/blob/07b3327102f7457f960940a4f5ceae8abb4686cd/esphome/components/mqtt/mqtt_client.cpp#L289

if it still hasn't reconnected it reboots the ESP :-)

BTW did you try out the new API from the other branch?

@tp1de
Copy link
Contributor Author

tp1de commented May 4, 2021

BTW did you try out the new API from the other branch?

Not yet since I struggle with the actual API with my ioBroker adapter.
I get some ERRCONRESET by hhtp get requests from my ioBroker production machine running in Docker. (1 error in 5 minutes)
From WIN10 test and raspberry PI4 I don't get any errors at all.
So I had to adjust the code with try / catch error tests. On 15 secs polling intervall some errors doesn't matter but during the initialization phase when I read all available datapoints it hurts. I am still looking how to treat this. Shall I introduce some delay between the http get calls?

Which branch do you mean: ft_https? And is the actual API command structure working?

@proddy
Copy link
Contributor

proddy commented May 4, 2021

I merged the new API format into dev. It's probably not perfect but the only way to see is if people use it.

It's worth trying to reproduce the ERRCONRESET. You could create a small bash script using curl and a sleep. The commands are in the doc/EMS-ESP32 API.md https://github.com/emsesp/EMS-ESP32/blob/e97f6c09e5f11426eb804c445f26921cf647b499/doc/EMS-ESP32%20API.md

@tp1de
Copy link
Contributor Author

tp1de commented May 4, 2021

I use 100 msecs delay during the initialization phase which runs on adapter start when I read per field the detailled info.
So init phase is now 20 seconds for ems-esp but no errors anymore.
Errors during polling (15 secs) are ignored.

I will test the new API version now.

@dutchrazor
Copy link

I also ran into this issue. Rebooting the ESP if all else fails seems like a good to me.

Actually, would it be possible to include the option of having the ESP rebooting weekly or even nightly? It is not ideal but it seems to me like it would solve a lot of reliability issues in the short term.

@tp1de
Copy link
Contributor Author

tp1de commented May 22, 2021

Actually, would it be possible to include the option of having the ESP rebooting weekly or even nightly?

I don't think that is a real solution. MQTT and REST API V3 are working now in ESP32 very reliable. Just on MQTT Broker / Server Maintenance (>15-20 minutes) I do have this reconnection issue. Anyhow webui and/or telnet is still working. A manual MQTT restart is always possible. Midterm this should be solved to secure stable long-term operation.

@proddy
Copy link
Contributor

proddy commented Jul 1, 2021

I spent some time to try and reproduce this, first in EMS-ESP by leaving the MQTT Server off for >1hr and then with a small standalone ESP32 application to help troubleshoot. I wasn't able to get the same error - it reconnected successfully each time. I'm closing this issue and we can re-open if there are more cases which can be reproduced

@daviessm
Copy link

daviessm commented Sep 6, 2022

This has happened to me a couple of times in the last few days, are there any logs that could be helpful?

I have a hunch it's when the EMS-ESP device starts up before the DHCP server - once the IP address is finally available, MQTT never tries to connect.

Edit: EMS-ESP Version v3.4.1

@proddy
Copy link
Contributor

proddy commented Sep 6, 2022

This has happened to me a couple of times in the last few days, are there any logs that could be helpful?

I have a hunch it's when the EMS-ESP device starts up before the DHCP server - once the IP address is finally available, MQTT never tries to connect.

Edit: EMS-ESP Version v3.4.1

The problem is I've never been able to reproduce this. If you have an almost 100% use case where it always fails we can provide a debug version with extra trace information. The MQTT service in EMS-ESP will always try and reconnect. Let me know if you want a special build

@proddy proddy reopened this Sep 6, 2022
@daviessm
Copy link

daviessm commented Sep 6, 2022

If you have an almost 100% use case where it always fails we can provide a debug version with extra trace information. The MQTT service in EMS-ESP will always try and reconnect. Let me know if you want a special build

I'll give it a bash over the next few days and let you know if I can reliably break it.

@daviessm
Copy link

daviessm commented Sep 7, 2022

I have a hunch it's when the EMS-ESP device starts up before the DHCP server - once the IP address is finally available, MQTT never tries to connect.

Edit: EMS-ESP Version v3.4.1

I can reproduce it with these steps:

  1. Stop EMS-ESP from being able to get an IP address from DHCP (for testing I blocked access to the DHCP server from the EMS-ESP MAC address, in real life this happens because DHCP is not run by the WiFi access point and the DHCP server was unavailable when EMS-ESP booted)
  2. Reboot EMS-ESP device
  3. EMS-ESP will connect to the WiFi but be unable to get a DHCP lease
  4. Allow DHCP access again
  5. EMS-ESP will get an IP address but MQTT will not connect. MQTT page on the interface shows "Status
    Disconnected

Disconnect Reason
TCP disconnected"

Happy to try a debug build if that would be helpful.

@proddy
Copy link
Contributor

proddy commented Sep 7, 2022

ok, I'll use those steps too. I take it if you hit Save on the MQTT Settings page it does reconnect?

@daviessm
Copy link

daviessm commented Sep 7, 2022

Yes it does.

@MichaelDvP
Copy link
Contributor

@daviessm Do you have ipv6 enabled?

@daviessm
Copy link

daviessm commented Sep 9, 2022

@daviessm Do you have ipv6 enabled?

The network has IPv6 enabled, broadcasts SLAAC and has a DHCPv6 server, but EMS-ESP seems to only give itself a link-local address:

IP Address
10.255.3.13, fe80:0000:0000:0000:aa03:2aff:fe21:f21c

@MichaelDvP
Copy link
Contributor

Could you reproduce the issue with ipv6 disabled? I think it's this line. Mqtt do not reconfigure if ipv6 is first and ipv4 comes later.

@daviessm
Copy link

daviessm commented Sep 9, 2022

With IPv6 disabled I can't reproduce the issue. Good spot.

@daviessm
Copy link

daviessm commented Sep 9, 2022

I suppose it's because the SYSTEM_EVENT_GOT_IP6 event happens when an interface gets any IP address, but that's including the link-local address that's always generated and probably doesn't have a connection to the MQTT server. Ideally the code should only trigger on receiving an IPv6 address with what Linux calls the 'global' scope.

A secondary problem is that no global IPv6 address is being requested (through DHCP) or allocated (through SLAAC) - would you like me to raise another issue for that?

Edit: SLAAC is covered in #283.

@MichaelDvP
Copy link
Contributor

global address: Let's wait for the next arduino core, there is a PR open.

Normally _EVENT_WIFI_STA_GOT_IP comes before the ..GOT_IPV6, but only if ipv4 dhcp is up.
It does not hurt to reconfigure mqtt on every GOT_IPx, it will only give some more log entries with mqtt-disconnect, mqtt-connect on startup. Same is for starting syslog and mDNS.
I'll make a change to dev and v3.5.0.

MichaelDvP added a commit to MichaelDvP/EMS-ESP32 that referenced this issue Sep 9, 2022
@MichaelDvP
Copy link
Contributor

Can you check the dev build from here, if it works for you i'll make a PR.

@daviessm
Copy link

daviessm commented Sep 9, 2022

Yes that seems to solve the IPv6 problem for me.

proddy added a commit that referenced this issue Sep 9, 2022
MichaelDvP added a commit to MichaelDvP/EMS-ESP32 that referenced this issue Sep 11, 2022
@proddy proddy closed this as completed in a020a48 Sep 18, 2022
@daviessm
Copy link

Sorry to be a pain but I noticed this morning that my MQTT connection was disconnected again. The DHCP/DNS/MQTT server was rebooted and since then there's been no MQTT connection.

This time, there's only an IPv4 address shown in the UI - no IPv6, even though IPv6 is enabled.

@MichaelDvP
Copy link
Contributor

Looks like a network ipv6 issue. If ems-esp got no ipv6, but mqtt host is ipv6 or dns giives back the ipv6, the connection can't work. Let's wait for arduino ipv6 update. Please keep us informed about circumstances of every connection issue.

@daviessm
Copy link

Sounds good. My MQTT server is set to an IPv4 IP address (no DNS) if that makes any difference.

@MichaelEFlip
Copy link

MichaelEFlip commented Feb 1, 2024

I can reproduce this issue with ems-esp 3.6.2 on a esp32.
From time to time the MQTT connection is no more working. It seems that if the MQTT broker is not available for several minutes (e.g. 15 minutes) there is no reconnect request. The HTTP web interface is still working. After a reboot of the ems-esp through the web interface everything is working again.

My MQTT broker is connected remotely via wireguard VPN of the AVM Fritzbox. If the internet connection fails (unreliable cable-internet connection) and gets connected again after some minutes - these error seems to be related to a failing ems-esp MQTT connection.

@proddy
Copy link
Contributor

proddy commented Feb 3, 2024

3.6.2 is old (about 4 months and we move quick!). A lot of has been done to address the MQTT reconnect issues. If you're experiencing the same with 3.6.5 then do let us know.

@daviessm
Copy link

daviessm commented Feb 3, 2024

Looks like 3.6.5 isn't released yet and there was no 3.6.3 so 3.6.2 is only one version behind the latest!

@tp1de
Copy link
Contributor Author

tp1de commented Feb 3, 2024

I do not understand your remarks:

image

@tp1de
Copy link
Contributor Author

tp1de commented Feb 3, 2024

With this version I do not have any stability problems anymore. Neither MQTT nor WiFi.
With a Mesh WiFi-Network I do recommend to use a fixed BSSID within network settings.

@MichaelEFlip
Copy link

I have still MQTT reconnection issues with 3.6.5-dev.11 :-(

@proddy
Copy link
Contributor

proddy commented Feb 13, 2024

@MichaelEFlip
Copy link

Yes I set a BSSID to connect to a designated wifi AP.
But the problem is the connection to the MQTT broker. The wifi connection is still present.

@proddy
Copy link
Contributor

proddy commented Feb 24, 2024

can you trace with logs and show what is happening? what does the MQTT broker tell you in it's logs? Also make sure you are using unique client IDs as they may be conflicting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants