Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EMS-ESP status disconnected #859

Closed
HansRemmerswaal opened this issue Dec 29, 2022 · 41 comments
Closed

EMS-ESP status disconnected #859

HansRemmerswaal opened this issue Dec 29, 2022 · 41 comments
Labels
bug Something isn't working
Milestone

Comments

@HansRemmerswaal
Copy link

Question
I sometimes see some "gaps" in some measurement in HA. Checking the EMS-ESP status then I see that around those times the status was disconnected. However, when checking the logs I do not see anything strange on the EMS-ESP.

Not sure yet where is the problem. Can we, based on the attached log file, exclude that the issue is in EMS-ESP?

Screenshots
image

Device information
emsesp_info.txt

Additional context
log 20221228_153827 to 20221229_072137.zip

@HansRemmerswaal HansRemmerswaal added the question Question about something label Dec 29, 2022
@proddy
Copy link
Contributor

proddy commented Dec 29, 2022

have a look at the BUS status and uptime to see if your EMS-ESP is losing network connectivity (your RSSI is very low at -86, see https://www.metageek.com/training/resources/understanding-rssi/) or your EMS-ESP is restarting dur to out-of-memory (your available free memory at 83 and available block size is also very low).

@HansRemmerswaal
Copy link
Author

The why the RSSI is that low I do not understand, there is a WIFI mesh module nearby. But that might be related to the fact that I can't see the EMS-ESP as device connected to the network... Working on that.

Question is why it is restarting that often. It looks like that this restart frequency has increased since I installed EMS-ESP-3_5_0b12-ESP32 and higher.

In the last two graphs you see the uptime, strange that I have 2 different uptime fields from EMS-ESP, one before and one after the upgrade. Looks like the UoM has changed...

When I look at the Free Memory graph I get the impression there is enough free memory. But maybe the system did not get enough time to interface the low memory.

image

image

image

image

@HansRemmerswaal
Copy link
Author

I have done a factory reset and configured all again, now with fixed IP. Right now the RSSI is -47 so it connected to the nearby DECO M5. But I doubt that this RSSI was causing the restart of the device each time.

emsesp_info.txt

@HansRemmerswaal
Copy link
Author

HansRemmerswaal commented Dec 29, 2022

And again the EMS-ESP restarted.
Not sure if it helps but I have reduced in the System Log the Max Buffer Size from 50 to 25, maybe that will reduce the memory usage.

@HansRemmerswaal
Copy link
Author

HansRemmerswaal commented Dec 30, 2022

Keeps restarting. Is there a way to capture the reason for the restart? E.g. the reset reason is "Software reset CPU", so maybe a more detailed reason can be given for the root cause. Or maybe a dump can be stored somewhere with the needed details.

I'm using this version: https://bbqkees-electronics.nl/product/gateway-e32-ethernet-edition/

"System Info": {
"version": "3.5.0-dev.14",
"platform": "ESP32",
"uptime": "000+05:36:15.956",
"uptime (seconds)": 20175,
"free mem": 83,
"max alloc": 39,
"free app": 28,
"reset reason": "Software reset CPU / Software reset CPU"
},

image

@proddy
Copy link
Contributor

proddy commented Dec 30, 2022

hmm, it could me memory related. Your free mem and max alloc is very low (mine is 133/75, yours 83/39). The first thing I would do is change the Log Buffer max size. I updated the wiki with some other things to try out: https://emsesp.github.io/docs/Troubleshooting/#ems-esp-sometimes-crashes-and-restarts

sorry about this. I know how annoying it can be. It's hard to simulate everyone's setup but we're working on it.

@HansRemmerswaal
Copy link
Author

If the WebUI is accessible, go to System->System Status and look at the Heap. If the Free memory is below 90KB or the Max allocation below 45KB then that is an issue and you'll need to turn of services, try again and report this.

In the settings I disabled all the options (in my case Telnet Console and Analog Sensors)
System Log is INFO and Buffer Size is 25 with Compact.

Question: why do we need a Buffer Size? If we keep the screen open we keep all the messages. Would Buffer Size = 0 be an option or will that not save much memory?

Shortly after restart I have:
Heap (Free / Max Alloc)
139 KB / 95 KB

A few minutes later when all the entities are found:
Heap (Free / Max Alloc)
104 KB / 39 KB

image

With NTP and MQTT disabled:
Heap (Free / Max Alloc)
108 KB / 59 KB

With MQTT enabled and NTP and OTA disabled:
Heap (Free / Max Alloc)
105 KB / 41 KB

So not much saving by disabling services. Any suggestion how to free up memory?
Excluding entities somehow from the dashboard?

000+00:00:00.000 I 0: [emsesp] Last system reset reason Core0: Software reset CPU, Core1: Software reset CPU
000+00:00:00.000 I 1: [emsesp] Loading board profile E32
000+00:00:00.000 I 2: [system] Current version from settings is 3.5.0-dev.14

@proddy
Copy link
Contributor

proddy commented Dec 30, 2022

Question: why do we need a Buffer Size? If we keep the screen open we keep all the messages. Would Buffer Size = 0 be an option or will that not save much memory?

It's so you can go to the Log page and always see the last logs, which is useful when EMS-ESP is booting up

I'm going to try and simulate your setup by adding dummy devices so I can trace what is happening.

@HansRemmerswaal
Copy link
Author

HansRemmerswaal commented Dec 31, 2022

I don't want to be to optimistic, but the EMS-ESP didn't restarted anymore since I disabled...
Telnet Console
Analog Sensors
OTA Updates
.. and put System Log to INFO. (buffer size was already 25)

image
image
image

@proddy
Copy link
Contributor

proddy commented Dec 31, 2022

that's good but equally strange since Telnet/Analog/OTA don't really use a lot of heap memory. 100/41 is just on the edge. Michael and I are going to do some memory optimizations in 3.6.0 as soon as the current 3.5.x is out.

Out of interest, are you compiling yourself or taking the firmware .bin files from the repo? Reason I'm asking is that there was a pio library update a few days ago that is using by default espressif32 5.3.0 which adds a lot more mem

@HansRemmerswaal
Copy link
Author

HansRemmerswaal commented Dec 31, 2022

No, I'm using your dev build from https://github.com/emsesp/EMS-ESP32/releases. I have installed the one of Dec 28th. I see there is one of yesterday, will install that one.

Didn't help, 106 KB / 40 KB.

@proddy
Copy link
Contributor

proddy commented Dec 31, 2022

@MichaelDvP what do you think? It's 100+ KB free heap seem normal for 3 devices (boiler and 2 thermostats) with a total of180 entities. My system has 78 so hard to compare.

I'm planning to make some changes to test.cpp so we can simulate adding devices on a standalone ESP32 and watch how EMS-ESP handles the memory.

@MichaelDvP
Copy link
Contributor

Yes, in #857 the system is E32, boiler, heatsource, thermostat, 2x mixer, solar and also ~100k heap and 38k block.
I think the max allocable block is the main issue. We need for dashboard, customization 20k msgpackresponse and for api 16k prettyjsonresponse. I think the responses also needs buffer for serialization, so run into issues with lower ~38k block.

The heap is imo not critical, it's the fragmentation. I tried to malloc(40k) in emsesp-start and in dashboard, custoimzation and api i free the buffer, alloc the reponse and after send realloc the 40k buffer. Seems to work good. Worst what could happen is, that realloc 40k failed, then it's normal emsesp again.

Now i'm testing to malloc another extra 20k just to reduce free heap. Also unlimited 100 message log buffer.
grafik
works since 50 min. Let's see if stable till next year.

@HansRemmerswaal
Copy link
Author

I was to optimistic... Restarted twice today, at 04:00 and 10:00.

@proddy
Copy link
Contributor

proddy commented Jan 2, 2023

another to try is using the customization web page and disabling about 30-50 of the entities, or things you're not interested in seeing (from the 180 you have). And then use http://ems-esp.local/api/system/info to see the "free mem" and "max alloc"

@HansRemmerswaal
Copy link
Author

HansRemmerswaal commented Jan 2, 2023

I reduced to 4, did a restart, and there was no impact on the Heap.
I reduced them by;

  • exclude from MQTT and API
  • hide from Dashboard

emsesp_info (3).txt

image
image

@proddy
Copy link
Contributor

proddy commented Jan 2, 2023

ok, I'll make some code changes to simulate your devices so I can debug locally

@proddy proddy added the bug Something isn't working label Jan 2, 2023
@proddy
Copy link
Contributor

proddy commented Jan 2, 2023

also, I realized that it's a heap memory issue and it crashes when you're not doing anything manually (in console, in web etc). When you disable entities in the customization page it still loads all the details into memory, just sets flags etc. So sorry I led you on a bit of a goose chase. I'll try something else and send over a custom build for you to try.

@HansRemmerswaal
Copy link
Author

As a test I temporary installed EMS-ESP-3_4_4-ESP32.
So those 21 extra entities consume that much additional memory? Or is this the result of other changes?

EMS-ESP-3_4_4-ESP32
image
image
emsesp_info EMS-ESP-3_4_4-ESP32.txt

EMS-ESP-3_5_0-dev_14-ESP32 - Jan 2nd
image
image
emsesp_info EMS-ESP-3_5_0-dev_14-ESP32.txt

@HansRemmerswaal
Copy link
Author

Just thinking...

If people use MQTT to send the data to HA or any other tool, then the WebUI is most likely only used for making configuration changes. Can we think of a new option in the Settings menu to enable / disable some WebUI functionality to reduce the memory usage for those using MQTT?

E.g. for me the Dashboard can be disabled or limited when all is running fine as I will look at the values in HA. And when I need to make some changes I can then temporary enable them again.

But not sure if that will save a lot of memory.

@HansRemmerswaal
Copy link
Author

Heap (Free / Max Alloc)
103 KB / 39 KB

This sounds as if there is a lot of free memory, so why does the EMS-ESP then restarts? Is there a process that suddenly requires a lot of memory? A big backlog of MQTT messages?

And how does the following sum up to 103 KB?
"free mem": 87,
"max alloc": 39,
"free app": 29,

@MichaelDvP
Copy link
Contributor

Do you have mqtt qos set? The mqtt queue is 300 messages, needed for HA autoconfig. If qos fail it could be that the queue is filled with large data messages, consuming a lot of ram. (on mqtt disconnect output is stopped).

And how does the following sum up to 103 KB?

It does not. The output of system info takes a 16k buffer from heap, 87 + 16 ->103.
Affter sending the system info the heap is 103 again as listed in status page.
Free app is the remaining flash for future updates.

@HansRemmerswaal
Copy link
Author

QoS = 0

So if I understand correctly then if messages fail this should not result in a big queue of messages.

@HansRemmerswaal
Copy link
Author

HansRemmerswaal commented Jan 3, 2023

Thinking about something else...

I can connect to the EMS-ESP
The EMS-ESP is able to send the messages to HA
On a regular base the EMS-ESP restarts
And often I see in the app of my router (DECO M5) that the EMS-ESP is in the list of offline clients.

Checking the RSSI it's something like -77 although there is one DECO M5 (mesh network) a few meters away, meaning that it doesn't connect to that nearby one. I just disabled MESH for the EMS-ESP and fixed it to the nearby DECO M5. The RSSI is now -55.

So what if the connection is/was so bad that it takes to long to send the messages and due to that the queue is getting full?

Update:
Still not always shown in the list of Online clients. Even when I'm connected via the WebUI. But at least the RSSI is much better :)

@proddy
Copy link
Contributor

proddy commented Jan 3, 2023

could be, there is a limit of the mqtt buffer obviously and nothing more is added. But there might also be a nasty memory leak in the code somewhere which is creeping up. The best way to monitor this is in HA and see the free mem and alloc mem is falling.

You can also disable mDNS in the Network settings page. This will give you an additional 5kb of heap which may be just enough to keep it alive.

@HansRemmerswaal
Copy link
Author

In HA there is only the "Free memory" for me, not the "Alloc mem".
The "Free memory" is stable around 110 except shortly after the restarts.

After the next restart I will disable the mDNS.

What about the Dallas sensor service? Is that using a lot of memory? I can't disable it but I also do not use it.

[dallassensor] Starting Dallas sensor service

@HansRemmerswaal
Copy link
Author

mDNS disabled

Heap (Free / Max Alloc)
109 KB / 39 KB

@proddy
Copy link
Contributor

proddy commented Jan 4, 2023

Ok, let us know if it still crashes. We're doing some memory optimizations in the background (#869)

@HansRemmerswaal
Copy link
Author

The highest uptime since Dec 20th... So looks promising.

image

@HansRemmerswaal
Copy link
Author

HansRemmerswaal commented Jan 7, 2023

See #894
After disabling the ethernet (settings boardprofile, switch to custom and disable the phy) the heap increased.

Maybe something to add to the It may be memory related in
https://emsesp.github.io/docs/Troubleshooting/#ems-esp-sometimes-crashes-and-restarts

I will check if these extra KB are enough to prevent the restarts.

Heap (Free / Max Alloc)
100 KB / 39 KB

disabled OTA
Heap (Free / Max Alloc)
102 KB / 39 KB

disabled ethernet
Heap (Free / Max Alloc)
118 KB / 45 KB

@proddy
Copy link
Contributor

proddy commented Jan 7, 2023

Thanks Hans for all the time you put into benchmarking the scenarios. It's been really helpful. #869 will help us figure out how much heap and max-alloc we can safely live with before the ESP32 implodes. I'm working on some optimizations in a separate branch, one that includes this max-alloc in the heartbeat MQTT topic so I can see if it starts getting smaller over time.

@emsesp emsesp deleted a comment from JokerGermany Jan 8, 2023
@HansRemmerswaal
Copy link
Author

Do you want to keep this one open or shall we close it?
With disabling all that could be disabled, including the network connection, my EMS-ESP does not seems to disconnect anymore.

We have the #869 in place to improve the heap memory, and on top of that we have the #891 which allows to excluded entities completely if they are not relevant for someone's configuration.

After #869 I can try to enable functionality again.

@proddy
Copy link
Contributor

proddy commented Jan 11, 2023

I'd prefer we keep it open as a reference and reminder. I'm making progress on #869 - using that build I've seen an increase in the max alloc buffer by 24% on my live environment and heap by 8%. It's still not enough though, loading 200+ entities crashes and I'm almost 100% sure it's related to MQTT publishing. I'll have more time in a few weeks to pick this up again as I'm traveling extensively for work this month

@HansRemmerswaal
Copy link
Author

Do I understand that (on high level);

  • we have a certain (dynamic) buffer size to store the MQTT message that are pending to be send
  • at a given moment in time the buffer is full and the system crashes

If the above is true then is the problem related to the speed of sending the MQTT messages or is it related to free up the speed in the buffer?

On top of that, if the setting is QoS0 then there is a risk that messages will get lost. If someone doesn't want that then QoS1 or QoS2 should be used.
So if the setting is QoS0 and the buffer is full.... can't we then not just "skip" the message and not trying to put it in that buffer? When it's put in the buffer the system will fail and all the messages get lost anyway... so then better to lose only 1 or 2 message...

Don't get me wrong, finding and solving the root cause is better. But in the meantime we can try to avoid the restarts by using such tricks..

@proddy
Copy link
Contributor

proddy commented Jan 11, 2023

This is what I will need to investigate, now that I can reproduce the crash. The fix to prevent the restarts is to prevent over-allocation of buffers, and report loss of messages/data as an log ERR. Then we can optimize the memory allocation.

@zillion42
Copy link

zillion42 commented Jan 13, 2023

I have disconnection issues, the s32 drops to AP mode and disconnects from the network. I don't know whether it's related to this issue, but @proddy asked me to mention it here. I will attach my config and the screenshot of the HA Free memory and EMS Status around the time of the last disconnect.
2023-01-13 20_32_01-History – Home Assistant
image

emsesp_settings.json.txt

@zillion42
Copy link

zillion42 commented Jan 13, 2023

Looking at the graph, I just realized that it disconnected after HA Reboot that night. I set up an automation to reboot HA host OS at 5 in the morning. It maybe crashed because report loss of messages/data as an log ERR
image

@zillion42
Copy link

zillion42 commented Jan 13, 2023

I just manually rebootet HA, and it disconnected MQTT and it won't reconnect.

Jan 13 21:04:19 ems-esp telegram 000+01:35:09.631 E 987: Last Tx Read operation failed after 3 retries. Ignoring request: 0B 90 FF 00 19 01 A5
Jan 13 21:07:10 ems-esp mqtt 000+01:38:00.341 W 988: MQTT disconnected: TCP

@proddy
Copy link
Contributor

proddy commented Jan 13, 2023

if you're hosting a Mosquitto broker on HA and restarting HA every night and EMS-ESP doesn't reconnect - is that correct? If so please create a new issue and include the support info. I don't think it's related to this one, I was wrong, sorry

@zillion42
Copy link

I disabled the automation, but yes, that's what happened last time. MQTT Broker is also running on HA. I will open an issue later.

@proddy proddy added this to the v3.6.0 milestone Jan 19, 2023
@proddy
Copy link
Contributor

proddy commented Feb 12, 2023

closing this for now as we know its memory related and workarounds have been published.

@proddy proddy closed this as completed Feb 12, 2023
@proddy proddy reopened this Feb 13, 2023
@emsesp emsesp locked and limited conversation to collaborators Feb 13, 2023
@proddy proddy converted this issue into discussion #998 Feb 13, 2023
@proddy proddy removed the question Question about something label Feb 19, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants