New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High RX Fail count and frequent restarts #156
Comments
Too much weirdness here... Checked my HA core log and found the dreaded So, I cleared out the homeassistant tree in MQTT Explorer, restarted EMS-ESP and restarted HA. Now, after 8 minutes uptime, I have no RX Fails and no incomplete telegram warnings in the web interface log set to INFO. Is it possible that the HA core log warning above is somehow linked to the RX Fails and restarts? I'll keep monitoring it a bit longer... For some annoying reason, HA has stopped treating RX Fails as a graphable entity and now treats it as text, so clickign the value just produces a horrid rainbow blob thing rather than a graph. |
Spoke too soon. Restarted EMS-ESP32 from the telnet interface. When it came back up, it started logging RX errors - 25 already in just 2 minutes. HA core log now logging things like...
So there is some sort of link between HA logging errors like this and the RX Fails and restarts. |
The HA logging errors are a direct consequence of the reboots and nothing to worry about. After reboot the emsesp has to collect data from bus, which takes some time, and, depending on the publish frequency, publishes some mqtt messages with missing values. Reboots and rx-errors can happen on power issues. Please check:
|
And when do the rx-fails/reboots start? Does it depend on software version? |
I can't be sure when the problem started - I only noticed it recently. Kees suggested software downgrade too. Happy to do this. What version should I try? Any ideas on how to get HA to plot RX Fails as a line graph again? It's almost as though it's seeing the count as text rather than numeric. |
I need to add back |
This is fixed in the latest dev build
…On Wed, 13 Oct 2021 at 16:42, Norbert Fischer ***@***.***> wrote:
Why is HA now rendering RX Fails like this... [image: image]
<https://user-images.githubusercontent.com/57917148/136909170-8aabd5ea-75f6-417d-9cb5-b35e2d8165ec.png>
... instead of a line graph, as it was 2 days ago?? Graph makes it much
easier to monitor problems.
My workaround to display a line graph in a history graph card is to create
a template sensor, for example with sensor.dallas_faults:
- platform: template
sensors:
dallas_number_of_faults:
value_template: >-
{{ '%d' % states("sensor.dallas_faults") | int }}
unit_of_measurement: 'n'
Inserting the entity "sensor.dallas_number_of_faults" in a history graph
card shows a line graph (Home Assistant 2021.10.3).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#156 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJMO6BG6K6VRXO3JVRHMXLUGWLETANCNFSM5FZAXJ5A>
.
|
Clearing browser cache seems to fix the login issue. Downgraded to 3.0.1. Let's see what happens with restarts and RX Fails... |
fingers-crossed. I'd be surprised (and sad) if it's indeed software related. |
Me too. I think you'd have loads of people with issues, so I suspect it's my hardware *8( |
There's general weird stuff going on with it. It won't restart properly from the web interface. Firmware doesn't load properly - progress bar whizzes to 100% then nothing happens. Getting to the point where I will wipe the flash and start again. Loading 3.0.1 killed the Dallas sensor too. Weird. |
Reflashed. Had all sorts of hassle trying to set up fresh install through web interface / access point with both Edge and Firefox on laptop. Login page stuck for ever. WiFi network scan stuck for ever. Used Safari on an iPhone and it all worked straight off. Very weird. Tried to restart device on laptop - Restart button went grey, nothing further. Tried on Safari on iPhone, popped up restart banner and restarted immediately. Don't understand this at all. This problem has been going on for months and firmware updates have never worked properly. So there is something odd there. I think I had better luck with Chrome. I'll see how things go from now. |
Looking like this is a hardware issue, as we'd all hoped and expected. I contacted Kees and he suggested checking the output of the plug-in DC-DC converter, which should be 4.5-5 volts. Mine is sitting at 5.37V, so a bit on the high side but not quite at the danger level, 5.7V. It can be adjusted but I'd rather not tinker with it at this point. I pulled the converter out to use USB power and the issues went away. Put the converter back and they haven't recurred - so I've now got over 15 hours of uptime with 10 RX Fails. Given that the only thing I've disturbed was the converter, I have to conclude that there was a problem with the connection between it and the main board and re-plugging it has cleared this. I'll monitor for another 24 hours and then, hopefully, we can close this. I raised #159 in case this could help troubleshoot problems like this. |
are you sure it's restarting? Like are you seeing a similar saw type figure in HA in "System Uptime (sec)". If so then the error could be in the code when processing certain telegrams (out of bounds error for example). In that case you should try an earlier build as you did before. And also use something like SysLog to see the last telegram that was Rx or Tx'd that could have caused the error. |
Yup. Uptime now points to a restart at 02:47 this morning. RX Fail count returned to zero at 02:46:59... I'll have to think about setting up a syslog server somewhere, maybe. Hmm... it could be busy. I hadn't considered the possibility of corrupt telegrams crashing the gateway. Is that possible, given the CRC checking? I supposed a prolonged burp of corrupt data could break things. That potentially points to a whole new raft of hardware issues - EMS interface circuit, duff UART on ESP32... |
To save power you can go to network settings, use lower bandwidth and reduce tx power. |
@glitter-ball are you still experiencing regular restarts (even without touching the console/webUI)? |
I am and my priority is to get a new ESP32 into the gateway ASAP. But the new MH-ET ESP32 Live I have is a different version - MiniKIT v2.0 and it won't flash as it's supplied. It needs to be put into flash mode somehow. Investigations continue... Wary of muddying the water further, but I'm seeing something else. After restarts, I'm consistently seeing HA complain about the same couple of points having no definition : rettemp and wwstorage1, I think. Looking in MQTT Explorer, rettemp appears under ems-esp but there's no matching definition under homeassistant. If I clear all the data from MQTT Explorer, it all goes away. But, the mystery is.. why is EMS-ESP sending rettemp briefly, with no definition? I'd either expect both MQTT entries to be there or neither - not just one! I'll try and capture precise details next time. |
I can't help you with the flashing if you're rolling your own. It depends on the USBTTL chip and USB speed. You could try lowering the baud or switching the flash mode from dio to dout. On the wwstoragetemp1 remark: the config topic you are seeing is the definition for Home Assistant. If you look at the contents its says "fetch the value called 'wwstroragetemp1' from the MQTT topic called 'boiler_data_ww'", but here it seems that it's missing in the |
I checked the code I remember I've already implemented the dynamic loading, so it will only create the HA entity if it exists. So all good there. wwstoragetemp1 = "storage intern temperature" and wwstoragetemp2 = "storage extern temperature" so yes it should also be in the MQTT topic |
wwstoragetemp1 is not in the MQTT topic, only wwstoragetemp2. 'rettemp' is another one that appears now and then - that's not there either currently. rettemp should not appear at all on my system because the boiler does not have a manufacturer's return temp sensor - I added a Dallas to do that independently. So for both points, it looks as though MQTT payload is including them when there's no matching point in the UI for the data to originate from. It feels like a bug in there somewhere... just trying to put my finger on exactly where! Unless it's linked to the ESP32 hardware issue somehow. Sample of errors...
'rettemp' didn't appear that time. |
The Home Assistant config topics are sticky, meaning they use the Retain flag to stay forever even after a restart. So when EMS-ESP boots up and it hasn't found the wwstoragetemp1 (storage intern temperature) to publish to the MQTT topic |
Mrs G-B picking up a bottle right now *8) I've put the new ESP32 in the gateway and it's still clocking up a lot of RX Fails - 72% quality. So, it's something else in the gateway or my bus connection. Personally, I suspect the DC-DC converter. Powering it with USB stops all the RX Fails on the spot. |
@proddy @MichaelDvP - I closed this 4 days ago but here’s the final word! It was the DC-DC converter causing the RX errors and restarts. Got another off eBay and, armed with a 220uF output capacitor to ride through ESP32 startup demand, it works fine - no errors, no restarts. Rest of gateway and boiler OK. So if you get similar reports in future that’s the place to start. Thanx G-B |
ok, good to know. We spent a lot of time chasing this one. I didn't realize until later that you had designed your own circuit! 🤦🏻♂️ |
Err… not much design on my part. Standard Kees gateway. I unplugged the original converter in it which had developed a fault and plugged in a £3 similar one from eBay. It just needed the extra capacitor across its output to get the ESP32 to start reliably - it grabs a few big pulses of current early on.Kees kindly offered me a new gateway at a discount but I didn’t want to just chuck this one away without at least trying to fix the issue first. New ESP32 didn’t fix it, so power was the next thing to try.Have just 3D-printed the case shared on Discord so looking forward to tidying it all up at last 😁
|
Also realise with hindsight that the issues I was having with my Dallas sensor may well have been an early warning of power issues. At random intervals, the sensor would return large negative values. These, too, have stopped. Hurrah. |
@glitter-ball maybe it's a good idea to add to the documentation under Troubleshooting. Strange things to watch out for that could point to some hardware failure. If you want to write something up I'll add it to the wiki |
Yep, it would be good to share the learning. Here's the bottom of the Troubleshooting page with my additions in code-style... Many Rx errors or incomplete telegrams It is quite usual to see a few warnings in the log about incomplete telegrams. This could be due to interference on the line. The warnings are usually harmless as EMS-ESP will either wait for the next broadcast or keep trying to fetch the telegram. If you're seeing an Rx or Tx quality less than 80% then try:
. |
added. Really nice, much appreciated! |
This is closed, but I leave an important comment here. I was still seeing restarts every few days and RX Fails. I pretty much gave up trying to fix it, believing the issue to be some sort of power supply problem. Since upgrading to 3.4.1 the system has run for 20+ days with no restarts and the RX incompletes have dropped significantly (269 in over 1.7 million). Not sure how this is particularly helpful other than to identify that upgrading has massively improved reliability without touching the hardware, for which I am very grateful! |
@glitter-ball any idea what is causing the restarts? Are you doing multiple connections for example opening multiple sessions in a web browser or have telnet open? Rx Fails is normal, as long as it shows as 100% then all is ok. |
Nope. I never got to the bottom of it other than it was linked to the buck regulator. I changed this and things improved significantly but I still got some restarts. I'm now up to 24 days uninterrupted on 3.4.1. It could be something in the code that's changed or maybe the device is now drawing less power or has a different power profile so the power issues have receded. Really annoying, I know, but I don't know what else to suggest. |
Bug description
Unit is showing high RX Fail count and frequent restarts - raised at request of @proddy after discussion on Discord.
Steps to reproduce
Run the EMS-ESP32 and look at data in HA, telnet or web interface.
Expected behavior
Few RX Fails and stable operation.
Screenshots
Device information
Additional context
Device is connected to boiler service jack. Not clear yet whether this is hardware or software problem (or both!!)
The text was updated successfully, but these errors were encountered: