Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High RX Fail count and frequent restarts #156

Closed
glitter-ball opened this issue Oct 11, 2021 · 42 comments
Closed

High RX Fail count and frequent restarts #156

glitter-ball opened this issue Oct 11, 2021 · 42 comments
Labels
help wanted Extra attention is needed

Comments

@glitter-ball
Copy link

Bug description
Unit is showing high RX Fail count and frequent restarts - raised at request of @proddy after discussion on Discord.

Steps to reproduce
Run the EMS-ESP32 and look at data in HA, telnet or web interface.

Expected behavior
Few RX Fails and stable operation.

Screenshots
image
image
image

Device information

   
System  
version "3.2.2b10"
uptime "000+04:51:32.053"
uptime_sec 17492
freemem 129
Status  
bus "connected"
bus protocol "HT3"
telegrams received 17581
read requests sent 193
write requests sent 0
incomplete telegrams 3569
tx fails 0
rx line quality 80
tx line quality 100
MQTT publishes 22530
MQTT publish fails 0
dallas sensors 1
dallas reads 3496
dallas fails 1
Devices  
0  
type "Boiler"
name "Worcester Logamax Plus/GB192/Condens GC9000 (DeviceID:0x08 ProductID:208, Version:01.01)"
handlers "0x10 0x11 0xC2 0x14 0x15 0x1C 0x18 0x19 0x1A 0x35 0x16 0x33 0x34 0x26 0x2A 0xD1 0xE3 0xE4 0xE5 0xE6 0xE9 0xEA"
1  
type "Thermostat"
name "RC300/RC310/Moduline 3000/1010H/CW400/Sense II (DeviceID:0x10, ProductID:158, Version:33.03)"
handlers "0xA3 0x06 0xA2 0x12 0x2A5 0x2B9 0x2AF 0x29B 0x471 0x2A6 0x2BA 0x2B0 0x29C 0x472 0x2A7 0x2BB 0x2B1 0x29D 0x473 0x2A8 0x2BC 0x2B2 0x29E 0x474 0x2F5 0x31B 0x31D 0x31E 0x23A 0x267 0x240"
2  
type "Controller"
name "ErP (DeviceID:0x09, ProductID:209, Version:01.03)"

Additional context
Device is connected to boiler service jack. Not clear yet whether this is hardware or software problem (or both!!)

@glitter-ball glitter-ball added the bug Something isn't working label Oct 11, 2021
@glitter-ball
Copy link
Author

Too much weirdness here...

Checked my HA core log and found the dreaded 2021-10-11 21:44:28 WARNING (MainThread) [homeassistant.helpers.template] Template variable warning: 'dict object' has no attribute 'servicecode' when rendering '{{value_json.servicecode}}'

So, I cleared out the homeassistant tree in MQTT Explorer, restarted EMS-ESP and restarted HA. Now, after 8 minutes uptime, I have no RX Fails and no incomplete telegram warnings in the web interface log set to INFO.

Is it possible that the HA core log warning above is somehow linked to the RX Fails and restarts? I'll keep monitoring it a bit longer... For some annoying reason, HA has stopped treating RX Fails as a graphable entity and now treats it as text, so clickign the value just produces a horrid rainbow blob thing rather than a graph.

@glitter-ball
Copy link
Author

glitter-ball commented Oct 11, 2021

Spoke too soon.

Restarted EMS-ESP32 from the telnet interface. When it came back up, it started logging RX errors - 25 already in just 2 minutes.

HA core log now logging things like...

2021-10-11 22:14:18 WARNING (MainThread) [homeassistant.helpers.template] Template variable warning: 'dict object' has no attribute 'designtemp' when rendering '{{value_json.hc1.designtemp}}'
2021-10-11 22:14:18 WARNING (MainThread) [homeassistant.helpers.template] Template variable warning: 'dict object' has no attribute 'offsettemp' when rendering '{{value_json.hc1.offsettemp}}'
2021-10-11 22:14:18 WARNING (MainThread) [homeassistant.helpers.template] Template variable warning: 'dict object' has no attribute 'minflowtemp' when rendering '{{value_json.hc1.minflowtemp}}'
2021-10-11 22:14:18 WARNING (MainThread) [homeassistant.helpers.template] Template variable warning: 'dict object' has no attribute 'roominfluence' when rendering '{{value_json.hc1.roominfluence}}'

So there is some sort of link between HA logging errors like this and the RX Fails and restarts.

@MichaelDvP
Copy link
Contributor

The HA logging errors are a direct consequence of the reboots and nothing to worry about. After reboot the emsesp has to collect data from bus, which takes some time, and, depending on the publish frequency, publishes some mqtt messages with missing values.

Reboots and rx-errors can happen on power issues. Please check:

  • connectors proper snapped in on both sides, some connectors have to big rubber parts conflicting with emsesp-housing.
  • cable not damaged
  • cable core thick enough for power delivery (not a thin audio cable).
  • jumper or dc/dc-converter set to jackpower (buspower over servicejack is possible but not sufficient for esp32)
  • Do you have reboots if powered by usb (jumper/dcdc removed)?

@glitter-ball
Copy link
Author

glitter-ball commented Oct 12, 2021

The hardware has not changed for 6 months. The last hardware change was installing the ESP32. I've re-seated the service jack cable (supplied with the gateway) at both ends on Sunday, with no obvious improvement.

So, assuming it's a hardware problem, possibilities include:

  • Developing fault with the PCB in the boiler (let's hope not...)
  • Service jack cable fault (unlikely - same cable, not moved for 6+ months)
  • Developing fault with DC-DC converter in gateway (possible - need to try USB power to see what effect this has)
  • Other fault on EMS bus side of the gateway, affecting power and data transfer to the ESP32
  • Developing fault with ESP32 which is causing comms and restart issues
  • External EM/RF interference, causing crash and errors - unlikely

Trying USB power would pick up a few of these, so that's the obvious next step, probably later this week.

Why is HA now rendering RX Fails like this...
image

... instead of a line graph, as it was 2 days ago?? Graph makes it much easier to monitor problems.

@MichaelDvP
Copy link
Contributor

The hardware has not changed for 6 months.

And when do the rx-fails/reboots start? Does it depend on software version?
Maybe it's something in the software like #78. If there are others seeing these reboots, we need to check reboot reason with serial logging. But if it is only your system, we have to search in your installation.

@glitter-ball
Copy link
Author

I can't be sure when the problem started - I only noticed it recently.

Kees suggested software downgrade too. Happy to do this. What version should I try?

Any ideas on how to get HA to plot RX Fails as a line graph again? It's almost as though it's seeing the count as text rather than numeric.

@MichaelDvP
Copy link
Contributor

What version should I try?

6 month ago it was v3.0.1, start with this. Versions v3.1.0 and v3.1.1 have reboots on tx-errors #78, i think better skip these.
All version found here .

With HA i can't help,

@proddy
Copy link
Contributor

proddy commented Oct 12, 2021

With HA i can't help,

I need to add back "state_class": "total_increasing" for HA entities that are numerical, some how

proddy added a commit that referenced this issue Oct 12, 2021
@NFis
Copy link

NFis commented Oct 13, 2021

Why is HA now rendering RX Fails like this... image

... instead of a line graph, as it was 2 days ago?? Graph makes it much easier to monitor problems.

My workaround to display a line graph in a history graph card is to create a template sensor, for example with sensor.dallas_faults:

- platform: template
  sensors:
    dallas_number_of_faults:
      value_template: >-
              {{ '%d' % states("sensor.dallas_faults") | int }}
      unit_of_measurement: 'n'

Inserting the entity "sensor.dallas_number_of_faults" in a history graph card shows a line graph (Home Assistant 2021.10.3).

@proddy
Copy link
Contributor

proddy commented Oct 13, 2021 via email

@glitter-ball
Copy link
Author

Hmm... struggling to get into the web GUI to downgrade the firmware. When I connect, I get this forever...
image

If I refresh the page, the login prompt appears very briefly with a red "Error verifying authorization: Network error when attempting to fetch resource". Restarting via telnet hasn't fixed it, nor has hardware reset.

I'll keep at it and hope I get in soon, but this doesn't bode well!

@glitter-ball
Copy link
Author

Clearing browser cache seems to fix the login issue. Downgraded to 3.0.1. Let's see what happens with restarts and RX Fails...

@proddy
Copy link
Contributor

proddy commented Oct 13, 2021

fingers-crossed. I'd be surprised (and sad) if it's indeed software related.

@glitter-ball
Copy link
Author

Me too. I think you'd have loads of people with issues, so I suspect it's my hardware *8(

@glitter-ball
Copy link
Author

glitter-ball commented Oct 13, 2021

There's general weird stuff going on with it. It won't restart properly from the web interface. Firmware doesn't load properly - progress bar whizzes to 100% then nothing happens. Getting to the point where I will wipe the flash and start again.

Loading 3.0.1 killed the Dallas sensor too. Weird.

@glitter-ball
Copy link
Author

Reflashed. Had all sorts of hassle trying to set up fresh install through web interface / access point with both Edge and Firefox on laptop. Login page stuck for ever. WiFi network scan stuck for ever. Used Safari on an iPhone and it all worked straight off. Very weird. Tried to restart device on laptop - Restart button went grey, nothing further. Tried on Safari on iPhone, popped up restart banner and restarted immediately. Don't understand this at all. This problem has been going on for months and firmware updates have never worked properly. So there is something odd there. I think I had better luck with Chrome.

I'll see how things go from now.

@proddy proddy added help wanted Extra attention is needed and removed bug Something isn't working labels Oct 14, 2021
@glitter-ball
Copy link
Author

OK... gateway has been up nearly 48 hours after a serious clear-out. Looking at the fail counts since then (yellow line)...
image
... it's better than it was but not great still - 3 crashes in 48 hours.

So I'll purse some power/connection tests next and see what effect this has.

One other question... Changing the TX Mode seemed to be very inconclusive. I settled on EMS+, which should be the right setting for my Sense II (CW400 clone). If the setting's wrong, will it affect the RX Fail count?

In the 13h13m since the last crash, stats are showing:
image
So the 2,300-odd failures in 48,300 are just within the 5% tolerance threshold of the stats display to get 100%. I guess some fails are OK - the crashes, probably aren't.

@glitter-ball
Copy link
Author

Looking like this is a hardware issue, as we'd all hoped and expected.

I contacted Kees and he suggested checking the output of the plug-in DC-DC converter, which should be 4.5-5 volts. Mine is sitting at 5.37V, so a bit on the high side but not quite at the danger level, 5.7V. It can be adjusted but I'd rather not tinker with it at this point.

I pulled the converter out to use USB power and the issues went away. Put the converter back and they haven't recurred - so I've now got over 15 hours of uptime with 10 RX Fails. Given that the only thing I've disturbed was the converter, I have to conclude that there was a problem with the connection between it and the main board and re-plugging it has cleared this.

I'll monitor for another 24 hours and then, hopefully, we can close this. I raised #159 in case this could help troubleshoot problems like this.

@glitter-ball
Copy link
Author

Hmm...
image
... re-plugged the converter at the yellow line. Looks good until about 02:49 this morning when they started ramping up again.

The flat section contains 13 RX Fails and another restart, but it's hidden by the scaling. More thinking needed...

@proddy
Copy link
Contributor

proddy commented Oct 17, 2021

are you sure it's restarting? Like are you seeing a similar saw type figure in HA in "System Uptime (sec)". If so then the error could be in the code when processing certain telegrams (out of bounds error for example). In that case you should try an earlier build as you did before. And also use something like SysLog to see the last telegram that was Rx or Tx'd that could have caused the error.

@glitter-ball
Copy link
Author

glitter-ball commented Oct 17, 2021

Yup. Uptime now points to a restart at 02:47 this morning. RX Fail count returned to zero at 02:46:59...

I'll have to think about setting up a syslog server somewhere, maybe. Hmm... it could be busy.
log.txt

I hadn't considered the possibility of corrupt telegrams crashing the gateway. Is that possible, given the CRC checking? I supposed a prolonged burp of corrupt data could break things.

That potentially points to a whole new raft of hardware issues - EMS interface circuit, duff UART on ESP32...

@glitter-ball
Copy link
Author

A few days back, I set the lower CPU frequency option. This didn't fix the RX Fails, but the system did run for 3d 15h with no restarts, which is interesting. It would be great to see some change in, for instance, CPU temp, but nothing...
image
For the sake of 7 quid, I've got hold of a new ESP32 which I'll solder up and install.

@MichaelDvP
Copy link
Contributor

To save power you can go to network settings, use lower bandwidth and reduce tx power.
A time ago i've measured current on 8266 and esp32 and medium consuption (~22 mA) is nearly the same for esp32 160MHz, low bandwidth and tx-power 10dBm. But while sending the peak current from esp32 stays a bit higher (8266 ~90mA, esp32 ~ 130 mA).

@proddy
Copy link
Contributor

proddy commented Nov 5, 2021

@glitter-ball are you still experiencing regular restarts (even without touching the console/webUI)?

@glitter-ball
Copy link
Author

I am and my priority is to get a new ESP32 into the gateway ASAP. But the new MH-ET ESP32 Live I have is a different version - MiniKIT v2.0 and it won't flash as it's supplied. It needs to be put into flash mode somehow. Investigations continue...

Wary of muddying the water further, but I'm seeing something else. After restarts, I'm consistently seeing HA complain about the same couple of points having no definition : rettemp and wwstorage1, I think. Looking in MQTT Explorer, rettemp appears under ems-esp but there's no matching definition under homeassistant. If I clear all the data from MQTT Explorer, it all goes away. But, the mystery is.. why is EMS-ESP sending rettemp briefly, with no definition? I'd either expect both MQTT entries to be there or neither - not just one! I'll try and capture precise details next time.

@glitter-ball
Copy link
Author

Here's the wwstoragetemp1 example...
image
So this appears in the ems-esp tree but not homeassistant. When I delete it from MQTT Explorer, it doesn't come back.

It looks as though this problem occurred after the last crash/restart and I think this happens regularly. What determines whether the wwstoragetemp1 point is rendered for MQTT or not? Is it from an older firmware version? Wonder if my config or flash file system is corrupted somehow?

@proddy
Copy link
Contributor

proddy commented Nov 6, 2021

I can't help you with the flashing if you're rolling your own. It depends on the USBTTL chip and USB speed. You could try lowering the baud or switching the flash mode from dio to dout.

On the wwstoragetemp1 remark: the config topic you are seeing is the definition for Home Assistant. If you look at the contents its says "fetch the value called 'wwstroragetemp1' from the MQTT topic called 'boiler_data_ww'", but here it seems that it's missing in the boiler_data MQTT topic. This is most likely because there is no value (you can check using the WebUI). Thing is, the HA interface is a bit dumb as it just adds all the sensors regardless of whether they actually exist. This has never really been an issue, but perhaps a nice enhancement for later to make it dynamic.

@glitter-ball
Copy link
Author

Thanks for coming back.

Re the MQTT stuff : OK... so this is a universal issue that others will be seeing too? Oddly I have a wwstoragetemp2 in both MQTT trees. In the Web UI I have...
image
Is there a way for EMS-ESP only to send definition data only for points that actually exist? Naively I assumed that already happened, otherwise I'd presumably have scores of these errors? That's why I wonder whether something is broken specifically for 'rettempt' and 'wwstoragetemp1' as these are the only two I routinely get?

Re : the flashing... soldering iron and capacitor coming out next.

@proddy
Copy link
Contributor

proddy commented Nov 6, 2021

I checked the code I remember I've already implemented the dynamic loading, so it will only create the HA entity if it exists. So all good there.

wwstoragetemp1 = "storage intern temperature" and wwstoragetemp2 = "storage extern temperature" so yes it should also be in the MQTT topic boiler_data_ww. Can you check this? It may be a bug.

@glitter-ball
Copy link
Author

wwstoragetemp1 is not in the MQTT topic, only wwstoragetemp2. 'rettemp' is another one that appears now and then - that's not there either currently. rettemp should not appear at all on my system because the boiler does not have a manufacturer's return temp sensor - I added a Dallas to do that independently.

So for both points, it looks as though MQTT payload is including them when there's no matching point in the UI for the data to originate from.

It feels like a bug in there somewhere... just trying to put my finger on exactly where! Unless it's linked to the ESP32 hardware issue somehow.

Sample of errors...

2021-11-06 14:09:41 WARNING (MainThread) [homeassistant.helpers.template] Template variable warning: 'dict object' has no attribute 'wwstoragetemp1' when rendering '{{value_json.wwstoragetemp1}}'
2021-11-06 14:09:48 WARNING (MainThread) [homeassistant.helpers.template] Template variable warning: 'dict object' has no attribute 'wwstoragetemp1' when rendering '{{value_json.wwstoragetemp1}}'

'rettemp' didn't appear that time.

@proddy
Copy link
Contributor

proddy commented Nov 6, 2021

The Home Assistant config topics are sticky, meaning they use the Retain flag to stay forever even after a restart. So when EMS-ESP boots up and it hasn't found the wwstoragetemp1 (storage intern temperature) to publish to the MQTT topic boiler_data_ww it will throw that warning. It may happen a few times, it's pretty harmless. If you're seeing like >300 warning messages then there is indeed something fishy afoot. But if all works and you're seeing the data then I wouldn't worry too much and have another glass of wine

@glitter-ball
Copy link
Author

Mrs G-B picking up a bottle right now *8)

I've put the new ESP32 in the gateway and it's still clocking up a lot of RX Fails - 72% quality. So, it's something else in the gateway or my bus connection. Personally, I suspect the DC-DC converter. Powering it with USB stops all the RX Fails on the spot.

@glitter-ball
Copy link
Author

@proddy @MichaelDvP - I closed this 4 days ago but here’s the final word!

It was the DC-DC converter causing the RX errors and restarts. Got another off eBay and, armed with a 220uF output capacitor to ride through ESP32 startup demand, it works fine - no errors, no restarts. Rest of gateway and boiler OK.

So if you get similar reports in future that’s the place to start.

Thanx

G-B

@proddy
Copy link
Contributor

proddy commented Nov 10, 2021

ok, good to know. We spent a lot of time chasing this one. I didn't realize until later that you had designed your own circuit! 🤦🏻‍♂️

@glitter-ball
Copy link
Author

glitter-ball commented Nov 10, 2021 via email

@glitter-ball
Copy link
Author

Also realise with hindsight that the issues I was having with my Dallas sensor may well have been an early warning of power issues. At random intervals, the sensor would return large negative values. These, too, have stopped. Hurrah.

@proddy
Copy link
Contributor

proddy commented Nov 11, 2021

@glitter-ball maybe it's a good idea to add to the documentation under Troubleshooting. Strange things to watch out for that could point to some hardware failure. If you want to write something up I'll add it to the wiki

@glitter-ball
Copy link
Author

Yep, it would be good to share the learning. Here's the bottom of the Troubleshooting page with my additions in code-style...

Many Rx errors or incomplete telegrams

It is quite usual to see a few warnings in the log about incomplete telegrams. This could be due to interference on the line. The warnings are usually harmless as EMS-ESP will either wait for the next broadcast or keep trying to fetch the telegram. If you're seeing an Rx or Tx quality less than 80% then try:

  • powering: try to power ems-esp by usb or service-jack. We've seen examples where a noisy or failing DC supply can cause RX Fail or incomplete telegrams and trying USB power (check how to switch to USB powering in the Wiki) can help track this down.

  • disruptions on the bus (emc, reflections, etc): try to connect ems-esp to another device on the bus. In general a previously unconnected bus-out on a devices like MM100 is better than a split connection on an already used connector.

Unusual Dallas sensor readings

If you're seeing unusual Dallas sensor readings (crazy negative temperatures, sudden spikes in readings etc.) check...

- Wiring to the JST connector on the gateway.
- That you're in the right powering mode for your application : parasitic or non-parasitic
- Whether powering from USB fixes the issue (the Wiki explains how to switch to/from USB power for your model of gateway). If it does, this might suggest a problem with power to the gateway from the BUS or service jack connectors.
Frequent restarts

A healthy gateway should run happily for long periods without spontaneous restarts so, if yours is restarting by itself at random intervals, then something's not right. Things to check...

> Power down the gateway and check wiring connections are secure. Check that the ESP32, DC-DC converter and any jumpers on the gateway securely seated onto their connectors.
> Try powering the gateway from the ESP32's USB socket (check the Wiki for how to do this on your particular gateway model). If the restarts stop, then you've got a problem with the external power source (BUS or service jack) or the DC-DC converter inside the gateway.
> Firmware and settings are loaded OK. Re-flashing the firmware and resetting the config might help. Make a note of the settings first if you can

.

proddy added a commit to emsesp/docs that referenced this issue Nov 11, 2021
@proddy
Copy link
Contributor

proddy commented Nov 11, 2021

added. Really nice, much appreciated!

@glitter-ball
Copy link
Author

This is closed, but I leave an important comment here.

I was still seeing restarts every few days and RX Fails. I pretty much gave up trying to fix it, believing the issue to be some sort of power supply problem.

Since upgrading to 3.4.1 the system has run for 20+ days with no restarts and the RX incompletes have dropped significantly (269 in over 1.7 million). Not sure how this is particularly helpful other than to identify that upgrading has massively improved reliability without touching the hardware, for which I am very grateful!

@proddy
Copy link
Contributor

proddy commented Jun 21, 2022

@glitter-ball any idea what is causing the restarts? Are you doing multiple connections for example opening multiple sessions in a web browser or have telnet open?

Rx Fails is normal, as long as it shows as 100% then all is ok.

@glitter-ball
Copy link
Author

Nope. I never got to the bottom of it other than it was linked to the buck regulator. I changed this and things improved significantly but I still got some restarts. I'm now up to 24 days uninterrupted on 3.4.1. It could be something in the code that's changed or maybe the device is now drawing less power or has a different power profile so the power issues have receded. Really annoying, I know, but I don't know what else to suggest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants