Skip to content

Z Wave Network Healing

Chris Jackson edited this page Apr 14, 2014 · 7 revisions

The zWave binding supports features to attempt to heal the network if nodes become unresponsive. It must however be noted that the binding can't work miracles, and if you are constantly seeing dead nodes, then this is likely an issue with network topology. You should look at trying to rectify this by adding nodes to fill in blank areas etc.

The STATUS information displayed for a node in HABmin provides some information on how the node is performing. The Dead line will show if the node is currently dead, but it also shows how many times the node has been dead (but subsequently recovered) since the binding started. The Packet Statistics line shows how many packets have been retried out of the total number that have been sent (retries / total sent). If there is a high number of retries, communications with the device may be considered unreliable and you should try to find ways of reducing this since it is likely that the device will occasionally be marked as DEAD.

The network heal, and dead node handling within the binding should however allow for a reasonable response to occasional network problems. This page describes what happens.

Dead node handling

The network monitor within the binding receives notifications when a device is classified as DEAD. A DEAD node is one that does not respond to a request after 3 tries. Clearly, for a radio based system where connectivity can be affected by electrical noise etc in the house, dropping a node out of the network when it doesn't respond to a request is a harsh punishment! However, we also need to be careful that we're not slowing down the whole network just because one node has poor connectivity, so we need to keep an eye on how all our devices are working in order to keep the network running as best as it possibly can.

The binding therefore attempts to deal with DEAD nodes in a reasonable way. It attempts to continue to communicate with them while also limiting the impact on the rest of the network. If we were to continue to send 3 retries to a DEAD node, this would block the network for 15 seconds, which could have an impact on other activities.

Therefore, the following is implemented -:

  • ALIVE nodes are allowed 3 retries
  • A periodic PING is used at a slow rate to ensure a node is still ALIVE. This allows detection of communication problems before the node is required, and this can then be rectified. It also allows us to determine the health of communications with the node, so we can take preventative measures. The PING is sent approximately every 90 seconds to the node that has been updated last, therefore if a node is routinely polled, it will not be polled further by the network monitor. Note also that battery devices are not polled and polling is disabled during a heal. Additionally, the ping will only be sent if there's no frames queued so that it doesn't impact on the normal running of the network.
  • If a node fails to respond after 3 retries it gets marked DEAD.
  • A notification will be sent to start a network heal on the DEAD node.
  • Any further commands etc sent to the DEAD node will still be sent, but no retries will be supported to avoid locking up the network.
  • As soon as a node responds, it gets marked ALIVE, and has full retry privileges restored.

Nightly network heal

A nightly heal will perform a complete network heal at a specified time. It is recommended that this is done when you're not expecting a lot of network activity since it can cause the network to slow down while the controller is updating neighbours and routes. The network heal will not start a heal on a node if there's other traffic going on on the zwave network in order to reduce impact on the normal operation. Network heal is configured by setting the healtime configuration parameter to a time (in hours) when you want the heal to run.

The heal performs the following steps.

  • Ping the node to see if it's awake
  • Update all the neighbours so that all nodes know who is around them
  • Update the associations so that we know which nodes need to talk to others
  • Update the routes between devices that have associations set
  • Retrieve the neighbour list so that the binding knows who's out there
  • Ping the node to see if it's awake
  • Save the device files

If a step fails or times out, it will be repeated 5 times. If after 5 attempts the step has not been successfully completed, the heal will fail for that node, and the system will move on to the next node.

It has been observed that the update neighbours step can fail - especially when updating the controller. This is also seen on OZW. As above, the step will be retried, and most of the time this will fix the issue. I have however seen the update neighbours fail continuously if there's a major configuration of the network.

For mains nodes the heal function simply runs each node in turn, however for battery nodes, the heal function waits until it receives a wakeup from the device and then attempts to continue the heal. Battery nodes in zWave are difficult to manage, and the heal is not always successful. However, this should not be a big issue since battery nodes do not participate in routing.

Z-Wave Heal Status

It is also possible to perform a manual heal by pressing the Heal button in HABmin. This will do exactly the same as the nightly heal does.

The above image shows the zWave node status - this has a line showing the heal status.