Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NTP sometimes stops working after power failure #1773

Closed
Frogmore42 opened this issue Feb 3, 2018 · 9 comments
Closed

NTP sometimes stops working after power failure #1773

Frogmore42 opened this issue Feb 3, 2018 · 9 comments
Labels
stale Action - Issue left behind - Used by the BOT to call for attention

Comments

@Frogmore42
Copy link
Contributor

So, I think I have figured out what is going on with the very rare NTP issues a few people have been having. I have seen the issue on some of my devices twice in the several months that I have been using Tasmota. Both times there was a power failure and Internet was not available for some period of time. The failure was persistent even with a reset 2 and a power cycle (and even a OTA FW updated to 5.11.1h) and I verified that the NTP server was correct (as was all other network configuration and Wi-Fi signal). I had a theory that it might be corruption in the RTC memory that was causing the problem, so I unplugged the POW for a few min to allow the RTC memory to completely lose whatever data it had. I then plugged it back in and the POW was back to being on time.

I took a look at code and it seemed to be fine and is not using the RTC memory in a bad way. I had another device that was in the bad state, so I tried unplugging it for awhile. It was still broken. I then took a closer look at the code and increased the logging level. When I restarted it I did not see any additional logging. I then noticed this in the code (code tags don't work right for me):

ntpsync = 0;
if (RtcTime.year < 2016) {
if (WL_CONNECTED == WiFi.status()) {
ntpsync = 1; // Initial NTP sync
}

This is in RtcSecond(). What it looks like is happening, is that all of my Sonoffs are trying to get time from the NTP servers every second at about the same time (when the network, or power comes back on line). This looks like DOS attack, so it is frequently filtered.

So, I unplugged the device for a long period of time (several mins) but made no other change. When I plugged it back in, I saw the following:

00:00:04 HTP: Web server active on dehumidCounter-4254.local with IP address 192.168.0.76
00:00:06 MQT: Attempting connection...
00:00:06 MQT: Connected
00:00:06 MQT: frogmore/tele/dehumidCounter/LWT = Online (retained)
00:00:06 MQT: frogmore/cmnd/dehumidCounter/POWER =
00:00:06 MQT: Subscribe to frogmore/cmnd/dehumidCounter/#
00:00:06 MQT: Subscribe to frogmore/cmnd/sonoffs/#
00:00:06 MQT: Subscribe to cmnd/DVES_AB109E/#
00:00:06 MQT: frogmore/tele/dehumidCounter/INFO1 = {"Module":"WeMos D1 mini","Version":"5.10.0b","FallbackTopic":"DVES_AB109E","GroupTopic":"sonoffs"}
00:00:06 MQT: frogmore/tele/dehumidCounter/INFO2 = {"WebServerMode":"Admin","Hostname":"dehumidCounter-4254","IPAddress":"192.168.0.76"}
00:00:06 MQT: frogmore/tele/dehumidCounter/INFO3 = {"RestartReason":"Power on"}
00:00:06 APP: (UTC) Sat Feb 03 09:11:23 2018
00:00:06 APP: (DST) Sun Mar 25 02:00:00 2018
00:00:06 APP: (STD) Sun Oct 28 03:00:00 2018
10:11:30 SHT: Sensor did not ACK command
10:11:30 I2C: HTU21 found at 0x40
10:11:30 I2C: BME280 found at 0x76

This time there was the logging message indicated it was successful in getting the time from SNTP. So, it looks like the implementation of SNTP in Tasmota is less than ideal.

From this: https://en.wikipedia.org/wiki/NTP_server_misuse_and_abuse

One particularly common software error is to generate query packets at short (less than five second) intervals until a response is received. When such an implementation finds itself behind a packet filter that refuses to pass the incoming response, this results in a never-ending stream of requests to the NTP server. Such grossly over-eager clients (particularly those polling once per second) commonly make up more than 50% of the traffic of public NTP servers, despite being a minuscule fraction of the total clients. While it is reasonable to send a few initial packets at short intervals, it is essential for the health of any connectionless network that unacknowledged packets be generated at exponentially decreasing rates. This applies to any connectionless protocol, and many portions of connection-based protocols. Examples can be found in the TCP specification for connection establishment, zero-window probing, and keepalive transmissions.

@Frogmore42
Copy link
Contributor Author

Verified on another one, that is running 5.8.0. Turned power off for many minutes, but no other change, and it got the correct time. It had been trying for over 17 hours and was still stuck in 1970.

@arendst
Copy link
Owner

arendst commented Feb 7, 2018

Will introduce ntp time slots based on chip id to spread ntp ddos in next pre-release

arendst added a commit that referenced this issue Feb 7, 2018
5.11.1j
 * Prep for release
 * Change uptime from hour to second
resulting in a display of 123T13:45:21 where 123 is days
 * Fix Arilux
RF induced exception by moving interrupt handler to iram on non
esp/arduino lib v2.3.0
 * Add NTP sync based on chip id (#1773)
@Frogmore42
Copy link
Contributor Author

I took a closer look at the code. I was mistaken about how frequently it checks the NTP servers. The Tasmota code does ask the SNTP code the time every second (if it doesn't get something in this millennium). But the SNTP code does not appear to query the NTP time severs too frequently.

If there is a site wide power failure/restore event, they do all query at the same time, which might be an issue in some cases, so your change will be good.

But, I have turned off power to a single device for a short period, which should cause the device to query the NTP servers again (but it seems it doesn't actually do that). I need to figure out a way to recreate the problem. I think I can do it by telling my router to block the MAC address of a test device from connecting to the Internet and the power cycling the device. That should put it into the bad state. With more logging on the SNTP code it should be more obvious what is going wrong. Since it only happens when there is an extended power outage (and that only happens a few times a year) and I now know how to fix the problem without hard resetting or reloading the FW, it has not been high on the priority list.

It almost seems like there is something in RTC memory that is causing the bad behavior, but I didn't see anything there that could cause the issue.

arendst added a commit that referenced this issue Feb 9, 2018
5.12.0 20180209
* Change library PubSubClient.h define MQTT_MAX_PACKET_SIZE from 512 to
1000 for Home Assistant  support
* Change relation of define MESSZ being dependent on PubSubClient.h
define MQTT_MAX_PACKET_SIZE
* Change command color parameter input checks to less strict for Home
Assistant support
* Change command Ina219Mode into command Sensor13
* Change commands HlwPCal, HlwUCal and HlwICal to PowerCal, VoltageCal
and CurrentCal to be used for both Pow and S31 calibration
* Change commands HlwPSet, HlwUSet and HlwISet to PowerSet, VoltageSet
and CurrentSet to be used for both Pow and S31 calibration
* Change uptime from hour to second resulting in a display of
123T13:45:21 where 123 is days
* Change module name Wemos D1 mini into Generic (#1220)
* Change HTML from width=100% to style=width:100% supporting HTML5
(#1358)
* Change OSWATCH_RESET_TIME (Blocked loop) from 30 to 120 seconds to
allow slow networks (#1556)
* Change WIFI_MANAGER_SEC into WIFI_CONFIG_SEC (#1616)
* Change function pointers code to save code space and memory (#1683)
* Change webserver argument processing gaining 5k code space (#1705)
* Change weblog memory usage (#1730, #1793, #1819)
* Update TasmotaSerial library to 1.1.0
* Update language files Italian (#1594), Dutch (#1723) and Spanish
(#1722)
* Fix Non-English JSON temperature unit attachement
* Fix Arilux RF induced exception by moving interrupt handler to iram on
non ESP8266/Arduino lib v2.3.0
* Fix truncated command names and wrong response for DomoticzSwitchIdx
(#1571)
* Fix %-sign issue as printf escape character in Humidity and Sonoff SC
(#1579)
* Fix DS18B20 temperature JSON decimal dot (#1561)
* Fix Energy JSON message (#1621)
* Fix IRSend parameter translation (#1636)
* Fix TSL2561 device detection (#1644, #1825)
* Fix BME680 teleperiod resistance measuring (#1647)
* Fix Energy Monitoring Energy Today and Energy Total reading after
restart (#1648)
* Fix IRReceive Data value (#1663)
* Fix Energy Monitoring Energy Period roll-over (#1688)
* Fix compiler warnings (#1774)
* Fix command PWM response if no PWM channel is configured (#1783)
* Add locale Decimal Separator to Web sensor page
* Add ColorTemperature to light status message
* Add command PowerOnState option 5 which inverts PulseTime and allows
for delayed always on after power on
* Add OtaMagic two step Web server OTA upgrade using filename-minimal
image if OTA free space is too small
* Add support for PMS5003 and PMS7003 particle concentration sensor
* Add command SetOption21 1 to allow Energy Monitoring when power is off
on Sonoff Pow and Sonoff S31 (#1420)
* Add Chinese language file (#1551)
* Add French language file (#1561)
* Add Spanish language file (#1589)
* Add HTTP Allow Cross Origin removed from ESP8266/Arduino lib v2.4.0
(#1572)
* Add Home Assistant MQTT Discovery for switch and light to be enabled
by command SetOption19 1 (#1534) or define
HOME_ASSISTANT_DISCOVERY_ENABLE in user_config.h (#1685)
* Add command State to retrieve device state information (same data as
teleperiod state and status 11 in slightly different JSON format)
* Add optional login to Webserver AP mode (#1587, #1635)
* Add command Sensor15 2 to start MHZ19(B) Zero Point Calibration
(#1643)
* Add support for Sonoff S31 Smart Socket with Power Consumption
Detection (#1626)
* Add command SetOption20 to allow update of Dimmer/Color/Ct without
turning power on (#1719, #1741)
* Add NTP sync time slot based on chip id (#1773)
* Add cursor pointer to web button (#1836)
@Frogmore42
Copy link
Contributor Author

So, the power went out yesterday morning. The generator came on a min later, but the network was down, so none of my devices could get the right time. About an hour later the network was restored. But, most of devices running Tasmota have the wrong time. I added a little debugging code and flashed a device with it. It shows the call to the SDK to get the time returning 0. So, I don't believe the issue is in the Tasmota code. The issue persists if even if I unplug the device for about 30s. However, if I unplug the device for many mins, then it gets the correct time in a few seconds. I am using a variety of versions, but it does occur on 5.12.1i even with 2.3.0 and 2.4.0.

I tried to recreate the issue by using parental controls to disable internet access for one device, but it got the time shortly after I reenabled it.

@Frogmore42
Copy link
Contributor Author

I added some debugging statements to the code, and I am now convinced that this is an issue with the underlying libraries and not a Tasmota issue (though it still impacts Tasmota). Now I have to figure out how to debug this issue in Arduino library to see if the problem is there or in a lower layer. I have a couple of TH10/16 that are at they barn yard that still have the wrong time from the power outage, so it is not self repairing. A new device I placed there got the correct time, so I don't think the NTP servers have blocked my IP. I see the NTP library returning to Tasmota that it does not know the current time.

I have found that when a device gets in this state, the way to fix it is to remove it from the network for an extended period of time. I am not sure if just removing it from the network (by turning off the router) or powering it down is what is needed.

@stale
Copy link

stale bot commented Jun 13, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Action - Issue left behind - Used by the BOT to call for attention label Jun 13, 2018
@stale
Copy link

stale bot commented Jun 28, 2018

This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem.

@stale stale bot closed this as completed Jun 28, 2018
@kiat-huang
Copy link

kiat-huang commented Jul 15, 2018

It's still an issue for some of us. Recently I've flashed 6.1.0 to a Sonoff Basic and the correct time is not displayed in the monitor, though connected via WiFi to the internet via my router. Via the console over the Sonoff's Tasmota web page is there a command which can force the Sonoff to pull the correct time via NTP?

[Edit: installed an NTP server on the same subnet as the Sonoffs and they are now picking up the correct time]

@ascillato
Copy link
Contributor

ascillato commented Jul 16, 2018

Hi,

If you use static IP, you need to define also DNS and gateway so as Tasmota can access Internet for NTP Sync. I don't know if that is exactly your issue, but this is a common configuration problem.

It's still an issue for some of us.

If you think that you hit a bug, please open a bug issue with all the template completed
If you think you have a configuration problem open a troubleshooting issue with all the template completed.

Via the console over the Sonoff's Tasmota web page is there a command which can force the Sonoff to pull the correct time via NTP?

No, it is an automatic process. When Tasmota have internet (or reach the NTP Server) it will do a Time Sync.

curzon01 pushed a commit to curzon01/Tasmota that referenced this issue Sep 6, 2018
5.11.1j
 * Prep for release
 * Change uptime from hour to second
resulting in a display of 123T13:45:21 where 123 is days
 * Fix Arilux
RF induced exception by moving interrupt handler to iram on non
esp/arduino lib v2.3.0
 * Add NTP sync based on chip id (arendst#1773)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Action - Issue left behind - Used by the BOT to call for attention
Projects
None yet
Development

No branches or pull requests

4 participants