-
-
Notifications
You must be signed in to change notification settings - Fork 438
Description
Describe the bug
We have a handful of devices whose interface graphs are gapping for one, 5-min polling cycle every 2 hours. What we have been able to determine is that the gapping starts for a device after it physically goes down then restores, so it seems to be tied to the uptime. Every two hours, all of the affected devices log that a Recache Assert Failed on all of their data queries then a NaN is written to the RRD files, even though debugging shows actual values are returned by the device.
This issue is affecting interface graphs 95% of the time, but gaps show up in other graphs (uptime, latency, etc.) every once in a great while when recache events happen back-to-back. There are roughly 15 devices doing this at the moment in our production instance and they all show Recache Assert Failed events in the remote poller logs at the same time. I have added two of these devices to our lab setup and there is no gapping or recache events every 2 hours until after a device goes physically down then restores.
This problem started after we upgraded to the one of the develop commits of Cacti 1.2.12 from 1.2.9. There are no errors logged anywhere that I have looked when the gapping occurs.
What seems to be happening is that Cacti is using an uptime assert value that is multiple times larger than the uptime reported in the last cycle when these recache events occur. After that, the assert value is correct until 2 hours later. Here are some of the logs showing the recache events in our lab:
2020/07/13 22:40:02 - SPINE: Poller[3] PID[22976] Device[8] HT[2] DQ[4] RECACHE ASSERT FAILED: '47934056<24174325'
2020/07/13 22:40:02 - SPINE: Poller[3] PID[22976] Device[8] HT[2] DQ[10] RECACHE ASSERT FAILED: '47934098<24174325'
2020/07/13 22:40:02 - SPINE: Poller[3] PID[22976] Device[8] HT[2] DQ[11] RECACHE ASSERT FAILED: '47934139<24174325'
2020/07/13 22:40:02 - SPINE: Poller[3] PID[22976] Device[8] HT[1] DQ[4] RECACHE ASSERT FAILED: '47934056<24174331'
2020/07/13 22:40:02 - SPINE: Poller[3] PID[22976] Device[8] HT[1] DQ[10] RECACHE ASSERT FAILED: '47934098<24174331'
2020/07/13 22:40:02 - SPINE: Poller[3] PID[22976] Device[8] HT[1] DQ[11] RECACHE ASSERT FAILED: '47934139<24174331'
2020/07/14 00:40:02 - SPINE: Poller[3] PID[26782] Device[8] HT[2] DQ[4] RECACHE ASSERT FAILED: '47934056<24894334'
2020/07/14 00:40:02 - SPINE: Poller[3] PID[26782] Device[8] HT[2] DQ[10] RECACHE ASSERT FAILED: '47934098<24894334'
2020/07/14 00:40:02 - SPINE: Poller[3] PID[26782] Device[8] HT[2] DQ[11] RECACHE ASSERT FAILED: '47934139<24894334'
2020/07/14 00:40:02 - SPINE: Poller[3] PID[26782] Device[8] HT[1] DQ[4] RECACHE ASSERT FAILED: '47934056<24894342'
2020/07/14 00:40:02 - SPINE: Poller[3] PID[26782] Device[8] HT[1] DQ[10] RECACHE ASSERT FAILED: '47934098<24894342'
2020/07/14 00:40:02 - SPINE: Poller[3] PID[26782] Device[8] HT[1] DQ[11] RECACHE ASSERT FAILED: '47934139<24894342'
2020/07/14 02:45:02 - SPINE: Poller[3] PID[30728] Device[8] HT[2] DQ[4] RECACHE ASSERT FAILED: '47934056<25644304'
2020/07/14 02:45:02 - SPINE: Poller[3] PID[30728] Device[8] HT[2] DQ[10] RECACHE ASSERT FAILED: '47934098<25644304'
2020/07/14 02:45:02 - SPINE: Poller[3] PID[30728] Device[8] HT[2] DQ[11] RECACHE ASSERT FAILED: '47934139<25644304'
2020/07/14 02:45:02 - SPINE: Poller[3] PID[30728] Device[8] HT[1] DQ[4] RECACHE ASSERT FAILED: '47934056<25644311'
2020/07/14 02:45:02 - SPINE: Poller[3] PID[30728] Device[8] HT[1] DQ[10] RECACHE ASSERT FAILED: '47934098<25644311'
2020/07/14 02:45:02 - SPINE: Poller[3] PID[30728] Device[8] HT[1] DQ[11] RECACHE ASSERT FAILED: '47934139<25644311'
2020/07/14 04:45:02 - SPINE: Poller[3] PID[34573] Device[8] HT[2] DQ[4] RECACHE ASSERT FAILED: '47934056<26364310'
2020/07/14 04:45:02 - SPINE: Poller[3] PID[34573] Device[8] HT[2] DQ[10] RECACHE ASSERT FAILED: '47934098<26364310'
2020/07/14 04:45:02 - SPINE: Poller[3] PID[34573] Device[8] HT[2] DQ[11] RECACHE ASSERT FAILED: '47934139<26364310'
2020/07/14 04:45:02 - SPINE: Poller[3] PID[34573] Device[8] HT[1] DQ[4] RECACHE ASSERT FAILED: '47934056<26364316'
2020/07/14 04:45:02 - SPINE: Poller[3] PID[34573] Device[8] HT[1] DQ[10] RECACHE ASSERT FAILED: '47934098<26364316'
2020/07/14 04:45:02 - SPINE: Poller[3] PID[34573] Device[8] HT[1] DQ[11] RECACHE ASSERT FAILED: '47934139<26364316'
2020/07/14 06:50:03 - SPINE: Poller[3] PID[38712] Device[8] HT[2] DQ[4] RECACHE ASSERT FAILED: '47934056<27114382'
2020/07/14 06:50:03 - SPINE: Poller[3] PID[38712] Device[8] HT[2] DQ[10] RECACHE ASSERT FAILED: '47934098<27114382'
2020/07/14 06:50:03 - SPINE: Poller[3] PID[38712] Device[8] HT[2] DQ[11] RECACHE ASSERT FAILED: '47934139<27114382'
2020/07/14 06:50:03 - SPINE: Poller[3] PID[38712] Device[8] HT[1] DQ[4] RECACHE ASSERT FAILED: '47934056<27114388'
2020/07/14 06:50:03 - SPINE: Poller[3] PID[38712] Device[8] HT[1] DQ[10] RECACHE ASSERT FAILED: '47934098<27114388'
2020/07/14 06:50:03 - SPINE: Poller[3] PID[38712] Device[8] HT[1] DQ[11] RECACHE ASSERT FAILED: '47934139<27114388'
2020/07/14 08:55:02 - SPINE: Poller[3] PID[42637] Device[8] HT[2] DQ[4] RECACHE ASSERT FAILED: '47934056<27864336'
2020/07/14 08:55:02 - SPINE: Poller[3] PID[42637] Device[8] HT[2] DQ[10] RECACHE ASSERT FAILED: '47934098<27864336'
2020/07/14 08:55:02 - SPINE: Poller[3] PID[42637] Device[8] HT[2] DQ[11] RECACHE ASSERT FAILED: '47934139<27864336'
2020/07/14 08:55:02 - SPINE: Poller[3] PID[42637] Device[8] HT[1] DQ[4] RECACHE ASSERT FAILED: '47934056<27864344'
2020/07/14 08:55:02 - SPINE: Poller[3] PID[42637] Device[8] HT[1] DQ[10] RECACHE ASSERT FAILED: '47934098<27864344'
2020/07/14 08:55:02 - SPINE: Poller[3] PID[42637] Device[8] HT[1] DQ[11] RECACHE ASSERT FAILED: '47934139<27864344'
2020/07/14 10:55:02 - SPINE: Poller[3] PID[46503] Device[8] HT[2] DQ[4] RECACHE ASSERT FAILED: '47934056<28584300'
2020/07/14 10:55:02 - SPINE: Poller[3] PID[46503] Device[8] HT[2] DQ[10] RECACHE ASSERT FAILED: '47934098<28584300'
2020/07/14 10:55:02 - SPINE: Poller[3] PID[46503] Device[8] HT[2] DQ[11] RECACHE ASSERT FAILED: '47934139<28584300'
2020/07/14 10:55:02 - SPINE: Poller[3] PID[46503] Device[8] HT[1] DQ[4] RECACHE ASSERT FAILED: '47934056<28584306'
2020/07/14 10:55:02 - SPINE: Poller[3] PID[46503] Device[8] HT[1] DQ[10] RECACHE ASSERT FAILED: '47934098<28584306'
2020/07/14 10:55:02 - SPINE: Poller[3] PID[46503] Device[8] HT[1] DQ[11] RECACHE ASSERT FAILED: '47934139<28584306'
To Reproduce
Steps to reproduce the behaviour:
-
Add a test device to Cacti, enable debug, then create a few interface statistics graphs.
-
Physically power off the device for a short time then turn it back on.
-
Let Cacti run for a few hours after the device has restored.
-
If gapping starts, check the remote poller's log. There will likely be a Recache Assert Failed event at the times of the gaps.
Expected behavior
These devices use uptime for the re-index method for data queries. They should not recache every two hours and the recache should not cause a total loss of data for that polling cycle.
Screenshots
Server Information
Cacti version: 1.2.12 (Develop Commit 5181b79)
Spine version: 1.2.12
Server OS: Ubuntu 18.04 LTS
Apache version: 2.4.29
PHP version: 7.2.24-0ubuntu0.18.04.6
MariaDB version: 10.4.13-MariaDB-1:10.4.13+maria~bionic
Number of Remote Pollers: 5
