Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote pollers may sometimes fail to replicate data back to main system #2775

Closed
eschoeller opened this issue Jun 27, 2019 · 15 comments
Closed
Labels
bug Undesired behaviour resolved A fixed issue
Milestone

Comments

@eschoeller
Copy link

Unfortunately I've also just found that data queries do not appear to be re-indexing when they should be. I have data queries setup with an Uptime re-index method, however I just restarted about 50 devices (after making specific changes to information the data query captures) and I had to re-run all the data queries by hand to refresh them.
Cacti is accurately obtaining the Uptime for these devices, at least I can see the correct Uptime listed in the Management->Devices pane.

@cigamit
Copy link
Member

cigamit commented Jun 28, 2019

Which data collector?

@eschoeller
Copy link
Author

Ohh, good question! They are all on 192.168.252.3.
Heh, things have been working so smoothly with the data collectors I have been forgetting about them!

@eschoeller
Copy link
Author

eschoeller commented Jun 28, 2019

Device IDs 493,494,495,496,497,498,499,512,513,532,536,537,538,542,545,546,547,548,576,577,578,579,580,581,605,608,612,613,614,615,619,651,671,672,673,675,678,679,718,719,720,721,723,724,725,726,794,812
It's that pesky temperature/humidity data query again. And guess what, all the graph_text legends broke again in the legacy aggregates. But I am going through and torpedoing a bunch of those aggregates right now. They were holding onto some really old data sources that I need to purge.

@cigamit
Copy link
Member

cigamit commented Jun 29, 2019

So, is the re-indexing only not appearing to work on the remote data collectors? Can you confirm?

@cigamit cigamit added unverified Some days we don't have a clue bug Undesired behaviour and removed unverified Some days we don't have a clue labels Jun 29, 2019
@cigamit
Copy link
Member

cigamit commented Jun 30, 2019

Okay, found the culprit.

@cigamit cigamit changed the title [1.2.4] Data Queries fail to Re-Index Data Queries for remotely managed devices fail to push updated data to main data collector Jun 30, 2019
cigamit added a commit that referenced this issue Jun 30, 2019
Data Queries for remotely managed devices fail to push updated data to main data collector
@cigamit cigamit added the resolved A fixed issue label Jun 30, 2019
@cigamit
Copy link
Member

cigamit commented Jun 30, 2019

Okay, just pushed two changes. What you can do to verify is that on the remote data collector you see the re-index being forced. Then, you should see updated information coming back to the main data collector from the remote. I'm not sure this will actually drive the automation, but this much is at least broken right now. So, there are a few ways you can verify, you could delete all the records for the device in the host_snmp_cache on the main table, and verify that they re-populate properly, which is what I did, or see if a new graph get's created. I've marked resolved, but I still have to trace a few things.

@cigamit
Copy link
Member

cigamit commented Jun 30, 2019

Just walking through things, I still see a few issues. Still checking.

@cigamit
Copy link
Member

cigamit commented Jun 30, 2019

Yea, automation is going to continue to fail. Let me see what I can do about that.

@cigamit
Copy link
Member

cigamit commented Jun 30, 2019

Okay, so I just reviewed the code, and found that re-indexes that are performed by a remote device reboot, will not force an automation run. That's the way it's written right now. Looks like automation is only run from the main data collector. The question is in the case where we have a reboot, how can we schedule a run. I'll have to think about that.

Please confirm what you are seeing on your end. Maybe delete a graph, then purge the host_snmp_cache for the device and data query in question on the main data collector, reboot the device to force a recache, then see that the data is repopulated in the host_snmp_cache after the data collector runs, then re-run automation on the main data collector to see that the graph is created again.

@eschoeller
Copy link
Author

Sorry for the delay I should be able to look at this in about 2-3 hours

@eschoeller
Copy link
Author

What I will do is simply change the name of a temperature sensor and then reboot the PDU. When the uptime goes backwards the Data query should reindex and update the name of the temperature sensor without any other intervention on my part. Should only take about 5 minutes to test.

@cigamit
Copy link
Member

cigamit commented Jun 30, 2019

Cool. I know that I can force an automation run, but for now, we may just want to open another issue.

@eschoeller
Copy link
Author

Sorry, something went totally haywire over here and I've been trying to fix it. I think I merged too many updates and things got really quirky so I had to back-out of it

@eschoeller
Copy link
Author

OK. It worked

|     812 |            88 | tHSensorName                 | Bottom-Rack-Inlet_B5    | 1          | .1.3.6.1.4.1.1718.4.1.9.2.1.3.1.1 |       1 | 2019-06-23 15:30:53 |
|     812 |            88 | tHSensorName                 | Bottom-Rack-Inlet_B5a   | 1          | .1.3.6.1.4.1.1718.4.1.9.2.1.3.1.1 |       1 | 2019-06-30 21:21:48 |
...
2019/06/30 21:20:02 - SPINE: Poller[3] Device[812] HT[1] DQ[86] RECACHE ASSERT FAILED: '89950103<3147'
2019/06/30 21:20:02 - SPINE: Poller[3] Device[812] HT[1] DQ[87] RECACHE ASSERT FAILED: '89950103<3147'
2019/06/30 21:20:02 - SPINE: Poller[3] Device[812] HT[1] DQ[88] RECACHE ASSERT FAILED: '89950103<3147'
2019/06/30 21:20:02 - SPINE: Poller[3] Device[812] HT[1] DQ[89] RECACHE ASSERT FAILED: '89950103<3147'
2019/06/30 21:20:02 - SPINE: Poller[3] Device[812] HT[1] DQ[92] RECACHE ASSERT FAILED: '89950103<3147'
2019/06/30 21:20:54 - SYSTEM STATS: Time:52.1684 Method:spine Processes:1 Threads:16 Hosts:77 HostsPerProcess:77 DataSources:6442 RRDsProcessed:0
2019/06/30 21:20:54 - PCOMMAND Device[812] WARNING: Recache Event Detected for Device
2019/06/30 21:20:54 - DSTRACE Running Re-Index for Device[812], DQ[83]
2019/06/30 21:21:08 - DSTRACE Running Re-Index for Device[812], DQ[84]
2019/06/30 21:21:17 - DSTRACE Running Re-Index for Device[812], DQ[85]
2019/06/30 21:21:22 - DSTRACE Running Re-Index for Device[812], DQ[86]
2019/06/30 21:21:32 - DSTRACE Running Re-Index for Device[812], DQ[87]
2019/06/30 21:21:48 - DSTRACE Running Re-Index for Device[812], DQ[88]
2019/06/30 21:21:48 - DSTRACE Running Re-Index for Device[812], DQ[89]
2019/06/30 21:21:49 - DSTRACE Running Re-Index for Device[812], DQ[92]
2019/06/30 21:21:50 - RECACHE STATS: Poller:3 RecacheTime:55.7865 DevicesRecached:1
2019/06/30 21:22:06 - SPINE: Poller[3] Device[812] Hostname[172.20.9.187] NOTICE: HOST EVENT: Device Returned from DOWN State
...

@cigamit
Copy link
Member

cigamit commented Jul 1, 2019

Okay, closing.

@cigamit cigamit closed this as completed Jul 1, 2019
@cigamit cigamit added this to the v1.2.5 milestone Jul 1, 2019
@netniV netniV changed the title Data Queries for remotely managed devices fail to push updated data to main data collector Remote pollers may sometimes not always replicate data back to main system Jul 14, 2019
@netniV netniV changed the title Remote pollers may sometimes not always replicate data back to main system Remote pollers may sometimes fail to replicate data back to main system Jul 14, 2019
@github-actions github-actions bot locked and limited conversation to collaborators Jun 30, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Undesired behaviour resolved A fixed issue
Projects
None yet
Development

No branches or pull requests

2 participants