Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increased bulk insert size to avoid partial inserts and potential data loss. #4808

Closed
gvde opened this issue Jun 7, 2022 · 5 comments
Closed
Labels
bug Undesired behaviour confirmed Bug is confirm by dev team resolved A fixed issue
Milestone

Comments

@gvde
Copy link
Contributor

gvde commented Jun 7, 2022

Describe the bug

I have just upgraded from 1.2.15 to 1.2.21 on CentOS 7 using the EPEL 7 RPM.

2022/06/07 15:40:01 - POLLER: Poller[1] PID[6208] WARNING: Poller Output Table not Empty.  Issues: 8, DS[24539, 24539, 24539, 24539, 24798, 24798, 24798, 24798]

I now see this error for two data sources of two different devices on every poller run every 5 minute.s

If I delete the graph & data source and recreate it, the error just moves to the next following data source id on the same device.

The problem looks very much like issue #4687 but I still have the issue with 1.2.21.

I run the command suggested in #4687 and it showed something interesting:

[root@cacti cacti]# sudo -u apache php -q /usr/share/cacti/cmd.php --first=79 --last=79 --debug --mibs --poller=1
Total[0.1503] Device[79] STATUS: Device 'sw4cr.example.com' is UP.
Total[1.1050] Device[79] RECACHE: Processing 1 items in the auto reindex cache for 'sw4cr.example.com'.
Total[1.1623] Device[79] DQ[1] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert:10326104400 < output:10326118300)
Total[1.1692] Device[79] DS[20692] TT[6.44] SNMP: v2: sw4cr.example.com, dsname: traffic_in, oid: .1.3.6.1.2.1.31.1.1.1.6.2, output: 967480684385
Total[1.1828] Device[79] DS[20692] TT[13.22] SNMP: v2: sw4cr.example.com, dsname: traffic_out, oid: .1.3.6.1.2.1.31.1.1.1.10.2, output: 197844427214367
Total[1.1882] Device[79] DS[20693] TT[5.34] SNMP: v2: sw4cr.example.com, dsname: traffic_in, oid: .1.3.6.1.2.1.31.1.1.1.6.3, output: 15612290891279
Total[1.1942] Device[79] DS[20693] TT[5.82] SNMP: v2: sw4cr.example.com, dsname: traffic_out, oid: .1.3.6.1.2.1.31.1.1.1.10.3, output: 1128329598664172
Total[1.2023] Device[79] DS[20694] TT[8.08] SNMP: v2: sw4cr.example.com, dsname: traffic_in, oid: .1.3.6.1.2.1.31.1.1.1.6.4, output: 61918210986497
Total[1.2079] Device[79] DS[20694] TT[5.5] SNMP: v2: sw4cr.example.com, dsname: traffic_out, oid: .1.3.6.1.2.1.31.1.1.1.10.4, output: 67025955929767
Total[1.2139] Device[79] DS[20695] TT[5.93] SNMP: v2: sw4cr.example.com, dsname: traffic_in, oid: .1.3.6.1.2.1.31.1.1.1.6.5, output: 84245199078354
Total[1.2197] Device[79] DS[20695] TT[5.64] SNMP: v2: sw4cr.example.com, dsname: traffic_out, oid: .1.3.6.1.2.1.31.1.1.1.10.5, output: 55206823810825
Total[1.2277] Device[79] DS[20703] TT[7.9] SNMP: v2: sw4cr.example.com, dsname: traffic_in, oid: .1.3.6.1.2.1.31.1.1.1.6.41, output: 1158166100030
Total[1.2361] Device[79] DS[20703] TT[8.2] SNMP: v2: sw4cr.example.com, dsname: traffic_out, oid: .1.3.6.1.2.1.31.1.1.1.10.41, output: 661501084783
Total[1.2440] Device[79] DS[20704] TT[7.7] SNMP: v2: sw4cr.example.com, dsname: traffic_in, oid: .1.3.6.1.2.1.31.1.1.1.6.46, output: 580914728523
Total[1.2646] Device[79] DS[20704] TT[20.49] SNMP: v2: sw4cr.example.com, dsname: traffic_out, oid: .1.3.6.1.2.1.31.1.1.1.10.46, output: 683445271450
Total[1.2727] Device[79] DS[20706] TT[8.01] SNMP: v2: sw4cr.example.com, dsname: traffic_in, oid: .1.3.6.1.2.1.31.1.1.1.6.56, output: 251804378104363
Total[1.2802] Device[79] DS[20706] TT[7.35] SNMP: v2: sw4cr.example.com, dsname: traffic_out, oid: .1.3.6.1.2.1.31.1.1.1.10.56, output: 337430807284819
Total[1.2877] Device[79] DS[20707] TT[7.41] SNMP: v2: sw4cr.example.com, dsname: traffic_in, oid: .1.3.6.1.2.1.31.1.1.1.6.61, output: 4288729550307
...
Total[11.3192] Device[79] DS[24538] TT[7.24] SNMP: v2: sw4cr.example.com, dsname: discards_in, oid: .1.3.6.1.2.1.2.2.1.13.336, output: 0
Total[11.3288] Device[79] DS[24538] TT[9.49] SNMP: v2: sw4cr.example.com, dsname: discards_out, oid: .1.3.6.1.2.1.2.2.1.19.336, output: 92
Total[11.3340] Device[79] DS[24538] TT[5.07] SNMP: v2: sw4cr.example.com, dsname: errors_in, oid: .1.3.6.1.2.1.2.2.1.14.336, output: 0
Total[11.3389] Device[79] DS[24538] TT[4.85] SNMP: v2: sw4cr.example.com, dsname: errors_out, oid: .1.3.6.1.2.1.2.2.1.20.336, output: 0
Total[11.3452] Device[79] DS[24539] TT[6.2] SNMP: v2: sw4cr.example.com, dsname: discards_in, oid: .1.3.6.1.2.1.2.2.1.13.341, output: 0
Total[11.3452] Device[79] Writing 1001 items to Poller Output Table
Total[11.3628] Device[79] DS[24539] TT[8.29] SNMP: v2: sw4cr.example.com, dsname: discards_out, oid: .1.3.6.1.2.1.2.2.1.19.341, output: 100
Total[11.3978] Device[79] DS[24539] TT[34.81] SNMP: v2: sw4cr.example.com, dsname: errors_in, oid: .1.3.6.1.2.1.2.2.1.14.341, output: 0
Total[11.4045] Device[79] DS[24539] TT[6.59] SNMP: v2: sw4cr.example.com, dsname: errors_out, oid: .1.3.6.1.2.1.2.2.1.20.341, output: 0
...
Total[16.3092] Device[79] DS[24650] TT[10.53] SNMP: v2: sw4cr.example.com, dsname: discards_in, oid: .1.3.6.1.2.1.2.2.1.13.2013, output: 0
Total[16.3221] Device[79] DS[24650] TT[12.88] SNMP: v2: sw4cr.example.com, dsname: discards_out, oid: .1.3.6.1.2.1.2.2.1.19.2013, output: 0
Total[16.3352] Device[79] DS[24650] TT[12.94] SNMP: v2: sw4cr.example.com, dsname: errors_in, oid: .1.3.6.1.2.1.2.2.1.14.2013, output: 0
Total[16.3492] Device[79] DS[24650] TT[13.95] SNMP: v2: sw4cr.example.com, dsname: errors_out, oid: .1.3.6.1.2.1.2.2.1.20.2013, output: 0
Total[16.3493] Device[79] Writing 447 items to Poller Output Table
Total[16.3549] Device[79] Time[16.21] Items[1449] Errors[2]
Total[16.3639] Time: 16.3634 s, Poller: 1, Threads: N/A, Devices: 1, Items: 1449, Errors: 2

It looks the same on the other device:

Total[3.3275] Device[78] DS[24798] TT[1.35] SNMP: v2: bd1sw0.example.com, dsname: discards_in, oid: .1.3.6.1.2.1.2.2.1.13.4196996, output: 0
Total[3.3290] Device[78] DS[24798] TT[1.43] SNMP: v2: bd1sw0.example.com, dsname: discards_out, oid: .1.3.6.1.2.1.2.2.1.19.4196996, output: 0
Total[3.3290] Device[78] Writing 1001 items to Poller Output Table
Total[3.3401] Device[78] DS[24798] TT[1.93] SNMP: v2: bd1sw0.example.com, dsname: errors_in, oid: .1.3.6.1.2.1.2.2.1.14.4196996, output: 0
Total[3.3437] Device[78] DS[24798] TT[3.42] SNMP: v2: bd1sw0.example.com, dsname: errors_out, oid: .1.3.6.1.2.1.2.2.1.20.4196996, output: 0

So it seems that exactly for the data source id during which the Writing 1001 items to Poller Output Table happens it won't clear out the poller output table correctly.

I have two other devices which very many data sources. However, when I run the command on those two it looks like this:

Total[1.1513] Device[92] DS[23878] TT[0.64] SNMP: v2: firecr.example.com, dsname: discards_in, oid: .1.3.6.1.2.1.2.2.1.13.709, output: 0
Total[1.1521] Device[92] DS[23878] TT[0.57] SNMP: v2: firecr.example.com, dsname: discards_out, oid: .1.3.6.1.2.1.2.2.1.19.709, output: 0
Total[1.1527] Device[92] DS[23878] TT[0.53] SNMP: v2: firecr.example.com, dsname: errors_in, oid: .1.3.6.1.2.1.2.2.1.14.709, output: 0
Total[1.1533] Device[92] DS[23878] TT[0.5] SNMP: v2: firecr.example.com, dsname: errors_out, oid: .1.3.6.1.2.1.2.2.1.20.709, output: 0
Total[1.1533] Device[92] Writing 1001 items to Poller Output Table
Total[1.1635] Device[92] DS[23879] TT[0.71] SNMP: v2: firecr.example.com, dsname: discards_in, oid: .1.3.6.1.2.1.2.2.1.13.710, output: 0
Total[1.1642] Device[92] DS[23879] TT[0.57] SNMP: v2: firecr.example.com, dsname: discards_out, oid: .1.3.6.1.2.1.2.2.1.19.710, output: 0
Total[1.1648] Device[92] DS[23879] TT[0.54] SNMP: v2: firecr.example.com, dsname: errors_in, oid: .1.3.6.1.2.1.2.2.1.14.710, output: 0
Total[1.1654] Device[92] DS[23879] TT[0.57] SNMP: v2: firecr.example.com, dsname: errors_out, oid: .1.3.6.1.2.1.2.2.1.20.710, output: 0

Here, the "Writing 1001 items to Poller Output Table" happens exactly between the data sources and not in the middle of it.

@gvde gvde added bug Undesired behaviour unverified Some days we don't have a clue labels Jun 7, 2022
@TheWitness
Copy link
Member

TheWitness commented Jun 8, 2022

Yea, I say crap! Can you bump the number in cmd.php to something larger? I think for the time being, that'll help. Spine will not do an insert until the total length of the string is 131k bytes. Let's assume that every insert is say 200 bytes, that would be around 655 records, but the Max Buffer for MySQL is several MB. So, going to 2k or higher should not be a problem.

@gvde
Copy link
Contributor Author

gvde commented Jun 8, 2022

You mean the 1000 here:

cacti/cmd.php

Line 391 in 1235955

if ($output_count > 1000) {

to 2000?

@TheWitness
Copy link
Member

Yup. I hate to do it this way, this is a complicated problem, and I'm trying to have a simplified summer from a personal perspective. Work is pretty complicated, but only occupies 8 hours a day 👍

@gvde
Copy link
Contributor Author

gvde commented Jun 8, 2022

O.K. Set it to 2000 as temporary workaround. poller_output table is now empty again and I haven't received any notification emails since then either.

@TheWitness TheWitness added confirmed Bug is confirm by dev team and removed unverified Some days we don't have a clue labels Jun 11, 2022
TheWitness added a commit that referenced this issue Jun 11, 2022
- WARNING: Poller Output Table not Empty
- This is a summer solution.  The real solution is a bit more complicated
@TheWitness
Copy link
Member

I've increased the insert size for cmd.php. Going to leave it at that for now.

@TheWitness TheWitness added the resolved A fixed issue label Jun 11, 2022
@TheWitness TheWitness added this to the v1.2.22 milestone Jun 11, 2022
@netniV netniV changed the title WARNING: Poller Output Table not Empty Increased bulk insert size to avoid partial inserts and potential data loss. Aug 14, 2022
@github-actions github-actions bot locked and limited conversation to collaborators Nov 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Undesired behaviour confirmed Bug is confirm by dev team resolved A fixed issue
Projects
None yet
Development

No branches or pull requests

2 participants