Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boost may be blocked by overflowing poller_output table #4375

Closed
bernisys opened this issue Aug 24, 2021 · 17 comments
Closed

Boost may be blocked by overflowing poller_output table #4375

bernisys opened this issue Aug 24, 2021 · 17 comments
Labels
bug Undesired behaviour resolved A fixed issue
Milestone

Comments

@bernisys
Copy link
Contributor

Describe the bug

We just encountered a situation where the poller_output table was filling more and more on the main poller, without any recognizable reason. In the end the (already quite large dimensioned) memory table was filled with 1.9M entries and from that point in time everything started to go haywire.

What happened was:

  • during the past 2 months we saw a steady increase of polling time on the main poller
  • on Sunday morning around 2:00 local time, the cycle times reached the 300 seconds
  • from that point in time, boost just stopped working and the items assembled in poller_output_boost
  • after several hours when we came to work on monday, there were already >200M entries
  • we tried to find out what happened, and the only thing we saw was a series of "Poller table full" messages in the logs
  • we cleared that table, and then our boost stats came back, with a bit shocking result (see attached image below)
  • the ~330M entries in boost-arch took way more than 24h to process
  • over time there was another >200M bunch accumulating in the next boost table

We mitigated manually more or less, i tried by understanding how boost works internally, figuring out the workings of the poller_output_boost_local_data_ids table, checking the lowest ID and dropping everythig below.
That made things worse at the start - processing now took 10 times longer on the probably fragmented table.
I assume this is probably one main reason why the boost does not delete entries from the arch tables any more...

Restarting MySQL and waiting a bit somehow got this a bit under control, so the 1st 328M was processed in the end, and the processing of the new 260M bunch took a lot less time (curse you, exponential time growth .. it's like corona spreading ... )

Now we seem to have recovered with just minimal data loss (the 1.9M entries in the in-mem table, but who cares).
Still it seems there is some major misalignment in the boost cycle handling, causing this to happen.
Ideas to fix this are below. :)

To Reproduce

Not sure how this can be 100% reproduced - in our case it just took a lot of time to show up, and it seems to be recurring, because we see values accumulating slowly again in the poller_output table, and we already saw a similar behavior in the past, polling time always peaking out exponentially (last times with 1.2.16 and 17 it recovered on its own though - could be some side effect of some changes in 1.2.18).
Not sure why this happens, the table content should be completely processed into the boost table.

Just put a monitoring on the number of entries in that table, and see if it is rising.

Expected behavior

Boost should run, no matter what, and always obey the configured limits. Otherwise the system will get slower and slower due to DB the usual O(something) processing time constraints. More entries mean exponentially more processing time, which kills the whole system.

It would be probably wiser to add a safety limit to the number of boost table entries, and if it reaches that number, the table gets moved away (creation of new arch-table with timestamp) and let the poller collect more data in a fresh table.

This would also mean that inside boost, the table deletion must be altered a bit:
Currently it seems that boost deletes all the tables it finds after it finishes what ever it had done before. This would mean that if another process created a new arch table in the mean time, it would also be deleted with the end of the boost run, without being processed.
It would be probably a better approach to do the deletion inside the loop where all arch tables get processed sequentially and not at the very end. Or alternatively, remember the names of the tables that were completed and delete only those at the end.

Screenshots

If applicable, add screenshots to help explain your problem.

Boost table content over time:
image

Additional context

@bernisys bernisys added bug Undesired behaviour unverified Some days we don't have a clue labels Aug 24, 2021
@bernisys bernisys changed the title When poller_output table is filling over time, it can wind up into a critical situation blocking boost. When poller_output table is filling over time on the main poller, it can wind up into a critical situation blocking boost. Aug 24, 2021
@bmfmancini
Copy link
Member

I am seeing similar behaviour but I dont think its boost for us

Are you seeing the polling time increase only on the main poller but the remotes are fine ?
Right now even the polling times increase slowly and a restart of mariadb rectifies the issue for about 2 months

The polling times increases slowly then at one point jumps up very high causing polling timeouts for just the main poller while the remotes are just fine

image

When I zoom into the time of the polling timeouts you see boost seems ok

This is on 1.2.16

@bernisys
Copy link
Contributor Author

Hi Sean,

on the remotes we usually don't see that problem, unless of course the main poller's DB has so much to do that it is completely stuck with processing the queries - then the remotes are also affected and sometimes show slightly higher poll times, but that's not significant. We had this in the past a couple of times, and it recovered on its own when we restarted the DB (due to upgrades for example or to change the DB configs). So my intention was to let it peak again intentionally this time and check the process table during that time, to see which process is behaving differently.
And sadly, this time it was really significant when it peaked, and basically all pollers were jumping around like mad once the table was filled to the brim.

I had placed a few debug-prints into the poller.php last time we had a similar problem, I guess I might want to add them again and place a pull request so that everyone can debug their problems a bit better. I remember last time this happened our cacti spent most of the time in the data post-processing subroutines (splitting multi-value lines into the different key/value pairs after the retrieval).

I'd say you should check your situation over a longer period, if values are slowly accumulating in "poller_output" as well.
I believe if you wait some time, this might also happen on your installation.

I can add the corresponding poll time agregate diagram for reference, when i'm back at may work PC tomorrow :)
I see that your poll time peaks out at 60s - I also vaguely remember you said somewhere that you poll at 60s intervals.
We use 5min intervals, so our poll time peaks at 300s of course.

I can attach also a long time view of the main poll time diagram, this shows the exponential poll-time peaking a few times, always rising until we had restarted the DB during upgrades or to change some DB settings.

@bernisys
Copy link
Contributor Author

As promised, here is the long term view on the main poller's poll time:
image
This long term view shows that in our case we need to wait about 2.5 months to reach the moment where it peaks.

I assume this can be correlated to the total number of data sources and the size of the poller-output table - the more DS, the less time it takes / the more RAM for the table, the more time it takes

And here is a more detailed view on our last incident, including the remotes:

image

image

@bmfmancini
Copy link
Member

Ah yea same type of behaviour I am seeing I think because you have more head room polling every 5 minutes vs every minute it takes longer for you to see the critical impact

I checked the poller_output table on our prod instance that has this behaviour

MariaDB [cacti]> select count(*) from poller_output;
+----------+
| count(*) |
+----------+
|    39461 |
+----------+
1 row in set (0.00 sec)

However what is weird is the left over values in poller_output after the polling cycle completes
I am going to check our lab instance which is much much smaller to see if the table is building up there as well
on the small instances this issue is not seen

@bmfmancini
Copy link
Member

bmfmancini commented Aug 25, 2021

Wait a moment look at this check out the dates on these records
it looks like they slowly build up over days and dont clear and this happens only on the main poller
the remotes output tables are always clear afterwards

whats also funny is I checked the below ds ID it belongs to a device that is not on the main poller

MariaDB [cacti]> select *  from poller_output LIMIT 10  \G;
*************************** 1. row ***************************
local_data_id: 120133
     rrd_name: traffic_in
         time: 2021-08-09 08:24:02
       output: U
*************************** 2. row ***************************
local_data_id: 120133
     rrd_name: traffic_in
         time: 2021-08-18 07:53:02
       output: U
*************************** 3. row ***************************
local_data_id: 120133
     rrd_name: traffic_in
         time: 2021-08-16 17:53:02
       output: U
*************************** 4. row ***************************
local_data_id: 120133
     rrd_name: traffic_in
         time: 2021-08-11 05:39:03
       output: U
*************************** 5. row ***************************
local_data_id: 120133
     rrd_name: traffic_in
         time: 2021-08-08 05:41:03
       output: U

I thought maybe PHP memory exhaustion but I dont think so the table doesnt really get that big
I have my php memory limit set at 700M

+-------------------------------------+-----------+
| Table                               | Size (MB) |
+-------------------------------------+-----------+
| host_snmp_cache                     |    969.97 |
| poller_output_boost_arch_1629891683 |    470.09 |
| poller_output                       |    388.99 |
| poller_output_boost                 |    200.98 |
| graph_templates_item                |    187.63 |
| poller_item                         |    127.05 |
| thold_data                          |    115.98 |
| poller_resource_cache               |     99.38 |
| data_input_data                     |     70.31 |

@bmfmancini
Copy link
Member

Yea on smaller instances this is not happening so its a scale thing

@bernisys wanna catch up on a call ?

@bernisys
Copy link
Contributor Author

@bmfmancini

That's exactly what we are observing - the table is not completely emptied and older values accumulate over time until the table is completely filled. So it would be good if someone from the team can check as well and make it a confirmed bug.

And yeah sure lets have a call - just ping me a proposal on my company address, we can use Teams, i got some time tomorrow from early afternoon starting from 1300 CEST.
Maybe if someone from the team has time as well, that would be great, so that we can discuss on some pointers, how we can help to identify the source of the problem. haven't checked the code but from the behavior it looks a bit like either an "off-by-one" error somewhere in the handling of the poller_output table, or maybe something timing-related in larger environments.

Looking forward! :)

@bmfmancini
Copy link
Member

Sounds good to me!

I wonder if running poller_output_empty.php in the mean time say every few weeks would keep it at bay
until we figure out what the cause is

it would force the rrd update for the values in the table as well so not like your losing the data

@bernisys
Copy link
Contributor Author

The idea running poller_output_empty.php sounds like a plan, but would it work properly, inserting previously missed values seemlessly into the RRDs?
I remember that rrdtool complains if you want to update an RRD file with past data, when there is more recent data already present.

@bmfmancini
Copy link
Member

bmfmancini commented Aug 26, 2021 via email

@bmfmancini
Copy link
Member

bmfmancini commented Aug 26, 2021 via email

@bmfmancini
Copy link
Member

bmfmancini commented Aug 27, 2021

Hey guys

So @bernisys and I caught up and we did some more troubleshooting
In my case it turns out a single device/data source was clogging up the poller_output table with 34k+ records
the device its self was not having an issue and the graphs were not populating

Interestingly enough the graphs for this device have never populated which tells me something during the save may have been messed up ??

MariaDB [cacti]> select local_data_id, count(1) from poller_output where time < (now() - 300) group by local_data_id order by count(1) desc limit 10;
+---------------+----------+
| local_data_id | count(1) |
+---------------+----------+
|        120133 |    34472 |
+---------------+----------+
1 row in set (0.02 sec)

MariaDB [cacti]>

thanks @bernisys for the query :)

it should be noted that this device was not on the main poller but yet was clogging the poller_output table on the main poller while its residing poller was just fine

TheWitness added a commit that referenced this issue Aug 27, 2021
…oller

When poller_output table is filling over time on the main poller, it can wind up into a critical situation blocking boost
@TheWitness TheWitness added resolved A fixed issue and removed unverified Some days we don't have a clue labels Aug 27, 2021
@TheWitness TheWitness added this to the 1.2.19 milestone Aug 27, 2021
@TheWitness
Copy link
Member

Please check the latest commit and report back. This will ensure that the poller_output table on the main server does not explode on you due to a missing check.

@bernisys
Copy link
Contributor Author

@TheWitness wow, thanks, that was quick :)

I still think that someone should take another look at the trigger for the boost mechanism, if it can be be put into a different order or let it start in parallel somehow. Because if the table overflows due to a potential other issue, boost could as well (or as bad) get stuck. I think the start of the process is quite uncritical as it spans multiple poll cycles anyway, but it needs to happen at some point in time, no matter what. (See my comments in the initial description)
What's your opinion on that?

@bmfmancini can you give it a shot in your test env?
Our test might not be big enough to quickly see the accumulating effect. Haven't looked too deeply into test for this issue to be honest, as the effects were much more visible in prod. Maybe our L1 can take a deeper look at it on Monday.

Though I still wonder why data is first put into poller_output on the main, when the remotes flush it out. We have the "populate boost directly" option in the performance settings activated, so shouldn't this influence the behavior? Or are there other constraints which prohibit populating the boost table directly? I checked the documentation, the one on github but did not elaborate on this, but then i found a hint in https://docs.cacti.net/Settings-Performance.md - as i understand it, this will basically duplicate the data into both the output and the boost table, am i correct?

In total, let's call it a good weekend and see what next week brings! :)
(oof.. 3 in the morning .. hmm .. sleep? overrated ...)

TheWitness added a commit that referenced this issue Aug 28, 2021
@netniV netniV closed this as completed Sep 5, 2021
@gj00354347
Copy link

Hi Larry ,

we have done the changes in poller.php but unfortunately we are still having lingering data source issue in poller_output table .
 we have one DS that is still accumulating in the table, DS-ID = 284832and if you check the RRD file, it is updated but history data is still in poller_output.

is there any an off-by-one error somewhere in the scripts, which prevents proper removal of items of one data source .

no other data sources are being accumulated.

Best Regards,
Gopal

@netniV netniV changed the title When poller_output table is filling over time on the main poller, it can wind up into a critical situation blocking boost. Boost may be blocked by overflowing poller_output table Oct 3, 2021
@gj00354347
Copy link

gj00354347 commented Oct 13, 2021

Please check the latest commit and report back. This will ensure that the poller_output table on the main server does not explode on you due to a missing check.

HI Larry ,

So there are a few data sources still lingering in poller_output table but the strange thing is that those data are being reflected in graphs , which means poller_output has transferred to boost and that got written to rrd but still we have those items in poller_output .

we see it the lingering items in poller_output table for its for just 4-5 data sources . Then I checked which are data sources

MariaDB [cacti]> select distinct (local_data_id) from poller_output where time < now()-INTERVAL 10 DAY;
+---------------+
| local_data_id |
+---------------+
| 216336 |
| 284832 |
| 567942 |
+---------------+

then I looked for graphs and found that graph is upto date with one small break that too for today only , as i see in one day span for 3 to 4 example but nowhere gap is found.

MariaDB [cacti]> select count(1) from poller_output where local_data_id=284832;
+----------+
| count(1) |
+----------+
| 8680 |
+----------+

MariaDB [cacti]> select name_cache from data_template_data where local_data_id=284832;
+---------------------------------+
| name_cache |
+---------------------------------+
| xxxxxxxxxxx - Traffic - Gi1/0/2 |
+---------------------------------+

image

Best Regards,
Gopal

@TheWitness
Copy link
Member

I don't think this is the right place for that comment Gopal. I think you should consider opening a fresh ticket.

@github-actions github-actions bot locked and limited conversation to collaborators Jan 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Undesired behaviour resolved A fixed issue
Projects
None yet
Development

No branches or pull requests

5 participants