-
-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Boost may be blocked by overflowing poller_output table #4375
Comments
I am seeing similar behaviour but I dont think its boost for us Are you seeing the polling time increase only on the main poller but the remotes are fine ? The polling times increases slowly then at one point jumps up very high causing polling timeouts for just the main poller while the remotes are just fine When I zoom into the time of the polling timeouts you see boost seems ok This is on 1.2.16 |
Hi Sean, on the remotes we usually don't see that problem, unless of course the main poller's DB has so much to do that it is completely stuck with processing the queries - then the remotes are also affected and sometimes show slightly higher poll times, but that's not significant. We had this in the past a couple of times, and it recovered on its own when we restarted the DB (due to upgrades for example or to change the DB configs). So my intention was to let it peak again intentionally this time and check the process table during that time, to see which process is behaving differently. I had placed a few debug-prints into the poller.php last time we had a similar problem, I guess I might want to add them again and place a pull request so that everyone can debug their problems a bit better. I remember last time this happened our cacti spent most of the time in the data post-processing subroutines (splitting multi-value lines into the different key/value pairs after the retrieval). I'd say you should check your situation over a longer period, if values are slowly accumulating in "poller_output" as well. I can add the corresponding poll time agregate diagram for reference, when i'm back at may work PC tomorrow :) I can attach also a long time view of the main poll time diagram, this shows the exponential poll-time peaking a few times, always rising until we had restarted the DB during upgrades or to change some DB settings. |
Ah yea same type of behaviour I am seeing I think because you have more head room polling every 5 minutes vs every minute it takes longer for you to see the critical impact I checked the poller_output table on our prod instance that has this behaviour
However what is weird is the left over values in poller_output after the polling cycle completes |
Wait a moment look at this check out the dates on these records whats also funny is I checked the below ds ID it belongs to a device that is not on the main poller
I thought maybe PHP memory exhaustion but I dont think so the table doesnt really get that big
|
Yea on smaller instances this is not happening so its a scale thing @bernisys wanna catch up on a call ? |
That's exactly what we are observing - the table is not completely emptied and older values accumulate over time until the table is completely filled. So it would be good if someone from the team can check as well and make it a confirmed bug. And yeah sure lets have a call - just ping me a proposal on my company address, we can use Teams, i got some time tomorrow from early afternoon starting from 1300 CEST. Looking forward! :) |
Sounds good to me! I wonder if running poller_output_empty.php in the mean time say every few weeks would keep it at bay it would force the rrd update for the values in the table as well so not like your losing the data |
The idea running poller_output_empty.php sounds like a plan, but would it work properly, inserting previously missed values seemlessly into the RRDs? |
Yea maybe something to lab test the other solution would be some sort of
script to remove records from any day before today but that's a band aid
…On Thu., Aug. 26, 2021, 11:17 BerndP, ***@***.***> wrote:
The idea running poller_output_empty.php sounds like a plan, but would it
work properly, inserting previously missed values seemlessly into the RRDs?
I remember that rrdtool complains if you want to update an RRD file with
past data, when there is more recent data already present.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4375 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADGEXTAOLK6GNCWL2NIHRNDT6ZLG3ANCNFSM5CXAV6YQ>
.
|
Another thing would be to disable polling briefly and truncate the poller
table ever few weeks or something like that
…On Thu., Aug. 26, 2021, 19:31 Sean Mancini, ***@***.***> wrote:
Yea maybe something to lab test the other solution would be some sort of
script to remove records from any day before today but that's a band aid
On Thu., Aug. 26, 2021, 11:17 BerndP, ***@***.***> wrote:
> The idea running poller_output_empty.php sounds like a plan, but would it
> work properly, inserting previously missed values seemlessly into the RRDs?
> I remember that rrdtool complains if you want to update an RRD file with
> past data, when there is more recent data already present.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#4375 (comment)>, or
> unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ADGEXTAOLK6GNCWL2NIHRNDT6ZLG3ANCNFSM5CXAV6YQ>
> .
>
|
Hey guys So @bernisys and I caught up and we did some more troubleshooting Interestingly enough the graphs for this device have never populated which tells me something during the save may have been messed up ??
thanks @bernisys for the query :) it should be noted that this device was not on the main poller but yet was clogging the poller_output table on the main poller while its residing poller was just fine |
…oller When poller_output table is filling over time on the main poller, it can wind up into a critical situation blocking boost
Please check the latest commit and report back. This will ensure that the poller_output table on the main server does not explode on you due to a missing check. |
@TheWitness wow, thanks, that was quick :) I still think that someone should take another look at the trigger for the boost mechanism, if it can be be put into a different order or let it start in parallel somehow. Because if the table overflows due to a potential other issue, boost could as well (or as bad) get stuck. I think the start of the process is quite uncritical as it spans multiple poll cycles anyway, but it needs to happen at some point in time, no matter what. (See my comments in the initial description) @bmfmancini can you give it a shot in your test env? Though I still wonder why data is first put into poller_output on the main, when the remotes flush it out. We have the "populate boost directly" option in the performance settings activated, so shouldn't this influence the behavior? Or are there other constraints which prohibit populating the boost table directly? I checked the documentation, the one on github but did not elaborate on this, but then i found a hint in https://docs.cacti.net/Settings-Performance.md - as i understand it, this will basically duplicate the data into both the output and the boost table, am i correct? In total, let's call it a good weekend and see what next week brings! :) |
Hi Larry , we have done the changes in poller.php but unfortunately we are still having lingering data source issue in poller_output table . is there any an off-by-one error somewhere in the scripts, which prevents proper removal of items of one data source . no other data sources are being accumulated. Best Regards, |
HI Larry , So there are a few data sources still lingering in poller_output table but the strange thing is that those data are being reflected in graphs , which means poller_output has transferred to boost and that got written to rrd but still we have those items in poller_output . we see it the lingering items in poller_output table for its for just 4-5 data sources . Then I checked which are data sources MariaDB [cacti]> select distinct (local_data_id) from poller_output where time < now()-INTERVAL 10 DAY; then I looked for graphs and found that graph is upto date with one small break that too for today only , as i see in one day span for 3 to 4 example but nowhere gap is found. MariaDB [cacti]> select count(1) from poller_output where local_data_id=284832; MariaDB [cacti]> select name_cache from data_template_data where local_data_id=284832; Best Regards, |
I don't think this is the right place for that comment Gopal. I think you should consider opening a fresh ticket. |
Describe the bug
We just encountered a situation where the poller_output table was filling more and more on the main poller, without any recognizable reason. In the end the (already quite large dimensioned) memory table was filled with 1.9M entries and from that point in time everything started to go haywire.
What happened was:
We mitigated manually more or less, i tried by understanding how boost works internally, figuring out the workings of the poller_output_boost_local_data_ids table, checking the lowest ID and dropping everythig below.
That made things worse at the start - processing now took 10 times longer on the probably fragmented table.
I assume this is probably one main reason why the boost does not delete entries from the arch tables any more...
Restarting MySQL and waiting a bit somehow got this a bit under control, so the 1st 328M was processed in the end, and the processing of the new 260M bunch took a lot less time (curse you, exponential time growth .. it's like corona spreading ... )
Now we seem to have recovered with just minimal data loss (the 1.9M entries in the in-mem table, but who cares).
Still it seems there is some major misalignment in the boost cycle handling, causing this to happen.
Ideas to fix this are below. :)
To Reproduce
Not sure how this can be 100% reproduced - in our case it just took a lot of time to show up, and it seems to be recurring, because we see values accumulating slowly again in the poller_output table, and we already saw a similar behavior in the past, polling time always peaking out exponentially (last times with 1.2.16 and 17 it recovered on its own though - could be some side effect of some changes in 1.2.18).
Not sure why this happens, the table content should be completely processed into the boost table.
Just put a monitoring on the number of entries in that table, and see if it is rising.
Expected behavior
Boost should run, no matter what, and always obey the configured limits. Otherwise the system will get slower and slower due to DB the usual O(something) processing time constraints. More entries mean exponentially more processing time, which kills the whole system.
It would be probably wiser to add a safety limit to the number of boost table entries, and if it reaches that number, the table gets moved away (creation of new arch-table with timestamp) and let the poller collect more data in a fresh table.
This would also mean that inside boost, the table deletion must be altered a bit:
Currently it seems that boost deletes all the tables it finds after it finishes what ever it had done before. This would mean that if another process created a new arch table in the mean time, it would also be deleted with the end of the boost run, without being processed.
It would be probably a better approach to do the deletion inside the loop where all arch tables get processed sequentially and not at the very end. Or alternatively, remember the names of the tables that were completed and delete only those at the end.
Screenshots
If applicable, add screenshots to help explain your problem.
Boost table content over time:
Additional context
The text was updated successfully, but these errors were encountered: