-
-
Notifications
You must be signed in to change notification settings - Fork 427
Description
Describe the bug
We just encountered a situation where the poller_output table was filling more and more on the main poller, without any recognizable reason. In the end the (already quite large dimensioned) memory table was filled with 1.9M entries and from that point in time everything started to go haywire.
What happened was:
- during the past 2 months we saw a steady increase of polling time on the main poller
- on Sunday morning around 2:00 local time, the cycle times reached the 300 seconds
- from that point in time, boost just stopped working and the items assembled in poller_output_boost
- after several hours when we came to work on monday, there were already >200M entries
- we tried to find out what happened, and the only thing we saw was a series of "Poller table full" messages in the logs
- we cleared that table, and then our boost stats came back, with a bit shocking result (see attached image below)
- the ~330M entries in boost-arch took way more than 24h to process
- over time there was another >200M bunch accumulating in the next boost table
We mitigated manually more or less, i tried by understanding how boost works internally, figuring out the workings of the poller_output_boost_local_data_ids table, checking the lowest ID and dropping everythig below.
That made things worse at the start - processing now took 10 times longer on the probably fragmented table.
I assume this is probably one main reason why the boost does not delete entries from the arch tables any more...
Restarting MySQL and waiting a bit somehow got this a bit under control, so the 1st 328M was processed in the end, and the processing of the new 260M bunch took a lot less time (curse you, exponential time growth .. it's like corona spreading ... )
Now we seem to have recovered with just minimal data loss (the 1.9M entries in the in-mem table, but who cares).
Still it seems there is some major misalignment in the boost cycle handling, causing this to happen.
Ideas to fix this are below. :)
To Reproduce
Not sure how this can be 100% reproduced - in our case it just took a lot of time to show up, and it seems to be recurring, because we see values accumulating slowly again in the poller_output table, and we already saw a similar behavior in the past, polling time always peaking out exponentially (last times with 1.2.16 and 17 it recovered on its own though - could be some side effect of some changes in 1.2.18).
Not sure why this happens, the table content should be completely processed into the boost table.
Just put a monitoring on the number of entries in that table, and see if it is rising.
Expected behavior
Boost should run, no matter what, and always obey the configured limits. Otherwise the system will get slower and slower due to DB the usual O(something) processing time constraints. More entries mean exponentially more processing time, which kills the whole system.
It would be probably wiser to add a safety limit to the number of boost table entries, and if it reaches that number, the table gets moved away (creation of new arch-table with timestamp) and let the poller collect more data in a fresh table.
This would also mean that inside boost, the table deletion must be altered a bit:
Currently it seems that boost deletes all the tables it finds after it finishes what ever it had done before. This would mean that if another process created a new arch table in the mean time, it would also be deleted with the end of the boost run, without being processed.
It would be probably a better approach to do the deletion inside the loop where all arch tables get processed sequentially and not at the very end. Or alternatively, remember the names of the tables that were completed and delete only those at the end.
Screenshots
If applicable, add screenshots to help explain your problem.
Boost table content over time:
