Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poller recovery starts multiple processes and fails to recover properly #4215

Closed
bernisys opened this issue Apr 8, 2021 · 21 comments
Closed
Labels
bug Undesired behaviour resolved A fixed issue
Milestone

Comments

@bernisys
Copy link
Contributor

bernisys commented Apr 8, 2021

Describe the bug

First of all, this probably appears only in really large setups, where the recovery is taking more time than a polling cycle.

When connectivity returns after an outage in a remote poller setup, the poller recovery does not work properly.
It spawns multiple processes and causes a lot of DB stress, because it does not take into account that a process is already running.

I also found the actual flaw in the code and i think i was able to fix it.

To Reproduce

Let a really large (in terms of data sources) remote poller go into "Heartbeat" mode and let it recover again.
You will see multiple processes running after a while, which are trying all the same: starting the data sync from scratch over and over again.

Expected behavior

Poller recovery should run only once.
Every subsequent call should detect that a process is running already.

Additional context

The problem is, that the "recovery_pid" is nowhere inserted into the corresponding table.
It is only deleted at one point in the code.
This way the subsequent processes always see an empty value (because there is no "recovery_pid" entry in the "settings" table) and will start the process all over again.

@bernisys bernisys added bug Undesired behaviour unverified Some days we don't have a clue labels Apr 8, 2021
@bmfmancini
Copy link
Member

bmfmancini commented Apr 8, 2021 via email

@TheWitness
Copy link
Member

You need to move that pull request to the 1.2.x branch. I've closed it for now.

@bernisys
Copy link
Contributor Author

bernisys commented Apr 8, 2021

Sure, will do so, sorry didn't know.

@bernisys
Copy link
Contributor Author

bernisys commented Apr 8, 2021

That should be better now.

@bmfmancini - could you please verify if that works also for you?

In our case i have seen lately that the PIDs that are written into the table differ from the ones of the runing processes.
But still, now there is just one process running in our setup.

@bmfmancini
Copy link
Member

bmfmancini commented Apr 8, 2021 via email

TheWitness pushed a commit that referenced this issue Apr 8, 2021
New fix for #4215 - now in correct branch.
@TheWitness TheWitness added resolved A fixed issue and removed unverified Some days we don't have a clue labels Apr 8, 2021
@TheWitness TheWitness added this to the v1.2.17 milestone Apr 8, 2021
@bmfmancini
Copy link
Member

Initial tests are looking good

@bernisys
Copy link
Contributor Author

bernisys commented Apr 8, 2021

Somehow i think this is not yet the complete truth .. I have observed the recovery process for a while now, it was still not completely working as it should be.

The INSERT statement that i added is not always working properly, specially when the process times out and is killed. (Which should not happen IMHO - the script should run, otherwise the initial master->poller replication of those 4 tables might be happening over and over again, which costs valuable time and lets the output table fill.)

I have now added "ON DUPLICATE KEY UPDATE value=VALUES(value)" into the patch - it's looking way better now.
Replication starts with pulling these 4 tables to the poller, then it does something in the background silently ...
I would wish for more verbosity here by default, for example seeing lines like this every once in a while would be nice:
"pushing out of <max_n> values back to main"
Then we can see much better how the sync is proceeding, and this should be not just debug output but default output, as it occurs not in a debug session but after an emergency situation.

Also i am still a bit worried about this initial "undefined offset" at the start, this should be caught and logged properly, so that we know what this kind of error really means. In fact there are a lot of different other errors that should be explicitly handled. You might know the answer, but someone new to the tool is certainly lost :)

$ php ./poller_recovery.php
PHP Notice:  Undefined offset: 1 in /opt/SP/cacti/cacti/lib/database.php on line 415
2021-04-08 19:27:48 - WEBUI NOTE: Table data_local Replicated to Poller With 68208 Rows Updated
2021-04-08 19:28:07 - WEBUI NOTE: Table graph_local Replicated to Poller With 66929 Rows Updated
2021-04-08 19:29:45 - WEBUI NOTE: Table host_snmp_cache Replicated to Poller With 247247 Rows Updated
2021-04-08 19:29:45 - WEBUI NOTE: Table poller_reindex Replicated to Poller With 491 Rows Updated
PHP Fatal error:  Maximum execution time of 300 seconds exceeded in /opt/SP/cacti/cacti/poller_recovery.php on line 217

But the PID of the then new spawned process is now correct in the DB.
Will prepare that fix now and offer another pull.

Have to get a bit more used to the internals.

TheWitness added a commit that referenced this issue Apr 8, 2021
* fix poller recovery running process detection

New fix for #4215 - now in correct branch.

* additional fix for running process detection

If the old process is killed and the PID-value is still in the DB, we need to use "update" as a fallback.

* Update poller_recovery.php

Co-authored-by: Bernd <berni@berni.lan>
Co-authored-by: TheWitness <thewitness@cacti.net>
@bernisys
Copy link
Contributor Author

bernisys commented Apr 9, 2021

Update for our situation: All pollers are now recovered after i had inserted all above fixes.
Additionally i ran a copy the recovery script with the following changes:

  • increased the execution time at the top of the script, so that it survives longer
  • dropped the initial table backsync to the remote poller, cause it costs lots of time
  • increased the record_limit to 150000 - this really changed a lot
  • added a few debug messages to see what's going on internally (shall i prepare a pull? i think they are helpful in general)

Took much less time then to sync all the 20M records that were accumulating on the last poller.
That one did not come to the point where it actually could sync data, because the initial table-backsync was already eating up the 300 seconds and the script was killed by php.

But from what i could observe, also the sync-loop itself has its corners. One of the main culprits here seems to be the general slowness of the first query in the loop, which tries to get the max poll-time for the first records out of a very large table. This query needs to go thru all the data in the table, and the more data is in the table, the slower it gets (as it needs to sort the whole lot of entries by poll time)
And then a second query pulls the actual recods from the table. So in principle, there are two queries doing a very similar thing, except the second one filters for the max poll time first and needs to sort less records.
Not sure if this can be done more efficiently though, as you will have to sort the data anyway, one way or the other.

Conclusions:

With my manual runs, most beneficial steps were (i adapted them in the below order):

  • skipping the initial table sync
    probably can't be skipped all the time, but for consequent script runs it's beneficial
    if the last sync was just a few minutes ago, the data would not change that quickly on the main system.
  • increasing hard-coded record count
    50k was starting to get us going, 150k was great
    but be careful, the main DB will see lots more records during recovery!
  • increasing the run time
    should be increased by default to at least 30min in larger setups

That really worked out fine. This $record_count should be definitely configurable in the GUI, best would be on a per-poller basis, then you can adapt the value as needed for your situation. Some pollers are more loaded and need a higher value, others are on low load and hence sync faster naturally because less rows are collected during an outage on those.

@TheWitness
Copy link
Member

What version are you running? Right now the poller_recovery.php uses the entirety of the max allowed packet. It seems you are still using 1.2.16 version.

@TheWitness
Copy link
Member

But dang, I do see an issue. The CLI's should not have a 300 second run limit, which is what it's doing now. That's a bigger bug.

@TheWitness
Copy link
Member

Making another change to this. Should make things better. @bernisys & @bmfmancini please tests in your labs as soon as it's in.

TheWitness added a commit that referenced this issue Apr 10, 2021
Poller recovery starts multiple processes and fails to recover properly
@bernisys
Copy link
Contributor Author

bernisys commented Apr 10, 2021

@TheWitness
Those 300 seconds .. yes i had seen that as well and i had fixed by adding a corresponding line to the top of the script, like in the replication script. This timeout was not matching my php settings, so i figured it must come from somewhere else... and 300s smell strongly like a polling cycle limitation in some upstream script or library. See you found the culprit already, will adapt our setup accordingly.

BTW we're currently using 1.2.14 in prod and 16 in test, but i ported already a few of the not too complex fixes from 17 into our prod. Will add the above changes on monday when i'm back at work. Don't really want to do that when basically nobody is at work ... :)

And how hard/complex would it be to add the $record_limit as a global setting?
That would be a lot more convenient being able to adapt this value in the GUI. Probably it's also helpful to have some additional logging in the recovery script, to see how large the data packets will get, this helps a lot in fine-tuning that value.
With 150k records i saw record sizes around 8MB, which is still a lot less than the recommended value from the Cacti SQL config documentation or the hints you see in the GUI at install time.

Let me quickly check and propose another pull for the recovery script for those logs, i had added those in our prod to see what's going on internally.

@TheWitness
Copy link
Member

The 300 was coming from cli_check.php, which I changed to -1 (unlimited). As far as the record limit goes, that's a good point. I set it to 150k on the slurp, and on the push, it's dynamic based upon the max_packet_size.

@bernisys
Copy link
Contributor Author

@TheWitness - please check and comment
1.2.x...bernisys:1.2.x

I've made a few changes which make the recovery process a lot more verbose.
(I felt free to rename the $count variable in that part, as it was a bit too non-verbose for my taste.)

@TheWitness
Copy link
Member

I'm okay with more verbosity. That lib/database.php change should be backed out. Add some white-space after 197 and 199.

@TheWitness
Copy link
Member

Do a fresh pull once you've tested.

@bernisys
Copy link
Contributor Author

Okay, i think i will start over in a more granular approach and create a branch for each change. Doing all that in the main is a bad idea generally, should have seen that coming. Though i didn't think about the possibility to contribute more and more. Turns out though it could become a new hobby ...

@netniV
Copy link
Member

netniV commented Apr 11, 2021

That's how it happens to us all.

TheWitness pushed a commit that referenced this issue Apr 11, 2021
…, code cleanup (#4224)

* removed unused variable "$end_count"

* renamed $count to more speaking $record_size

* renamed $inserted to $records_inserted

* code clean-up

* add log verbosity

When sending data to main, this happens in portions and this uses
several DB-queries. For optimization purposes, we want to know how much
time these queries took.
It is also good to know, how fast the recovery is processing data, hence
we also log the max timestamp for the current loop's data portion.

* correction of recovery core loop

The previous handling was a bit off. Though this was technically not
critical, it was not correctly calculating the data sizes, which would
lead to wrong annotations in the logs.

Introducing the helper variable $sql_size helps to keep track properly
about the size of the SQL string for the currently processed record.
Previously the size was already added even if the current record was not
taken into account in the current loop run, and after transfering the
complete data-frame, the size was nulled, therefore the next size was
not reflecting this record.

Adding $record_count allows to log the actual amount of data that was
transfered to the main poller in the current loop cycle.

Co-authored-by: Bernd <berni@berni.lan>
@bernisys
Copy link
Contributor Author

@netniV tell me about it ... ;)

FYI: just finished the pull request, so please have a look at the changes in my branch.
1.2.x...bernisys:poller-recovery

While looking thru the code i saw another minor flaw, which i have corrected as well, and i did a little clean up with some of the variable names and positions for initialization. Makes the code a bit more readable for people who take a fresh look into it.

I will try to be as verbose as possible with the changes i propose, so that the purpose becomes clear as possible.
Also i try to keep clean-ups away from actual changes, this avoids people getting confused about additional changed lines that have nothing to do with the actual fix.

Hope the code is as clean as you would expect.

TheWitness added a commit that referenced this issue Apr 12, 2021
This change modifies logic slightly as follows:

1) Retry only twice for a connection (speed up the GUI in offline mode)
2) Don't update resource cache when in offline or recovery, you could downgrade yourself
3) Don't flush_boost() function if you are not a remote data collector
4) Better logging for automation to indicate offline or remote
TheWitness added a commit that referenced this issue Apr 12, 2021
@TheWitness
Copy link
Member

Okay, I've tested everything here. Looks good now.

@TheWitness
Copy link
Member

More testing... Still an issue.

@github-actions github-actions bot locked and limited conversation to collaborators Jul 13, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Undesired behaviour resolved A fixed issue
Projects
None yet
Development

No branches or pull requests

4 participants