Skip to content

When remote poller is in offline mode, GUI can become inaccesible and poller can timeout #4896

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bmfmancini opened this issue Aug 18, 2022 · 16 comments
Labels
bug Undesired behaviour confirmed Bug is confirm by dev team question A question not a bug remote data collection Issue related to remote data collection resolved A fixed issue
Milestone

Comments

@bmfmancini
Copy link
Member

bmfmancini commented Aug 18, 2022

We have been performing testing on 1.2.22

During failure testing we found that if you fail the master poller the remote poller GUI is unusable and
spine times out

spine spends a lot of time spawning script server processes
this behaviour is only seen in offline mode

2022-08-18 14:23:37 - SPINE: Poller[1] PID[23532] PT[139764272642176] DEBUG: SS[1] PHP Script Server Child FORK Success
2022-08-18 14:23:55 - SPINE: Poller[1] PID[23532] PT[139764272642176] DEBUG: SS[1] Confirmed PHP Script Server running using readfd[17], writefd[16]
2022-08-18 14:23:55 - SPINE: Poller[1] PID[23532] PT[139764272642176] DEBUG: SS[2] PHP Script Server Routine Starting
2022-08-18 14:23:55 - SPINE: Poller[1] PID[23532] PT[139764272642176] DEBUG: SS[2] PHP Script Server About to FORK Child Process
2022-08-18 14:23:55 - SPINE: Poller[1] PID[23532] PT[139764272642176] DEBUG: SS[2] PHP Script Server Child FORK Success
2022-08-18 14:24:13 - SPINE: Poller[1] PID[23532] PT[139764272642176] DEBUG: SS[2] Confirmed PHP Script Server running using readfd[19], writefd[18]
2022-08-18 14:24:13 - SPINE: Poller[1] PID[23532] PT[139764272642176] DEBUG: SS[3] PHP Script Server Routine Starting
2022-08-18 14:24:13 - SPINE: Poller[1] PID[23532] PT[139764272642176] DEBUG: SS[3] PHP Script Server About to FORK Child Process
2022-08-18 14:24:13 - SPINE: Poller[1] PID[23532] PT[139764272642176] DEBUG: SS[3] PHP Script Server Child FORK Success
2022-08-18 14:24:31 - SPINE: Poller[1] PID[23532] PT[139764272642176] DEBUG: SS[3] Confirmed PHP Script Server running using readfd[21], writefd[20]
2022-08-18 14:24:31 - SPINE: Poller[1] PID[23532] PT[139764272642176] DEBUG: SS[4] PHP Script Server Routine Starting
2022-08-18 14:24:31 - SPINE: Poller[1] PID[23532] PT[139764272642176] DEBUG: SS[4] PHP Script Server About to FORK Child Process
2022-08-18 14:24:31 - SPINE: Poller[1] PID[23532] PT[139764272642176] DEBUG: SS[4] PHP Script Server Child FORK Success
2022-08-18 14:24:49 - SPINE: Poller[1] PID[23532] PT[139764272642176] DEBUG: SS[4] Confirmed PHP Script Server running using readfd[23], writefd[22]
2022-08-18 14:24:49 - SPINE: Poller[1] PID[23532] PT[139764272642176] DEBUG: SS[5] PHP Script Server Routine Starting
2022-08-18 14:24:49 - SPINE: Poller[1] PID[23532] PT[139764272642176] DEBUG: SS[5] PHP Script Server About to FORK Child Process

I checked via packet capture for excessive retries to the primary poller but did not find any

In the logs I am seeing the remote trying to push data to main constantly

2022-08-18 14:37:36 - CMDPHP SQL Backtrace: (/poller.php[759]:poller_push_data_to_main(), /lib/poller.php[1957]:poller_push_table(), /lib/poller.php[1995]:db_execute(), /lib/database.php[272]:db_execute_prepared())
2022-08-18 14:37:36 - CMDPHP ERROR: A DB Exec Failed!, Error: MySQL server has gone away
2022-08-18 14:37:36 - CMDPHP SQL Backtrace: (/poller.php[759]:poller_push_data_to_main(), /lib/poller.php[1957]:poller_push_table(), /lib/poller.php[1995]:db_execute(), /lib/database.php[272]:db_execute_prepared())
2022-08-18 14:37:36 - CMDPHP ERROR: A DB Exec Failed!, Error: MySQL server has gone away
2022-08-18 14:37:36 - CMDPHP SQL Backtrace: (/poller.php[759]:poller_push_data_to_main(), /lib/poller.php[1957]:poller_push_table(), /lib/poller.php[1995]:db_execute(), /lib/database.php[272]:db_execute_prepared())
2022-08-18 14:37:36 - CMDPHP ERROR: A DB Exec Failed!, Error: MySQL server has gone away
2022-08-18 14:37:36 - CMDPHP SQL Backtrace: (/poller.php[759]:poller_push_data_to_main(), /lib/poller.php[1946]:poller_push_table(), /lib/poller.php[1995]:db_execute(), /lib/database.php[272]:db_execute_prepared())
2022-08-18 14:37:36 - CMDPHP ERROR: A DB Exec Failed!, Error: MySQL server has gone away
2022-08-18 14:37:36 - CMDPHP SQL Backtrace: (/poller.php[729]:db_fetch_cell_prepared(), /lib/database.php[484]:db_execute_prepared())
2022-08-18 14:37:36 - CMDPHP ERROR: A DB Cell Failed!, Error: MySQL server has gone away
@bmfmancini bmfmancini added bug Undesired behaviour unverified Some days we don't have a clue labels Aug 18, 2022
@TheWitness
Copy link
Member

You are running out of connections.

@TheWitness TheWitness added question A question not a bug and removed bug Undesired behaviour unverified Some days we don't have a clue labels Aug 19, 2022
@bmfmancini
Copy link
Member Author

Let me check I didnt see any too many connections logs

@bmfmancini
Copy link
Member Author

connections are fine

MariaDB [cacti]> show status like '%max_used%';
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| Max_used_connections | 242   |
+----------------------+-------+
1 row in set (0.00 sec)

MariaDB [cacti]> show variables where variable_name =  'max_connections';
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| max_connections | 5000  |
+-----------------+-------+
1 row in set (0.00 sec)

@bmfmancini
Copy link
Member Author

Ok interesting find if you fail the MariaDB process on the main poller this issue is not seen the remote poller is accessible and polling completes without issues

when the primary server is not reachable when the network connection is down that's when the remote starts acting up
the local GUI is not accessible and the logs show polling timing out but the boost table has valid rows

when the network is restored to the primary recovery does kick in and valid records are passed to the main's boost table

@TheWitness
Copy link
Member

You should reduce your timeout and retry number. The default's are too high.

@bmfmancini
Copy link
Member Author

Yea I thought the same

I'll try that on Monday

@bmfmancini
Copy link
Member Author

Tested the script server timeout
dropped it from the default 40 seconds to 20 no difference

@bmfmancini
Copy link
Member Author

Brought down retry for spine from 5 to 3 same result

@TheWitness
Copy link
Member

I used the same process as you with 1 remote and can not repeat your problem. This might be a plugin issue. What plugins are you running?

@bmfmancini
Copy link
Member Author

bmfmancini commented Aug 24, 2022 via email

@bmfmancini
Copy link
Member Author

@TheWitness and I spoke about this to create a setting for main server timeout on the remote pollers
using a PDO timeout

example

$DBH = new PDO(
    "mysql:host=$host;dbname=$dbname", 
    $username, 
    $password,
    array(
        PDO::ATTR_TIMEOUT => 5, // in seconds
        PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION
    )
);

The reason for this is the MySQL/MariaDB client take longer to timeout when the server does not respond
if the primary DB is simply down but the server is up the response is connection refused which is immediate while the timeout will wait longer and attempt to retry

@bmfmancini
Copy link
Member Author

bmfmancini commented Aug 24, 2022

Also found the following

when the primary server is offline you will see this in the error_log of httpd

[Wed Aug 24 10:48:10.762636 2022] [php7:error] [pid 18633] [client :34869] PHP Fatal error:  Allowed memory size of 576716800 bytes exhausted (tried to allocate 262144 bytes) in /var/www/html/cacti/lib/database.php on line 394
[Wed Aug 24 12:23:30.729189 2022] [php7:error] [pid 7519] [client :38877] PHP Fatal error:  Allowed memory size of 576716800 bytes exhausted (tried to allocate 262144 bytes) in /var/www/html/cacti/lib/database.php on line 548
[Wed Aug 24 12:23:30.738659 2022] [php7:error] [pid 4446] [client :38763] PHP Fatal error:  Allowed memory size of 576716800 bytes exhausted (tried to allocate 262144 bytes) in /var/www/html/cacti/lib/database.php on line 548
[Wed Aug 24 12:23:30.748958 2022] [php7:error] [pid 6597] [client :38782] PHP Fatal error:  Allowed memory size of 576716800 bytes exhausted (tried to allocate 262144 bytes) in /var/www/html/cacti/lib/database.php on line 548
[Wed Aug 24 12:23:30.756671 2022] [php7:error] [pid 7965] [client :38736] PHP Fatal error:  Allowed memory size of 576716800 bytes exhausted (tried to allocate 262144 bytes) in /var/www/html/cacti/lib/database.php on line 548
[Wed Aug 24 12:23:30.769023 2022] [php7:error] [pid 10217] [client :38753] PHP Fatal error:  Allowed memory size of 576716800 bytes exhausted (tried to allocate 262144 bytes) in /var/www/html/cacti/lib/database.php on line 548
[Wed Aug 24 12:23:30.878296 2022] [php7:error] [pid 1652] [client :38732] PHP Fatal error:  Allowed memory size of 576716800 bytes exhausted (tried to allocate 262144 bytes) in /var/www/html/cacti/lib/database.php on line 548
[Wed Aug 24 12:23:30.990902 2022] [php7:error] [pid 4440] [client :38948] PHP Fatal error:  Allowed memory size of 576716800 bytes exhausted (tried to allocate 262144 bytes) in /var/www/html/cacti/lib/database.php on line 548
[Wed Aug 24 12:23:31.041406 2022] [php7:error] [pid 1425] [client :38745] PHP Fatal error:  Allowed memory size of 576716800 bytes exhausted (tried to allocate 262144 bytes) in /var/www/html/cacti/lib/database.php on line 548

Function at line 548

function db_fetch_row_prepared($sql, $params = array(), $log = true, $db_conn = false) {
        global $config;

        if (!empty($config['DEBUG_SQL_FLOW'])) {
                db_echo_sql('db_fetch_row_prepared(\'' . clean_up_lines($sql) . '\', $params = (\'' . implode('\', \'', $params) . '\'), $log = ' . $log . ', $db_conn = ' . ($db_conn ? 'true' : 'false') .')' . "\n");
        }

        return db_execute_prepared($sql, $params, $log, $db_conn, 'Row', false, 'db_fetch_row_return');
}

@TheWitness
Copy link
Member

I think we need to have a setting in the GUI to keep the poller offline from the GUI perspective until it comes back up. Something that we need to save in the session. I think this is the only good way to solve this issue.

@TheWitness
Copy link
Member

It's likely I'll be offline for a few days. We'll see.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the outdated No recent activity label Oct 25, 2022
@bmfmancini
Copy link
Member Author

bmfmancini commented Oct 25, 2022 via email

@github-actions github-actions bot removed the outdated No recent activity label Oct 26, 2022
@TheWitness TheWitness added the remote data collection Issue related to remote data collection label Nov 1, 2022
@TheWitness TheWitness added this to the v1.3.0 milestone Dec 17, 2022
TheWitness added a commit that referenced this issue Dec 18, 2022
* upgrading to 1.2.22 most of the plugins break in a multipoller setup
* When remote poller is in offline mode GUI inaccesible and poller times out
* When in Recovery Mode plugins that are designed to work remotely stop working
@TheWitness TheWitness added confirmed Bug is confirm by dev team bug Undesired behaviour resolved A fixed issue labels Dec 18, 2022
@TheWitness TheWitness removed this from the v1.3.0 milestone Dec 18, 2022
@TheWitness TheWitness added this to the v1.2.23 milestone Dec 18, 2022
TheWitness added a commit that referenced this issue Dec 18, 2022
Porting these three fixes from the 1.2.x branch

* upgrading to 1.2.22 most of the plugins break in a multipoller setup
* When remote poller is in offline mode GUI inaccesible and poller times out
* When in Recovery Mode plugins that are designed to work remotely stop working
@netniV netniV changed the title [1.2.22] - When remote poller is in offline mode GUI inaccesible and poller times out When remote poller is in offline mode, GUI can become inaccesible and poller can timeout Dec 31, 2022
@github-actions github-actions bot locked and limited conversation to collaborators Apr 1, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Undesired behaviour confirmed Bug is confirm by dev team question A question not a bug remote data collection Issue related to remote data collection resolved A fixed issue
Projects
None yet
Development

No branches or pull requests

2 participants