Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel boost restart due to timeout can result in errors. #4223

Closed
jdcoats opened this issue Apr 9, 2021 · 58 comments
Closed

Parallel boost restart due to timeout can result in errors. #4223

jdcoats opened this issue Apr 9, 2021 · 58 comments
Labels
bug Undesired behaviour resolved A fixed issue unreleased A bug that was introduced in development and not released.
Milestone

Comments

@jdcoats
Copy link

jdcoats commented Apr 9, 2021

Boost is hanging up and leaving a process running even after completion. When i find it, I have to manually kill it. It will continue as long as I leave it.

~ # grep 16:16:3 /var/log/cacti.log
2021/04/09 16:16:34 - SYSTEM BOOST STATS: Time:15.72 ProcessNumber:2 RRDUpdates:2731838
2021/04/09 16:16:35 - SYSTEM BOOST STATS: Time:17.54 RRDUpdates:2731838
2021/04/09 16:16:35 - CMDPHP ERROR: A DB Row Failed!, Error: Table 'cacti.poller_output_boost_arch_1617999315' doesn't exist
2021/04/09 16:16:35 - CMDPHP SQL Backtrace:  (/poller_boost.php[216]:boost_output_rrd_data(), /poller_boost.php[573]:boost_process_local_data_ids(), /poller_boost.php[673]:db_fetch_assoc(), /lib/database.php[500]:db_fetch_assoc_prepared(), /lib/database.php[514]:db_execute_prepared())
2021/04/09 16:16:35 - CMDPHP ERROR: A DB Row Failed!, Error: Table 'cacti.poller_output_boost_arch_1617999315' doesn't exist
2021/04/09 16:16:35 - CMDPHP SQL Backtrace:  (/poller_boost.php[216]:boost_output_rrd_data(), /poller_boost.php[573]:boost_process_local_data_ids(), /poller_boost.php[673]:db_fetch_assoc(), /lib/database.php[500]:db_fetch_assoc_prepared(), /lib/database.php[514]:db_execute_prepared())
2021/04/09 16:16:35 - CMDPHP ERROR: A DB Row Failed!, Error: Table 'cacti.poller_output_boost_arch_1617999315' doesn't exist
2021/04/09 16:16:35 - CMDPHP SQL Backtrace:  (/poller_boost.php[216]:boost_output_rrd_data(), /poller_boost.php[573]:boost_process_local_data_ids(), /poller_boost.php[673]:db_fetch_assoc(), /lib/database.php[500]:db_fetch_assoc_prepared(), /lib/database.php[514]:db_execute_prepared())
.....
2021/04/09 17:24:33 - CMDPHP SQL Backtrace: (/poller_boost.php[216]:boost_output_rrd_data(), /poller_boost.php[573]:boost_process_local_data_ids(), /poller_boost.php[673]:db_fetch_assoc(), /lib/database.php[500]:db_fetch_assoc_prepared(), /lib/database.php[514]:db_execute_prepared())
2021/04/09 17:24:33 - CMDPHP ERROR: A DB Row Failed!, Error: Table 'cacti.poller_output_boost_arch_1617999315' doesn't exist 

~ # ps aux| grep boost
root     12528 76.1  0.1 114380 81048 ?        R    16:16  51:47 /usr/bin/php /var/www/localhost/htdocs/cacti/poller_boost.php --child=1
root     22250  0.0  0.0   8000  2372 pts/1    S+   17:24   0:00 grep --colour=auto boost
~ # kill -9 12528
@jdcoats jdcoats added bug Undesired behaviour unverified Some days we don't have a clue labels Apr 9, 2021
@jdcoats
Copy link
Author

jdcoats commented Apr 9, 2021

I was running it at 8 processes/cpu's and i cut it back to 2. I will change to 1 and see if it continues.

@TheWitness
Copy link
Member

So, my guess is that you changed the setting while boost was running. Let me take a look at the code. There might be something around that change that would cause boost to exit before everything was done, and then dropping the table while the legacy setting process was still busy.

@TheWitness
Copy link
Member

I can possibly see how this could happen if you exceed the maximum runtime by 3x, boost will forcibly restart itself. Adding some logic to 'fix/workaround' that.

@TheWitness TheWitness changed the title 1.2.17 current boost process issue Parrallel boost restart due to timeout can result in errors. Apr 10, 2021
@TheWitness TheWitness changed the title Parrallel boost restart due to timeout can result in errors. Parallel boost restart due to timeout can result in errors. Apr 10, 2021
TheWitness added a commit that referenced this issue Apr 10, 2021
Parallel boost restart due to timeout can result in errors
@TheWitness TheWitness added resolved A fixed issue and removed unverified Some days we don't have a clue labels Apr 10, 2021
@TheWitness TheWitness added this to the v1.2.17 milestone Apr 10, 2021
@TheWitness
Copy link
Member

So, technically, you should not get the errors any longer, but you will have to check for warnings to ensure that you don't stack up a bunch of archive tables.

@jdcoats
Copy link
Author

jdcoats commented Apr 10, 2021

When it runs it get this now.


2021/04/10 12:40:17 - BOOST WARNING: Boost Poller terminated by user
--
2021/04/10 12:40:17 - BOOST WARNING: Killing Boost Child PID 29648 due to it exceeding 3x its maximum allowed runtime
2021/04/10 12:40:17 - BOOST WARNING: Boost Poller terminated by user
2021/04/10 12:40:17 - POLLER: Poller[Main Poller] PID[2116] ERROR: Process being killed due to timeout! (boost, master, 1, 29638)

No actual boost stats are displayed. Maxim run time is set to 40 minutes and it only runs for 40 seconds

@TheWitness
Copy link
Member

So, I'm wondering who is starting poller_boost.php? Did you manually run?

@jdcoats
Copy link
Author

jdcoats commented Apr 10, 2021

not a manual run. The cacti schedule every 2 hours or 7000000 records
boost

@jdcoats
Copy link
Author

jdcoats commented Apr 10, 2021


2021/04/10 14:43:06 - SYSTEM BOOST STATS: Time:48.15 RRDUpdates:6294591
--
2021/04/10 14:43:05 - SYSTEM BOOST STATS: Time:38.39 ProcessNumber:1 RRDUpdates:6294591
2021/04/10 14:42:18 - BOOST WARNING: Boost Poller terminated by user
2021/04/10 14:42:18 - BOOST WARNING: Boost Poller terminated by user
2021/04/10 14:42:18 - BOOST WARNING: Killing Boost Child PID 12567 due to it exceeding 3x its maximum allowed runtime
2021/04/10 14:42:18 - POLLER: Poller[Main Poller] PID[17472] ERROR: Process being killed due to timeout! (boost, master, 1, 12557)
2021/04/10 12:41:05 - SYSTEM BOOST STATS: Time:48.04 RRDUpdates:6242511
2021/04/10 12:41:05 - SYSTEM BOOST STATS: Time:37.54 ProcessNumber:1 RRDUpdates:6242511
2021/04/10 12:40:17 - BOOST WARNING: Boost Poller terminated by user
2021/04/10 12:40:17 - BOOST WARNING: Killing Boost Child PID 29648 due to it exceeding 3x its maximum allowed runtime
2021/04/10 12:40:17 - BOOST WARNING: Boost Poller terminated by user
2021/04/10 12:40:17 - POLLER: Poller[Main Poller] PID[2116] ERROR: Process being killed due to timeout! (boost, master, 1, 29638)
2021/04/10 10:39:58 - SYSTEM BOOST STATS: Time:42.99 RRDUpdates:12537883
2021/04/10 10:39:57 - SYSTEM BOOST STATS: Time:40.79 ProcessNumber:1 RRDUpdates:6294742
2021/04/10 10:39:15 - SYSTEM BOOST STATS: Time:59.94 ProcessNumber:1 RRDUpdates:6243141
2021/04/10 10:39:15 - BOOST WARNING: Boost Poller terminated by user
2021/04/10 10:39:15 - POLLER: Poller[Main Poller] PID[23921] ERROR: Process being killed due to timeout! (boost, master, 1, 19027)

@TheWitness
Copy link
Member

Were those pids actually still running at the time?

@TheWitness
Copy link
Member

It looks like the master pid is not being cleared from the process table.

@TheWitness
Copy link
Member

Did you update to the latest poller_boost.php?

@jdcoats
Copy link
Author

jdcoats commented Apr 10, 2021 via email

@jdcoats
Copy link
Author

jdcoats commented Apr 10, 2021

I pulled today at 10:50am est

@TheWitness
Copy link
Member

Okay, so start the boost process by hand in debug mode

./poller_boost.php --force --debug

Then from another shell, do an strace after a few minutes, if it does not end on it's own. Use strace like the following:

strace -p pid -s 2000 -tt -o /tmp/strace.out

@jdcoats
Copy link
Author

jdcoats commented Apr 10, 2021

the strace wouldn't work, even though the pids did exist, I'll try again when I have more records and itll run long enough.

# php poller_boost.php --force --debug
DEBUG: Checking if Boost is ready to run.
DEBUG: Last Runtime was 2021-04-10 14:42:18 (1618080138).
DEBUG: Next Runtime is 2021-04-10 16:42:18 (1618087338).
DEBUG: Records Found:4076217, Max Threshold:7000000.
DEBUG: Time to Run Boost, Force Run is true!
DEBUG: Parallel Process Setup Begins.
DEBUG: Data Sources:31498, Concurrent Processes:1
DEBUG: Parallel Process Setup Complete.  Ready to spawn children.
DEBUG: About to launch 1 processes.
DEBUG: Launching Boost Process Number 1
2021/04/10 16:01:29 - POLLER: Poller[1] PID[16179] DEBUG: About to Spawn a Remote Process [CMD: /usr/bin/php, ARGS: /var/www/localhost/htdocs/cacti/poller_boost.php --child=1 --debug]
DEBUG: Sleeping for 2 seconds.
DEBUG: Sleeping for 2 seconds.
DEBUG: Sleeping for 2 seconds.
DEBUG: Sleeping for 2 seconds.
DEBUG: Sleeping for 2 seconds.
DEBUG: Sleeping for 2 seconds.
DEBUG: Sleeping for 2 seconds.
DEBUG: Sleeping for 2 seconds.
DEBUG: Sleeping for 2 seconds.
DEBUG: Sleeping for 2 seconds.
DEBUG: Sleeping for 2 seconds.
DEBUG: Sleeping for 2 seconds.
DEBUG: Sleeping for 2 seconds.
DEBUG: Sleeping for 2 seconds.
DEBUG: Sleeping for 2 seconds.
DEBUG: Sleeping for 2 seconds.
DEBUG: Sleeping for 2 seconds.
DEBUG: Sleeping for 2 seconds.
DEBUG: Sleeping for 2 seconds.
DEBUG: Sleeping for 2 seconds.
DEBUG: Sleeping for 2 seconds.
DEBUG: Sleeping for 2 seconds.
DEBUG: Sleeping for 2 seconds.
DEBUG: Sleeping for 2 seconds.
DEBUG: Sleeping for 2 seconds.
DEBUG: Sleeping for 2 seconds.
DEBUG: Sleeping for 2 seconds.
DEBUG: Sleeping for 2 seconds.
DEBUG: Sleeping for 2 seconds.
DEBUG: Sleeping for 2 seconds.
DEBUG: Sleeping for 2 seconds.
DEBUG: Sleeping for 2 seconds.
2021/04/10 16:02:36 - SYSTEM BOOST STATS: Time:67.82 RRDUpdates:4076335

~ # strace -p pid -s 16179 -tt -o /tmp/strace.out
strace: Invalid process id: 'pid'
~ # ps aux| grep boost
root     16179  0.1  0.0  72100 37588 pts/1    S+   16:01   0:00 php poller_boost.php --force --debug
root     16181 18.1  0.1 122376 88716 pts/1    S+   16:01   0:10 /usr/bin/php /var/www/localhost/htdocs/cacti/poller_boost.php --child=1 --debug
root     21171  0.0  0.0   8000  2428 pts/2    S+   16:02   0:00 grep --colour=auto boost
~ # strace -p pid -s 16181 -tt -o /tmp/strace.out
strace: Invalid process id: 'pid'
~ # ps aux| grep boost
root     21178  0.0  0.0   7872   704 pts/2    S+   16:02   0:00 grep --colour=auto boost

@TheWitness
Copy link
Member

you have to replace the 'pid' with the pid of the boost process.

@TheWitness
Copy link
Member

strace -p 16179 -s 2000 -o /tmp/strace.out

@jdcoats
Copy link
Author

jdcoats commented Apr 10, 2021

was that not the pid ? I tried 16179 and 16181

@jdcoats
Copy link
Author

jdcoats commented Apr 10, 2021


2021/04/10 16:04:49 - SYSTEM BOOST STATS: Time:8.52 RRDUpdates:154799
--
2021/04/10 16:04:49 - SYSTEM BOOST STATS: Time:7.05 ProcessNumber:1 RRDUpdates:154799
2021/04/10 16:04:49 - BOOST CHILD DEBUG: Processing Complete for Boost Process 1.  It took 5 passed to complete.
2021/04/10 16:04:47 - BOOST CHILD DEBUG: Processing 4 of 4 for Boost Process 1
2021/04/10 16:04:45 - BOOST CHILD DEBUG: Processing 3 of 4 for Boost Process 1
2021/04/10 16:04:43 - BOOST CHILD DEBUG: Processing 2 of 4 for Boost Process 1
2021/04/10 16:04:42 - BOOST CHILD DEBUG: Processing 1 of 4 for Boost Process 1
2021/04/10 16:04:42 - BOOST CHILD DEBUG: Processes:1, TotalRows:154799
2021/04/10 16:04:41 - BOOST CHILD DEBUG: Processing RRRtool Output for Boost Process 1
2021/04/10 16:04:41 - POLLER: Poller[Main Poller]  PID[30900] DEBUG: About to Spawn a Remote Process [CMD: /usr/bin/php,  ARGS: /var/www/localhost/htdocs/cacti/poller_boost.php --child=1  --debug]
2021/04/10 16:04:41 - BOOST WARNING: Boost Poller forced by command line.
2021/04/10 16:02:36 - SYSTEM BOOST STATS: Time:67.82 RRDUpdates:4076335
2021/04/10 16:02:34 - SYSTEM BOOST STATS: Time:65.57 ProcessNumber:1 RRDUpdates:4076335
2021/04/10 16:02:34 - BOOST CHILD DEBUG: Processing Complete for Boost Process 1.  It took 83 passed to complete.
2021/04/10 16:02:34 - BOOST CHILD DEBUG: Processing 82 of 82 for Boost Process 1
2021/04/10 16:02:34 - BOOST CHILD DEBUG: Processing 81 of 82 for Boost Process 1
2021/04/10 16:02:33 - BOOST CHILD DEBUG: Processing 80 of 82 for Boost Process 1
2021/04/10 16:02:33 - BOOST CHILD DEBUG: Processing 79 of 82 for Boost Process 1
2021/04/10 16:02:33 - BOOST CHILD DEBUG: Processing 78 of 82 for Boost Process 1
2021/04/10 16:02:25 - BOOST CHILD DEBUG: Processing 77 of 82 for Boost Process 1
2021/04/10 16:02:13 - BOOST CHILD DEBUG: Processing 76 of 82 for Boost Process 1
2021/04/10 16:01:55 - BOOST CHILD DEBUG: Processing 75 of 82 for Boost Process 1
2021/04/10 16:01:55 - BOOST CHILD DEBUG: Processing 74 of 82 for Boost Process 1
2021/04/10 16:01:55 - BOOST CHILD DEBUG: Processing 73 of 82 for Boost Process 1
2021/04/10 16:01:54 - BOOST CHILD DEBUG: Processing 72 of 82 for Boost Process 1
2021/04/10 16:01:54 - BOOST CHILD DEBUG: Processing 71 of 82 for Boost Process 1
2021/04/10 16:01:53 - BOOST CHILD DEBUG: Processing 70 of 82 for Boost Process 1
2021/04/10 16:01:53 - BOOST CHILD DEBUG: Processing 69 of 82 for Boost Process 1
2021/04/10 16:01:53 - BOOST CHILD DEBUG: Processing 68 of 82 for Boost Process 1
2021/04/10 16:01:52 - BOOST CHILD DEBUG: Processing 67 of 82 for Boost Process 1
2021/04/10 16:01:52 - BOOST CHILD DEBUG: Processing 66 of 82 for Boost Process 1
2021/04/10 16:01:52 - BOOST CHILD DEBUG: Processing 65 of 82 for Boost Process 1
2021/04/10 16:01:52 - BOOST CHILD DEBUG: Processing 64 of 82 for Boost Process 1
2021/04/10 16:01:51 - BOOST CHILD DEBUG: Processing 63 of 82 for Boost Process 1
2021/04/10 16:01:51 - BOOST CHILD DEBUG: Processing 62 of 82 for Boost Process 1
2021/04/10 16:01:51 - BOOST CHILD DEBUG: Processing 61 of 82 for Boost Process 1
2021/04/10 16:01:50 - BOOST CHILD DEBUG: Processing 60 of 82 for Boost Process 1
2021/04/10 16:01:50 - BOOST CHILD DEBUG: Processing 59 of 82 for Boost Process 1
2021/04/10 16:01:50 - BOOST CHILD DEBUG: Processing 58 of 82 for Boost Process 1
2021/04/10 16:01:50 - BOOST CHILD DEBUG: Processing 57 of 82 for Boost Process 1
2021/04/10 16:01:49 - BOOST CHILD DEBUG: Processing 56 of 82 for Boost Process 1
2021/04/10 16:01:49 - BOOST CHILD DEBUG: Processing 55 of 82 for Boost Process 1
2021/04/10 16:01:49 - BOOST CHILD DEBUG: Processing 54 of 82 for Boost Process 1
2021/04/10 16:01:49 - BOOST CHILD DEBUG: Processing 53 of 82 for Boost Process 1
2021/04/10 16:01:48 - BOOST CHILD DEBUG: Processing 52 of 82 for Boost Process 1
2021/04/10 16:01:48 - BOOST CHILD DEBUG: Processing 51 of 82 for Boost Process 1
2021/04/10 16:01:48 - BOOST CHILD DEBUG: Processing 50 of 82 for Boost Process 1
2021/04/10 16:01:47 - BOOST CHILD DEBUG: Processing 49 of 82 for Boost Process 1
2021/04/10 16:01:47 - BOOST CHILD DEBUG: Processing 48 of 82 for Boost Process 1
2021/04/10 16:01:47 - BOOST CHILD DEBUG: Processing 47 of 82 for Boost Process 1
2021/04/10 16:01:47 - BOOST CHILD DEBUG: Processing 46 of 82 for Boost Process 1
2021/04/10 16:01:46 - BOOST CHILD DEBUG: Processing 45 of 82 for Boost Process 1
2021/04/10 16:01:46 - BOOST CHILD DEBUG: Processing 44 of 82 for Boost Process 1
2021/04/10 16:01:46 - BOOST CHILD DEBUG: Processing 43 of 82 for Boost Process 1
2021/04/10 16:01:45 - BOOST CHILD DEBUG: Processing 42 of 82 for Boost Process 1
2021/04/10 16:01:45 - BOOST CHILD DEBUG: Processing 41 of 82 for Boost Process 1
2021/04/10 16:01:45 - BOOST CHILD DEBUG: Processing 40 of 82 for Boost Process 1
2021/04/10 16:01:44 - BOOST CHILD DEBUG: Processing 39 of 82 for Boost Process 1
2021/04/10 16:01:44 - BOOST CHILD DEBUG: Processing 38 of 82 for Boost Process 1
2021/04/10 16:01:44 - BOOST CHILD DEBUG: Processing 37 of 82 for Boost Process 1
2021/04/10 16:01:43 - BOOST CHILD DEBUG: Processing 36 of 82 for Boost Process 1
2021/04/10 16:01:43 - BOOST CHILD DEBUG: Processing 35 of 82 for Boost Process 1
2021/04/10 16:01:43 - BOOST CHILD DEBUG: Processing 34 of 82 for Boost Process 1
2021/04/10 16:01:42 - BOOST CHILD DEBUG: Processing 33 of 82 for Boost Process 1
2021/04/10 16:01:42 - BOOST CHILD DEBUG: Processing 32 of 82 for Boost Process 1
2021/04/10 16:01:42 - BOOST CHILD DEBUG: Processing 31 of 82 for Boost Process 1
2021/04/10 16:01:41 - BOOST CHILD DEBUG: Processing 30 of 82 for Boost Process 1
2021/04/10 16:01:41 - BOOST CHILD DEBUG: Processing 29 of 82 for Boost Process 1
2021/04/10 16:01:40 - BOOST CHILD DEBUG: Processing 28 of 82 for Boost Process 1
2021/04/10 16:01:40 - BOOST CHILD DEBUG: Processing 27 of 82 for Boost Process 1
2021/04/10 16:01:40 - BOOST CHILD DEBUG: Processing 26 of 82 for Boost Process 1
2021/04/10 16:01:39 - BOOST CHILD DEBUG: Processing 25 of 82 for Boost Process 1
2021/04/10 16:01:39 - BOOST CHILD DEBUG: Processing 24 of 82 for Boost Process 1
2021/04/10 16:01:38 - BOOST CHILD DEBUG: Processing 23 of 82 for Boost Process 1
2021/04/10 16:01:38 - BOOST CHILD DEBUG: Processing 22 of 82 for Boost Process 1
2021/04/10 16:01:38 - BOOST CHILD DEBUG: Processing 21 of 82 for Boost Process 1
2021/04/10 16:01:37 - BOOST CHILD DEBUG: Processing 20 of 82 for Boost Process 1
2021/04/10 16:01:37 - BOOST CHILD DEBUG: Processing 19 of 82 for Boost Process 1
2021/04/10 16:01:36 - BOOST CHILD DEBUG: Processing 18 of 82 for Boost Process 1
2021/04/10 16:01:36 - BOOST CHILD DEBUG: Processing 17 of 82 for Boost Process 1
2021/04/10 16:01:36 - BOOST CHILD DEBUG: Processing 16 of 82 for Boost Process 1
2021/04/10 16:01:35 - BOOST CHILD DEBUG: Processing 15 of 82 for Boost Process 1
2021/04/10 16:01:35 - BOOST CHILD DEBUG: Processing 14 of 82 for Boost Process 1
2021/04/10 16:01:34 - BOOST CHILD DEBUG: Processing 13 of 82 for Boost Process 1
2021/04/10 16:01:34 - BOOST CHILD DEBUG: Processing 12 of 82 for Boost Process 1
2021/04/10 16:01:34 - BOOST CHILD DEBUG: Processing 11 of 82 for Boost Process 1
2021/04/10 16:01:33 - BOOST CHILD DEBUG: Processing 10 of 82 for Boost Process 1
2021/04/10 16:01:33 - BOOST CHILD DEBUG: Processing 9 of 82 for Boost Process 1
2021/04/10 16:01:33 - BOOST CHILD DEBUG: Processing 8 of 82 for Boost Process 1
2021/04/10 16:01:32 - BOOST CHILD DEBUG: Processing 7 of 82 for Boost Process 1
2021/04/10 16:01:32 - BOOST CHILD DEBUG: Processing 6 of 82 for Boost Process 1
2021/04/10 16:01:31 - BOOST CHILD DEBUG: Processing 5 of 82 for Boost Process 1
2021/04/10 16:01:31 - BOOST CHILD DEBUG: Processing 4 of 82 for Boost Process 1
2021/04/10 16:01:31 - BOOST CHILD DEBUG: Processing 3 of 82 for Boost Process 1
2021/04/10 16:01:30 - BOOST CHILD DEBUG: Processing 2 of 82 for Boost Process 1
2021/04/10 16:01:30 - BOOST CHILD DEBUG: Processing 1 of 82 for Boost Process 1
2021/04/10 16:01:30 - BOOST CHILD DEBUG: Processes:1, TotalRows:4076335
2021/04/10 16:01:29 - BOOST CHILD DEBUG: Processing RRRtool Output for Boost Process 1
2021/04/10 16:01:29 - POLLER: Poller[Main Poller]  PID[16179] DEBUG: About to Spawn a Remote Process [CMD: /usr/bin/php,  ARGS: /var/www/localhost/htdocs/cacti/poller_boost.php --child=1  --debug]
2021/04/10 16:01:28 - BOOST WARNING: Boost Poller forced by command line.

@TheWitness
Copy link
Member

So, boost is ending and the process is not removed from the processes table. So, run one more test. Let it run to completion and then run this query:

SELECT * FROM processes;

@jdcoats
Copy link
Author

jdcoats commented Apr 10, 2021

while running and then completion

MariaDB [cacti]> SELECT * FROM processes;
+--------+-------+----------+----------+--------+---------+---------------------+---------------------+
| id     | pid   | tasktype | taskname | taskid | timeout | started             | last_update         |
+--------+-------+----------+----------+--------+---------+---------------------+---------------------+
| 183765 | 23719 | boost    | child    |      1 |    7200 | 2021-04-10 16:23:07 | 2021-04-10 16:23:07 |
| 183764 | 23471 | boost    | master   |      1 |    7200 | 2021-04-10 16:23:07 | 2021-04-10 16:23:07 |
+--------+-------+----------+----------+--------+---------+---------------------+---------------------+
2 rows in set (0.000 sec)

MariaDB [cacti]> SELECT * FROM processes;
Empty set (0.000 sec)

@jdcoats
Copy link
Author

jdcoats commented Apr 10, 2021

I'll wait for the next full 2hr run

@jdcoats
Copy link
Author

jdcoats commented Apr 12, 2021

no remote poller

@TheWitness
Copy link
Member

Okay

@TheWitness
Copy link
Member

Well, that's weird, the master goes away, but the child stays put. How many concurrent processes was that?

@jdcoats
Copy link
Author

jdcoats commented Apr 12, 2021 via email

@jdcoats
Copy link
Author

jdcoats commented Apr 12, 2021

i grabbed the changes from Saturday afternoon and Sunday this morning. The following boost run completed without the error. I will update you in about an hour on next run.

@jdcoats
Copy link
Author

jdcoats commented Apr 12, 2021

okay, so it creates the master and child (12642 & 12631). Appears to complete then runs another child (17529)

2021/04/12 11:03:25 - SYSTEM BOOST STATS: Time:8.51 RRDUpdates:103200
--
2021/04/12 11:03:23 - SYSTEM BOOST STATS: Time:6.56 ProcessNumber:1 RRDUpdates:103200
2021/04/12 11:03:16 - BOOST WARNING: Boost Poller terminated by user
2021/04/12 11:03:16 - POLLER: Poller[Main Poller] PID[32130] ERROR: Process being killed due to timeout! (boost, master, 1, 27281)
2021/04/12 11:03:15 - SYSTEM BOOST STATS: Time:58.71 ProcessNumber:1 RRDUpdates:51599
2021/04/12 11:01:16 - SYSTEM BOOST STATS: Time:0.01 ProcessNumber:1 RRDUpdates:
2021/04/12 11:01:16 - BOOST ERROR: Failed to retrieve archive table name
2021/04/12 11:01:15 - BOOST WARNING: Boost Poller terminated by user
2021/04/12 11:01:15 - BOOST SVR WARNING: Detected Poller Boost Overrun, Possible Boost Poller Crash
2021/04/12 11:01:15 - BOOST WARNING: Killing Boost Child PID 17529 due to it exceeding 3x its maximum allowed runtime
2021/04/12 11:00:15 - BOOST WARNING: Boost Poller terminated by user
2021/04/12 11:00:15 - BOOST WARNING: Boost Poller terminated by user
2021/04/12 11:00:15 - BOOST WARNING: Killing Boost Child PID 12642 due to it exceeding 3x its maximum allowed runtime
2021/04/12 11:00:15 - POLLER: Poller[Main Poller] PID[17519] ERROR: Process being killed due to timeout! (boost, master, 1, 12631)

boost-process2.txt

@jdcoats
Copy link
Author

jdcoats commented Apr 12, 2021

straces
strace.zip
strace.txt

@jdcoats
Copy link
Author

jdcoats commented Apr 12, 2021

It thinks boost has exceeded its run time when another poller starts since I'm on 1 minute polling?

@TheWitness
Copy link
Member

That should not happen, looks like you are missing a few additional patches. Best you test using the full 1.2.x branch.

@TheWitness
Copy link
Member

-issue#4183: Process control timing out processes pre-maturely

@jdcoats
Copy link
Author

jdcoats commented Apr 13, 2021

I don't update a file at a time. I can revert to 4/8 before this started. I'm doing full 1.2.x pulls and saving +2 previous revisions.

@jdcoats
Copy link
Author

jdcoats commented Apr 13, 2021

I'm available to look yourself but I can't find anything that I've overlooked.

@jdcoats
Copy link
Author

jdcoats commented Apr 13, 2021

#4183 was a month ago and I follow close as best I can to assist with testing in a production system, this just happened in the past week

 ~ # grep strtotime /var/www/localhost/htdocs/cacti/lib/poller.php
        } elseif (strtotime($r['started']) + $r['timeout'] > time()) {

@TheWitness
Copy link
Member

Dan, I was able to reproduce the problem today. Going to update poller_boost.php and lib/poller.php shortly.

@TheWitness
Copy link
Member

Changes are in. Please test.

@jdcoats
Copy link
Author

jdcoats commented Apr 13, 2021

Thanks! I'm glad you were able to reproduce. I have updated again. Boost will run in about an hour but I have to work at 2am so may not check it, ill update with results early am sometime.

@jdcoats
Copy link
Author

jdcoats commented Apr 13, 2021

still was not smooth
boost

@jdcoats
Copy link
Author

jdcoats commented Apr 13, 2021

It does finally complete with what I think should be the normal boost stats 6434281 updates in 45-48 sec's

boost2

@TheWitness
Copy link
Member

I know the issue now. Fix it in that am.

@TheWitness
Copy link
Member

I was not able to reproduce, but I have a thought. Run this query and post the results:

SELECT value FROM settings WHERE name = 'boost_rrd_update_max_runtime';

I thought that I had written the timeout logic correctly, but upon inspection this morning, it looks good.

@jdcoats
Copy link
Author

jdcoats commented Apr 14, 2021

MariaDB [cacti]> SELECT value FROM settings WHERE name = 'boost_rrd_update_max_runtime';
+-------+
| value |
+-------+
| 2400  |
+-------+
1 row in set (0.000 sec)

@TheWitness
Copy link
Member

So, timeout is 40 minutes. Bah.

@TheWitness TheWitness added the unreleased A bug that was introduced in development and not released. label Apr 14, 2021
TheWitness added a commit that referenced this issue Apr 14, 2021
Parallel boost restart due to timeout can result in errors
@TheWitness
Copy link
Member

TheWitness commented Apr 14, 2021

Okay, pull lib/poller.php again. Note to self: Don't every buy code manufactured on a Monday, or is that a Car?

@jdcoats
Copy link
Author

jdcoats commented Apr 14, 2021

Thanks! that worked with one process :) I'll turn it up a bit and see now.

@jdcoats
Copy link
Author

jdcoats commented Apr 14, 2021

looks good now


2021/04/14 12:40:50 - SYSTEM BOOST STATS: Time:35.13 RRDUpdates:6140154
--
2021/04/14 12:40:50 - SYSTEM BOOST STATS: Time:33.28 ProcessNumber:1 RRDUpdates:671063
2021/04/14 12:40:49 - SYSTEM BOOST STATS: Time:32.99 ProcessNumber:2 RRDUpdates:762538
2021/04/14 12:40:49 - SYSTEM BOOST STATS: Time:32.92 ProcessNumber:4 RRDUpdates:752619
2021/04/14 12:40:49 - SYSTEM BOOST STATS: Time:32.64 ProcessNumber:3 RRDUpdates:762542
2021/04/14 12:40:48 - SYSTEM BOOST STATS: Time:31.79 ProcessNumber:9 RRDUpdates:693209
2021/04/14 12:40:48 - SYSTEM BOOST STATS: Time:31.63 ProcessNumber:5 RRDUpdates:569623
2021/04/14 12:40:47 - SYSTEM BOOST STATS: Time:30.80 ProcessNumber:6 RRDUpdates:537363
2021/04/14 12:40:47 - SYSTEM BOOST STATS: Time:30.37 ProcessNumber:7 RRDUpdates:455565
2021/04/14 12:40:45 - SYSTEM BOOST STATS: Time:29.05 ProcessNumber:10 RRDUpdates:472819
2021/04/14 12:40:43 - SYSTEM BOOST STATS: Time:26.91 ProcessNumber:8 RRDUpdates:462813
2021/04/14 10:40:17 - SYSTEM BOOST STATS: Time:60.72 RRDUpdates:6185933
2021/04/14 10:40:15 - SYSTEM BOOST STATS: Time:58.43 ProcessNumber:1 RRDUpdates:6185933

@jdcoats jdcoats closed this as completed Apr 14, 2021
@github-actions github-actions bot locked and limited conversation to collaborators Jul 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Undesired behaviour resolved A fixed issue unreleased A bug that was introduced in development and not released.
Projects
None yet
Development

No branches or pull requests

2 participants