New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel boost restart due to timeout can result in errors. #4223
Comments
I was running it at 8 processes/cpu's and i cut it back to 2. I will change to 1 and see if it continues. |
So, my guess is that you changed the setting while boost was running. Let me take a look at the code. There might be something around that change that would cause boost to exit before everything was done, and then dropping the table while the legacy setting process was still busy. |
I can possibly see how this could happen if you exceed the maximum runtime by 3x, boost will forcibly restart itself. Adding some logic to 'fix/workaround' that. |
Parallel boost restart due to timeout can result in errors
So, technically, you should not get the errors any longer, but you will have to check for warnings to ensure that you don't stack up a bunch of archive tables. |
When it runs it get this now.
No actual boost stats are displayed. Maxim run time is set to 40 minutes and it only runs for 40 seconds |
So, I'm wondering who is starting poller_boost.php? Did you manually run? |
|
Were those pids actually still running at the time? |
It looks like the master pid is not being cleared from the process table. |
Did you update to the latest poller_boost.php? |
I checked for a boost process running the last time I saw the error and there was not one. I didn’t check while it was running to see if there were 2
|
I pulled today at 10:50am est |
Okay, so start the boost process by hand in debug mode ./poller_boost.php --force --debug Then from another shell, do an strace after a few minutes, if it does not end on it's own. Use strace like the following: strace -p pid -s 2000 -tt -o /tmp/strace.out |
the strace wouldn't work, even though the pids did exist, I'll try again when I have more records and itll run long enough.
|
you have to replace the 'pid' with the pid of the boost process. |
|
was that not the pid ? I tried 16179 and 16181 |
|
So, boost is ending and the process is not removed from the processes table. So, run one more test. Let it run to completion and then run this query: SELECT * FROM processes; |
while running and then completion
|
I'll wait for the next full 2hr run |
no remote poller |
Okay |
Well, that's weird, the master goes away, but the child stays put. How many concurrent processes was that? |
1
|
i grabbed the changes from Saturday afternoon and Sunday this morning. The following boost run completed without the error. I will update you in about an hour on next run. |
okay, so it creates the master and child (12642 & 12631). Appears to complete then runs another child (17529)
|
straces |
It thinks boost has exceeded its run time when another poller starts since I'm on 1 minute polling? |
That should not happen, looks like you are missing a few additional patches. Best you test using the full 1.2.x branch. |
|
I don't update a file at a time. I can revert to 4/8 before this started. I'm doing full 1.2.x pulls and saving +2 previous revisions. |
I'm available to look yourself but I can't find anything that I've overlooked. |
#4183 was a month ago and I follow close as best I can to assist with testing in a production system, this just happened in the past week
|
Dan, I was able to reproduce the problem today. Going to update poller_boost.php and lib/poller.php shortly. |
Changes are in. Please test. |
Thanks! I'm glad you were able to reproduce. I have updated again. Boost will run in about an hour but I have to work at 2am so may not check it, ill update with results early am sometime. |
I know the issue now. Fix it in that am. |
I was not able to reproduce, but I have a thought. Run this query and post the results: SELECT value FROM settings WHERE name = 'boost_rrd_update_max_runtime'; I thought that I had written the timeout logic correctly, but upon inspection this morning, it looks good. |
|
So, timeout is 40 minutes. Bah. |
Parallel boost restart due to timeout can result in errors
Okay, pull lib/poller.php again. Note to self: Don't every buy code manufactured on a Monday, or is that a Car? |
Thanks! that worked with one process :) I'll turn it up a bit and see now. |
looks good now
|
Boost is hanging up and leaving a process running even after completion. When i find it, I have to manually kill it. It will continue as long as I leave it.
The text was updated successfully, but these errors were encountered: