-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
windows agent cpu spikes when the director stops #137
Comments
It might also be worth noting here that the windows VM is consistently at ~25% CPU utilization, even when nothing is going on. Not clear to me whether this is due to the concourse worker job or the bosh agent. |
@flavorjones Hrm, very interesting and definitely unexpected (cc: @davidjahn). Thanks for reporting! I've scheduled a bug investigation in our backlog here. FWIW, naively killing the bosh director on my cf deployment on GCP, using a stemcell v1200.1 release candidate, wasn't enough to reproduce this issue for me. That's pretty surprising -- I can't imagine any running process aside from the agent would be affected by the director's accessibility. I think I know the answer to this question, but just to be sure, would you happen to know whether the agent was performing any director-requested work at the time the director stopped? |
Ohhh this is interesting... not sure if the problem is agent induced alone or is some interaction with concourse... we will see if we can reproduce it on GCP. |
One thing to note. If the agent dies it's jobs (the services it creates) will keep running. |
Is that red line the agent's CPU usage, and if not do we know which process it correlates to? |
In investigating this bug, we discovered a completely unrelated bug where the agent will not restart after termination. Thanks! We're still trying to reproduce the behavior you're seeing. In the stemcell you're using (where the aforementioned unrelated bug is not present), the agent does try to restart when the director connection is lost. We don't see our VMs using 100% CPU, though. Could you tell us a bit more about the instance types in this deployment? Also, logs from |
OK, just landed and will try to reproduce and get y'all some logs and maybe some screenshots from Task Manager. |
OK, so we do see that, with the default of restarting the agent every 5 seconds on failure, about 25% CPU usage. This is a bit excessive, so we're going to go with an exponential backoff for restarting the agent. We'll back off up to 5 minutes, and then try to start the agent every 5 minutes thereafter. Note that the Linux agent is also chatty on startup when NATS is unreachable, but probably doesn't have such an expensive bootstrap process. It's worth noting that with a backoff of 5 minutes, when you bring your director back online and the resurrector is enabled, the resurrector may recreate the VM before the agent has a chance to restart itself. |
What about capping the wait at 2.5 minutes that would give you a better chance of reconnecting with the director and is still 30 times longer than 5 seconds?
…Sent from my iPhone
On Aug 11, 2017, at 3:46 PM, Matthew Horan <notifications@github.com> wrote:
OK, so we do see that, with the default of restarting the agent every 5 seconds on failure, about 25% CPU usage. This is a bit excessive, so we're going to go with an exponential backoff for restarting the agent. We'll back off up to 5 minutes, and then try to start the agent every 5 minutes thereafter. Note that the Linux agent is also chatty on startup when NATS is unreachable, but probably doesn't have such an expensive bootstrap process.
It's worth noting that with a backoff of 5 minutes, when you bring your director back online and the resurrector is enabled, the resurrector may recreate the VM before the agent has a chance to restart itself.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
(Hopefully) Fixed in dc9a5b4. |
Looks good. I updated Thanks all! |
@flavorjones Great to hear! We'll ping back here and close this issue once this agent change makes its way into a 2012R2 stemcell release. |
@crawsible did this make it through? |
@cppforlife Yes it did, thanks for the reminder. |
For the record, I believe this was patched in 1200.3 (though it may have been 1200.2). |
Howdy,
Given I have bosh-deployed concourse
And that deployment includes a windows 2012R2 VM
And that windows VM is running the
bosh-google-kvm-windows2012R2-go_agent
v1200.0 stemcellAnd I stop the director VM
Then I see the windows agent spike to 100% CPU
Here's a chart from GCP/Stackdriver:
I'll try to reproduce later and provide more detailed logs from the agent.
The text was updated successfully, but these errors were encountered: