Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

windows agent cpu spikes when the director stops #137

Closed
flavorjones opened this issue Aug 11, 2017 · 15 comments
Closed

windows agent cpu spikes when the director stops #137

flavorjones opened this issue Aug 11, 2017 · 15 comments

Comments

@flavorjones
Copy link

Howdy,

Given I have bosh-deployed concourse
And that deployment includes a windows 2012R2 VM
And that windows VM is running the bosh-google-kvm-windows2012R2-go_agent v1200.0 stemcell
And I stop the director VM
Then I see the windows agent spike to 100% CPU

Here's a chart from GCP/Stackdriver:

screenshot from 2017-08-10 21-07-49

I'll try to reproduce later and provide more detailed logs from the agent.

@flavorjones
Copy link
Author

It might also be worth noting here that the windows VM is consistently at ~25% CPU utilization, even when nothing is going on. Not clear to me whether this is due to the concourse worker job or the bosh agent.

@crawsible
Copy link

crawsible commented Aug 11, 2017

@flavorjones Hrm, very interesting and definitely unexpected (cc: @davidjahn). Thanks for reporting! I've scheduled a bug investigation in our backlog here.

FWIW, naively killing the bosh director on my cf deployment on GCP, using a stemcell v1200.1 release candidate, wasn't enough to reproduce this issue for me. That's pretty surprising -- I can't imagine any running process aside from the agent would be affected by the director's accessibility. I think I know the answer to this question, but just to be sure, would you happen to know whether the agent was performing any director-requested work at the time the director stopped?

@davidjahn
Copy link
Contributor

Ohhh this is interesting... not sure if the problem is agent induced alone or is some interaction with concourse... we will see if we can reproduce it on GCP.

@charlievieth
Copy link
Contributor

charlievieth commented Aug 11, 2017

One thing to note. If the agent dies it's jobs (the services it creates) will keep running.

@charlievieth
Copy link
Contributor

Is that red line the agent's CPU usage, and if not do we know which process it correlates to?

@mhoran
Copy link
Contributor

mhoran commented Aug 11, 2017

In investigating this bug, we discovered a completely unrelated bug where the agent will not restart after termination. Thanks!

We're still trying to reproduce the behavior you're seeing. In the stemcell you're using (where the aforementioned unrelated bug is not present), the agent does try to restart when the director connection is lost. We don't see our VMs using 100% CPU, though. Could you tell us a bit more about the instance types in this deployment?

Also, logs from /var/vcap/bosh/log would be super helpful.

@flavorjones
Copy link
Author

OK, just landed and will try to reproduce and get y'all some logs and maybe some screenshots from Task Manager.

@mhoran
Copy link
Contributor

mhoran commented Aug 11, 2017

OK, so we do see that, with the default of restarting the agent every 5 seconds on failure, about 25% CPU usage. This is a bit excessive, so we're going to go with an exponential backoff for restarting the agent. We'll back off up to 5 minutes, and then try to start the agent every 5 minutes thereafter. Note that the Linux agent is also chatty on startup when NATS is unreachable, but probably doesn't have such an expensive bootstrap process.

It's worth noting that with a backoff of 5 minutes, when you bring your director back online and the resurrector is enabled, the resurrector may recreate the VM before the agent has a chance to restart itself.

@charlievieth
Copy link
Contributor

charlievieth commented Aug 11, 2017 via email

@mhoran
Copy link
Contributor

mhoran commented Aug 11, 2017

(Hopefully) Fixed in dc9a5b4.

@flavorjones
Copy link
Author

Looks good. I updated service_wrapper.xml with the changes from dc9a5b4 and went through the stop/uninstall/install/start cycle. Here's what CPU util looked like while the director was stopped:

screenshot from 2017-08-12 17-13-48

Thanks all!

@crawsible
Copy link

@flavorjones Great to hear! We'll ping back here and close this issue once this agent change makes its way into a 2012R2 stemcell release.

@cppforlife
Copy link
Contributor

@crawsible did this make it through?

@crawsible
Copy link

@cppforlife Yes it did, thanks for the reminder.

@flavorjones
Copy link
Author

For the record, I believe this was patched in 1200.3 (though it may have been 1200.2).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants