windows agent cpu spikes when the director stops #137

flavorjones · 2017-08-11T01:15:04Z

Howdy,

Given I have bosh-deployed concourse
And that deployment includes a windows 2012R2 VM
And that windows VM is running the bosh-google-kvm-windows2012R2-go_agent v1200.0 stemcell
And I stop the director VM
Then I see the windows agent spike to 100% CPU

Here's a chart from GCP/Stackdriver:

I'll try to reproduce later and provide more detailed logs from the agent.

The text was updated successfully, but these errors were encountered:

flavorjones · 2017-08-11T06:13:15Z

It might also be worth noting here that the windows VM is consistently at ~25% CPU utilization, even when nothing is going on. Not clear to me whether this is due to the concourse worker job or the bosh agent.

crawsible · 2017-08-11T07:17:04Z

@flavorjones Hrm, very interesting and definitely unexpected (cc: @davidjahn). Thanks for reporting! I've scheduled a bug investigation in our backlog here.

FWIW, naively killing the bosh director on my cf deployment on GCP, using a stemcell v1200.1 release candidate, wasn't enough to reproduce this issue for me. That's pretty surprising -- I can't imagine any running process aside from the agent would be affected by the director's accessibility. I think I know the answer to this question, but just to be sure, would you happen to know whether the agent was performing any director-requested work at the time the director stopped?

davidjahn · 2017-08-11T14:14:46Z

Ohhh this is interesting... not sure if the problem is agent induced alone or is some interaction with concourse... we will see if we can reproduce it on GCP.

charlievieth · 2017-08-11T16:45:05Z

One thing to note. If the agent dies it's jobs (the services it creates) will keep running.

charlievieth · 2017-08-11T16:47:03Z

Is that red line the agent's CPU usage, and if not do we know which process it correlates to?

mhoran · 2017-08-11T18:32:16Z

In investigating this bug, we discovered a completely unrelated bug where the agent will not restart after termination. Thanks!

We're still trying to reproduce the behavior you're seeing. In the stemcell you're using (where the aforementioned unrelated bug is not present), the agent does try to restart when the director connection is lost. We don't see our VMs using 100% CPU, though. Could you tell us a bit more about the instance types in this deployment?

Also, logs from /var/vcap/bosh/log would be super helpful.

flavorjones · 2017-08-11T19:45:14Z

OK, just landed and will try to reproduce and get y'all some logs and maybe some screenshots from Task Manager.

mhoran · 2017-08-11T19:46:12Z

OK, so we do see that, with the default of restarting the agent every 5 seconds on failure, about 25% CPU usage. This is a bit excessive, so we're going to go with an exponential backoff for restarting the agent. We'll back off up to 5 minutes, and then try to start the agent every 5 minutes thereafter. Note that the Linux agent is also chatty on startup when NATS is unreachable, but probably doesn't have such an expensive bootstrap process.

It's worth noting that with a backoff of 5 minutes, when you bring your director back online and the resurrector is enabled, the resurrector may recreate the VM before the agent has a chance to restart itself.

charlievieth · 2017-08-11T20:19:19Z

What about capping the wait at 2.5 minutes that would give you a better chance of reconnecting with the director and is still 30 times longer than 5 seconds?

…

Sent from my iPhone On Aug 11, 2017, at 3:46 PM, Matthew Horan <notifications@github.com> wrote: OK, so we do see that, with the default of restarting the agent every 5 seconds on failure, about 25% CPU usage. This is a bit excessive, so we're going to go with an exponential backoff for restarting the agent. We'll back off up to 5 minutes, and then try to start the agent every 5 minutes thereafter. Note that the Linux agent is also chatty on startup when NATS is unreachable, but probably doesn't have such an expensive bootstrap process. It's worth noting that with a backoff of 5 minutes, when you bring your director back online and the resurrector is enabled, the resurrector may recreate the VM before the agent has a chance to restart itself. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

mhoran · 2017-08-11T20:29:46Z

(Hopefully) Fixed in dc9a5b4.

flavorjones · 2017-08-12T21:21:49Z

Looks good. I updated service_wrapper.xml with the changes from dc9a5b4 and went through the stop/uninstall/install/start cycle. Here's what CPU util looked like while the director was stopped:

Thanks all!

crawsible · 2017-08-14T05:55:09Z

@flavorjones Great to hear! We'll ping back here and close this issue once this agent change makes its way into a 2012R2 stemcell release.

cppforlife · 2017-10-17T20:52:36Z

@crawsible did this make it through?

crawsible · 2017-10-18T15:08:51Z

@cppforlife Yes it did, thanks for the reminder.

flavorjones · 2017-10-18T15:20:56Z

For the record, I believe this was patched in 1200.3 (though it may have been 1200.2).

crawsible closed this as completed Oct 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

windows agent cpu spikes when the director stops #137

windows agent cpu spikes when the director stops #137

flavorjones commented Aug 11, 2017

flavorjones commented Aug 11, 2017

crawsible commented Aug 11, 2017 •

edited

Loading

davidjahn commented Aug 11, 2017

charlievieth commented Aug 11, 2017 •

edited

Loading

charlievieth commented Aug 11, 2017

mhoran commented Aug 11, 2017

flavorjones commented Aug 11, 2017

mhoran commented Aug 11, 2017

charlievieth commented Aug 11, 2017 via email

mhoran commented Aug 11, 2017

flavorjones commented Aug 12, 2017

crawsible commented Aug 14, 2017

cppforlife commented Oct 17, 2017

crawsible commented Oct 18, 2017

flavorjones commented Oct 18, 2017

windows agent cpu spikes when the director stops #137

windows agent cpu spikes when the director stops #137

Comments

flavorjones commented Aug 11, 2017

flavorjones commented Aug 11, 2017

crawsible commented Aug 11, 2017 • edited Loading

davidjahn commented Aug 11, 2017

charlievieth commented Aug 11, 2017 • edited Loading

charlievieth commented Aug 11, 2017

mhoran commented Aug 11, 2017

flavorjones commented Aug 11, 2017

mhoran commented Aug 11, 2017

charlievieth commented Aug 11, 2017 via email

mhoran commented Aug 11, 2017

flavorjones commented Aug 12, 2017

crawsible commented Aug 14, 2017

cppforlife commented Oct 17, 2017

crawsible commented Oct 18, 2017

flavorjones commented Oct 18, 2017

crawsible commented Aug 11, 2017 •

edited

Loading

charlievieth commented Aug 11, 2017 •

edited

Loading