-
Notifications
You must be signed in to change notification settings - Fork 658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unresponsive agents over time OpenStack #96
Comments
Are you still having this problem with OpenStack stemcells? If so, have you tried building a newer one that may have addressed the issue? |
I'm currently using the ones from CI.
But still have the problem from time to time. I found however a workaround: doing a I currently have a microbosh deployed so what would be helpfull logging to further debug this problem? |
How recent is the microbosh code ('bosh status' should show the git sha). Do the registry logs under /var/vcap/sys/log/registry show anything of interest? @frodenas have you seen behavior like this on openstack? |
I'm currently running |
I found the following stacks-trace in
|
I encountered the problem again today. It happened when I added the echo service to my deployment. For this packages needed to be compiled and as a consequence new vms needed to be created. The creation process all went fine (vms where created when checking in horizon) but then the compiling did not start. |
Seems that the registry is losing the connection to OpenStack, it tries to reauthenticate but it fails. |
There's a bug in the fog gem. Once a user token has expired, it doesn't reauthenticate because it using the same token again and doesn't ask for another new token. |
@frodenas: is there a issue and or pull request upstream for this issue? |
Nice |
Today I had a compilation VM that couldn't authenticate with registry. Hopefully related and hopefully fixed. Has a new microbosh come out since #235 was merged? |
Having said the above, my compilation VM's {"registry":{"endpoint":"http://10.0.0.2:25777"}, ... |
I have deployed the wordpress example a few days ago; but tonight its failing as above on a new deployment. |
How do you kill a deployment when the compilation VM is hanging due to agent issues? Timeout takes forever when you know its a glitch. |
Upgrading from 676 to 693 to see if it fixes issue. |
@drnic, the 698 stemcell doesn't contain the #235 PR, it'll be in stemcell >=704 (not yet published). The user-data usually doesn't contain the user/pwd for the registry (except in the microbosh vm). The bosh_registry implements a security mechanism when reading settings that checks that the ip of the vm asking for settings is the same as the ip of the settings requested. It'll be useful to see the vm logs to check exactly what's happening in your case. Regarding cancelling a compilation VM, actually it's not possible. We've an story in our backlog to deal with this issue. |
Correction: The patch is included in stemcell >= 703 (it has been published just a few minutes ago). @rkoster Can you please try the latest stemcell? |
I have deployed 703 but the director has some problems while starting.
|
Seems like default bosh postgress user has been changed
|
Fixed the problem of the failed postgres migration with: Now have successfully deployed microbosh 703. If the problem still persists I should manifest itself within one day. |
Can you create a ticket to add this as a migration?Dr Nic Williams On Tue, Jun 4, 2013 at 7:23 AM, Ruben Koster notifications@github.com
|
Perhaps it's not possible to migrate actually. So do we need to add properties to legacy micro_bosh.yml?Dr Nic Williams On Tue, Jun 4, 2013 at 7:23 AM, Ruben Koster notifications@github.com
|
The regression happened as our CI system unfortunately doesn't test upgrades, just clean installs. Sorry about that, I'll make sure someone looks at fixing it. |
Perhaps a ticket/feature to test n-1 -> n upgrades?Dr Nic Williams On Tue, Jun 4, 2013 at 7:30 AM, Martin Englund notifications@github.com
|
Both of those stories exist. We have the failed update bug at the top of the backlog so it will get picked up next. We have CI upgrade stories for each platform for micro and full bosh which are also prioritized in the backlog. CI improvements are the current focus of the team, so we anticipate coverage for these cases within a few weeks. |
xoxo to the ci team! |
@rkoster Did your agents lost the connection again? |
The problem did not reappeared. Have tried increasing the size of the deployment and the machines were added without problems. I also don't see the connection problem anymore in the registry log. I only see the following stacktace which does not seem to cause problems. If this stacktrace is not expected I will create an new issue for it.
|
Thanks @rkoster for reporting back! Reopen the issue if the bug appears again. |
I have deployed a micro bosh on OpenStack and with this micro bosh a normal bosh has been deployed. I have been deploying other stuff with this normal bosh on friday. Over the weekend however some of the instances in the bosh deployment have become unresponsive. I get the following when
bosh vms
while targeting the micro bosh.I'm currently running
Version 1.5.0.pre2 (release:346bb97d bosh:346bb97d)
which is created from master one week ago. I created a micro-bosh and a normal stemcell from this same commit.The text was updated successfully, but these errors were encountered: