Can't delete or fix failed VM #62

Closed
danielkwinsor opened this Issue Mar 1, 2013 · 14 comments

7 participants

@danielkwinsor

I would like a way to manually delete a VM for the following use case, or perhaps even fix it with bosh cloudcheck. I'm deploying cloudfoundry. I don't know what went wrong to get to this state, but 1 out of some number of VMs is in an unresponsive state.

unknown/unknown | unresponsive agent | | | ed079483-17b1-4b9b-8f02-64ca9253b483 | 8d29364c-ae75-4a7f-9479-bc698d69a95a

You can see it has a cloud id and agent id, I also know it's there in my Openstack
A bosh deploy will yield
binding existing deployment: Timed out sending get_state' to 8d29364c-ae75-4a7f-9479-bc698d69a95a after 30 seconds (00:01:30)
Error 450002: Timed out sending
get_state' to 8d29364c-ae75-4a7f-9479-bc698d69a95a after 30 seconds

bosh cloudcheck yields
Problem 1 of 1: Unknown VM (ed079483-17b1-4b9b-8f02-64ca9253b483) is not responding.

  1. Ignore problem
  2. Reboot VM
  3. Recreate VM using last known apply spec
  4. Delete VM reference (DANGEROUS!) Please choose a resolution [1 - 4]:

None of 2 -4 will work.
If I delete
unresponsive_agent 182: Delete VM reference (DANGEROUS!): VM `182' has a cloud id, please use a different resolution. (00:00:10)

If I recreate
unresponsive_agent 182: Recreate VM using last known apply spec: Unable to look up VM apply spec (00:00:10)

If I reboot
unresponsive_agent 182: Reboot VM: Agent still unresponsive after reboot (00:13:42)

I can delete the deployment and try again, but out of 40 jobs in cloudfoundry, it seems inevitably that something will go wrong.

As requested, here's my info
root@ubuntu:/var/vcap/deployments# bosh status
bosUpdating director data... hdone

Director
Name microbosh-openstack
URL http://10.152.93.141:25555
Version 0.7 (release:b240cdfb bosh:57b31b3f)
User admin
UUID 213167e4-adb2-479a-880f-1c8d28814e8f
CPI openstack
dns enabled

Deployment
Manifest /var/vcap/deployments/Cloudfoundry-full.yml
root@ubuntu:/var/vcap/deployments# bosh version
BOSH 1.0.3
root@ubuntu:/var/vcap/deployments# bosh stemcells

+---------------+---------+--------------------------------------+
| Name | Version | CID |
+---------------+---------+--------------------------------------+
| bosh-stemcell | 0.6.7 | 335e6abd-e26a-48ff-9d30-0bd495fb2248 |
+---------------+---------+--------------------------------------+

Stemcells total: 1

micro bosh stemcell is 0.8.1

@pmenglund

Did the initial VM creation fail? It might be the case that the recreate code assumes we have successfully created the VM, as it wants to get the stored apply_spec, but that only gets stored when the VM has been successfully created.

@pmenglund pmenglund was assigned Mar 6, 2013
@danielkwinsor

Yes and no. The VM is created on openstack ok, and probably boots up and all that. What happened is that, after creating x number of VMs, the x+1th VM is put on a compute node that is different than the compute node that the director VM is on. For some reason the network is set up such that VMs on different compute nodes can't ping each other. Thus the VM itself may have been created ok, but I get the error "Timed out pinging to b5d59cdc-a4e0-417d-977d-d5d1ca967cc8 after 600 seconds (00:10:50)" when doing a "bosh deploy" Make sense?

@rkoster

I also receive this error. The thing is I only have one compute node. Also the pinging means sending a ping request to the agent on the new stemcell which should respond with a pong.

I have tried the investigate further. First I had to get into the freshly created stemcell:
SSH into unresponsive stemcell

Login via console

vcap/c1oudc0w                                                                                                                            
su -                                                                                                                                     
c1oudc0w

Install nano

apt-get install nano                                                                                                                     

Enable ssh password login

nano /etc/ssh/sshd-config                                                                                                                
ChallengeResponseAuthentication yes                                                                                                      
/etc/init.d/ssh restart                                                                                                                  

Login to stemcell

bosh-bootstrap ssh                                                                                                                       
ssh vcap@{privat_ip}                                                                                                                     
c1oudc0w                                                                                                                                 
su -                                                                                                                                     
c1oudc0w

Then I found the following messages in the logs /var/vcap/bosh/log:

INFO: got user_data: {"registry"=>{"endpoint"=>"http://10.200.7.4:25777"}, "server"=>{"name"=>"vm-be441e38-442c-4b9c-a46e-a2ffd4f8a841"}, "dns"=>{"nameserver"=>["10.200.7.4", "10.200.7.1"]}}
2013-03-15_09:02:21.03547 #[26846] INFO: failed to load infrastructure settings: Cannot read settings for `http://10.200.7.4:25777/servers/vm-be441e38-442c-4b9c-a46e-a2ffd4f8a841/settings' from registry, got HTTP 500

When visiting http://10.200.7.4:25777/servers/vm-be441e38-442c-4b9c-a46e-a2ffd4f8a841/settings using sshuttle I got the following:
registry

@rkoster

Here the stacktrace from the openstack_registry on the micro bosh node:

NoMethodError - undefined method `body' for #<Hash:0x00000002306310>:
 /var/vcap/packages/openstack_registry/gem_home/gems/fog-1.9.0/lib/fog/openstack/compute.rb:338:in `rescue in request'
 /var/vcap/packages/openstack_registry/gem_home/gems/fog-1.9.0/lib/fog/openstack/compute.rb:326:in `request'
 /var/vcap/packages/openstack_registry/gem_home/gems/fog-1.9.0/lib/fog/openstack/requests/compute/list_servers_detail.rb:15:in `list_servers_detail'
 /var/vcap/packages/openstack_registry/gem_home/gems/fog-1.9.0/lib/fog/openstack/models/compute/servers.rb:21:in `all'
 /var/vcap/packages/openstack_registry/gem_home/gems/fog-1.9.0/lib/fog/core/collection.rb:141:in `lazy_load'
 /var/vcap/packages/openstack_registry/gem_home/gems/fog-1.9.0/lib/fog/core/collection.rb:22:in `each'
 /var/vcap/packages/openstack_registry/gem_home/gems/bosh_openstack_registry-1.5.0.pre2/lib/openstack_registry/server_manager.rb:67:in `find'
 /var/vcap/packages/openstack_registry/gem_home/gems/bosh_openstack_registry-1.5.0.pre2/lib/openstack_registry/server_manager.rb:67:in `server_ips'
 /var/vcap/packages/openstack_registry/gem_home/gems/bosh_openstack_registry-1.5.0.pre2/lib/openstack_registry/server_manager.rb:47:in `check_instance_ips'
 /var/vcap/packages/openstack_registry/gem_home/gems/bosh_openstack_registry-1.5.0.pre2/lib/openstack_registry/server_manager.rb:34:in `read_settings'
 /var/vcap/packages/openstack_registry/gem_home/gems/bosh_openstack_registry-1.5.0.pre2/lib/openstack_registry/api_controller.rb:22:in `block in <class:ApiController>'
 /var/vcap/packages/openstack_registry/gem_home/gems/sinatra-1.2.8/lib/sinatra/base.rb:1175:in `call'
 /var/vcap/packages/openstack_registry/gem_home/gems/sinatra-1.2.8/lib/sinatra/base.rb:1175:in `block in compile!'
 /var/vcap/packages/openstack_registry/gem_home/gems/sinatra-1.2.8/lib/sinatra/base.rb:739:in `instance_eval'
 /var/vcap/packages/openstack_registry/gem_home/gems/sinatra-1.2.8/lib/sinatra/base.rb:739:in `route_eval'
 /var/vcap/packages/openstack_registry/gem_home/gems/sinatra-1.2.8/lib/sinatra/base.rb:723:in `block (2 levels) in route!'
 /var/vcap/packages/openstack_registry/gem_home/gems/sinatra-1.2.8/lib/sinatra/base.rb:773:in `block in process_route'
 /var/vcap/packages/openstack_registry/gem_home/gems/sinatra-1.2.8/lib/sinatra/base.rb:770:in `catch'
 /var/vcap/packages/openstack_registry/gem_home/gems/sinatra-1.2.8/lib/sinatra/base.rb:770:in `process_route'
 /var/vcap/packages/openstack_registry/gem_home/gems/sinatra-1.2.8/lib/sinatra/base.rb:722:in `block in route!'
 /var/vcap/packages/openstack_registry/gem_home/gems/sinatra-1.2.8/lib/sinatra/base.rb:721:in `each'
 /var/vcap/packages/openstack_registry/gem_home/gems/sinatra-1.2.8/lib/sinatra/base.rb:721:in `route!'
 /var/vcap/packages/openstack_registry/gem_home/gems/sinatra-1.2.8/lib/sinatra/base.rb:857:in `dispatch!'
 /var/vcap/packages/openstack_registry/gem_home/gems/sinatra-1.2.8/lib/sinatra/base.rb:659:in `block in call!'
 /var/vcap/packages/openstack_registry/gem_home/gems/sinatra-1.2.8/lib/sinatra/base.rb:823:in `block in invoke'
 /var/vcap/packages/openstack_registry/gem_home/gems/sinatra-1.2.8/lib/sinatra/base.rb:823:in `catch'
 /var/vcap/packages/openstack_registry/gem_home/gems/sinatra-1.2.8/lib/sinatra/base.rb:823:in `invoke'
 /var/vcap/packages/openstack_registry/gem_home/gems/sinatra-1.2.8/lib/sinatra/base.rb:659:in `call!'
 /var/vcap/packages/openstack_registry/gem_home/gems/sinatra-1.2.8/lib/sinatra/base.rb:644:in `call'
 /var/vcap/packages/openstack_registry/gem_home/gems/rack-1.5.2/lib/rack/head.rb:11:in `call'
 /var/vcap/packages/openstack_registry/gem_home/gems/sinatra-1.2.8/lib/sinatra/showexceptions.rb:21:in `call'
 /var/vcap/packages/openstack_registry/gem_home/gems/rack-1.5.2/lib/rack/builder.rb:138:in `call'
 /var/vcap/packages/openstack_registry/gem_home/gems/rack-1.5.2/lib/rack/urlmap.rb:65:in `block in call'
 /var/vcap/packages/openstack_registry/gem_home/gems/rack-1.5.2/lib/rack/urlmap.rb:50:in `each'
 /var/vcap/packages/openstack_registry/gem_home/gems/rack-1.5.2/lib/rack/urlmap.rb:50:in `call'
 /var/vcap/packages/openstack_registry/gem_home/gems/thin-1.5.0/lib/thin/connection.rb:81:in `block in pre_process'
 /var/vcap/packages/openstack_registry/gem_home/gems/thin-1.5.0/lib/thin/connection.rb:79:in `catch'
 /var/vcap/packages/openstack_registry/gem_home/gems/thin-1.5.0/lib/thin/connection.rb:79:in `pre_process'
 /var/vcap/packages/openstack_registry/gem_home/gems/thin-1.5.0/lib/thin/connection.rb:54:in `process'
 /var/vcap/packages/openstack_registry/gem_home/gems/thin-1.5.0/lib/thin/connection.rb:39:in `receive_data'
 /var/vcap/packages/openstack_registry/gem_home/gems/eventmachine-0.12.10/lib/eventmachine.rb:256:in `run_machine'
 /var/vcap/packages/openstack_registry/gem_home/gems/eventmachine-0.12.10/lib/eventmachine.rb:256:in `run'
 /var/vcap/packages/openstack_registry/gem_home/gems/bosh_openstack_registry-1.5.0.pre2/lib/openstack_registry/runner.rb:23:in `run'
 /var/vcap/packages/openstack_registry/gem_home/gems/bosh_openstack_registry-1.5.0.pre2/bin/openstack_registry:28:in `<top (required)>'
 /var/vcap/packages/openstack_registry/bin/openstack_registry:23:in `load'
 /var/vcap/packages/openstack_registry/bin/openstack_registry:23:in `<main>'
@rkoster

As can been seen in the stacktrace I'm using micro bosh and stemcell with version 1.5.0.pre2

@frodenas
Cloud Foundry member

@rkoster What version of OpenStack are you using? Seems OS is not returning a servers hash when fog calls list_server_details. Can you please use fog to connect to your OS environment and execute a simple openstack.servers?

@rkoster

I have just redeployed the microbosh with the --update flag (same stemcell). And now the problem has gone away. But I have had this problem before so I think it will resurface again. I looks like the problem occurs over time.

@tpradeep

As extreme step..I could delete unresponsive agents which were not getting removed through cloudcheck by deleting deployment using "--force"

@drnic
Cloud Foundry member
@rkoster

I encountered this problem again. I'm running OpenStack version essex with micro bosh version 1.5.0.pre2 (release:346bb97d bosh:346bb97d)

NoMethodError - undefined method `body' for #<Hash:0x000000042f8f38>:
 /var/vcap/packages/registry/gem_home/gems/fog-1.9.0/lib/fog/openstack/compute.rb:338:in `rescue in request'
 /var/vcap/packages/registry/gem_home/gems/fog-1.9.0/lib/fog/openstack/compute.rb:326:in `request'
 /var/vcap/packages/registry/gem_home/gems/fog-1.9.0/lib/fog/openstack/requests/compute/list_servers_detail.rb:15:in `list_servers_detail'
 /var/vcap/packages/registry/gem_home/gems/fog-1.9.0/lib/fog/openstack/models/compute/servers.rb:21:in `all'
 /var/vcap/packages/registry/gem_home/gems/fog-1.9.0/lib/fog/core/collection.rb:141:in `lazy_load'
 /var/vcap/packages/registry/gem_home/gems/fog-1.9.0/lib/fog/core/collection.rb:22:in `each'
 /var/vcap/packages/registry/gem_home/gems/bosh_registry-1.5.0.pre2/lib/bosh_registry/instance_manager/openstack.rb:39:in `find'
 /var/vcap/packages/registry/gem_home/gems/bosh_registry-1.5.0.pre2/lib/bosh_registry/instance_manager/openstack.rb:39:in `instance_ips'
 /var/vcap/packages/registry/gem_home/gems/bosh_registry-1.5.0.pre2/lib/bosh_registry/instance_manager.rb:45:in `check_instance_ips'
 /var/vcap/packages/registry/gem_home/gems/bosh_registry-1.5.0.pre2/lib/bosh_registry/instance_manager.rb:29:in `read_settings'
 /var/vcap/packages/registry/gem_home/gems/bosh_registry-1.5.0.pre2/lib/bosh_registry/api_controller.rb:22:in `block in 

To debug I created a /root/.fog file with the credentials from /var/vcap/jobs/registry/config/registry.yml.
Then I run:

export PATH=/var/vcap/packages/ruby/bin:$PATH
GEM_HOME=/var/vcap/packages/registry/gem_home
fog
openstack = Fog::Compute.new(:provider => "OpenStack")
openstack.servers

Which returns

  <Fog::Compute::OpenStack::Servers
    filters={}
    [
      <Fog::Compute::OpenStack::Server
        id="67163032-afb2-4d2f-ae11-fa28e4ed60b0",
        instance_name=nil,
        addresses={"service"=>[{"version"=>4, "addr"=>"10.200.7.5"}]},
        flavor={"id"=>"1004", "links"=>[{"href"=>"http://{ip}/7276102beb584c66a11bb6b923a4d4f1/flavors/1004", "rel"=>"bookmark"}]},
        host_id="eb4609af2272e22d5c94d87f7f66a0b9a5e960ec702b752aa8ae6aa3",
        image={"id"=>"298505df-476a-45d4-a213-18d6d0224cb3", "links"=>[{"href"=>"http://{ip}/7276102beb584c66a11bb6b923a4d4f1/images/298505df-476a-45d4-a213-18d6d0224cb3", "rel"=>"bookmark"}]},
        metadata=        <Fog::Compute::OpenStack::Metadata
          []
        >,
        links=[{"href"=>"http://{ip}/v1.1/7276102beb584c66a11bb6b923a4d4f1/servers/67163032-afb2-4d2f-ae11-fa28e4ed60b0", "rel"=>"self"}, {"href"=>"http://{ip}/7276102beb584c66a11bb6b923a4d4f1/servers/67163032-afb2-4d2f-ae11-fa28e4ed60b0", "rel"=>"bookmark"}],
        name="vm-5a7b25dc-8691-4101-ab9b-3697215b1f0d",
        personality=nil,
        progress=0,
        accessIPv4="",
        accessIPv6="",
        availability_zone=nil,
        user_data_encoded=nil,
        state="ACTIVE",
        created=2013-03-20 15:08:36 UTC,
        updated=2013-03-20 15:09:46 UTC,
        tenant_id="7276102beb584c66a11bb6b923a4d4f1",
        user_id="834d9dc2fb7d4ec5b109103e6649d19d",
        key_name="microbosh-openstack",
        fault=nil,
        os_dcf_disk_config="MANUAL",
        os_ext_srv_attr_host=nil,
        os_ext_srv_attr_hypervisor_hostname=nil,
        os_ext_srv_attr_instance_name=nil,
        os_ext_sts_power_state=1,
        os_ext_sts_task_state=nil,
        os_ext_sts_vm_state="active"
      >
   ]
@rkoster

When running ps aux | grep registry I get:

vcap      2511  0.4  1.0 134732 43292 ?        S<l  Mar18  12:09 /var/vcap/data/packages/ruby/3/bin/ruby /var/vcap/packages/registry/bin/bosh_registry -c /var/vcap/jobs/registry/config/registry.yml

When I kill this process all works fine again.

@oppegard
Cloud Foundry member

Are you still having similar issues with newer versions of BOSH?

@rkoster

The problem of the registry getting stuck over time is still a issue in newer versions of BOSH.
But when I read back this whole issue it looks like this was not the initial problem being described.
I have no experience with running a multinode openstack so on second thought I think they should be two separate issues. And the debugging information I posted above seems to be more related to #96. Sorry for hijacking this issue.

@frodenas
Cloud Foundry member

Regarding the original issue, it has been fixed. The Bosh cck command now allows you to delete the vm reference, so you can recreate it later. I'm closing this issue, if it happens again, feel free to reopen it.

@rkoster Regarding the registry issue, we'll follow it in #96

@frodenas frodenas closed this May 31, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment