VM becomes unresponsive to bosh after reboot #35

gossion · 2018-10-19T08:23:57Z

Not sure if you realize this issue.

I tried a test to deploy a simple workload on Windows VM on Azure, it deployed successfully. The problem is that when I tried to restart the VM (via azure portal), the VMs became unresponsive forever and never came back to service.

guwe@guwecf0628:~/workspace/sample-go-windows-boshrelease$ bosh -e azure vms
Using environment '10.0.0.4' as client 'admin'

Task 967. Done

Deployment 'webapp'

Instance                                      Process State       AZ  IPs  VM CID
                   VM Type  Active
webapp1/56d68506-4ed6-4e0c-a9bf-fa9624a29b09  unresponsive agent  z2  -    agent_id:f95268c3-a067-4b09-ba9f-65c0ae2d2fe8;resource_gr
oup_name:guwe0628  small    true
webapp1/8452e005-bb6e-49ff-b7dc-0b74b492db0e  unresponsive agent  z1  -    agent_id:066b3e8d-6c9e-4c8e-84a3-fcf4b3b262a8;resource_gr
oup_name:guwe0628  small    true

2 vms

Succeeded

My env:

cloud provider: Azure
stemcell:
https://bosh.io/stemcells/bosh-azure-hyperv-windows1803-go_agent
manifest:

---
name: webapp

instance_groups:
- name: webapp1
  azs: [z1, z2, z3]
  instances: 2
  vm_type: small
  stemcell: windows1803
  networks:
    - name: default
      default: [gateway, dns]
    #- name: network2
  jobs:
  - name: simple-go-web-app
    release: sample-go-windows
    properties:
      port: 3000


variables: []

stemcells:
- alias: windows
  os: windows2012R2
  version: latest
- alias: "windows1803"
  os: "windows1803"
  version: "1803.2"

update:
  canaries: 1
  canary_watch_time: 1000-120000
  update_watch_time: 1000-120000
  max_in_flight: 1
  serial: false

releases:
- name: sample-go-windows
  version: 1.0.0
  url: https://github.com/cloudfoundry-community/sample-go-windows-boshrelease/releases/download/v1.0.0/sample-go-windows-1.0.0.tgz
  sha1: 7d15b2bd43acf849fac5f6ec805e0b6cfa1b9bb5

I believe the VM was up because it had response to RDP. Maybe there is an issue with bosh-agent, however, I don't have a credential to login to VM to check.

The text was updated successfully, but these errors were encountered:

cf-gitbot · 2018-10-19T08:23:59Z

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/161338351

The labels on this github issue will be updated when the story is started.

thisisnotashwin · 2018-10-22T14:43:33Z

Hey @gossion

Bosh does not currently support a workflow where VM restarts are managed outside of the Bosh director. In case you do need to restart VMs, the bosh cli has commands to do the same. Restarting the VM from the portal leads to the BOSH director thinking the VMs have been torn down and it tries to recreate them.

Was there a particular use case that you had which required the VMs to be restarted from the Azure portal? It would help us understand the requirement a little better.

Thanks!!

gossion · 2018-10-23T01:13:27Z

Thanks @ashwin-venkatesh . Currently, I don't have any workload blocked, I just saw the error and thought that it could be an issue.

A use case that I can think is a test environment, people would like to stop the VMs for purpose of cost saving, and later on when they need to use the test environment again they start the VMs, and finally they find that the VMs are not recovered.

thisisnotashwin · 2018-10-24T14:38:15Z

Hey @gossion

That is a reasonable use case. Bosh does currently support this workflow using the bosh stop command. It also includes the --hard flag which deletes the VMs but holds onto the persistent disks.

Those would be the recommended way to achieve the desired result. This should ensure things work seamlessly.Bosh maintains it's own state of the world and does not work as intended when changes are made to a VM state via mechanisms external to it.

I hope this addresses the above concern. Do you have other questions or would it be alright if I closed this issue?

thisisnotashwin · 2018-10-24T14:38:51Z

Additionally, you can use bosh start to restart a stopped deployment.

gossion · 2018-10-25T05:27:54Z

Thanks @ashwin-venkatesh .

There is no API for CPI to stop (which is allocate on Azure) a VM, for purpose of cost saving I need to delete the VM (bosh stop --hard) which is not convenient to find the instance id once it is deleted (it is not shown on bosh vms anymore). So I think it is better if the VM can come back to service after an expected/unexpected reboot.

Anyway, I am not blocked. I will close this issue, but think it is a nice-to-have feature.

cf-gitbot added the unscheduled label Oct 19, 2018

gossion closed this as completed Oct 25, 2018

cf-gitbot added delivered accepted and removed unscheduled delivered labels Oct 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VM becomes unresponsive to bosh after reboot #35

VM becomes unresponsive to bosh after reboot #35

gossion commented Oct 19, 2018

cf-gitbot commented Oct 19, 2018

thisisnotashwin commented Oct 22, 2018

gossion commented Oct 23, 2018

thisisnotashwin commented Oct 24, 2018

thisisnotashwin commented Oct 24, 2018

gossion commented Oct 25, 2018

VM becomes unresponsive to bosh after reboot #35

VM becomes unresponsive to bosh after reboot #35

Comments

gossion commented Oct 19, 2018

cf-gitbot commented Oct 19, 2018

thisisnotashwin commented Oct 22, 2018

gossion commented Oct 23, 2018

thisisnotashwin commented Oct 24, 2018

thisisnotashwin commented Oct 24, 2018

gossion commented Oct 25, 2018