Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deployment locked by a vm. Error Unmarshalling monit status #1754

Closed
poblin-orange opened this issue Aug 9, 2017 · 7 comments
Closed

Deployment locked by a vm. Error Unmarshalling monit status #1754

poblin-orange opened this issue Aug 9, 2017 · 7 comments

Comments

@poblin-orange
Copy link

poblin-orange commented Aug 9, 2017

I consistently have a problem with bosh director loosing control on a deployment (see traces below)
A single vm on a depl under load prevents us from any operation (instances / vms / deploy).

Tested with bosh 262.3
Id expect an inconsistent vm not to block bosh director operation on a deployment.

poblin@prod:~$ bosh -d logsearch vms
Using environment '192.168.116.158' as client 'poblin'

Task 641331

09:44:02 | Error: Action Failed get_state: Getting processes status: Getting service status: Unmarshalling Monit status: read tcp 127.0.0.1:49154->127.0.0.1:2822: read: connection reset by peer

Started  Wed Aug  9 09:44:02 UTC 2017
Finished Wed Aug  9 09:44:02 UTC 2017
Duration 00:00:00

Task 641331 error

Listing deployment 'logsearch' vms infos:
  Expected task '641331' to succeed but state is 'error'

Exit code 1

Here the related bosh task --debug

running","uptime":{"secs":133400},"mem":{"kb":14840,"percent":0.1},"cpu":{"total":0}}],"resurrection_paused":false,"az":null,"id":"1b7e4b2a-b1ee-4b5e-8782-10fd40864aef","bootstrap":true,"ignore":false}                                                                                                                                       
') WHERE ("id" = 641403)                                                                                                                                                
D, [2017-08-09 10:05:14 #12270] [] DEBUG -- DirectorJobRunner: Thread is no longer needed, cleaning up                                                                  
D, [2017-08-09 10:05:14 #12270] [] DEBUG -- DirectorJobRunner: (0.000751s) COMMIT                                                                                       
D, [2017-08-09 10:05:14 #12270] [] DEBUG -- DirectorJobRunner: Thread is no longer needed, cleaning up                                                                  
D, [2017-08-09 10:05:15 #12270] [] DEBUG -- DirectorJobRunner: RECEIVED: director.1fd9f6a4-ecda-400f-84f0-3545b8b7de95.ebcbecf1-39ac-4ee2-a8b0-79517c991493 {"exception":{"message":"Action Failed get_state: Getting processes status: Getting service status: Unmarshalling Monit status: read tcp 127.0.0.1:55068-\u003e127.0.0.1:2822: read: connection reset by peer"}}                                                                                                                                            
E, [2017-08-09 10:05:15 #12270] [] ERROR -- DirectorJobRunner: Worker thread raised exception: Action Failed get_state: Getting processes status: Getting service status: Unmarshalling Monit status: read tcp 127.0.0.1:55068->127.0.0.1:2822: read: connection reset by peer - /var/vcap/packages/director/gem_home/ruby/2.3.0/gems/bosh-director-0.0.0/lib/bosh/director/agent_client.rb:280:in `handle_method'                                                                                                      
/var/vcap/packages/director/gem_home/ruby/2.3.0/gems/bosh-director-0.0.0/lib/bosh/director/agent_client.rb:335:in `handle_message_with_retry'                           
/var/vcap/packages/director/gem_home/ruby/2.3.0/gems/bosh-director-0.0.0/lib/bosh/director/agent_client.rb:387:in `start_task'                                          
/var/vcap/packages/director/gem_home/ruby/2.3.0/gems/bosh-director-0.0.0/lib/bosh/director/agent_client.rb:353:in `send_message'                                        
/var/vcap/packages/director/gem_home/ruby/2.3.0/gems/bosh-director-0.0.0/lib/bosh/director/agent_client.rb:62:in `get_state'                                            
/var/vcap/packages/director/gem_home/ruby/2.3.0/gems/bosh-director-0.0.0/lib/bosh/director/jobs/vm_state.rb:88:in `vm_details'                                          
/var/vcap/packages/director/gem_home/ruby/2.3.0/gems/bosh-director-0.0.0/lib/bosh/director/jobs/vm_state.rb:37:in `process_instance'                                    
/var/vcap/packages/director/gem_home/ruby/2.3.0/gems/bosh-director-0.0.0/lib/bosh/director/jobs/vm_state.rb:24:in `block (3 levels) in perform'                         
/var

bosh agent logs

2017-08-09_09:54:57.71038 [http-client] 2017/08/09 09:54:57 DEBUG - status function called
2017-08-09_09:54:57.71038 [http-client] 2017/08/09 09:54:57 DEBUG - Monit request: url='http://127.0.0.1:2822/_status2?format=xml' body=''
2017-08-09_09:54:57.71038 [attemptRetryStrategy] 2017/08/09 09:54:57 DEBUG - Making attempt #0 for *http.requestRetryable
2017-08-09_09:54:57.71039 [clientRetryable] 2017/08/09 09:54:57 DEBUG - [requestID=2954d296-4103-4add-5742-14cb0399b8aa] Requesting (attempt=1): Request{ Method: 'GET', URL: 'http://127.0.0.1:2822/_status2?format=xml' }
2017-08-09_09:54:58.71078 [attemptRetryStrategy] 2017/08/09 09:54:58 DEBUG - Making attempt #1 for *http.requestRetryable
2017-08-09_09:54:58.71080 [clientRetryable] 2017/08/09 09:54:58 DEBUG - [requestID=2954d296-4103-4add-5742-14cb0399b8aa] Requesting (attempt=2): Request{ Method: 'GET', URL: 'http://127.0.0.1:2822/_status2?format=xml' }
2017-08-09_09:54:58.71129 [Action Dispatcher] 2017/08/09 09:54:58 ERROR - Action Failed get_state: Getting processes status: Getting service status: Unmarshalling Monit status: read tcp 127.0.0.1:52808->127.0.0.1:2822: read: connection reset by peer
2017-08-09_09:54:58.71131 [MBus Handler] 2017/08/09 09:54:58 INFO - Responding
2017-08-09_09:54:58.71131 [MBus Handler] 2017/08/09 09:54:58 DEBUG - Payload
2017-08-09_09:54:58.71131 ********************
2017-08-09_09:54:58.71134 {"exception":{"message":"Action Failed get_state: Getting processes status: Getting service status: Unmarshalling Monit status: read tcp 127.0.0.1:52808-\u003e127.0.0.1:2822: read: connection reset by peer"}}
2017-08-09_09:54:58.71134 ********************

@voelzmo
Copy link
Contributor

voelzmo commented Aug 14, 2017

Saw the same some time ago. Director error handling isn't very good in these situations, so we definitely should fix this.

@voelzmo
Copy link
Contributor

voelzmo commented Sep 26, 2017

In case others stumble upon this:

  • Finding the blocking VMs worked nicely with bosh ssh -d <DEPLOYMENT> -c 'http://vcap:random-password@127.0.0.1:2822/_status2?format=xml' -r --tty > /tmp/ssh-out
  • root-cause was in our case that the VM was overloaded in terms of CPU. Killing the CPU-heavy process made monit responsive again

Still todo: Make the Director able to cope with that message and show the unresponsive VM to the user.

@voelzmo
Copy link
Contributor

voelzmo commented Feb 6, 2018

Fixed with c9cc3ff

@alext
Copy link
Contributor

alext commented May 25, 2018

@voelzmo As far as I can tell, that fix is only included in the 262.x series. It doesn't appear to be included in 265.x. We're currently hitting this issue running v265.2.0.

I see that https://www.pivotaltracker.com/n/projects/1456570/stories/151434806 is marked as accepted. What needs to happen to get this fix included in 265?

@voelzmo
Copy link
Contributor

voelzmo commented May 28, 2018

Good catch, @alext! I think we might have missed this when branching off the 265.x branch. Version 266 should contain the fix again:

rescue Bosh::Director::RpcTimeout, Bosh::Director::RpcRemoteException => e
if fix
state = {'job_state' => 'unresponsive'}
else
raise e, "#{existing_instance.name}: #{e.message}"
end

@cppforlife can you backport c9cc3ff also to 265.x? It's also not in 264.x and 263.x, not sure what your long maintenance plans are for those.

@voelzmo
Copy link
Contributor

voelzmo commented May 29, 2018

Turns out that fix was not included in 262.x either: https://github.com/cloudfoundry/bosh/blob/262.x/src/bosh-director/lib/bosh/director/deployment_plan/assembler.rb#L90 It's just an incorrect github tag that's suggesting this commit is in a 262 release.

266 will be the first version containing this fix.

bandesz pushed a commit to alphagov/paas-bootstrap that referenced this issue Jun 22, 2018
We have encountered an issue [1] of a process putting a VM under load that causes
monit to drop the connections. This is fixed in the 266.x series, therefore we
should upgrade to that.

We bump the stemcell to the latest versions because:
* we want to have the latest security updates
* the precompiled BOSH release is only created for the latest stemcell versions

[1] cloudfoundry/bosh#1754
bandesz pushed a commit to alphagov/paas-bootstrap that referenced this issue Jun 22, 2018
We have encountered an issue [1] of a process putting a VM under load that causes
monit to drop the connections. This is fixed in the 266.x series, therefore we
should upgrade to that.

We bump the stemcell to the latest versions because:
* we want to have the latest security updates
* the precompiled BOSH release is only created for the latest stemcell versions

[1] cloudfoundry/bosh#1754
henrytk added a commit to alphagov/paas-bootstrap that referenced this issue Jul 12, 2018
We have encountered an issue [1] of a process putting a VM under load that causes
monit to drop the connections. This is fixed in the 266.x series, therefore we
should upgrade to that.

Version 266.6.0 may have fixed another issue as well[2].

[1] cloudfoundry/bosh#1754
[2] cloudfoundry/bosh#1990

For reference, the Precompiled releases are available in the public bucket:
https://s3.amazonaws.com/bosh-compiled-release-tarballs

Version IDs can be extracted using something like:

```bash
aws s3api list-object-versions --bucket bosh-compiled-release-tarballs --prefix bosh-266.6
```
@jfmyers9
Copy link
Contributor

Closing this issue as it has been fixed in v266+.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants