Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connection frozen for hours #542

Closed
jogaco opened this issue Dec 15, 2015 · 24 comments
Closed

connection frozen for hours #542

jogaco opened this issue Dec 15, 2015 · 24 comments

Comments

@jogaco
Copy link

jogaco commented Dec 15, 2015

In an upload to S3 storage, I got a frozen connection in a rake task. It stood like this for hours and had to kill it.

/home/my/project/shared/bundle/ruby/2.1.0/gems/excon-0.45.4/lib/excon/ssl_socket.rb:129:in `connect'
/home/my/project/shared/bundle/ruby/2.1.0/gems/excon-0.45.4/lib/excon/ssl_socket.rb:129:in `initialize'
/home/my/project/shared/bundle/ruby/2.1.0/gems/excon-0.45.4/lib/excon/connection.rb:387:in `new'
/home/my/project/shared/bundle/ruby/2.1.0/gems/excon-0.45.4/lib/excon/connection.rb:387:in `socket'
/home/my/project/shared/bundle/ruby/2.1.0/gems/excon-0.45.4/lib/excon/connection.rb:106:in `request_call'
/home/my/project/shared/bundle/ruby/2.1.0/gems/excon-0.45.4/lib/excon/middlewares/mock.rb:47:in `request_call'
/home/my/project/shared/bundle/ruby/2.1.0/gems/excon-0.45.4/lib/excon/middlewares/instrumentor.rb:22:in `request_call
'
/home/my/project/shared/bundle/ruby/2.1.0/gems/excon-0.45.4/lib/excon/middlewares/base.rb:15:in `request_call'
/home/my/project/shared/bundle/ruby/2.1.0/gems/excon-0.45.4/lib/excon/middlewares/base.rb:15:in `request_call'
/home/my/project/shared/bundle/ruby/2.1.0/gems/excon-0.45.4/lib/excon/middlewares/base.rb:15:in `request_call'
/home/my/project/shared/bundle/ruby/2.1.0/gems/excon-0.45.4/lib/excon/connection.rb:233:in `request'
/home/my/project/shared/bundle/ruby/2.1.0/gems/fog-core-1.32.1/lib/fog/core/connection.rb:81:in `request'
/home/my/project/shared/bundle/ruby/2.1.0/gems/fog-xml-0.1.2/lib/fog/xml/connection.rb:9:in `request'
/home/my/project/shared/bundle/ruby/2.1.0/gems/fog-aws-0.7.6/lib/fog/aws/storage.rb:592:in `_request'
/home/my/project/shared/bundle/ruby/2.1.0/gems/fog-aws-0.7.6/lib/fog/aws/storage.rb:587:in `request'
/home/my/project/shared/bundle/ruby/2.1.0/gems/fog-aws-0.7.6/lib/fog/aws/requests/storage/put_object.rb:43:in `put_ob
ject'

I have this in my config:

#http://sortedarray.com/handling-s3-incompletebody-error-when-using-asset_sync-gem/
Excon.defaults[:nonblock] = false
@geemus
Copy link
Contributor

geemus commented Dec 21, 2015

Any particular reason for turning nonblock off? I think with it on the timeout behavior might be a little more predictable/reliable?

@jogaco
Copy link
Author

jogaco commented Dec 26, 2015

I was getting Bad Request errors with IncompleteBody error code and followed advice on the mentioned blog post. Since then, I'm not getting these.

@geemus
Copy link
Contributor

geemus commented Jan 4, 2016

Thanks for the update.

@geemus geemus closed this as completed Jan 4, 2016
@jogaco
Copy link
Author

jogaco commented Mar 14, 2016

I think I was not clear here.
Without Excon.defaults[:nonblock] = false I was getting some Bad Request errors with IncompleteBody error code.

After adding Excon.defaults[:nonblock] = false, following advise in http://sortedarray.com/handling-s3-incompletebody-error-when-using-asset_sync-gem/, I stopped having Bad Request errors.

Now, sometimes I get a frozen connection for hours. Quite infrequently but sometimes. Once a week or less for a task that runs every day.

Shouldn't there be a timeout?

@geemus geemus reopened this Mar 14, 2016
@geemus
Copy link
Contributor

geemus commented Mar 14, 2016

Yes, I would expect a timeout. The one possible exception would be a weird edge case. Basically the timeouts are largely setup around individual reads/writes/etc. So if you kept getting back, say one byte per read, with a very large request, it might not timeout. We have this (vs a more globally tracked timeout limit that the whole request would be tracked against) because it is much simpler to implement/understand and almost always has worked in a manner close enough to what we wanted to not worry about it. So it could be possible that this is being problematic.

That said, the stacktrace makes it look a bit like this is while connecting rather than reading/writing, which is not a case I have seen before. I'm also not confident I agree with the analysis in that blog post. It should still use chunked encoding with nonblock disabled. In fact disabling nonblock seems more likely to cause the hangs you mention, vs leaving it enabled. That said, it sounds like this solved a problem (though I'm at a bit of a loss as to how/why that would solve the issue). You may want to try updating to a newer excon, I'm by no means certain it will fix the problem, but it shouldn't hurt.

Sorry for this difficulty, I'm not certain of the root cause here, but hopefully we can figure it out in time. Always more difficult to pin down these more infrequent errors.

@jogaco
Copy link
Author

jogaco commented Mar 14, 2016

The files in question are images, and not very large: 700 x 700, which need to be stored in s3. I would expect after a few minutes I would get some sort of Timeout Exception then decide if I retry or not.

I have commented out Excon.defaults[:nonblock] = false because I am getting too many frozen connections lately. Will monitor if I get many Bad Request errors.

@geemus
Copy link
Contributor

geemus commented Mar 14, 2016

Hmm. Are the images coming from memory or are they on disk before uploading? Just wondering if there are other places where the contention may be coming from.

@jogaco
Copy link
Author

jogaco commented Mar 14, 2016

From local hard disk.
Images are first fetched from remote url, then resized to maximum 700x700 and thumb, and then uploaded to s3

@geemus
Copy link
Contributor

geemus commented Mar 14, 2016

Is it in parallel? I just wonder if there might not be issues with local disk access somewhere in here instead of over-the-wire, but could be grasping at straws.

@jogaco
Copy link
Author

jogaco commented Mar 14, 2016

It's just sequential and in an underused machine with lots of cpu, memory and disk.

@geemus
Copy link
Contributor

geemus commented Mar 14, 2016

K. Thanks for the update

@jogaco
Copy link
Author

jogaco commented Mar 15, 2016

Finding more problems. With default configuration, commented out the line Excon.defaults[:nonblock] = false, I had another frozen process for a couple hours until I realized and SIGTERMed it.

rake aborted!
SignalException: SIGTERM
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/ssl_socket.rb:122:in `select'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/ssl_socket.rb:122:in `rescue in initialize'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/ssl_socket.rb:119:in `initialize'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/connection.rb:387:in `new'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/connection.rb:387:in `socket'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/connection.rb:106:in `request_call'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/middlewares/mock.rb:47:in `request_call'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/middlewares/instrumentor.rb:22:in `request_call'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/middlewares/base.rb:15:in `request_call'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/middlewares/base.rb:15:in `request_call'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/middlewares/base.rb:15:in `request_call'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/connection.rb:233:in `request'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/fog-core-1.32.1/lib/fog/core/connection.rb:81:in `request'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/fog-xml-0.1.2/lib/fog/xml/connection.rb:9:in `request'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/fog-aws-0.7.6/lib/fog/aws/storage.rb:592:in `_request'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/fog-aws-0.7.6/lib/fog/aws/storage.rb:587:in `request'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/fog-aws-0.7.6/lib/fog/aws/requests/storage/put_object.rb:43:in `put_object'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/dragonfly-s3_data_store-1.2/lib/dragonfly/s3_data_store.rb:53:in `block (2 levels) in write'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/dragonfly-1.0.11/lib/dragonfly/temp_object.rb:92:in `file'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/dragonfly-s3_data_store-1.2/lib/dragonfly/s3_data_store.rb:52:in `block in write'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/dragonfly-s3_data_store-1.2/lib/dragonfly/s3_data_store.rb:170:in `rescuing_socket_errors'

This is a rake task that fetches remote information and images and stores it, and each item should not take more than 2 seconds. If there's a problem at any endpoint, I'd like to have a Timeout Exception in a reasonable time: 20 seconds maybe. This task must store between 10K-20K items per day.

To be precise about my latest comments, sometimes there will be two or three jobs in parallel uploading images. All of them are sequential though.

@jogaco
Copy link
Author

jogaco commented Mar 15, 2016

Just got another couple freezes. This time, this was the only task connecting to s3. Default configuration. I realized of first one after two hours. Killed, then re-launch and another one.
The rest of 6000 uploads where fine. Both freezes have same stack trace.

SignalException: SIGTERM
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/ssl_socket.rb:122:in `select'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/ssl_socket.rb:122:in `rescue in initialize'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/ssl_socket.rb:119:in `initialize'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/connection.rb:387:in `new'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/connection.rb:387:in `socket'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/connection.rb:106:in `request_call'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/middlewares/mock.rb:47:in `request_call'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/middlewares/instrumentor.rb:22:in `request_call'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/middlewares/base.rb:15:in `request_call'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/middlewares/base.rb:15:in `request_call'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/middlewares/base.rb:15:in `request_call'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/connection.rb:233:in `request'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/fog-core-1.32.1/lib/fog/core/connection.rb:81:in `request'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/fog-xml-0.1.2/lib/fog/xml/connection.rb:9:in `request'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/fog-aws-0.7.6/lib/fog/aws/storage.rb:592:in `_request'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/fog-aws-0.7.6/lib/fog/aws/storage.rb:587:in `request'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/fog-aws-0.7.6/lib/fog/aws/requests/storage/put_object.rb:43:in `put_object'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/dragonfly-s3_data_store-1.2/lib/dragonfly/s3_data_store.rb:53:in `block (2 levels) in write'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/dragonfly-1.0.11/lib/dragonfly/temp_object.rb:92:in `file'

@geemus
Copy link
Contributor

geemus commented Mar 15, 2016

@jogaco hey, thanks for working through this with me and providing extra details. Sorry it is taking so long for me to sort it out. That said, I think I made progress this morning. It looks like there were a couple selects in the nonblock case for ssl specifically that were hidden deep in things that I had overlooked previously. I just pushed some changes to master to cleanup a bit of code and make the timeout behavior around these more consistent, which should hopefully fix your issue. Additionally, I think this would explain why disabling non-block would prevent the issue. Any chance you could try the current master and see if that fixes the issue? All tests pass, but would be nice to know if it helps you too before I do a release with no real effect. Thanks!

@jogaco
Copy link
Author

jogaco commented Mar 15, 2016

@geemus I really appreciate this component is working for most of the time: 99.99%. Just providing more details so this can be ironed out for me and others. On the contrary, your response is extremely timely!
Not to disappoint you, but my last freezes were produced with non-block disabled. So perhaps there is more to be checked? My last two stack traces are for this case: non-block disabled.
I will definitely try your version, but I can have 2 weeks working perfect then find a freeze.
Will report back any finds. Feel free to ask me for further testing of any additional changes, either with or without non-blocking.

@jogaco
Copy link
Author

jogaco commented Mar 15, 2016

Another freeze with non-block disabled.
Here stack-trace, this time deeper. Shows some SSLErrorWaitReadable which is perhaps relevant.

SignalException: SIGTERM
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/ssl_socket.rb:122:in `select'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/ssl_socket.rb:122:in `rescue in initialize'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/ssl_socket.rb:119:in `initialize'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/connection.rb:387:in `new'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/connection.rb:387:in `socket'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/connection.rb:106:in `request_call'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/middlewares/mock.rb:47:in `request_call'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/middlewares/instrumentor.rb:22:in `request_call'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/middlewares/base.rb:15:in `request_call'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/middlewares/base.rb:15:in `request_call'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/middlewares/base.rb:15:in `request_call'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/connection.rb:233:in `request'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/fog-core-1.32.1/lib/fog/core/connection.rb:81:in `request'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/fog-xml-0.1.2/lib/fog/xml/connection.rb:9:in `request'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/fog-aws-0.7.6/lib/fog/aws/storage.rb:592:in `_request'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/fog-aws-0.7.6/lib/fog/aws/storage.rb:587:in `request'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/fog-aws-0.7.6/lib/fog/aws/requests/storage/put_object.rb:43:in `put_object'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/dragonfly-s3_data_store-1.2/lib/dragonfly/s3_data_store.rb:53:in `block (2 levels) in write'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/dragonfly-1.0.11/lib/dragonfly/temp_object.rb:92:in `file'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/dragonfly-s3_data_store-1.2/lib/dragonfly/s3_data_store.rb:52:in `block in write'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/dragonfly-s3_data_store-1.2/lib/dragonfly/s3_data_store.rb:170:in `rescuing_socket_errors'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/dragonfly-s3_data_store-1.2/lib/dragonfly/s3_data_store.rb:51:in `write'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/dragonfly-1.0.11/lib/dragonfly/content.rb:180:in `store'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/dragonfly-1.0.11/lib/dragonfly/model/attachment.rb:179:in `store_job!'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/dragonfly-1.0.11/lib/dragonfly/model/attachment.rb:77:in `save!'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/dragonfly-1.0.11/lib/dragonfly/model/instance_methods.rb:16:in `block in save_dragonfly_attachments'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/dragonfly-1.0.11/lib/dragonfly/model/instance_methods.rb:15:in `each'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/dragonfly-1.0.11/lib/dragonfly/model/instance_methods.rb:15:in `save_dragonfly_attachments'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/activesupport-4.2.4/lib/active_support/callbacks.rb:432:in `block in make_lambda'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/activesupport-4.2.4/lib/active_support/callbacks.rb:164:in `call'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/activesupport-4.2.4/lib/active_support/callbacks.rb:164:in `block in halting'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/activesupport-4.2.4/lib/active_support/callbacks.rb:504:in `call'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/activesupport-4.2.4/lib/active_support/callbacks.rb:504:in `block in call'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/activesupport-4.2.4/lib/active_support/callbacks.rb:504:in `each'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/activesupport-4.2.4/lib/active_support/callbacks.rb:504:in `call'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/activesupport-4.2.4/lib/active_support/callbacks.rb:92:in `__run_callbacks__'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/activesupport-4.2.4/lib/active_support/callbacks.rb:778:in `_run_save_callbacks'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/activesupport-4.2.4/lib/active_support/callbacks.rb:81:in `run_callbacks'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/elasticsearch-persistence-0.1.8/lib/elasticsearch/persistence/model/store.rb:51:in `save'
my-code here
/home/my/.rvm/gems/ruby-2.1.7@project/bin/ruby_executable_hooks:15:in `eval'
/home/my/.rvm/gems/ruby-2.1.7@project/bin/ruby_executable_hooks:15:in `<main>'
OpenSSL::SSL::SSLErrorWaitReadable: read would block
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/ssl_socket.rb:120:in `connect_nonblock'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/ssl_socket.rb:120:in `initialize'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/connection.rb:387:in `new'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/connection.rb:387:in `socket'
/home/my/.rvm/gems/ruby-2.1.7@project/gems/excon-0.45.4/lib/excon/connection.rb:106:in `request_call'

@geemus
Copy link
Contributor

geemus commented Mar 15, 2016

Curious. Are you certain non-block is disabled though? I think that select should be within a if @nonblock guard, which I would think should never be reached if it is disabled, see: https://github.com/excon/excon/blob/v0.45.4/lib/excon/ssl_socket.rb#L118

@jogaco
Copy link
Author

jogaco commented Mar 15, 2016

Non-block is not disabled. It is the default mode, right? As opposed to my first stack traces.

This is my current initializer relevant line, commented out:

#http://sortedarray.com/handling-s3-incompletebody-error-when-using-asset_sync-gem/
# Excon.defaults[:nonblock] = false

@jogaco
Copy link
Author

jogaco commented Mar 15, 2016

Seems to be working fine. I'll let you know of any issues.

@geemus
Copy link
Contributor

geemus commented Mar 15, 2016

Cool, appreciated. Hopefully this will solve it, but since it is intermittent it may be hard to know for sure.

@jogaco
Copy link
Author

jogaco commented Mar 15, 2016

Yeah. I know. Watching closely for the next days...
Thanks a bunch!

@jogaco
Copy link
Author

jogaco commented May 3, 2016

A short note to report I haven't had any issues since the fix. Thanks!

@geemus
Copy link
Contributor

geemus commented May 3, 2016

Nice, glad to hear it!

@geemus
Copy link
Contributor

geemus commented Dec 12, 2016

Closing due to inactivity (looks like it was likely fixed).

@geemus geemus closed this as completed Dec 12, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants