Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolv.rb issues - threads hanging #747

Closed
zerobearing2 opened this issue Apr 26, 2021 · 15 comments
Closed

Resolv.rb issues - threads hanging #747

zerobearing2 opened this issue Apr 26, 2021 · 15 comments
Labels

Comments

@zerobearing2
Copy link

zerobearing2 commented Apr 26, 2021

We're noticing issues with our highly-concurrent sidekiq jobs after the bump to Excon v0.80.1 this weekend. Threads are hanging in the resolv.rb code. We're still investigating, but wanted to post notice here incase others experience similar issue. Below is an example of the stack trace associated with one of our "hung" threads... Any thoughts or suggestions on how/why this might be happening would be appreciated.

Current workaround, downgrade back to v0.79.0.

ruby 2.6.7p197 (2021-04-05 revision 67941) [x86_64-linux]
Sidekiq 6.1.3

Thanks!

Apr 26 07:20:33  | pid=213 tid=go5pkq979 WARN: Thread TID-go5tnitml processor
Apr 26 07:20:33  | pid=213 tid=go5pkq979 WARN: /usr/local/lib/ruby/2.6.0/resolv.rb:617:in `synchronize'
Apr 26 07:20:33  | /usr/local/lib/ruby/2.6.0/resolv.rb:617:in `allocate_request_id'
Apr 26 07:20:33  | /usr/local/lib/ruby/2.6.0/resolv.rb:837:in `sender'
Apr 26 07:20:33  | /usr/local/lib/ruby/2.6.0/resolv.rb:522:in `block in fetch_resource'
Apr 26 07:20:33  | /usr/local/lib/ruby/2.6.0/resolv.rb:1120:in `block (3 levels) in resolv'
Apr 26 07:20:33  | /usr/local/lib/ruby/2.6.0/resolv.rb:1118:in `each'
Apr 26 07:20:33  | /usr/local/lib/ruby/2.6.0/resolv.rb:1118:in `block (2 levels) in resolv'
Apr 26 07:20:33  | /usr/local/lib/ruby/2.6.0/resolv.rb:1117:in `each'
Apr 26 07:20:33  | /usr/local/lib/ruby/2.6.0/resolv.rb:1117:in `block in resolv'
Apr 26 07:20:33  | /usr/local/lib/ruby/2.6.0/resolv.rb:1115:in `each'
Apr 26 07:20:33  | /usr/local/lib/ruby/2.6.0/resolv.rb:1115:in `resolv'
Apr 26 07:20:33  | /usr/local/lib/ruby/2.6.0/resolv.rb:517:in `fetch_resource'
Apr 26 07:20:33  | /usr/local/lib/ruby/2.6.0/resolv.rb:507:in `each_resource'
Apr 26 07:20:33  | /usr/local/lib/ruby/2.6.0/resolv.rb:404:in `each_address'
Apr 26 07:20:33  | /usr/local/lib/ruby/2.6.0/resolv.rb:116:in `block in each_address'
Apr 26 07:20:33  | /usr/local/lib/ruby/2.6.0/resolv.rb:115:in `each'
Apr 26 07:20:33  | /usr/local/lib/ruby/2.6.0/resolv.rb:115:in `each_address'
Apr 26 07:20:33  | /usr/local/lib/ruby/2.6.0/resolv.rb:58:in `each_address'
Apr 26 07:20:33  | /usr/local/bundle/gems/excon-0.80.1/lib/excon/socket.rb:110:in `connect'
Apr 26 07:20:33  | /usr/local/bundle/gems/excon-0.80.1/lib/excon/ssl_socket.rb:166:in `connect'
Apr 26 07:20:33  | /usr/local/bundle/gems/excon-0.80.1/lib/excon/socket.rb:49:in `initialize'
Apr 26 07:20:33  | /usr/local/bundle/gems/excon-0.80.1/lib/excon/ssl_socket.rb:10:in `initialize'
Apr 26 07:20:33  | /usr/local/bundle/gems/excon-0.80.1/lib/excon/connection.rb:471:in `new'
Apr 26 07:20:33  | /usr/local/bundle/gems/excon-0.80.1/lib/excon/connection.rb:471:in `socket'
Apr 26 07:20:33  | /usr/local/bundle/gems/excon-0.80.1/lib/excon/connection.rb:118:in `request_call'
Apr 26 07:20:33  | /usr/local/bundle/gems/excon-0.80.1/lib/excon/middlewares/decompress.rb:12:in `request_call'
Apr 26 07:20:33  | /usr/local/bundle/gems/excon-0.80.1/lib/excon/middlewares/mock.rb:57:in `request_call'
Apr 26 07:20:33  | /usr/local/bundle/gems/excon-0.80.1/lib/excon/middlewares/instrumentor.rb:34:in `request_call'
Apr 26 07:20:33  | /usr/local/bundle/gems/excon-0.80.1/lib/excon/middlewares/idempotent.rb:19:in `request_call'
Apr 26 07:20:33  | /usr/local/bundle/gems/excon-0.80.1/lib/excon/middlewares/base.rb:22:in `request_call'
Apr 26 07:20:33  | /usr/local/bundle/gems/ddtrace-0.42.0/lib/ddtrace/contrib/excon/middleware.rb:41:in `request_call'
Apr 26 07:20:33  | /usr/local/bundle/gems/excon-0.80.1/lib/excon/middlewares/base.rb:22:in `request_call'
Apr 26 07:20:33  | /usr/local/bundle/gems/excon-0.80.1/lib/excon/connection.rb:283:in `request'
Apr 26 07:20:33  | /usr/local/bundle/gems/excon-0.80.1/lib/excon/connection.rb:369:in `get'
...
@geemus
Copy link
Contributor

geemus commented Apr 26, 2021

Thanks for the detailed report, sorry to hear this impacted you (I had hoped it would be zero impact).

I wonder if there could be fixes in newer ruby and resolv versions, since 2.6.0 is a bit on the older side. I suspect updating to a newer Ruby might be difficult, but newer versions of Resolv are also released as a gem. So one, hopefully easier, thing to try would be to install a newer, updated Resolv gem to replace the stdlib and see if you still have the same behavior. Would one or both of these be plausible with your setup?

Beyond that I'm not sure. I'll keep an eye out for additional reports or ideas, but it's a bit hard to dig deeper on this kind of thing without a clear reproducible case. And with concurrent issues like this, getting a repro-case can be really challenging/problematic.

Otherwise, happy to keep chatting and working on this to see if we can figure out what is wrong or if we need to revert or take a different track.

@stevenharman
Copy link

We've seen similar hung threads, resulting in timeouts, in the same code:

File /app/vendor/ruby-2.7.3/lib/ruby/2.7.0/resolv.rb line 622 in synchronize

We are running Ruby 2.7.3 with whichever version of resolv ships with that Ruby.

@geemus
Copy link
Contributor

geemus commented Apr 30, 2021

Ok, digging in to this, it looks like the timeout is coming from a mutex around access/updates to local hostfile.

By default, it looks like Resolve uses this as the default resolvers: [Resolv::Hosts.new, Resolv::DNS.new]

Maybe we can reverse the ordering to alleviate this? Are any of you in a position to add the following to setup and see if it alleviates the timeouts?

Resolv::DefaultResolver.replace_resolvers([Resolv::DNS.new, Resolv::Hosts.new])

I think this and/or updating the version of resolv (using gem) are good next steps.

@geemus geemus added the pinned label Apr 30, 2021
@geemus
Copy link
Contributor

geemus commented Apr 30, 2021

Note, you could also omit the Hosts portion of that (so that it would always use nameserver based resolution). The nameserver based one still has a mutex, but it looks like the window for it is much smaller, so it seems unlikely you would run into the same issues.

It's plausible that we should consider setting this for excon usage in general, but it would be good to know if it solves your problem or not first.

@stevenharman
Copy link

stevenharman commented May 1, 2021 via email

@stevenharman
Copy link

Just saw this: https://bugs.ruby-lang.org/issues/17781 So confirmed, it is a bug in the Resolv library.

@geemus
Copy link
Contributor

geemus commented May 3, 2021

Thanks for letting me know. Unless anyone on here needs the particular other changes that have happened in between, I'll hope the related fixes can be merged (rather that temporarily reverting this).

@stevenharman
Copy link

The issue has been fixed upstream, but we're still waiting for a new version of the Resolv Gem to be cut. Once a version > 0.20.0 is out we can explicitly add that to a Gemfile to work around this. And eventually the upstream fix will make its way into Ruby stdlib patches as well.

Good luck, everyone!

@geemus
Copy link
Contributor

geemus commented May 13, 2021

@stevenharman Thanks for closing the loop, I had also been monitoring the upstream issue but it unfortunately slipped my mind to note the change here for everyone. I'm looking forward to (hopefully) having this fix this issue for us so that we can keep rolling forward and take advantage of the other benefits we had originally hoped to gain.

@dentarg
Copy link
Contributor

dentarg commented Jun 10, 2021

Since June 1 there’s a new version of the resolv gem: https://rubygems.org/gems/resolv/versions/0.2.1

@geemus
Copy link
Contributor

geemus commented Jun 11, 2021

@dentarg Thanks for the update. Hopefully that will get things working for those who had issues previously.

@nathansamson
Copy link

I am not using the resolv gem, just the built in ruby version.

I think when loading the resolv gem in my gemfile my isuse is fixed, but I expect the issue also to be fixed in pure ruby without this gem at some point right?

@geemus
Copy link
Contributor

geemus commented Jun 23, 2021

@nathansamson Good question. I believe that any updates such as this will eventually be released in supported versions of Ruby, it just will take longer as those releases are less frequent than gem releases.

@stevenharman
Copy link

This has now been back ported to Ruby 3.0.2, 2.7.4, and 2.6.8. Yay! ruby/ruby@v2_7_3...v2_7_4#diff-73ed31bea018f3a994e757faebca1dbf4edb3027ca4a1ecdd623abc955b8e2f6

@geemus
Copy link
Contributor

geemus commented Jul 7, 2021

@stevenharman awesome, thanks for the update!

I'm going to go ahead and close this now as, to the best of my knowledge, these upstream fixes should remove the problem.

@geemus geemus closed this as completed Jul 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants