Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getaddrinfo #482

Closed
Cervenka opened this issue Feb 24, 2020 · 5 comments
Closed

getaddrinfo #482

Cervenka opened this issue Feb 24, 2020 · 5 comments

Comments

@Cervenka
Copy link

Hello,

when deploying to around ~100 servers we run into following issue once in a while. Only once in a while and very randomly - for one of the servers being deployed to. This has happened to people on macos and linux.

SSHKit:
:ExecuteError: Exception while executing as user@www46.oursite.com: getaddrinfo: nodename nor servname provided, or not known

Are there some known limitations perhaps?

Thanks!

@leehambley
Copy link
Member

We recently accepted some PRs to deal with large numbers of servers, so your ~100 count isn't exceptional in that regard.

Might I suggest you add a simple ping task, and try running things on a loop to get a harmless reproduction case, then you can run down some debugging options, such as clearing your DNS cache before, hard-coding the IPs in to your /etc/hosts files, etc, etc.

Your RUBY_VERSION can be significant here too, older Rubies, as a rule are less good at networking, but all rubies have been very good for at least 3-4 years, if not since the 2.0 release.

@Cervenka
Copy link
Author

Cervenka commented Feb 25, 2020

TLDR: I think I will try to reproduce this issue in Ruby (without sshkit) next.

Thank you for your input so far!

Having the IPs hard-coded in /etc/hosts does help. That has been my workaround for a while.
When previously trying to replicate the DNS resolution issue I was not able to do so using other tools.

I had this bash script running since yesterday without issues as well.

while true
 do
     
  date
  seq 1 100 | parallel --tag ping -c 1  www{}.oursite.com | grep 'Unknown'
  sleep 15
 done

I did run into the same issue again just now when trying to deploy. The script above was running - so the resolved IPs should still have been cached by the OS. Here two resolves failed at the same time.

#<Thread:0x00007fa45e2d4f60@/Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/runners/parallel.rb:10 run> terminated with exception (report_on_exception is true):
Traceback (most recent call last):
	17: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/runners/parallel.rb:12:in `block (2 levels) in execute'
	16: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/abstract.rb:31:in `run'
	15: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/abstract.rb:31:in `instance_exec'
	14: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/capistrano-3.11.0/lib/capistrano/scm/tasks/git.rake:8:in `block (3 levels) in eval_rakefile'
	13: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/abstract.rb:80:in `execute'
	12: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/abstract.rb:148:in `create_command_and_execute'
	11: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/abstract.rb:148:in `tap'
	10: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/abstract.rb:148:in `block in create_command_and_execute'
	 9: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/netssh.rb:130:in `execute_command'
	 8: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/netssh.rb:177:in `with_ssh'
	 7: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/connection_pool.rb:63:in `with'
	 6: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/connection_pool.rb:63:in `call'
	 5: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/net-ssh-5.2.0/lib/net/ssh.rb:246:in `start'
	 4: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/net-ssh-5.2.0/lib/net/ssh.rb:246:in `new'
	 3: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/net-ssh-5.2.0/lib/net/ssh/transport/session.rb:73:in `initialize'
	 2: from /Users/flo/.rvm/rubies/ruby-2.6.3/lib/ruby/2.6.0/socket.rb:631:in `tcp'
	 1: from /Users/flo/.rvm/rubies/ruby-2.6.3/lib/ruby/2.6.0/socket.rb:227:in `foreach'
/Users/flo/.rvm/rubies/ruby-2.6.3/lib/ruby/2.6.0/socket.rb:227:in `getaddrinfo': getaddrinfo: nodename nor servname provided, or not known (SocketError)
	1: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/runners/parallel.rb:11:in `block (2 levels) in execute'
/Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/runners/parallel.rb:15:in `rescue in block (2 levels) in execute': Exception while executing as user@www6.oursite.com: getaddrinfo: nodename nor servname provided, or not known (SSHKit::Runner::ExecuteError)
#<Thread:0x00007fa45e426fd0@/Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/runners/parallel.rb:10 run> terminated with exception (report_on_exception is true):
Traceback (most recent call last):
	17: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/runners/parallel.rb:12:in `block (2 levels) in execute'
	16: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/abstract.rb:31:in `run'
	15: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/abstract.rb:31:in `instance_exec'
	14: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/capistrano-3.11.0/lib/capistrano/scm/tasks/git.rake:8:in `block (3 levels) in eval_rakefile'
	13: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/abstract.rb:80:in `execute'
	12: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/abstract.rb:148:in `create_command_and_execute'
	11: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/abstract.rb:148:in `tap'
	10: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/abstract.rb:148:in `block in create_command_and_execute'
	 9: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/netssh.rb:130:in `execute_command'
	 8: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/netssh.rb:177:in `with_ssh'
	 7: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/connection_pool.rb:63:in `with'
	 6: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/connection_pool.rb:63:in `call'
	 5: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/net-ssh-5.2.0/lib/net/ssh.rb:246:in `start'
	 4: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/net-ssh-5.2.0/lib/net/ssh.rb:246:in `new'
	 3: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/net-ssh-5.2.0/lib/net/ssh/transport/session.rb:73:in `initialize'
	 2: from /Users/flo/.rvm/rubies/ruby-2.6.3/lib/ruby/2.6.0/socket.rb:631:in `tcp'
	 1: from /Users/flo/.rvm/rubies/ruby-2.6.3/lib/ruby/2.6.0/socket.rb:227:in `foreach'
/Users/flo/.rvm/rubies/ruby-2.6.3/lib/ruby/2.6.0/socket.rb:227:in `getaddrinfo': getaddrinfo: nodename nor servname provided, or not known (SocketError)
	1: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/runners/parallel.rb:11:in `block (2 levels) in execute'
/Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/runners/parallel.rb:15:in `rescue in block (2 levels) in execute': Exception while executing as user@www30.oursite.com: getaddrinfo: nodename nor servname provided, or not known (SSHKit::Runner::ExecuteError)

I think I will try to reproduce this issue in Ruby (without sshkit) next.

@Cervenka
Copy link
Author

Cervenka commented Mar 3, 2020

I have tried to reproduce this in other ways (including below script) but have not been able to reproduce this issue besides when using capistrano to deploy (which uses sshkit).

Also I have tried switching DNS-server. Hardcoding the hosts in /etc/hosts fixes the issue for me so it seems.

# frozen_string_literal: true

require 'socket'

loop do
  puts Time.now

  threads = []

  (1..100).each do |i|
    threads << Thread.new do
      addr = "www#{i}.oursite.com"
      begin
        addrinfo = Socket.getaddrinfo(addr, 'https', nil, Socket::SOCK_STREAM)
      rescue Exception => e
        puts "#{addr} #{Time.now}", e, ''
      end
    end
  end

  threads.each(&:join)

  sleep 20
end

@creadone
Copy link

creadone commented Oct 7, 2022

I have the same problem but with single server. Tried some tests:

Success

require 'socket'
Socket.getaddrinfo('subdomain.domain.com', 80, nil, Socket::SOCK_STREAM)

Success

require 'net/ssh'

Net::SSH.start('subdomain.domain.com', 'sshuser') do |ssh|
  ssh.exec 'touch ~/test.txt'
end

Fail with the same stacktrace as Cervenka

require 'sshkit'
require 'sshkit/dsl'
include SSHKit::DSL

SSHKit::Backend::Netssh.configure do |ssh|
  ssh.connection_timeout = 5
  ssh.ssh_options = {
    user:         'sshuser',
    keys:         %w[~/.ssh/id_rsa],
    auth_methods: %w[ publickey ]
  }
end

nodes = %w[ 'subdomain.domain.com' ]

on nodes do |node|
  output = capture :ls, '-l'
  puts output
end

Also

  1. Flushed and checked DNS in loop — everything is ok, nothing suspicious.
  2. Tried with IP — fail, the same exception.
  3. Hardcoding the hosts in /etc/hosts not fixes.

Do you have any ideas where to dig deeper?

ruby 3.1.1p18 (2022-02-18 revision 53f5fc4236) [x86_64-darwin21]
sshkit (1.21.3)

@creadone
Copy link

creadone commented Oct 8, 2022

Solved. I need more sleep.

nodes = %w[ 'subdomain.domain.com' ] => nodes = %w[ subdomain.domain.com ]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants