Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deploy gets stuck and doesn't print anything #2066

Open
anyelopetit opened this issue Sep 8, 2020 · 29 comments
Open

deploy gets stuck and doesn't print anything #2066

anyelopetit opened this issue Sep 8, 2020 · 29 comments

Comments

@anyelopetit
Copy link

anyelopetit commented Sep 8, 2020

Steps to reproduce

cap production deploy
and that's it, just gets stuck forever until I abort it

Expected behavior

  • Capistrano must to print a TimeOut error log or something else when it doesn't connect to server

Actual behavior

When I run: cap production deploy it shows nothing, just keeps waiting
But when I run: cap production deploy --trace it shows this:

** Invoke production (first_time)
** Execute production
** Invoke load:defaults (first_time)
** Execute load:defaults
** Invoke rbenv:validate (first_time)
** Execute rbenv:validate

System configuration

https://gist.github.com/anyelopetit/7320fc8141554ea4b75ec88bb02f8132

@leehambley
Copy link
Member

I would suggest you reopen this on capistrano-rbenv that's the last plugin that is doing something when this is locking up, perhaps there's an exhaustive debug mode you can run.

I'll leave this issue open for now, but I don't think we can assist really.

@mattbrictson
Copy link
Member

Yes @anyelopetit please follow up with us when you track down what was causing the freeze. There may be something capistrano can do better in terms of error messaging.

My assumption is that rbenv:validate is the very first task that actually requires an SSH connection to the remote host, and that SSH connection is failing to open. You could try changing the connection timeout to see if that changes anything.

set :connection_timeout, 5

We probably can't help you fix the reason for the timeout (that is likely due to a firewall between you and the remote host, or an incorrect port number). But we may be able to improve capistrano to provide better error messaging, or maybe a more reasonable default timeout so capistrano doesn't appear to hang indefinitely.

@anyelopetit
Copy link
Author

Hey guys! @leehambley @mattbrictson Thank you for answering so quickly!
I just tried changing the connection timeout with:

set :connection_timeout, 5

but it doesn't change anything.

Do you have any idea about what is causing that error? How can I find out the reason why it gets stuck?
That is preventing me from deploying

@mattbrictson
Copy link
Member

Sorry I don't know what would cause this. My first test would be to try using ssh directly (i.e. take capistrano out of the picture) and verify that you can connect to the server that you specified in deploy/production.rb.

Perhaps the hang is because ssh is waiting for something on stdin? Like asking for a password? Make sure you have the necessary public/private key and ssh-agent setup so that you aren't challenged for a passphrase when authenticating. There is information about that here: https://capistranorb.com/documentation/getting-started/authentication-and-authorisation/

@anyelopetit
Copy link
Author

Hi @mattbrictson! I just verified that I can access to the server specified in deploy/production.rb running ssh deploy@[server_ip] and it's working! I also verified that my ssh key is in the ~/.ssh/authorized_keys file and in my Bitbucket account.
There is not any prompt or something else waiting when I access to the server. Do you have any other idea for this weird bug?

@leehambley
Copy link
Member

You can try to look at https://capistranorb.com/documentation/faq/why-does-something-work-in-my-ssh-session-but-not-in-capistrano/ and try to simulate more and more of what the rvm plug-in is doing by tweaking the way you use ssh.

@leehambley
Copy link
Member

To be clear, are you using https://github.com/rvm/rvm1-capistrano3 or capistrano/rvm#83 or something else?

@anyelopetit
Copy link
Author

@leehambley I'm using capistrano-rbenv, this is my capistrano gems set in my Gemfile:

group :development do
  gem 'capistrano'
  gem 'capistrano-bundler'
  gem 'capistrano-passenger'
  gem 'capistrano-rails'
  gem 'capistrano-rbenv', github: 'capistrano/rbenv'
  gem 'capistrano-sidekiq'
end

@anyelopetit
Copy link
Author

anyelopetit commented Sep 11, 2020

Hi guys! @mattbrictson @leehambley I was trying what is in https://capistranorb.com/documentation/faq/why-does-something-work-in-my-ssh-session-but-not-in-capistrano/ and I could notice this:

In staging, my shell is Interactive and Not Login, but in production my shell is Interactive and Login.

So, I tried running this commands in staging:

me@localhost $ ssh user@staging_ip "[[ $- == *i* ]] && echo 'Interactive' || echo 'Not interactive'"
Interactive
me@localhost $ ssh user@staging_ip "shopt -q login_shell && echo 'Login shell' || echo 'Not login shell'"
Not login shell

It works!

But I tried in production:

me@localhost $ ssh user@production_ip "[[ $- == *i* ]] && echo 'Interactive' || echo 'Not interactive'"
# NO ANSWER, it keeps waiting for something
me@localhost $ ssh user@production_ip "shopt -q login_shell && echo 'Login shell' || echo 'Not login shell'"
# SAME HERE, NO ANSWER

How can I change my shell to a Not Login shell?

@leehambley
Copy link
Member

Great bit of research, I'm not sure how that can change the behaviour of rbenv much, but at least we have something to work on.

I'd suggest to check with the people who configured the servers, there's a bunch of settings on the SSH daemon, and the default shell that would change this behaviour.

What do you know about the configurations of both environment's servers?

@leehambley
Copy link
Member

How can I change my shell to a Not Login shell?

A "log-in" shell, by the way isn't necessarily the problem, it just means that the server has been configured to treat connections like a log-in shell, meaning loading your "dot files" if any.. it's not impossible that one of them is acting weird when you connect without a tty (without a terminal)

@Miicky
Copy link

Miicky commented Oct 1, 2020

I have the same issue. But I can catch it only using one internet provider. In another, all works perfectly. I'm not sure if this info can help you

@mamantoha
Copy link

I have the same problem. Deploy get stuck on macOS. In my case, these are 2 different fixes:

  1. Using another provider fixes an issue on macOS.
  2. Deploy from Linux works well on both internet providers.

@leehambley
Copy link
Member

I suspect something is blocking waiting for input that you are not able to provide. If some part of the deploy is .. say, requesting a password, maybe on Linux it works "by accident" for you because of some combination of shared environment vars, something in your ruby env.

Can you clarify please @mamantoha what you mean "another provider", like internet provider?!

@mamantoha
Copy link

@leehambley yes, internet provider.

@JonTheStone
Copy link

I've also noticed this at the end of all my tasks and it's waiting for the connection pool to close. When I have Activity Monitor open while running tasks, I notice a ruby task starts followed by three SSH tasks (for my servers) and then another 3 open and then they never close. Tried restarting my machine and same things happens. Strange

@tsrivishnu
Copy link

tsrivishnu commented Dec 22, 2022

We are facing a similar issue. Deploy runs fine on Linux on all servers. However, on MacOS with docker, the deploy runs fine on our production server but not on the staging server.

The deployment is stuck at:

0:28 deploy:assets:precompile
      01 $HOME/.rbenv/bin/rbenv exec bundle exec rake assets:precompile

I can see on the server that the pre-compilation is running but Capistrano is stuck.

However, when it is run on a Linux machine or a CI server, it works fine.

00:09 deploy:assets:precompile
      01 $HOME/.rbenv/bin/rbenv exec bundle exec rake assets:precompile
      01 yarn install v1.22.17
      01 [1/4] Resolving packages...
      01 [2/4] Fetching packages...
      01 [3/4] Linking dependencies...
      01 warning " > @rails/actiontext@6.1.7" has incorrect peer dependency "trix@^1.2.0".
      01 [4/4] Building fresh packages...
      01 Done in 22.10s.
      01 yarn run v1.22.17
      01 $ RAILS_ENV=${RAILS_ENV:-production} NODE_ENV=${NODE_ENV:-production} && webpack --config webpack.prod.js --env NODE_ENV=${NODE_ENV} --env RAILS_ENV=${RAILS_ENV} --progress
      01 <s> [webpack.Progress] 0% 
      01 
      01 <s> [webpack.Progress] 1% setup before run
      01 <s> [webpack.Progress] 1% setup before run NodeEnvironmentPlugin
      01 <s> [webpack.Progress] 1% setup before run

In all scenarios, Capistrano is running from inside a Docker container.

The problem is only on MacOS and for only one server. Exact same Capistrano config and setup works for production deployments on MacOS.

I'm not sure on how to continue debugging this further and I would be very happy if someone could help and point me into the right direction.

Additionally, if I run Capistrano only for deploy:assets:precompile job, the job runs without any issues and print the output.

@mattbrictson
Copy link
Member

mattbrictson commented Jan 8, 2023

There is a chance this could be an SSH or network timeout issue. See the keep-alive solutions discussed in capistrano/sshkit#503. Edit: fixed issue link

@stoicAlchemist
Copy link

Don't know if this is still an issue for people but it is for me. Findings:

  1. Shared linked solution doesn't work
  2. Using watch -n 1 'ps aux | grep ssh | grep -v grep' to monitor the server's ssh connections I can see the ssh connection stays alive and well, so it doesn't seem to be a server doing anything wrong
  3. To me it does it at the end of the deployment, after the deploy:log_revision task
  4. It would seem like Capistrano just doesn't close the connection
    To give more information, my capistrano gems used are:
  gem "capistrano", "~> 3.6.1", :require => false
  gem "capistrano-rails", "~> 1.1", :require => false
  gem "capistrano-bundler", "~> 1.1", :require => false
  gem "capistrano-faster-assets", "~> 1.0", :require => false

@mattbrictson
Copy link
Member

@stoicAlchemist sorry to hear you are running into this. Have you tried tweaking the SSHKit connection pool? I would try disabling connection pooling entirely and seeing if that makes a difference.

SSHKit::Backend::Netssh.pool.idle_timeout = 0 # disabled

https://github.com/capistrano/sshkit?tab=readme-ov-file#connection-pooling

Also, you could check your log/capistrano.log file and see what the last lines look like prior to the hang. Maybe there are some clues there as to what capinstrano is waiting on.

@stoicAlchemist
Copy link

@mattbrictson thanks for the fast response. Using the pool config you mentioned got the problem worst 😅 , executing the first task it got just "frozen" and didn't go through. I'm trying to deploy to our QA environment and it just hangs:

** Invoke qa (first_time)
** Execute qa
** Invoke load:defaults (first_time)
** Execute load:defaults
** Invoke bundler:map_bins (first_time)
** Execute bundler:map_bins
** Invoke deploy:set_rails_env (first_time)
** Execute deploy:set_rails_env
** Invoke deploy:set_linked_dirs (first_time)
** Execute deploy:set_linked_dirs
** Invoke deploy (first_time)
** Execute deploy
** Invoke deploy:starting (first_time)
** Execute deploy:starting
** Invoke deploy:check (first_time)
** Execute deploy:check
** Invoke git:check (first_time)
** Invoke git:wrapper (first_time)
** Execute git:wrapper
00:00 git:wrapper
      01 mkdir -p /tmp

This is the output using the --trace flag. Log is stuck there too and the remote host did create the ssh connection and just hangs there.

Without the config, the log file ends on:

 DEBUG [333093e5] Command: cd /data/*********/releases && ( export PATH="/opt/rubies/2.2.3/bin:$PATH" ; /usr/bin/env echo "Branch qa (at b7fa7386bd87238a8bb8ac7ca0abebd656351eb1) deployed as release 20240415214107 by *******" >> /data/*********/revisions.log )
  INFO [9e29ffab] Finished in 0.306 seconds with exit status 0 (successful).
  INFO [8d9834e3] Finished in 0.308 seconds with exit status 0 (successful).
  INFO [333093e5] Finished in 0.358 seconds with exit status 0 (successful).
  INFO [aaf2a6f8] Finished in 0.389 seconds with exit status 0 (successful).

** I had to anonymize user and project name, other than that, all is current output

It seems that it is hanging after the deployment:log_revision is successfully completed.

@mattbrictson
Copy link
Member

Using the pool config you mentioned got the problem worse

@stoicAlchemist that seems to suggest that closing the connection is indeed the problem. With the pool enabled, a single connection is reused for the entire deploy, then closed at the end. With the pool disabled, it closes the connection after each step. And sure enough, the hang happens right after the first step.

I am a bit stumped, but if I were to make a wild guess, I would say that there is something in the shell init script (e.g. .bashrc) of your remote deploy user that is causing problems. Maybe there is a process being started on login that is running in the background, and thus the shell can't disconnect until the background job exits? I would debug by figuring out what rc file is being sourced when the cap user connects over ssh (it might be in the user's home directory, or it might be a global one coming from somewhere in /etc), and then start removing stuff. 😄

@stoicAlchemist
Copy link

That sounds perfect, I'll start debugging that part on the server, might take some time as I may need a hand from my OPS team. Will update here with results.

@terrafied
Copy link

@stoicAlchemist Did you ever find a root cause for this? I'm experiencing it now and can't find the cause.

@JonTheStone
Copy link

JonTheStone commented May 3, 2024

For me, we were using a bastion host so this was part of our ssh_options:

ssh_options[:proxy] = Net::SSH::Proxy::Command.new(
	"ssh -i #{ssh_options[:keys].first} -l #{fetch(:user)} #{jump_server_ip} nc -w 7200 %h %p"
)

Removing the nc -w 7200 and reworking our ssh_options to something like the following worked:

set :ssh_options, {
	user: fetch(:user),
	keys: KEY_HERE,
	keepalive: true,
	keepalive_interval: 60,
	timeout: 1800
}
fetch(:ssh_options)[:proxy] = Net::SSH::Proxy::Command.new("ssh -i #{fetch(:ssh_options)[:keys].first} -l #{fetch(:user)} #{fetch(:jump_server_ip)} -W %h:%p")

Best I could tell, it was preventing the connection pool from closing and thus completing the command.

@terrafied
Copy link

This makes a lot of sense. I'm using EC2 connect to avoid a bastion host, but there's still an underlying hop. I'll look into this. Thank you so much, @JonTheStone.

@JonTheStone
Copy link

This makes a lot of sense. I'm using EC2 connect to avoid a bastion host, but there's still an underlying hop. I'll look into this. Thank you so much, @JonTheStone.

Good luck!

@terrafied
Copy link

The proxy command for an EC2 connection is aws ec2-instance-connect open-tunnel --instance-id <INSTANCEID> but unfortunately I think it's the AWS CLI itself that's preventing the pool from closing in this case, so it's not as simple as I thought. I will continue debugging and post any solution I find here.

@terrafied
Copy link

Quickfix to address this for EC2 Instance Connect Endpoint

For those who are using (or considering) the newer EC2 Instance Connect Endpoints instead of bastion hosts, I wanted to provide details on an issue relating to this as well as a workaround.

Background

EC2 Instance Connect allows you to connect to a private EC2 instance using a local AWS CLI command
instead of hopping over a bastion host. Once implemented, you can ssh using a command such as:

ssh ec2-user@[INSTANCE] \
    -i [SSH-KEY] \
    -o ProxyCommand='aws ec2-instance-connect open-tunnel \
    --instance-id %h'

Where [INSTANCE] is the instance-id of our EC2 server.

Our Setup

Our keys and users were already managed for Capistrano, so we simplified this to adding the following
to our [REPO]/.ssh/config file:

Host test-server
  HostName [INSTANCE]
  ProxyCommand aws ec2-instance-connect open-tunnel --instance-id %h

Allowing us to have a [REPO]/config/deploy/test.rb have the following server entry:

server "test-server", user: "httpuser", roles: %w{web}

The Problem

This worked, but I immediately saw the behavior in this ticket. The deployment would hang following the
deploy:log_revision task. This is the final task in the Capistrano Flow

As noted in this comment, setting SSHKit::Backend::Netssh.pool.idle_timeout = 0 caused a hang after the first task. So the behavior for EC2 Instance Connect Endpoints seemed similar to that for bastion hosts and other scenarios.

After some extensive debugging, I figured out that the aws command was remaining open, which made it impossible
for socket.close to finish here, causing the hang.

What's notable about this is that ssh -l USER test-server would work as expected, because somehow the aws
command can see that the SSH connection has been broken and close when using a TTY, but not when using a socket connection via SSHKit. This strikes me as a problem with SSHKit, but I can't confirm that right now.

Less a solution than a hack

I found that if I manually kill the aws process while the deploy is stuck, the connection pool closes and the deploy
finishes as I'd expect. So I created the following task:

…
  task :kill_aws do
   sh 'killall aws'
  end
…
after 'deploy:finished', 'deploy:kill_aws'

Note, this is a pretty rough hack, and that task can be fleshed out quite a bit to make it more robust, but this gives
you the idea. Basically, we need a final task that shuts down anything that may be artificially holding the connection open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants