deploy gets stuck and doesn't print anything #2066

anyelopetit · 2020-09-08T20:28:00Z

Steps to reproduce

cap production deploy
and that's it, just gets stuck forever until I abort it

Expected behavior

Capistrano must to print a TimeOut error log or something else when it doesn't connect to server

Actual behavior

When I run: cap production deploy it shows nothing, just keeps waiting
But when I run: cap production deploy --trace it shows this:

** Invoke production (first_time)
** Execute production
** Invoke load:defaults (first_time)
** Execute load:defaults
** Invoke rbenv:validate (first_time)
** Execute rbenv:validate

System configuration

https://gist.github.com/anyelopetit/7320fc8141554ea4b75ec88bb02f8132

The text was updated successfully, but these errors were encountered:

leehambley · 2020-09-08T20:58:05Z

I would suggest you reopen this on capistrano-rbenv that's the last plugin that is doing something when this is locking up, perhaps there's an exhaustive debug mode you can run.

I'll leave this issue open for now, but I don't think we can assist really.

mattbrictson · 2020-09-09T01:26:42Z

Yes @anyelopetit please follow up with us when you track down what was causing the freeze. There may be something capistrano can do better in terms of error messaging.

My assumption is that rbenv:validate is the very first task that actually requires an SSH connection to the remote host, and that SSH connection is failing to open. You could try changing the connection timeout to see if that changes anything.

set :connection_timeout, 5

We probably can't help you fix the reason for the timeout (that is likely due to a firewall between you and the remote host, or an incorrect port number). But we may be able to improve capistrano to provide better error messaging, or maybe a more reasonable default timeout so capistrano doesn't appear to hang indefinitely.

anyelopetit · 2020-09-09T23:26:32Z

Hey guys! @leehambley @mattbrictson Thank you for answering so quickly!
I just tried changing the connection timeout with:

set :connection_timeout, 5

but it doesn't change anything.

Do you have any idea about what is causing that error? How can I find out the reason why it gets stuck?
That is preventing me from deploying

mattbrictson · 2020-09-10T00:17:24Z

Sorry I don't know what would cause this. My first test would be to try using ssh directly (i.e. take capistrano out of the picture) and verify that you can connect to the server that you specified in deploy/production.rb.

Perhaps the hang is because ssh is waiting for something on stdin? Like asking for a password? Make sure you have the necessary public/private key and ssh-agent setup so that you aren't challenged for a passphrase when authenticating. There is information about that here: https://capistranorb.com/documentation/getting-started/authentication-and-authorisation/

anyelopetit · 2020-09-10T18:00:22Z

Hi @mattbrictson! I just verified that I can access to the server specified in deploy/production.rb running ssh deploy@[server_ip] and it's working! I also verified that my ssh key is in the ~/.ssh/authorized_keys file and in my Bitbucket account.
There is not any prompt or something else waiting when I access to the server. Do you have any other idea for this weird bug?

leehambley · 2020-09-10T19:07:28Z

You can try to look at https://capistranorb.com/documentation/faq/why-does-something-work-in-my-ssh-session-but-not-in-capistrano/ and try to simulate more and more of what the rvm plug-in is doing by tweaking the way you use ssh.

leehambley · 2020-09-10T19:08:20Z

To be clear, are you using https://github.com/rvm/rvm1-capistrano3 or capistrano/rvm#83 or something else?

anyelopetit · 2020-09-10T22:15:57Z

@leehambley I'm using capistrano-rbenv, this is my capistrano gems set in my Gemfile:

group :development do
  gem 'capistrano'
  gem 'capistrano-bundler'
  gem 'capistrano-passenger'
  gem 'capistrano-rails'
  gem 'capistrano-rbenv', github: 'capistrano/rbenv'
  gem 'capistrano-sidekiq'
end

anyelopetit · 2020-09-11T13:54:04Z

Hi guys! @mattbrictson @leehambley I was trying what is in https://capistranorb.com/documentation/faq/why-does-something-work-in-my-ssh-session-but-not-in-capistrano/ and I could notice this:

In staging, my shell is Interactive and Not Login, but in production my shell is Interactive and Login.

So, I tried running this commands in staging:

me@localhost $ ssh user@staging_ip "[[ $- == *i* ]] && echo 'Interactive' || echo 'Not interactive'"
Interactive
me@localhost $ ssh user@staging_ip "shopt -q login_shell && echo 'Login shell' || echo 'Not login shell'"
Not login shell

It works!

But I tried in production:

me@localhost $ ssh user@production_ip "[[ $- == *i* ]] && echo 'Interactive' || echo 'Not interactive'"
# NO ANSWER, it keeps waiting for something
me@localhost $ ssh user@production_ip "shopt -q login_shell && echo 'Login shell' || echo 'Not login shell'"
# SAME HERE, NO ANSWER

How can I change my shell to a Not Login shell?

leehambley · 2020-09-11T15:20:11Z

Great bit of research, I'm not sure how that can change the behaviour of rbenv much, but at least we have something to work on.

I'd suggest to check with the people who configured the servers, there's a bunch of settings on the SSH daemon, and the default shell that would change this behaviour.

What do you know about the configurations of both environment's servers?

leehambley · 2020-09-11T15:36:22Z

How can I change my shell to a Not Login shell?

A "log-in" shell, by the way isn't necessarily the problem, it just means that the server has been configured to treat connections like a log-in shell, meaning loading your "dot files" if any.. it's not impossible that one of them is acting weird when you connect without a tty (without a terminal)

Miicky · 2020-10-01T20:20:09Z

I have the same issue. But I can catch it only using one internet provider. In another, all works perfectly. I'm not sure if this info can help you

mamantoha · 2020-12-08T11:56:22Z

I have the same problem. Deploy get stuck on macOS. In my case, these are 2 different fixes:

Using another provider fixes an issue on macOS.
Deploy from Linux works well on both internet providers.

leehambley · 2020-12-18T18:39:45Z

I suspect something is blocking waiting for input that you are not able to provide. If some part of the deploy is .. say, requesting a password, maybe on Linux it works "by accident" for you because of some combination of shared environment vars, something in your ruby env.

Can you clarify please @mamantoha what you mean "another provider", like internet provider?!

mamantoha · 2020-12-20T20:46:20Z

@leehambley yes, internet provider.

JonTheStone · 2022-04-07T15:20:41Z

I've also noticed this at the end of all my tasks and it's waiting for the connection pool to close. When I have Activity Monitor open while running tasks, I notice a ruby task starts followed by three SSH tasks (for my servers) and then another 3 open and then they never close. Tried restarting my machine and same things happens. Strange

tsrivishnu · 2022-12-22T12:58:15Z

We are facing a similar issue. Deploy runs fine on Linux on all servers. However, on MacOS with docker, the deploy runs fine on our production server but not on the staging server.

The deployment is stuck at:

0:28 deploy:assets:precompile
      01 $HOME/.rbenv/bin/rbenv exec bundle exec rake assets:precompile

I can see on the server that the pre-compilation is running but Capistrano is stuck.

However, when it is run on a Linux machine or a CI server, it works fine.

00:09 deploy:assets:precompile
      01 $HOME/.rbenv/bin/rbenv exec bundle exec rake assets:precompile
      01 yarn install v1.22.17
      01 [1/4] Resolving packages...
      01 [2/4] Fetching packages...
      01 [3/4] Linking dependencies...
      01 warning " > @rails/actiontext@6.1.7" has incorrect peer dependency "trix@^1.2.0".
      01 [4/4] Building fresh packages...
      01 Done in 22.10s.
      01 yarn run v1.22.17
      01 $ RAILS_ENV=${RAILS_ENV:-production} NODE_ENV=${NODE_ENV:-production} && webpack --config webpack.prod.js --env NODE_ENV=${NODE_ENV} --env RAILS_ENV=${RAILS_ENV} --progress
      01 <s> [webpack.Progress] 0% 
      01 
      01 <s> [webpack.Progress] 1% setup before run
      01 <s> [webpack.Progress] 1% setup before run NodeEnvironmentPlugin
      01 <s> [webpack.Progress] 1% setup before run

In all scenarios, Capistrano is running from inside a Docker container.

The problem is only on MacOS and for only one server. Exact same Capistrano config and setup works for production deployments on MacOS.

I'm not sure on how to continue debugging this further and I would be very happy if someone could help and point me into the right direction.

Additionally, if I run Capistrano only for deploy:assets:precompile job, the job runs without any issues and print the output.

mattbrictson · 2023-01-08T19:41:34Z

There is a chance this could be an SSH or network timeout issue. See the keep-alive solutions discussed in capistrano/sshkit#503. Edit: fixed issue link

stoicAlchemist · 2024-04-15T20:53:49Z

Don't know if this is still an issue for people but it is for me. Findings:

Shared linked solution doesn't work
Using watch -n 1 'ps aux | grep ssh | grep -v grep' to monitor the server's ssh connections I can see the ssh connection stays alive and well, so it doesn't seem to be a server doing anything wrong
To me it does it at the end of the deployment, after the deploy:log_revision task
It would seem like Capistrano just doesn't close the connection
To give more information, my capistrano gems used are:

  gem "capistrano", "~> 3.6.1", :require => false
  gem "capistrano-rails", "~> 1.1", :require => false
  gem "capistrano-bundler", "~> 1.1", :require => false
  gem "capistrano-faster-assets", "~> 1.0", :require => false

mattbrictson · 2024-04-15T21:09:56Z

@stoicAlchemist sorry to hear you are running into this. Have you tried tweaking the SSHKit connection pool? I would try disabling connection pooling entirely and seeing if that makes a difference.

SSHKit::Backend::Netssh.pool.idle_timeout = 0 # disabled

https://github.com/capistrano/sshkit?tab=readme-ov-file#connection-pooling

Also, you could check your log/capistrano.log file and see what the last lines look like prior to the hang. Maybe there are some clues there as to what capinstrano is waiting on.

stoicAlchemist · 2024-04-15T22:00:42Z

@mattbrictson thanks for the fast response. Using the pool config you mentioned got the problem worst 😅 , executing the first task it got just "frozen" and didn't go through. I'm trying to deploy to our QA environment and it just hangs:

** Invoke qa (first_time)
** Execute qa
** Invoke load:defaults (first_time)
** Execute load:defaults
** Invoke bundler:map_bins (first_time)
** Execute bundler:map_bins
** Invoke deploy:set_rails_env (first_time)
** Execute deploy:set_rails_env
** Invoke deploy:set_linked_dirs (first_time)
** Execute deploy:set_linked_dirs
** Invoke deploy (first_time)
** Execute deploy
** Invoke deploy:starting (first_time)
** Execute deploy:starting
** Invoke deploy:check (first_time)
** Execute deploy:check
** Invoke git:check (first_time)
** Invoke git:wrapper (first_time)
** Execute git:wrapper
00:00 git:wrapper
      01 mkdir -p /tmp

This is the output using the --trace flag. Log is stuck there too and the remote host did create the ssh connection and just hangs there.

Without the config, the log file ends on:

 DEBUG [333093e5] Command: cd /data/*********/releases && ( export PATH="/opt/rubies/2.2.3/bin:$PATH" ; /usr/bin/env echo "Branch qa (at b7fa7386bd87238a8bb8ac7ca0abebd656351eb1) deployed as release 20240415214107 by *******" >> /data/*********/revisions.log )
  INFO [9e29ffab] Finished in 0.306 seconds with exit status 0 (successful).
  INFO [8d9834e3] Finished in 0.308 seconds with exit status 0 (successful).
  INFO [333093e5] Finished in 0.358 seconds with exit status 0 (successful).
  INFO [aaf2a6f8] Finished in 0.389 seconds with exit status 0 (successful).

** I had to anonymize user and project name, other than that, all is current output

It seems that it is hanging after the deployment:log_revision is successfully completed.

mattbrictson · 2024-04-15T22:15:50Z

Using the pool config you mentioned got the problem worse

@stoicAlchemist that seems to suggest that closing the connection is indeed the problem. With the pool enabled, a single connection is reused for the entire deploy, then closed at the end. With the pool disabled, it closes the connection after each step. And sure enough, the hang happens right after the first step.

I am a bit stumped, but if I were to make a wild guess, I would say that there is something in the shell init script (e.g. .bashrc) of your remote deploy user that is causing problems. Maybe there is a process being started on login that is running in the background, and thus the shell can't disconnect until the background job exits? I would debug by figuring out what rc file is being sourced when the cap user connects over ssh (it might be in the user's home directory, or it might be a global one coming from somewhere in /etc), and then start removing stuff. 😄

stoicAlchemist · 2024-04-15T22:18:34Z

That sounds perfect, I'll start debugging that part on the server, might take some time as I may need a hand from my OPS team. Will update here with results.

terrafied · 2024-05-03T17:03:01Z

@stoicAlchemist Did you ever find a root cause for this? I'm experiencing it now and can't find the cause.

JonTheStone · 2024-05-03T17:09:23Z

For me, we were using a bastion host so this was part of our ssh_options:

ssh_options[:proxy] = Net::SSH::Proxy::Command.new(
	"ssh -i #{ssh_options[:keys].first} -l #{fetch(:user)} #{jump_server_ip} nc -w 7200 %h %p"
)

Removing the nc -w 7200 and reworking our ssh_options to something like the following worked:

set :ssh_options, {
	user: fetch(:user),
	keys: KEY_HERE,
	keepalive: true,
	keepalive_interval: 60,
	timeout: 1800
}
fetch(:ssh_options)[:proxy] = Net::SSH::Proxy::Command.new("ssh -i #{fetch(:ssh_options)[:keys].first} -l #{fetch(:user)} #{fetch(:jump_server_ip)} -W %h:%p")

Best I could tell, it was preventing the connection pool from closing and thus completing the command.

terrafied · 2024-05-03T17:26:07Z

This makes a lot of sense. I'm using EC2 connect to avoid a bastion host, but there's still an underlying hop. I'll look into this. Thank you so much, @JonTheStone.

JonTheStone · 2024-05-03T17:27:33Z

This makes a lot of sense. I'm using EC2 connect to avoid a bastion host, but there's still an underlying hop. I'll look into this. Thank you so much, @JonTheStone.

Good luck!

terrafied · 2024-05-03T18:32:04Z

The proxy command for an EC2 connection is aws ec2-instance-connect open-tunnel --instance-id <INSTANCEID> but unfortunately I think it's the AWS CLI itself that's preventing the pool from closing in this case, so it's not as simple as I thought. I will continue debugging and post any solution I find here.

terrafied · 2024-05-03T20:01:01Z

Quickfix to address this for EC2 Instance Connect Endpoint

For those who are using (or considering) the newer EC2 Instance Connect Endpoints instead of bastion hosts, I wanted to provide details on an issue relating to this as well as a workaround.

Background

EC2 Instance Connect allows you to connect to a private EC2 instance using a local AWS CLI command
instead of hopping over a bastion host. Once implemented, you can ssh using a command such as:

ssh ec2-user@[INSTANCE] \
    -i [SSH-KEY] \
    -o ProxyCommand='aws ec2-instance-connect open-tunnel \
    --instance-id %h'

Where [INSTANCE] is the instance-id of our EC2 server.

Our Setup

Our keys and users were already managed for Capistrano, so we simplified this to adding the following
to our [REPO]/.ssh/config file:

Host test-server
  HostName [INSTANCE]
  ProxyCommand aws ec2-instance-connect open-tunnel --instance-id %h

Allowing us to have a [REPO]/config/deploy/test.rb have the following server entry:

server "test-server", user: "httpuser", roles: %w{web}

The Problem

This worked, but I immediately saw the behavior in this ticket. The deployment would hang following the
deploy:log_revision task. This is the final task in the Capistrano Flow

As noted in this comment, setting SSHKit::Backend::Netssh.pool.idle_timeout = 0 caused a hang after the first task. So the behavior for EC2 Instance Connect Endpoints seemed similar to that for bastion hosts and other scenarios.

After some extensive debugging, I figured out that the aws command was remaining open, which made it impossible
for socket.close to finish here, causing the hang.

What's notable about this is that ssh -l USER test-server would work as expected, because somehow the aws
command can see that the SSH connection has been broken and close when using a TTY, but not when using a socket connection via SSHKit. This strikes me as a problem with SSHKit, but I can't confirm that right now.

Less a solution than a hack

I found that if I manually kill the aws process while the deploy is stuck, the connection pool closes and the deploy
finishes as I'd expect. So I created the following task:

…
  task :kill_aws do
   sh 'killall aws'
  end
…
after 'deploy:finished', 'deploy:kill_aws'

Note, this is a pretty rough hack, and that task can be fleshed out quite a bit to make it more robust, but this gives
you the idea. Basically, we need a final task that shuts down anything that may be artificially holding the connection open.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deploy gets stuck and doesn't print anything #2066

deploy gets stuck and doesn't print anything #2066

anyelopetit commented Sep 8, 2020 •

edited

leehambley commented Sep 8, 2020

mattbrictson commented Sep 9, 2020

anyelopetit commented Sep 9, 2020

mattbrictson commented Sep 10, 2020

anyelopetit commented Sep 10, 2020

leehambley commented Sep 10, 2020

leehambley commented Sep 10, 2020

anyelopetit commented Sep 10, 2020

anyelopetit commented Sep 11, 2020 •

edited

leehambley commented Sep 11, 2020

leehambley commented Sep 11, 2020

Miicky commented Oct 1, 2020

mamantoha commented Dec 8, 2020

leehambley commented Dec 18, 2020

mamantoha commented Dec 20, 2020

JonTheStone commented Apr 7, 2022

tsrivishnu commented Dec 22, 2022 •

edited

mattbrictson commented Jan 8, 2023 •

edited

stoicAlchemist commented Apr 15, 2024

mattbrictson commented Apr 15, 2024

stoicAlchemist commented Apr 15, 2024

mattbrictson commented Apr 15, 2024

stoicAlchemist commented Apr 15, 2024

terrafied commented May 3, 2024

JonTheStone commented May 3, 2024 •

edited

terrafied commented May 3, 2024

JonTheStone commented May 3, 2024

terrafied commented May 3, 2024

terrafied commented May 3, 2024

deploy gets stuck and doesn't print anything #2066

deploy gets stuck and doesn't print anything #2066

Comments

anyelopetit commented Sep 8, 2020 • edited

Steps to reproduce

Expected behavior

Actual behavior

System configuration

leehambley commented Sep 8, 2020

mattbrictson commented Sep 9, 2020

anyelopetit commented Sep 9, 2020

mattbrictson commented Sep 10, 2020

anyelopetit commented Sep 10, 2020

leehambley commented Sep 10, 2020

leehambley commented Sep 10, 2020

anyelopetit commented Sep 10, 2020

anyelopetit commented Sep 11, 2020 • edited

leehambley commented Sep 11, 2020

leehambley commented Sep 11, 2020

Miicky commented Oct 1, 2020

mamantoha commented Dec 8, 2020

leehambley commented Dec 18, 2020

mamantoha commented Dec 20, 2020

JonTheStone commented Apr 7, 2022

tsrivishnu commented Dec 22, 2022 • edited

mattbrictson commented Jan 8, 2023 • edited

stoicAlchemist commented Apr 15, 2024

mattbrictson commented Apr 15, 2024

stoicAlchemist commented Apr 15, 2024

mattbrictson commented Apr 15, 2024

stoicAlchemist commented Apr 15, 2024

terrafied commented May 3, 2024

JonTheStone commented May 3, 2024 • edited

terrafied commented May 3, 2024

JonTheStone commented May 3, 2024

terrafied commented May 3, 2024

terrafied commented May 3, 2024

Quickfix to address this for EC2 Instance Connect Endpoint

Background

Our Setup

The Problem

Less a solution than a hack

anyelopetit commented Sep 8, 2020 •

edited

anyelopetit commented Sep 11, 2020 •

edited

tsrivishnu commented Dec 22, 2022 •

edited

mattbrictson commented Jan 8, 2023 •

edited

JonTheStone commented May 3, 2024 •

edited