Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Can not delete Container over ssh if there is more than one #10117

Open
ugal1 opened this issue Dec 21, 2022 · 14 comments
Open

[BUG] Can not delete Container over ssh if there is more than one #10117

ugal1 opened this issue Dec 21, 2022 · 14 comments
Labels

Comments

@ugal1
Copy link

ugal1 commented Dec 21, 2022

Description

Trying to "down" a compose over ssh will fail
It needs to be two or more services inside the compose (one services works)

Similar issue here : #9185

Steps To Reproduce

1: docker-compose.yml

version: '3.8'
services:

  busybox1:
    image: busybox
    command: "sleep 1d"
    
  busybox2:
    image: busybox
    command: "sleep 1d"

2: Run services, then stop

docker-compose -H ssh://USER@remote up -d && docker-compose -H ssh://USER@remote down

3: An error occurs

[+] Running 2/2
 - Container composev2-busybox2-1  Started                                                                            1.2s
 - Container composev2-busybox1-1  Started                                                                            2.8s
[+] Running 1/2
 - Container composev2-busybox2-1  Removed                                                                           10.5s
 - Container composev2-busybox1-1  Error while Removing                                                              11.8s
error during connect: Delete "http://docker/v1.41/containers/9ec831308cba7cd8c68a698c7e1db9985974033918f58def885d18ba9b9059e4?force=1": command [ssh -l USER -- docker-vm docker system dial-stdio] has exited with exit status 1, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=

Compose Version

Docker version 20.10.12, build 20.10.12-0ubuntu4

Docker Compose version v2.14.0

Docker Environment

Client:
 Context:    default
 Debug Mode: false

Server:
 Containers: 125
  Running: 6
  Paused: 0
  Stopped: 119
 Images: 90
 Server Version: 20.10.12
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 
 runc version: 
 init version: 
 Security Options:
  apparmor
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.15.0-56-generic
 Operating System: Ubuntu 22.04.1 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 12
 Total Memory: 30.99GiB
 Name: hugues-tribu
 ID: OLST:TX4F:WU5G:FY2S:6OSC:HPJ6:E33R:UP6P:OAQU:74T7:VPR2:UEBB
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Anything else?

No response

@ugal1
Copy link
Author

ugal1 commented Dec 21, 2022

Some interesting insights here : #8856 (comment)

@ugal1
Copy link
Author

ugal1 commented Dec 29, 2022

version: '3.8'
services:

  busybox1:
    image: busybox
    command: "sleep 1d"
    
  busybox2:
    image: busybox
    command: "sleep 1d"
    depends_on: [busybox1]

Works fine, services stops correctly without deconnexion.
This is a "working workaround", but it makes non sense to make it this way.

@ugal1
Copy link
Author

ugal1 commented Feb 20, 2023

👀

@ndeloof
Copy link
Contributor

ndeloof commented May 10, 2023

This is probably caused by docker/cli#3900, which is included in latest release.
Can you please confirm issue persists with docker compose v2.17.3 ?

@husjon
Copy link

husjon commented May 20, 2023

Hi, I'm seeing this ocassionally as well with my own docker compose files using an SSH context.
Even a simple compose file as this fails every now and then.
https://gist.github.com/husjon/0d6aff7e726073dc00259ef39b3d9907#file-docker-compose-yaml

What I've found or didn't realize earlier is that docker-compose uses a single SSH connection per defined service (aka for the attached compose file, it would be 16 individual connection).

The error I keep getting is the following, which service is failing is usually not repeating (in this example service-13 failed).
It is always failing with the following error stderr=kex_exchange_identification: read: Connection reset by peer
It can also happen on both docker-compose up and docker-compose down

error during connect: Post "http://docker.example.com/v1.42/containers/create?name=playground-service-13-1": command [ssh -- homeserver docker system dial-stdio] has exited with exit status 255, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=kex_exchange_identification: read: Connection reset by peer
Connection reset by 192.168.1.253 port 22

Information about my server:
https://gist.github.com/husjon/0d6aff7e726073dc00259ef39b3d9907#file-docker-information-homeserver

Information about my workstation:
https://gist.github.com/husjon/0d6aff7e726073dc00259ef39b3d9907#file-docker-information-workstation

@ndeloof I originally tested it with v2.17.3, updated today to v2.18.1 and it is unfortunately still happening.

Edit 1:
After playing around with the workaround mentioned in #8856 (comment) setting each service as depending on the previous service, I find the attached compose file working (with caveats).
The reason this to be working is that docker-compose now only creates a new SSH connection when the previous one is finished.

Edit 2:
I played around a bit further and I'm finding that even while setting parallelism using docker-compose --parallel 4, docker-compose still opens all SSH connections to the remote server for each service, meaning that in the linked docker-compose file with 16 services, it opens all 16 connections even while it should be waiting.

@husjon
Copy link

husjon commented Jun 7, 2023

I have not been able to look further into this until now.
The issue seen from sshd is that the connections are being trottled because too many connections are made within a small window.

Jun 07 20:37:42 homeserver sshd[2268386]: pam_unix(sshd:session): session closed for user husjon
Jun 07 20:37:42 homeserver sshd[493]: exited MaxStartups throttling after 00:00:01, 1 connections dropped
Jun 07 20:37:42 homeserver sshd[493]: error: beginning MaxStartups throttling
Jun 07 20:37:42 homeserver sshd[493]: drop connection #11 from [192.168.1.201]:49814 on [192.168.1.253]:22 past MaxStartups
Jun 07 20:37:42 homeserver sshd[2268377]: pam_unix(sshd:session): session closed for user husjon
Jun 07 20:37:42 homeserver sshd[2268525]: Connection closed by 192.168.1.201 port 49706 [preauth]
Jun 07 20:37:42 homeserver sshd[2268388]: pam_unix(sshd:session): session closed for user husjon
Jun 07 20:37:42 homeserver sshd[2268526]: ssh_dispatch_run_fatal: Connection from 192.168.1.201 port 49712: Broken pipe [preauth]

This is caused by the trottling feature in OpenSSH (https://man.openbsd.org/sshd_config#MaxStartups)

By increasing MaxStartups in /etc/ssh/sshd_config this issue is no longer happening, however at the cost of security.

@d-ph
Copy link

d-ph commented Nov 29, 2023

The ssh's MaxStartups default config value (i.e. allow up to 10 concurrent unauthenticated ssh connections, and then randomly close any extra ones until there are 100 concurrent attempts, at which point hard reject the extra ones) thing should honestly be mentioned in docker compose docs. Everyone starts googling a solution to this problem the moment they have more than 10 docker containers defined in their docker-compose.yml file.

I'm pasting my specific cli error, so that google may index it and the next person looking for it doesn't spend hours to find it:

unable to get image 'nginx:1.15-alpine': error during connect: Get "http://docker.example.com/v1.42/images/nginx:1.15-alpine/json": command [ssh -o ConnectTimeout=30 -l vagrant -- 192.168.33.100 docker system dial-stdio] has exited with exit status 255, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=kex_exchange_identification: read: Connection reset by peer

@ndeloof
Copy link
Contributor

ndeloof commented Nov 29, 2023

Docker Compose by nature runs multiple docker API calls concurrently and can indeed quickly reach the ssh limits.
Docker ssh support is actually implemented by running docker system dial-stdio on remote host, and doesn't offer multiplexing so we could have multiple API calls over a single ssh connexion. I don't expect we get any short terms workaround to this limitation

@d-ph
Copy link

d-ph commented Dec 4, 2023

@ndeloof

Thanks for explaining this.

Do you know whether it would be possible for Docker Compose to respect the --parallel [e.g. 1] parameter and not initiate more ssh connection than that parameter? I'm just wondering what sysadmins could do if they didn't have the option of being able to increase the sshd limits in the sshd_config file.

@ndeloof
Copy link
Contributor

ndeloof commented Dec 6, 2023

Have you tried enabling ssh multiplexing on your client? his would allow docker compose command to only rely on a single ssh session to access remote docker engine

see #8191 (comment) for context

@ndeloof
Copy link
Contributor

ndeloof commented Dec 6, 2023

I tried to re-enable ssh multiplexing automatically enabled by docker CLI, but my PR fails for some non-obvious reason. Will need to wait for more eyes to help diagnose this :)
docker/cli#4699

@husjon
Copy link

husjon commented Dec 6, 2023

@ndeloof I just tried with multiplexing enabled in my ssh config towards one of my docker nodes.

ControlPath ~/.ssh/controlmasters-%r@%h:%p
ControlMaster auto
ControlPersist 10m

Note: compared to the example from ssh multiplexing, I changed the path to ~/.ssh/controlmasters-%r@%h:%p since the example requires the folder to exist before it works.

I see that the server only receives 1 ssh connection as expected
Correction: it seem like it still receives each connection individually.

Correction 2: I was too quick to jump to conclusions, I hadn't switched the context.
Using the example compose file below, it seem to work fine using multiplexing.

https://gist.github.com/husjon/0d6aff7e726073dc00259ef39b3d9907#file-docker-compose-yaml

@d-ph
Copy link

d-ph commented Dec 7, 2023

@ndeloof

I reverted the /etc/sshd_config::MaxStartups config change, and enabled ssh multiplexing using the .ssh/config snippet that husjon mentioned, and I can confirm that enabling ssh multiplexing for my "remote" docker machine "ssh hosts" works. As an added bonus: the docker compose up -d seems to run marginally faster -- the docker containers reach the "Created" state all-at-once. This is most likely due to not having to go through the same ssh-handshaking over 10 times.

Bottom line is that the PR that you proposed is not only fixing the problem, but also makes things run in a more proper way (because one could argue that "spamming a remote sshd with logins" is not entirely proper due to resemblance to a minor ddos attack).

Have a good day.

@LaXiS96
Copy link

LaXiS96 commented Jan 8, 2024

I'd like to add that the multiplexing trick does not work on Windows since its OpenSSH implementation does not support the feature. Windows users can therefore only resort to the MaxStartups server-side config.
I don't know how Docker Desktop users are doing, but in my case with remote Linux Engines this issue is highly impactful as it consistently breaks multi-container deployments.

Since switching to SSH authentication from certificates (RIP RancherOS), I noticed a substantial slowdown in all docker commands (both direct CLI and compose) including server-side SSH logs like kex_exchange_identification: connection reset by peer. I don't know if it's a bug in Windows' OpenSSH implementation or Docker, but it's quite an inconvenience.

Is there a Docker-side code change that can resolve the situation? If so, can we expect to find it soon in an upcoming release? (maybe after #11165 is also fixed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants