Skip to content
This repository has been archived by the owner on Sep 26, 2021. It is now read-only.

Many Docker-Machine Failures - Root Cause with Workaround #4483

Open
CollinChaffin opened this issue May 15, 2018 · 1 comment
Open

Many Docker-Machine Failures - Root Cause with Workaround #4483

CollinChaffin opened this issue May 15, 2018 · 1 comment

Comments

@CollinChaffin
Copy link

CollinChaffin commented May 15, 2018

Yesterday I tweeted and posted a video of the root cause of this (and almost every other Docker-Machine on Windows) error I have encountered.

Below I have posted a recap of the primary issue I experienced recently again and frankly for YEARS. This has been frustrating when all these Github issues seem to continually be erroneously closed to leave us having to perform a vast range of attempted workarounds without ever determining and addressing the root cause.

I personally have wasted hours probably totaling into the hundreds now troubleshooting this and similar various Docker on Windows issues and in all that time have never seen this resolution posted. It is certainly possible I missed it and if I did please feel free to point me to a dated write-up showing this as the root cause and I will absolutely stand corrected but otherwise I do believe this is the first time the actual root cause has been fully demonstrated with a solution.

Also of note: Please read to the end of this information because a common response may be that my findings are based on recent releases yet as I post below I can demonstrate this root cause has been in place for YEARS (I tested all the way back to the initial release of Docker and the results are the same!).

In my three decades in the industry, uncovering this one still felt pretty significant but then again when you've been banging your head against the wall for years on something, it usually does. :)

Background

There have been Github issues opened and subsequently closed dating all the way back to ISSUE #66 - very soon after the initial release yet closed. Many of these appear to be reproducible under this issue's cause or in some way related to this issue:

And the Docker forums:

And let's not forget all of StackOverflow sites (I'm only posting a couple but there are MANY):

Also, this issue's root cause is also responsible for all these Kitematic issues including:

And if that is not enough, just run these two Google searches below, and you can bet that most if not all those hits with these "mystery" 255/timeouts/nondescriptive are also this issue:

Issue(s)

  • Management of Docker using Docker-Machine on Windows is impossible using the native shells of Powershell and CMD under certain conditions.

  • Management of Docker using Docker-Machine on any OS may fail with recurring SSH errors.

Cause

Incompatible SSH client implementation.

Tests were run from the current beta release of OpenSSH_for_Windows_7.6p1, LibreSSL 2.6.4 OpenSSH-Win32 all the way back to OpenSSH_7.1p1 Microsoft Pragma Win32 port Oct 7 2015, OpenSSL 1.0.2d 9 Jul 2015 all tested versions exhibited this behavior.

Something in the call to "ip addr show" and possibly other operations from Docker-Machine are being interpreted incorrectly resulting in a terminating error. This prevents any successful Docker operations using either native shell in Windows.

This is becoming a bigger and bigger issue now since the incompatible client is automatically installed with Powershell and Chocolatey and added to the system path. The presence and priority in the system path can be changed and in such is the reason this issue does not appear to plague everyone on Windows and has been so difficult to troubleshoot.

Because the Mingwin bash shell relies on the separately installed Git SSH client, the QuickStart Terminal (usually) works which also had added to the difficulty in troubleshooting. However, once you move to native shells to manage the containers created with the QuickStart terminal, the system path will quickly prioritize the problematic SSH client and cause failures.

Based on testing I would recommend until full resolution that this could possibly be utilized as an accurate statement that sums up in one paragraph the current status of this issue:

Due to current incompatibilities, it is impossible for a user running any version of OpenSSH-Win32, installed by default with Powershell & Chocolatey, to install any version of Docker-Toolbox and utilize the native two shells (Powershell and CMD) to manage Docker without first taking action to disable OpenSSH-Win32 as the system default SSH client.

Note that the real solution is (and I am also immediately opening a Github issue there as well) for the Powershell team now maintaining the OpenSSH for Windows project to work with the Docker team to capture the low-level debug output from OpenSSH back to Docker-Machine's call to "ip addr show" (and other commands) to determine what on earth is being interpreted as a bad return, despite as I demonstrated the command clearly returning successfully the exact same output as the versions using OpenSSL instead of LibreSSL, and the connection via TLS being successful in all versions.

Issue Reproduction

Screencap video GIF

Workaround

Until the root cause of the incompatibility with the OpenSSH-Win32 client can be addressed, there is only one workaround that has been successfully tested.

Both the SYSTEM and USER environment variable for PATH must be edited and any references to OpenSSH-Win32 (Default path of C:\Program Files\OpenSSH-Win64) must be moved BELOW another compatible SSH client. For example, the Git installed version is compatible so if Git were installed the path of C:\Program Files\Git\usr\bin should be moved before the OpenSSH version.

After making this change, a reboot is recommended.

@CollinChaffin
Copy link
Author

UPDATE:

As I indicated, I also opened this issue over at the OpenSSH GH issues in hopes that they can collaborate in the determination of why this is happening. I am not able to debug (easily) much further but obviously the one very visible difference is the SSL library implemented. I can't fathom why that would impact this especially when you can see that all actual SSH connectivity is successful, even with the OpenSSH client. I realize 255 is a rather generic return code but it does in fact state the error is being caused on or around the command "ip addr show", yet you can watch my demo as the OpenSSH client also handles that and outputs (unless there is some extra unprintable char) the exact same output when I run it manually immediately before it is called internally by the "env" command and fails with the terminating error.

Here is the issue # over at OpenSSH for collaboration/reference on this:

PowerShell/Win32-OpenSSH#1155

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant