Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] On some host VMs (nested), crc vm's that were working, fail to start: crc log shows .."failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain" #3366

Closed
rscher opened this issue Sep 28, 2022 · 5 comments
Labels
kind/bug Something isn't working status/need triage

Comments

@rscher
Copy link

rscher commented Sep 28, 2022

General information

  • OS: Linux - rhel8.6
  • Hypervisor: KVM - libvirt/qemu
  • Did you run crc setup before starting it (Yes/No)? Yes
  • Running CRC on: VM (nested)

CRC version

# Put `crc version` output here
  
$ crc version
CRC version: 2.9.0+9591a8f
OpenShift version: 4.11.3
Podman version: 4.2.0
  
## CRC status
```bash
# Put `crc status --log-level debug` output here

$ crc status --log-level debug
DEBU CRC version: 2.9.0+9591a8f                   
DEBU OpenShift version: 4.11.3                    
DEBU Podman version: 4.2.0                        
DEBU Running 'crc status'                         
DEBU Checking file: /home/crcuser/.crc/machines/crc/.crc-exist 
DEBU Checking file: /home/crcuser/.crc/machines/crc/.crc-exist 
DEBU Found binary path at /home/crcuser/.crc/bin/crc-driver-libvirt 
DEBU Launching plugin server for driver libvirt   
DEBU Plugin server listening at address 127.0.0.1:43527 
DEBU () Calling .GetVersion                       
DEBU Using API Version 1                          
DEBU () Calling .SetConfigRaw                     
DEBU () Calling .GetMachineName                   
DEBU (crc) Calling .GetBundleName                 
DEBU (crc) Calling .GetState                      
DEBU (crc) DBG | time="2022-09-27T23:37:57-04:00" level=debug msg="Getting current state..." 
DEBU (crc) DBG | time="2022-09-27T23:37:57-04:00" level=debug msg="Fetching VM..." 
DEBU (crc) Calling .GetIP                         
DEBU (crc) DBG | time="2022-09-27T23:37:57-04:00" level=debug msg="GetIP called for crc" 
DEBU (crc) DBG | time="2022-09-27T23:37:57-04:00" level=debug msg="Getting current state..." 
DEBU (crc) Calling .GetIP                         
DEBU (crc) DBG | time="2022-09-27T23:37:57-04:00" level=debug msg="GetIP called for crc" 
DEBU (crc) DBG | time="2022-09-27T23:37:57-04:00" level=debug msg="Getting current state..." 
DEBU Running SSH command: df -B1 --output=size,used,target /sysroot | tail -1 
DEBU Using ssh private keys: [/home/crcuser/.crc/machines/crc/id_ecdsa /home/crcuser/.crc/cache/crc_libvirt_4.11.3_amd64/id_ecdsa_crc] 
DEBU SSH command results: err: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain, output:  
DEBU Cannot get root partition usage: ssh command error:
command : df -B1 --output=size,used,target /sysroot | tail -1
err     : ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain 
DEBU cannot get OpenShift status: stat /home/crcuser/.crc/machines/crc/kubeconfig: no such file or directory 
DEBU Making call to close driver server           
DEBU (crc) Calling .Close                         
DEBU Successfully made call to close driver server 
DEBU Making call to close connection to plugin binary 
CRC VM:          Running
OpenShift:       Unreachable (v4.11.3)
Podman:          
Disk Usage:      0B of 0B (Inside the CRC VM)
Cache Usage:     16.57GB
Cache Directory: /home/crcuser/.crc/cache

## CRC config
```bash
# Put `crc config view` output here

$ crc config view
- consent-telemetry             : no
- cpus                                  : 16
- disk-size                             : 100
- enable-cluster-monitoring    : true
- kubeadmin-password            : *******
- memory                                : 57344
- pull-secret-file                      : /home/crcuser/pull-secret.txt

## Host Operating System
```bash
# Put the output of `cat /etc/os-release` in case of Linux

$ cat /etc/os-release
NAME="Red Hat Enterprise Linux"
VERSION="8.6 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.6"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.6 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/8/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.6
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.6"

### Steps to reproduce

Notes: this is nested VM within IBM's castle/fyre env.
 latest crc/ocp 4.11.3 generally  working great in this env, but ...
 the env sometimes gets into this unknown state after an os or kernel update/reboot
 and can't get it back. This is not specific to this version of crc, also occurred on ocp 4.10.x

  1.  initially created cluster: crc setup; crc start  
       works fine, can stop/start crc.,  reboot host VM. all is swell, see expected results
  2.  update os/kernel : $ sudo dnf update ;   reboot
  3.  crc stuck in this state and wont start 
     crc log: unable to authenticate, attempted methods [none publickey], no supported methods remain 
     DEBU cannot get OpenShift status: stat /home/crcuser/.crc/machines/crc/kubeconfig: no such file or directory 
  4. run delete_crc.sh script (see below)
  5.  crc setup --log-level debug ;  crc start --log-level debug 
  6.  same result ... cluster won't start

### Expected
[crcuser@sno .kube]$ crc status
CRC VM:          Running
OpenShift:       Degraded (v4.11.3)   <--- ok, due to cluster monitoring enabled
Podman:          
Disk Usage:      28.51GB of 106.8GB (Inside the CRC VM)
Cache Usage:     16.57GB
Cache Directory: /home/crcuser/.crc/cache

[crcuser@sno .kube]$ oc get nodes 
NAME                 STATUS   ROLES           AGE   VERSION
crc-wkzjw-master-0   Ready    master,worker   19d   v1.24.0+b62823b

[crcuser@sno .kube]$ oc version 
Client Version: 4.11.3
Kustomize Version: v4.5.4
Server Version: 4.11.3
Kubernetes Version: v1.24.0+b62823b

[crcuser@sno .kube]$ ll ~/.crc/machines/crc/
total 36721692
-rw-r--r-- 1 qemu    qemu    37601017856 Sep 28 00:10 crc.qcow2
-rw------- 1 crcuser crcuser            8 Sep 27 06:40 kubeadmin-password
-rw------- 1 crcuser crcuser         889 Sep 27 06:38 config.json
-rw------- 1 crcuser crcuser       15271 Sep 26 22:44 kubeconfig
-rw------- 1 crcuser crcuser         253 Sep 26 22:39 id_ecdsa.pub
-rw------- 1 crcuser crcuser         384 Sep 26 22:39 id_ecdsa

### Actual
see  Steps to reproduce
 log:
  "Waiting for machine to come up 59/60 ...Unable to determine VM's IP address, did it fail to boot?"
      err     : ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain 
     DEBU cannot get OpenShift status: stat /home/crcuser/.crc/machines/crc/kubeconfig: no such file or directory 

-- files in machines/crc are invalid, missing kubeconfig,
--  crc.qcow2 should be 37601017856  (see expected section above for working  env)

[crcuser@sno ~]$ cd ~/.crc/machines/crc/
[crcuser@sno crc]$ ls -l
total 276
-rw------- 1 crcuser crcuser    875 Sep 27 14:09 config.json
-rw------- 1 crcuser crcuser     23 Sep 27 14:09 kubeadmin-password    <---  missing actual kubeconfig
-rw------- 1 crcuser crcuser    384 Sep 27 14:09 id_ecdsa
-rw------- 1 crcuser crcuser    253 Sep 27 14:09 id_ecdsa.pub
-rw-r--r-- 1 qemu    qemu    263744 Sep 27 14:09 crc.qcow2    <---- invalid size, 

### Logs
https://gist.github.com/rscher/14a4090076bb458fcfb7778008f1e31a

Before gather the logs try following if that fix your issue
```bash
my crc delete script ..

$ cat delete_crc.sh
#!/bin/bash
crc stop -f
crc  delete -f
crc cleanup
virsh -c qemu:///system destroy crc
virsh -c qemu:///system undefine crc
rm -rf ~/.crc


Please consider posting the output of `crc start --log-level debug`  on http://gist.github.com/ and post the link in the issue.
@rscher rscher added kind/bug Something isn't working status/need triage labels Sep 28, 2022
@rscher
Copy link
Author

rscher commented Sep 28, 2022

Hey guys, Russ here at IBM again.
recent crc versions, i.e. 4.11.x have been working well overall in our "castle/fyre" infrastructure (yes its nested VM)
but there is some environment issue that occurs (details above) that renders a working crc vm env
inoperable even after recreating the crc vm after a crc delete /cleanup.
seems related to the id_ecdsa ssh key or pull-secret in the keyring ? i.e. /run/user/1001/keyring/

really would like to get this one solved ... once it occurs on a VM, its a show stopper .
let me know any ideas I can try out, have 2 dead host vm's , 1 with latest crc .

thanks !
-Russ

@cfergeau
Copy link
Contributor

If crc start fails, then kubeconfig and id_ecdsa will also be missing. Most of what you put in the issue is expected when crc start failed, but you omitted this step from issue.
If I understood correctly, you provision a VM in your env where you will use crc nested. Then you can successfully use crc, crc start, crc status, ... all behave as expected. However after something happens, crc delete && crc start no longer gives you a working crc instance?

let me know any ideas I can try out, have 2 dead host vm's , 1 with latest crc .

I would focus on crc delete && crc start, and look at crc logs, try to ssh into crc's vm, ... when crc start fails.

@rscher
Copy link
Author

rscher commented Oct 1, 2022

Hi @cfergeau, mystery solved. The crc instances die when the hosting vpc infrastructure
(aka Castle) uses AMD EPYC processors and works when Castle uses Intel processors. Resources are allocated dynamically, so it's somewhat random when it allocates Intel vs AMD.
I know we discussed this many times, but didnt put 2 and 2 together until I ran sos reports.
This is a RHEL/iibvirt limitation running crc as a nested vm with AMD procs and does not occur on Ubuntu hosts running crc as a nested vm on AMD procs.
It's unfortunate that as an IBMer heavily invested in RHEL and OpenShift technologies, that I am forced to move to Ubuntu in order for crc to run on both Intel and AMD procs.
The culpability can now be blamed on RHEL and not Castle, as I originally thought.
SNO has the exact same issue as crc, otherwise I would've moved to SNO.

So we can put this to bed, finally.
thanks for all your help.
-Russ
IBM Cloud

@rscher rscher closed this as completed Oct 1, 2022
@praveenkumar
Copy link
Member

This is a RHEL/iibvirt limitation running crc as a nested vm with AMD procs and does not occur on Ubuntu hosts running crc as a nested vm on AMD procs.

@rscher Is there any doc/BZ around this limitation for RHEL/libvirt side?

@apevec
Copy link

apevec commented Dec 8, 2022

This is a RHEL/iibvirt limitation running crc as a nested vm with AMD procs and does not occur on Ubuntu hosts running crc as a nested vm on AMD procs.

@rscher Is there any doc/BZ around this limitation for RHEL/libvirt side?

there is now https://bugzilla.redhat.com/show_bug.cgi?id=2151878

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working status/need triage
Projects
None yet
Development

No branches or pull requests

4 participants