Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Live PXE: Ignition failure: failed to set SSH key: permission denied #496

Closed
dustymabe opened this issue May 24, 2020 · 14 comments
Closed

Live PXE: Ignition failure: failed to set SSH key: permission denied #496

dustymabe opened this issue May 24, 2020 · 14 comments

Comments

@dustymabe
Copy link
Member

@dustymabe dustymabe commented May 24, 2020

This was orinally reported in the discussion forum.

Boots of the Live environment via PXE fail because Ignition hits an error: failed to set SSH key: permission denied.

-------------------------------------------------------------------------------
Ignition has failed. Please ensure your config is valid. Note that only Ignition spec
v3.0.0+ configs are accepted.
A CLI validation tool to check this called ignition-validate can be downloaded from GitHub:
    https://github.com/coreos/ignition/releases
Here are the Ignition logs:
Ignition 2.2.1
Stage: fetch
reading system config file "/usr/lib/ignition/base.ign"
parsing config with SHA512: ff6a5153be363997e4d5d3ea8cc4048373a457c48c4a5b134a08a30aacd167c1e0f099f0bdf1e24c99ad180628cd02b767b863b5fe3a8fce3fe1886847eb8e2e
parsed url from cmdline: "https://dustymabe.fedorapeople.org/user-systemd.ign"
GET https://dustymabe.fedorapeople.org/user-systemd.ign: attempt #1
GET error: Get https://dustymabe.fedorapeople.org/user-systemd.ign: dial tcp: lookup dustymabe.fedorapeople.org on [::1]:53: read udp [::1]:35888->[::1]:53: read: connection refused
GET https://dustymabe.fedorapeople.org/user-systemd.ign: attempt #2
GET error: Get https://dustymabe.fedorapeople.org/user-systemd.ign: dial tcp: lookup dustymabe.fedorapeople.org on [::1]:53: read udp [::1]:33563->[::1]:53: read: connection refused
GET https://dustymabe.fedorapeople.org/user-systemd.ign: attempt #3
GET error: Get https://dustymabe.fedorapeople.org/user-systemd.ign: dial tcp: lookup dustymabe.fedorapeople.org on [::1]:53: read udp [::1]:52499->[::1]:53: read: connection refused
GET https://dustymabe.fedorapeople.org/user-systemd.ign: attempt #4
GET error: Get https://dustymabe.fedorapeople.org/user-systemd.ign: dial tcp: lookup dustymabe.fedorapeople.org on [::1]:53: read udp [::1]:57190->[::1]:53: read: connection refused
GET https://dustymabe.fedorapeople.org/user-systemd.ign: attempt #5
GET error: Get https://dustymabe.fedorapeople.org/user-systemd.ign: dial tcp: lookup dustymabe.fedorapeople.org on [::1]:53: read udp [::1]:34506->[::1]:53: read: connection refused
GET https://dustymabe.fedorapeople.org/user-systemd.ign: attempt #6
GET result: OK
parsing config with SHA512: 047ac3b529d55803bb023e1d805a11dc27d862371cefb336e36452027d96a789e5abe3b934070ed1737390ab3e85109a7bf9cd5a88688bbd1e5743b516611a21
fetch: fetch complete
fetch: fetch passed
Ignition finished successfully
Ignition 2.2.1
Stage: disks
reading system config file "/usr/lib/ignition/base.ign"
parsing config with SHA512: ff6a5153be363997e4d5d3ea8cc4048373a457c48c4a5b134a08a30aacd167c1e0f099f0bdf1e24c99ad180628cd02b767b863b5fe3a8fce3fe1886847eb8e2e
disks: disks passed
Ignition finished successfully
INFO     : Ignition 2.2.1
INFO     : Stage: mount
INFO     : reading system config file "/usr/lib/ignition/base.ign"
DEBUG    : parsing config with SHA512: ff6a5153be363997e4d5d3ea8cc4048373a457c48c4a5b134a08a30aacd167c1e0f099f0bdf1e24c99ad180628cd02b767b863b5fe3a8fce3fe1886847eb8e2e
INFO     : mount: mount passed
INFO     : Ignition finished successfully
INFO     : Ignition 2.2.1
INFO     : Stage: files
INFO     : reading system config file "/usr/lib/ignition/base.ign"
DEBUG    : parsing config with SHA512: ff6a5153be363997e4d5d3ea8cc4048373a457c48c4a5b134a08a30aacd167c1e0f099f0bdf1e24c99ad180628cd02b767b863b5fe3a8fce3fe1886847eb8e2e
INFO     : files: createUsers: op(1): [started]  creating or modifying user "core"
DEBUG    : files: createUsers: op(1): executing: "useradd" "--root" "/sysroot" "--create-home" "--password" "*" "--comment" "CoreOS Admin" "--groups" "adm,sudo,systemd-journal,wheel" "core"
INFO     : files: createUsers: op(1): [finished] creating or modifying user "core"
INFO     : files: createUsers: op(2): [started]  adding ssh keys to user "core"
CRITICAL : files: createUsers: op(2): [failed]   adding ssh keys to user "core": failed to set SSH key: permission denied
files failedFull config:
{
  "ignition": {
    "config": {
      "replace": {
        "source": null,
        "verification": {}
      }
    },
    "proxy": {},
    "security": {
      "tls": {}
    },
    "timeouts": {},
    "version": "3.1.0-experimental"
  },
  "passwd": {
    "users": [
      {
        "gecos": "CoreOS Admin",
        "groups": [
          "adm",
          "sudo",
          "systemd-journal",
          "wheel"
        ],
        "name": "core",
        "sshAuthorizedKeys": [
          "ssh-rsa ..."
        ]
      }
    ]
  },
  "storage": {},
  "systemd": {
    "units": [
      {
        "contents": "[Service]\nType=oneshot\nExecStart=/usr/bin/false\n\n[Install]\nWantedBy=multi-user.target",
        "enabled": true,
        "name": "example.service"
      }
    ]
  }
}CRITICAL : Ignition failed: failed to create users/groups: failed to create users: failed to add keys to user "core": failed to set SSH key: permission denied
INFO     : Ignition 2.2.1
INFO     : Stage: umount
INFO     : reading system config file "/usr/lib/ignition/base.ign"
DEBUG    : parsing config with SHA512: ff6a5153be363997e4d5d3ea8cc4048373a457c48c4a5b134a08a30aacd167c1e0f099f0bdf1e24c99ad180628cd02b767b863b5fe3a8fce3fe1886847eb8e2e
INFO     : umount: umount passed
INFO     : Ignition finished successfully

This exists in the latest stable release 31.20200505.3.0. It does not exist in the prior stable release 31.20200420.3.0.

I did a bisect of our testing-devel stream to try to narrow down when the problem was introduced. Here is what I found:

  • 31.20200521.20.0 - failed
  • 31.20200511.20.0 - failed
  • 31.20200507.20.0 - failed
  • 31.20200506.20.1 - passed
  • 31.20200506.20.0 - passed
  • 31.20200505.20.2 - passed
  • 31.20200427.20.1 - passed

So focusing on the 31.20200506.20.1 -> 31.20200507.20.0 we have:

Upgraded:
    ignition 2.2.1-5.git2d3ff58.fc31.x86_64 → 2.3.0-1.gitee616d5.fc31.x86_64
    libldb 2.0.9-1.fc31.x86_64 → 2.0.10-1.fc31.x86_64
    libsmbclient 2:4.11.7-1.fc31.x86_64 → 2:4.11.8-0.fc31.x86_64
    libsolv 0.7.11-1.fc31.x86_64 → 0.7.12-1.fc31.x86_64
    libwbclient 2:4.11.7-1.fc31.x86_64 → 2:4.11.8-0.fc31.x86_64
    samba-client-libs 2:4.11.7-1.fc31.x86_64 → 2:4.11.8-0.fc31.x86_64
    samba-common 2:4.11.7-1.fc31.noarch → 2:4.11.8-0.fc31.noarch
    samba-common-libs 2:4.11.7-1.fc31.x86_64 → 2:4.11.8-0.fc31.x86_64 

One might think the ignition upgrade would be a smoking gun, but the latest stable release doesn't yet have the new ignition and it does still fail. So we need to dig deeper and look at the COSA and FCOS configs repos:

  • 31.20200507.20.0 - failed
    • cosa - 44bdfbbccb0c7fa73fe10d4bba2a41edeb802b8d
    • conf - 5af3d8e8cf0dcf892c7e5616c6487fec7a595d9b
  • 31.20200506.20.1 - passed
    • cosa - 5a07e8a5aef1bdca2272e22cbd9aaed142819f8b
    • conf - 7b1434d2ace9d680016562d4f27de15c37f52fa9

The commits for that range are:

  • FCOS configs from 7b1434d2ace9d680016562d4f27de15c37f52fa9^..5af3d8e8cf0dcf892c7e5616c6487fec7a595d9b
5af3d8e o lockfiles: bump to latest                                       
f4d27ed o overrides: fasttrack ignition-2.3.0-1.gitee616d5.fc31
7b1434d o overlay: 14NetworkManager-plugins: update comment in conf dropin
  • COSA from 5a07e8a5aef1bdca2272e22cbd9aaed142819f8b^..44bdfbbccb0c7fa73fe10d4bba2a41edeb802b8d
44bdfbb o installer: Use consistent variable to refer to built initramfs
6724a4c o installer: Refactor initramfs code
4e60560 o cmdlib: Lower cost of cosa RPM overrides repo
5a07e8a o mantle: bump Ignition to 2.3.0

Out of these it appears the installer: Refactor initramfs code (from coreos/coreos-assembler#1423) seems the most likely suspect. I confirmed that if reset to src/cmd-buildextend-installer before that change (git checkout 6724a4c^ src/cmd-buildextend-installer) and do a local build on top of the testing-devel branch then my system boots normally.

@cgruver
Copy link

@cgruver cgruver commented May 24, 2020

For what it is worth, version 31.20200517.2.0 in the testing stream is working for bare-metal OKD4 builds with iPXE.

I don't know if that helps you narrow down the issue, but there it is. :-)

@cgwalters
Copy link
Member

@cgwalters cgwalters commented May 24, 2020

Nice investigation! I am not offhand understanding why that commit could cause this. Something like a race in when we mount the rootfs?

Clearly though it'd help to refactor testiso into kola run -p qemu-pxe basic so we actually inject SSH keys for example.

@dustymabe
Copy link
Member Author

@dustymabe dustymabe commented May 24, 2020

Nice investigation! I am not offhand understanding why that commit could cause this.

Nothing jumped out at me either. My late night self couldn't quite grok the change in that diff that would cause this.

Something like a race in when we mount the rootfs?

Yeah seems like maybe a timing issue. At least it is consistent so investigating this was straightforward.

Clearly though it'd help to refactor testiso into kola run -p qemu-pxe basic so we actually inject SSH keys for example.

+1000 for CI coverage so we don't regress here in the future.

@dustymabe
Copy link
Member Author

@dustymabe dustymabe commented May 24, 2020

For what it is worth, version 31.20200517.2.0 in the testing stream is working for bare-metal OKD4 builds with iPXE.

In my testing that version is affected (fails the test). Are you providing an ignition config when you PXE boot are are you just doing an install using the kargs (coreos.inst.install_dev, etc)? The former should fail (at least if you set an SSH key), while the latter should work I think.

@cgruver
Copy link

@cgruver cgruver commented May 25, 2020

Yes, the iPXE boot is passing an ignition config for one of bootstrap, master, or worker. I'm using kargs to set IP info and hostname, but everything else is in the ignition configs created by openshift-install.

@jlebon jlebon self-assigned this May 26, 2020
@jlebon jlebon added the jira label May 26, 2020
@jlebon
Copy link
Member

@jlebon jlebon commented May 26, 2020

One can also reproduce this on the live ISO with an embedded Ignition config that tries to add an SSH key.

jlebon added a commit to jlebon/coreos-assembler that referenced this issue May 28, 2020
openshift-merge-robot added a commit to coreos/coreos-assembler that referenced this issue May 28, 2020
Wanted to sanity-check this while debugging
coreos/fedora-coreos-tracker#496.
jlebon added a commit to jlebon/coreos-assembler that referenced this issue May 28, 2020
Super subtle but evil regression from coreos#1423. By switching to `find` to
feed filenames to `cpio`, we were passing it `.`. And `cpio` dutifully
made note of the permissions of `.`, which at the time it is run are
those of the temporary directory we allocated, which is naturally 0700.

This in turn breaks Ignition's `files` stage, which purposely runs
things like adding SSH keys as the target user, and thus it can't even
access the directories in `/` when doing the equivalent of `mkdir -p
/sysroot/var/home/core/.ssh/...`.

Instead, use `-mindepth 1` here so we always skip the root of the
temporary directory itself.

This resolves coreos/fedora-coreos-tracker#496
but not marking as `Closes:` because we should really add a basic live
PXE + SSH key test.
@jlebon
Copy link
Member

@jlebon jlebon commented May 28, 2020

This took me way too long to hunt down. I was looking at squashfs'es, splitting initrds, and doing all sorts of weird diffs. The actual issue turned out to be quite simple: coreos/coreos-assembler#1487

jlebon added a commit to jlebon/coreos-assembler that referenced this issue May 28, 2020
Super subtle but evil regression from coreos#1423. By switching to `find` to
feed filenames to `cpio`, we were passing it `.`. And `cpio` dutifully
made note of the permissions of `.`, which at the time it is run are
those of the temporary directory we allocated, which is naturally 0700.

This in turn breaks Ignition's `files` stage, which purposely runs
things like adding SSH keys as the target user, and thus it can't even
access the directories in `/` when doing the equivalent of `mkdir -p
/sysroot/var/home/core/.ssh/...`.

Instead, use `-mindepth 1` here so we always skip the root of the
temporary directory itself.

This resolves coreos/fedora-coreos-tracker#496
but not marking as `Closes:` because we should really add a basic live
PXE + SSH key test.
@dustymabe dustymabe reopened this May 29, 2020
@dustymabe
Copy link
Member Author

@dustymabe dustymabe commented May 29, 2020

This took me way too long to hunt down. I was looking at squashfs'es, splitting initrds, and doing all sorts of weird diffs. The actual issue turned out to be quite simple: coreos/coreos-assembler#1487

Wow, what a needle in a haystack. At least the haystack was much smaller due to our copious builds and rich history information that allowed us to narrow it down to a single PR.

The merged fix does seem to work for me in a local build:

[   12.439035] ignition[628]: INFO     : files: createUsers: op(1): [started]  creating or modifying user "core"                                                                                      
[   12.442604] ignition[628]: DEBUG    : files: createUsers: op(1): executing: "useradd" "--root" "/sysroot" "--create-home" "--password" "*" "--comment" "CoreOS Admin" "--groups" "adm,sudo,systemd-
journal,wheel" "core"                                                                              
[   12.523767] ignition[628]: INFO     : files: createUsers: op(1): [finished] creating or modifying user "core"                                                                                      
[   12.526146] ignition[628]: INFO     : files: createUsers: op(2): [started]  adding ssh keys to user "core"                                                                                         
[   12.529887] ignition[628]: wrote ssh authorized keys file for user: core                        
[   12.531789] ignition[628]: INFO     : files: createUsers: op(2): [finished] adding ssh keys to user "core"
@cgruver
Copy link

@cgruver cgruver commented May 29, 2020

Excellent! Let us know when this hits the testing stream. I had to regress my OKD4 single-node-cluster build to an earlier testing stream build. So, when this drops I will rerun it to help validate the fix.

@ratzrattillo
Copy link

@ratzrattillo ratzrattillo commented May 30, 2020

Looking forward to seeing it in Stable soon. Today I wanted to set up a productive system with FCOS and unfortunately stumbled upon this error, when booting from a modified installer with embedded ignition config, containing an SSH key. Push it guys! :)

@dustymabe
Copy link
Member Author

@dustymabe dustymabe commented May 31, 2020

hey @ratzrattillo - if you're just using the media to do an install (i.e., you aren't running workloads from a live environment) then you can use the previous set of media from the 31.20200420.3.0 release and complete an install. You can use that media to install the latest release or you can install the previous release and it will auto-update.

@ratzrattillo
Copy link

@ratzrattillo ratzrattillo commented May 31, 2020

@dustymabe Thank you, that helped. Now my systems are up and running :)

@jlebon
Copy link
Member

@jlebon jlebon commented Jun 8, 2020

This is fixed now in the latest releases, and we have a test to watch for regressions.

@jlebon jlebon closed this Jun 8, 2020
@dustymabe
Copy link
Member Author

@dustymabe dustymabe commented Jun 8, 2020

This was fixed in coreos/coreos-assembler#1487

The fix for this went into testing stream release 32.20200601.2.1.

The fix for this went into stable stream release 31.20200517.3.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

6 participants