Skip to content
This repository has been archived by the owner. It is now read-only.

oem-gce.service crashlooping on version 2191.4.1 #2608

Closed
george-angel opened this issue Aug 29, 2019 · 10 comments
Closed

oem-gce.service crashlooping on version 2191.4.1 #2608

george-angel opened this issue Aug 29, 2019 · 10 comments

Comments

@george-angel
Copy link

@george-angel george-angel commented Aug 29, 2019

Provider: GCE
CoreOS Container Linux version: 2191.4.1

$ rkt list
UUID            APP     IMAGE NAME                      STATE           CREATED         STARTED         NETWORKS
02c3d817        oem-gce coreos.com/oem-gce:2191.4.1     exited garbage  21 minutes ago  21 minutes ago
07b06c28        oem-gce coreos.com/oem-gce:2191.4.1     exited garbage  13 minutes ago  13 minutes ago
09314175        oem-gce coreos.com/oem-gce:2191.4.1     exited garbage  5 minutes ago   5 minutes ago
0be9b554        oem-gce coreos.com/oem-gce:2191.4.1     exited garbage  2 minutes ago   2 minutes ago
0ea0572f        oem-gce coreos.com/oem-gce:2191.4.1     exited garbage  22 minutes ago  22 minutes ago
11d1439d        oem-gce coreos.com/oem-gce:2191.4.1     exited garbage  16 minutes ago  16 minutes ago
130ecdf9        oem-gce coreos.com/oem-gce:2191.4.1     exited garbage  30 minutes ago  30 minutes ago
15fff556        oem-gce coreos.com/oem-gce:2191.4.1     exited garbage  27 minutes ago  27 minutes ago
16d68799        oem-gce coreos.com/oem-gce:2191.4.1     exited garbage  16 minutes ago  16 minutes ago
Aug 29 09:58:53 etcd-0-k8s-rlpv.c.uw-prod.internal systemd[1]: Starting GCE Linux Agent...
Aug 29 09:59:30 etcd-0-k8s-rlpv.c.uw-prod.internal rkt[11438]: + '[' -e /etc/default/instance_configs.cfg.template ']'
Aug 29 09:59:30 etcd-0-k8s-rlpv.c.uw-prod.internal rkt[11438]: + /usr/bin/google_instance_setup
Aug 29 09:59:30 etcd-0-k8s-rlpv.c.uw-prod.internal rkt[11438]: /init.sh: /usr/bin/google_instance_setup: /usr/lib/python-exec/python2.7/python: bad interpreter: No such file or directory
Aug 29 09:59:31 etcd-0-k8s-rlpv.c.uw-prod.internal systemd[1]: oem-gce.service: Main process exited, code=exited, status=126/n/a
Aug 29 09:59:31 etcd-0-k8s-rlpv.c.uw-prod.internal rkt[11486]: gc: moving pod "3de37879-8f93-4d1c-9717-997fd56715e2" to garbage
Aug 29 09:59:31 etcd-0-k8s-rlpv.c.uw-prod.internal systemd[1]: oem-gce.service: Failed with result 'exit-code'.
Aug 29 09:59:31 etcd-0-k8s-rlpv.c.uw-prod.internal systemd[1]: Failed to start GCE Linux Agent.
Aug 29 09:59:36 etcd-0-k8s-rlpv.c.uw-prod.internal systemd[1]: oem-gce.service: Service RestartSec=5s expired, scheduling restart.
Aug 29 09:59:36 etcd-0-k8s-rlpv.c.uw-prod.internal systemd[1]: oem-gce.service: Scheduled restart job, restart counter is at 285.
Aug 29 09:59:36 etcd-0-k8s-rlpv.c.uw-prod.internal systemd[1]: Stopped GCE Linux Agent.

This seems to be pushing the loadavg on the node and resulted in a high sync duration of the etcd on the nodes.

2135.6.0 - does not appear to have this problem.

@zmarano
Copy link

@zmarano zmarano commented Aug 29, 2019

FYI: This is broken in all current GCE CoreOS images published yesterday.
coreos-stable-2191-4-1-v20190828
coreos-beta-2219-2-1-v20190828
coreos-alpha-2247-0-0-v20190828

Loading

@ajeddeloh
Copy link

@ajeddeloh ajeddeloh commented Aug 29, 2019

Can repro, looking into this.

Loading

@matalo33
Copy link

@matalo33 matalo33 commented Aug 30, 2019

Hiya. Completely ignoring the subject of how this affected all three alpha, beta, stable channels, I'd like to know why steps haven't been taken to remove the broken images from circulation?

Currently anyone launching a stable coreos image on Google Compute Engine will be unable to SSH to their instances because the affected service is responsible for retrieving SSH keys from GCP Project Metadata. Additionally the constant CPU thrashing caused by systemd trying to start the service every 5 seconds starves small (1vcpu) instances of resource and they cannot support their intended function.

Why have the affected images on GCP not been marked as deprecated and the previous known working images marked as the active member of the image families?

Loading

@bgilbert
Copy link
Member

@bgilbert bgilbert commented Aug 30, 2019

It turns out that the problem was introduced in alpha 2163.0.0. So the issue has been present in the alpha channel since June 4 and the beta channel since June 25. No one has reported it before now, and obviously our CI didn't catch it either.

Upgrades aren't affected, because the agent is in the OEM partition which is not updated. As a workaround, you can launch 2135.6.0 and allow it to update normally.

As a policy, we don't remove released artifacts. We'll revert the coreos-stable image family to 2135.6.0, but the alpha and beta channels have progressed too far to revert. We're working on tracking this down and hope to have a fixed release soon.

Loading

@matalo33
Copy link

@matalo33 matalo33 commented Aug 30, 2019

Thank you for the update! This affected us in production so there's lessons learnt on our side too.

Interesting how no one noticed since June. We'll switch our lower environments to use coreos-alpha and beta :)

Loading

@bgilbert
Copy link
Member

@bgilbert bgilbert commented Aug 30, 2019

We've switched coreos-stable back to 2135.6.0. Thanks for your patience.

Loading

@bgilbert
Copy link
Member

@bgilbert bgilbert commented Aug 30, 2019

We've applied a fix to all three branches, updated CI to detect the issue, and expect to issue new releases on Wednesday. Thanks for reporting.

Loading

@HeikoOnnebrink
Copy link

@HeikoOnnebrink HeikoOnnebrink commented Aug 31, 2019

is there any link to this one #2601

opened it some time ago but did not get any feedback ..

Loading

dongsupark added a commit to flatcar-linux/coreos-overlay that referenced this issue Sep 2, 2019
To fix the recent issues about oem-gce.service crashlooping, we need to
bring back python 2.7 instead of 3.6.5 in package.provided. Otherwise
oem-gce.service will not start.

See also coreos/bugs#2608,
coreos/coreos-overlay#3746
@bgilbert
Copy link
Member

@bgilbert bgilbert commented Sep 3, 2019

@HeikoOnnebrink Nope, this issue wasn't introduced until 2163.0.0.

Loading

@george-angel
Copy link
Author

@george-angel george-angel commented Sep 10, 2019

Latest stable release https://coreos.com/releases/#2191.5.0 seems to fix this issue. Closing.

Thank you for prompt fix!

Loading

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants