Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

oem-gce.service crashlooping on version 2191.4.1 #2608

Closed
george-angel opened this issue Aug 29, 2019 · 10 comments

Comments

@george-angel
Copy link

commented Aug 29, 2019

Provider: GCE
CoreOS Container Linux version: 2191.4.1

$ rkt list
UUID            APP     IMAGE NAME                      STATE           CREATED         STARTED         NETWORKS
02c3d817        oem-gce coreos.com/oem-gce:2191.4.1     exited garbage  21 minutes ago  21 minutes ago
07b06c28        oem-gce coreos.com/oem-gce:2191.4.1     exited garbage  13 minutes ago  13 minutes ago
09314175        oem-gce coreos.com/oem-gce:2191.4.1     exited garbage  5 minutes ago   5 minutes ago
0be9b554        oem-gce coreos.com/oem-gce:2191.4.1     exited garbage  2 minutes ago   2 minutes ago
0ea0572f        oem-gce coreos.com/oem-gce:2191.4.1     exited garbage  22 minutes ago  22 minutes ago
11d1439d        oem-gce coreos.com/oem-gce:2191.4.1     exited garbage  16 minutes ago  16 minutes ago
130ecdf9        oem-gce coreos.com/oem-gce:2191.4.1     exited garbage  30 minutes ago  30 minutes ago
15fff556        oem-gce coreos.com/oem-gce:2191.4.1     exited garbage  27 minutes ago  27 minutes ago
16d68799        oem-gce coreos.com/oem-gce:2191.4.1     exited garbage  16 minutes ago  16 minutes ago
Aug 29 09:58:53 etcd-0-k8s-rlpv.c.uw-prod.internal systemd[1]: Starting GCE Linux Agent...
Aug 29 09:59:30 etcd-0-k8s-rlpv.c.uw-prod.internal rkt[11438]: + '[' -e /etc/default/instance_configs.cfg.template ']'
Aug 29 09:59:30 etcd-0-k8s-rlpv.c.uw-prod.internal rkt[11438]: + /usr/bin/google_instance_setup
Aug 29 09:59:30 etcd-0-k8s-rlpv.c.uw-prod.internal rkt[11438]: /init.sh: /usr/bin/google_instance_setup: /usr/lib/python-exec/python2.7/python: bad interpreter: No such file or directory
Aug 29 09:59:31 etcd-0-k8s-rlpv.c.uw-prod.internal systemd[1]: oem-gce.service: Main process exited, code=exited, status=126/n/a
Aug 29 09:59:31 etcd-0-k8s-rlpv.c.uw-prod.internal rkt[11486]: gc: moving pod "3de37879-8f93-4d1c-9717-997fd56715e2" to garbage
Aug 29 09:59:31 etcd-0-k8s-rlpv.c.uw-prod.internal systemd[1]: oem-gce.service: Failed with result 'exit-code'.
Aug 29 09:59:31 etcd-0-k8s-rlpv.c.uw-prod.internal systemd[1]: Failed to start GCE Linux Agent.
Aug 29 09:59:36 etcd-0-k8s-rlpv.c.uw-prod.internal systemd[1]: oem-gce.service: Service RestartSec=5s expired, scheduling restart.
Aug 29 09:59:36 etcd-0-k8s-rlpv.c.uw-prod.internal systemd[1]: oem-gce.service: Scheduled restart job, restart counter is at 285.
Aug 29 09:59:36 etcd-0-k8s-rlpv.c.uw-prod.internal systemd[1]: Stopped GCE Linux Agent.

This seems to be pushing the loadavg on the node and resulted in a high sync duration of the etcd on the nodes.

2135.6.0 - does not appear to have this problem.

@zmarano

This comment has been minimized.

Copy link

commented Aug 29, 2019

FYI: This is broken in all current GCE CoreOS images published yesterday.
coreos-stable-2191-4-1-v20190828
coreos-beta-2219-2-1-v20190828
coreos-alpha-2247-0-0-v20190828

@ajeddeloh

This comment has been minimized.

Copy link

commented Aug 29, 2019

Can repro, looking into this.

@matalo33

This comment has been minimized.

Copy link

commented Aug 30, 2019

Hiya. Completely ignoring the subject of how this affected all three alpha, beta, stable channels, I'd like to know why steps haven't been taken to remove the broken images from circulation?

Currently anyone launching a stable coreos image on Google Compute Engine will be unable to SSH to their instances because the affected service is responsible for retrieving SSH keys from GCP Project Metadata. Additionally the constant CPU thrashing caused by systemd trying to start the service every 5 seconds starves small (1vcpu) instances of resource and they cannot support their intended function.

Why have the affected images on GCP not been marked as deprecated and the previous known working images marked as the active member of the image families?

@bgilbert

This comment has been minimized.

Copy link
Member

commented Aug 30, 2019

It turns out that the problem was introduced in alpha 2163.0.0. So the issue has been present in the alpha channel since June 4 and the beta channel since June 25. No one has reported it before now, and obviously our CI didn't catch it either.

Upgrades aren't affected, because the agent is in the OEM partition which is not updated. As a workaround, you can launch 2135.6.0 and allow it to update normally.

As a policy, we don't remove released artifacts. We'll revert the coreos-stable image family to 2135.6.0, but the alpha and beta channels have progressed too far to revert. We're working on tracking this down and hope to have a fixed release soon.

@matalo33

This comment has been minimized.

Copy link

commented Aug 30, 2019

Thank you for the update! This affected us in production so there's lessons learnt on our side too.

Interesting how no one noticed since June. We'll switch our lower environments to use coreos-alpha and beta :)

@bgilbert

This comment has been minimized.

Copy link
Member

commented Aug 30, 2019

We've switched coreos-stable back to 2135.6.0. Thanks for your patience.

@bgilbert

This comment has been minimized.

Copy link
Member

commented Aug 30, 2019

We've applied a fix to all three branches, updated CI to detect the issue, and expect to issue new releases on Wednesday. Thanks for reporting.

@HeikoOnnebrink

This comment has been minimized.

Copy link

commented Aug 31, 2019

is there any link to this one #2601

opened it some time ago but did not get any feedback ..

dongsupark added a commit to flatcar-linux/coreos-overlay that referenced this issue Sep 2, 2019
profiles/oem-aci: bring back python 2.7 to fix oem-gce crashlooping
To fix the recent issues about oem-gce.service crashlooping, we need to
bring back python 2.7 instead of 3.6.5 in package.provided. Otherwise
oem-gce.service will not start.

See also coreos/bugs#2608,
coreos/coreos-overlay#3746
@bgilbert

This comment has been minimized.

Copy link
Member

commented Sep 3, 2019

@HeikoOnnebrink Nope, this issue wasn't introduced until 2163.0.0.

@george-angel

This comment has been minimized.

Copy link
Author

commented Sep 10, 2019

Latest stable release https://coreos.com/releases/#2191.5.0 seems to fix this issue. Closing.

Thank you for prompt fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.