Skip to content
This repository has been archived by the owner on Oct 16, 2020. It is now read-only.

update-engine and locksmith configuration #1982

Closed
cemo opened this issue May 29, 2017 · 7 comments
Closed

update-engine and locksmith configuration #1982

cemo opened this issue May 29, 2017 · 7 comments

Comments

@cemo
Copy link

cemo commented May 29, 2017

Hi,

My facts:

  • I am using cloud init (since I can not download configuration files from private s3 buckets)
  • I don't want to update my system since I am updating with new AMI's.
  • I want to use locksmith since restarts are safer. (A restart can be caused from another reason, not only because of updates)
  • I can not mask update-engine.service since cloud init is buggy.
  • I can not use /usr/.noupdate since file system is read only.

Please provide me an easy way to disable update-engine without stopping other services. I just want to disable updates.

Thanks in advance

@crawford
Copy link
Contributor

If you just want to disable updates, you can mask update-engine.service. Locksmith will still run, but it effectively won't do anything. You are right that coreos-cloudinit races against update engine. Your best bet is to both stop and mask it. That way, update engine will be stopped if it is already running on the first boot and it will be masked on subsequent boots.

#cloud-config

coreos:
  units:
    - name: update-engine.service
      mask: true
      command: stop

@cemo
Copy link
Author

cemo commented May 31, 2017

@crawford, I had tried this option too. But I tried once more here is the relevant logs:

Container Linux by CoreOS stable (1353.8.0)
Update Strategy: No Reboots
Failed Units: 1
  update-engine.service
core@ip-192-168-230-117 ~ $ journalctl -u update-engine.service
-- Logs begin at Wed 2017-05-31 07:13:11 UTC, end at Wed 2017-05-31 07:13:57 UTC. --
May 31 07:13:41 localhost systemd[1]: Starting Update Engine...
May 31 07:13:42 localhost update_engine[783]: I0531 07:13:42.710914   783 main.cc:89] CoreOS Update Engine starting
May 31 07:13:42 localhost systemd[1]: Started Update Engine.
May 31 07:13:42 localhost update_engine[783]: I0531 07:13:42.956463   783 update_check_scheduler.cc:74] Next update check in 8m25s
May 31 07:13:43 ip-192-168-230-117.eu-west-1.compute.internal systemd[1]: Stopping update-engine.service...
May 31 07:13:43 ip-192-168-230-117.eu-west-1.compute.internal systemd[1]: update-engine.service: Main process exited, code=exited, status=1/FAILURE
May 31 07:13:43 ip-192-168-230-117.eu-west-1.compute.internal systemd[1]: Stopped update-engine.service.
May 31 07:13:43 ip-192-168-230-117.eu-west-1.compute.internal systemd[1]: update-engine.service: Unit entered failed state.
May 31 07:13:43 ip-192-168-230-117.eu-west-1.compute.internal systemd[1]: update-engine.service: Failed with result 'exit-code'.

and locksmitdh.service

core@ip-192-168-230-117 ~ $ journalctl -u locksmithd.service
-- Logs begin at Wed 2017-05-31 07:13:11 UTC, end at Wed 2017-05-31 07:15:06 UTC. --
May 31 07:13:42 localhost systemd[1]: Started Cluster reboot manager.
May 31 07:13:43 ip-192-168-230-117.eu-west-1.compute.internal locksmithd[824]: Reboot strategy "best-effort" is deprecated and will be removed in the future.
May 31 07:13:43 ip-192-168-230-117.eu-west-1.compute.internal locksmithd[824]: Please explicitly set the reboot strategy to one of [off reboot etcd-lock]
May 31 07:13:43 ip-192-168-230-117.eu-west-1.compute.internal locksmithd[824]: See https://coreos.com/os/docs/latest/update-strategies.html for details on configuring reboot strategies.
May 31 07:13:43 ip-192-168-230-117.eu-west-1.compute.internal locksmithd[824]: No configured reboot window
May 31 07:13:43 ip-192-168-230-117.eu-west-1.compute.internal systemd[1]: Stopping Cluster reboot manager...
May 31 07:13:43 ip-192-168-230-117.eu-west-1.compute.internal locksmithd[824]: Received interrupt/termination signal - locksmithd is exiting.
May 31 07:13:43 ip-192-168-230-117.eu-west-1.compute.internal systemd[1]: Stopped Cluster reboot manager.
May 31 07:13:43 ip-192-168-230-117.eu-west-1.compute.internal systemd[1]: Started Cluster reboot manager.
May 31 07:13:43 ip-192-168-230-117.eu-west-1.compute.internal locksmithd[926]: No configured reboot window
May 31 07:13:43 ip-192-168-230-117.eu-west-1.compute.internal locksmithd[926]: Cannot get update engine status: The name com.coreos.update1 was not provided by any .service files
May 31 07:13:43 ip-192-168-230-117.eu-west-1.compute.internal systemd[1]: locksmithd.service: Main process exited, code=exited, status=1/FAILURE
May 31 07:13:43 ip-192-168-230-117.eu-west-1.compute.internal systemd[1]: locksmithd.service: Unit entered failed state.
May 31 07:13:43 ip-192-168-230-117.eu-west-1.compute.internal systemd[1]: locksmithd.service: Failed with result 'exit-code'.
May 31 07:13:54 ip-192-168-230-117.eu-west-1.compute.internal systemd[1]: locksmithd.service: Service hold-off time over, scheduling restart.
May 31 07:13:54 ip-192-168-230-117.eu-west-1.compute.internal systemd[1]: Stopped Cluster reboot manager.
May 31 07:13:54 ip-192-168-230-117.eu-west-1.compute.internal systemd[1]: Started Cluster reboot manager.
May 31 07:13:54 ip-192-168-230-117.eu-west-1.compute.internal locksmithd[997]: No configured reboot window
May 31 07:13:54 ip-192-168-230-117.eu-west-1.compute.internal locksmithd[997]: Cannot get update engine status: The name com.coreos.update1 was not provided by any .service files

Locksmith is looking for update engine. To my limited knowledge, a restart manager should not be stopped because of not getting status from update engine. I may need to restart cluster for another reason. Here is the log when I remove the part you suggested to add:

Container Linux by CoreOS stable (1353.8.0)
core@ip-192-168-230-147 ~ $ systemctl status -l locksmithd.service
● locksmithd.service - Cluster reboot manager
   Loaded: loaded (/usr/lib/systemd/system/locksmithd.service; disabled; vendor preset: disabled)
  Drop-In: /run/systemd/system/locksmithd.service.d
           └─20-cloudinit.conf
   Active: active (running) since Wed 2017-05-31 07:09:20 UTC; 15s ago
 Main PID: 946 (locksmithd)
    Tasks: 6 (limit: 32768)
   Memory: 3.3M (limit: 32.0M)
      CPU: 34ms
   CGroup: /system.slice/locksmithd.service
           └─946 /usr/lib/locksmith/locksmithd

May 31 07:09:20 ip-192-168-230-147.eu-west-1.compute.internal systemd[1]: Stopped Cluster reboot manager.
May 31 07:09:20 ip-192-168-230-147.eu-west-1.compute.internal systemd[1]: Started Cluster reboot manager.
May 31 07:09:20 ip-192-168-230-147.eu-west-1.compute.internal locksmithd[946]: No configured reboot window
May 31 07:09:20 ip-192-168-230-147.eu-west-1.compute.internal locksmithd[946]: locksmithd starting currentOperation="UPDATE_STATUS_IDLE" strategy="etcd-lock"
core@ip-192-168-230-147 ~ $ systemctl status -l update-engine.service
● update-engine.service - Update Engine
   Loaded: loaded (/usr/lib/systemd/system/update-engine.service; disabled; vendor preset: disabled)
   Active: active (running) since Wed 2017-05-31 07:09:19 UTC; 39s ago
 Main PID: 818 (update_engine)
    Tasks: 1 (limit: 32768)
   Memory: 6.1M
      CPU: 10ms
   CGroup: /system.slice/update-engine.service
           └─818 /usr/sbin/update_engine -foreground -logtostderr
May 31 07:09:18 localhost systemd[1]: Starting Update Engine...
May 31 07:09:19 localhost update_engine[818]: I0531 07:09:19.634905   818 main.cc:89] CoreOS Update Engine starting
May 31 07:09:19 localhost systemd[1]: Started Update Engine.
May 31 07:09:19 localhost update_engine[818]: I0531 07:09:19.759829   818 update_check_scheduler.cc:74] Next update check in 5m13s
May 31 07:10:05 ip-192-168-230-147.eu-west-1.compute.internal update_engine[818]: I0531 07:10:05.283128   818 update_attempter.cc:493] Updating boot flags...

@cemo
Copy link
Author

cemo commented Jun 2, 2017

@crawford Don't you think that this is a bug?

@crawford
Copy link
Contributor

crawford commented Jun 2, 2017

Ah, I forgot that Locksmith would do this. You'll want to stop and mask both of them. Otherwise, Locksmith will keep attempting to reconnect. Since Locksmith's primary use case is to facilitate automatic updates, it doesn't work well when update engine is stopped. We don't want to support cases where automatic updates are disabled (since that is the thesis of our security model), so I don't think we'll want to change this behavior.

Can I ask why you want to disable automatic updates?

@cemo
Copy link
Author

cemo commented Jun 3, 2017

@crawford please forgive my ignorance. I am on the side of immutable architectures. Instead of updating, I prefer firing a new cluster. I had some troubles with updates of CoreOS last month too. It was a real trouble for us.
Honestly, I still believe that locksmith should not depends on update-manager. It can of course use it but in case a problem with it, it should not be stopped. This seems for me as a bug. There is also no indication about this dependency as well.

There are also legacy systems which are strictly bounded with a specific version of CoreOS. In these cases update-manager must not work. But since Locksmith is a reboot manager, It can still continue to benefit from it.

What do you think?

@crawford
Copy link
Contributor

crawford commented Jun 6, 2017

There are also legacy systems which are strictly bounded with a specific version of CoreOS. In these cases update-manager must not work. But since Locksmith is a reboot manager, It can still continue to benefit from it.

Locksmith only exists to facilitate automatic reboots. If you have the updates disabled, the machine won't ever reboot and therefore doesn't need to coordinate using Locksmith. What is your use case for Locksmith?

@bgilbert
Copy link
Contributor

I'll close. @crawford is right; locksmith can only coordinate reboots requested by update-engine, and not other types of reboots.

@cemo By the by, Ignition can now use IAM roles to fetch from private S3 buckets.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants