Documentation for production deployments #1658

lazzarello · 2020-10-19T19:16:06Z

Add section for production installation describing snap version pinning and HA option starting with 1.19.

I had some old commits in my local repo, these can be squashed since the diff is only the README.

* version pinning via dangerous mode * HA >= version 1.19

balchua · 2020-10-20T09:03:53Z

@lazzarello the docs can now be contributed from https://discuss.kubernetes.io/c/general-discussions/microk8s/26

Hi @evilnick what do you think about additional option in the docs as mentioned by the PR?

lazzarello · 2020-10-20T19:57:51Z

the docs can now be contributed from https://discuss.kubernetes.io/c/general-discussions/microk8s/26

Thanks for the update. The only language I found to reference a production deployment with no auto-update is under the Alternative installs section. My reasoning for the editorial change to the README in this PR is based on over a year of operating a production deployment of microk8s used with industrial control software. This is a production environment where an unattended restart of microk8s could at worst destroy physical machines. At the best, it would require losing a day of work and/or incident reporting to an outside regulatory agency.

I feel strongly that version pinning as described here is not merely an alternative installation method but a strong requirement. I'd suggest anyone else running a production system where availability is a priority to do the same.

Unrelated but not sure why the build failed since it's just a diff to the README.md file 🤷

ktsakalozos · 2020-10-21T12:36:48Z

@lazzarello , the easy part to your question first. The tests failed due to flakiness.

The hard part now. It is not easy to accept this PR. It is true that with what you recommend you will not get updates and in setups where you cannot tolerate restarts this is a way to stop snap refreshes. However, we cannot recommend to users to not apply patches. Think for example security patches. Also consider the cost we will have to support users on any possible snap revision. This is why in snaps you can schedule when updates will reach you but you cannot completely disable them [1].

Feel free to mention your approach in the respective discourse topic as a workaround to this limitation.

[1] https://snapcraft.io/docs/keeping-snaps-up-to-date

lazzarello · 2020-10-21T18:07:57Z

we cannot recommend to users to not apply patches.

No one will be recommending users to not apply patches. By disabling auto-update, users control when patches are applied and which ones to apply. Control has shifted from the upstream authors to the users. Trusting the user might be a change in philosophy, so I understand the controversy.

I will continue to advocate for this philosophy because microk8s has been a good platform for me and my colleagues, though the dependence on snapcraft is the weakest link. It has broke safety critical systems during their operation. I've personally had to report to a external regulatory agency after an unattended update broke a high-pressure control system during operation.

I'll make some edits to the text in the README about a distinction between Getting Started, Production and a third, yet to be named type of deployment. Maybe Mission Critical?

lazzarello · 2020-10-21T20:07:40Z

@ktsakalozos How about this latest diff?

Bessonov · 2020-10-23T11:14:21Z

I think the critical part is what users are expecting. Auto update of snaps is very confusing. But from my point of view there are enough red flags to say "select the channel and you get all updates from this channel".

Another way to see it, is what is the goal of microk8s. One of them is the zero ops. True zero ops means for me that I install something and just forget about it. Like book a SaaS which has some SLAs and regularly updates to the most last stable version. That's not possible with kubernetes. But to pin a channel seems like a very good compromise for production use cases.

But there are enough use cases for pinning the exact version of microk8s. Therefore the information from this PR is very important.

From my point of view there is a need for two things. The first one is an official documentation on how to setup microk8s for project goals compatible production usage. And from my point of view this is the channel pinned version. The second thing is a documentation which covers another use cases like version pinning (there are other topics like production storage etc.). For version pinning I would suggest not the README, but Channels, releases and upgrades page.

lazzarello · 2020-10-26T20:06:12Z

Another way to see it, is what is the goal of microk8s. One of them is the zero ops. True zero ops means for me that I install something and just forget about it.

I agree that Zero Ops is a goal we can all rally behind. I'm rallying for it right now. I'll give an example of my motivation for providing the explicit information in an obvious place in this PR.

Back in The Before Times when we could all be closer to each other, I was responsible for maintaining infrastructure for a group of people using a custom platform built on top of microk8s to load high pressure cryogenic gas. I installed microk8s, and tracked a stable channel, 1.15/stable, if I remember correctly. This application provided a UI to people at workstations in a control room, like the ones you see in the movies but...like...purchased from Best Buy. These people could see the realtime metrics of things like PSI, ullage, etc. The application was required to continue running to maintain control over these values. If it stopped, we no longer knew the current PSI, which could lead to some physical damage to materials. Given the Zero Ops value of microk8s, I expected to "set it and forget it" so I wouldn't have to worry about this application restarting without my attention. The principal of least surprise, as some call it.

But one day, there was a surprise. A considerable amount of high pressure liquid nitrogen was loaded and the operators could no longer view or control the system. We scrambled to troubleshoot the issue. Network connectivity was established to the servers running the control room application but all our data and UI screens were down for about 5 minutes so far. A high pressure tank was sitting in a fenced in yard while no person could get near it. About 10 minutes in, we saw logs that indicated microk8s had stopped because an auto-update checkpoint was detected to bump the stable channel to a minor point release. This required shutting down microk8s, saving the old snap and refreshing to the newer version. About 15 minutes from the outage, microk8s had started again. Fortunately, everything was safely shut down and the work for the day was cut short, at a considerable expense.

Later that day I had to file an incident report with a federal regulatory agency in my home state. microk8s was mentioned in the report.

I'd like the language expressed in the diff to the README in this PR to address production (mission critical?) scenarios like this one. Currently, my solution is to disable auto-update by pinning a version of the snap. I update when I know it's safe to do so. This information should be front-and-center in the README.md file and included in deeper documentation. I'll give it another go.

Bessonov · 2020-10-26T20:22:06Z

I think you try to address a different issue. If it was a single node, then:

The principal of least surprise, as some call it.

isn't surprise given the documentation. If it was a HA setup, then the issue is a not coordinated update. They are two different beasts. Which of them leads to the situation?

Does someone know if the second issue still persists? I assume that snap package is updated without coordinating with microk8s and therefore a HA setup is impossible without exact version pinning, right? Well, then I agree with @lazzarello .

lazzarello · 2020-11-02T22:53:57Z

I've updated the language to provide the list of options for production and mission critical deployments. I think it's a good compromise to not recommend version pinning but suggest it for a niche use case. I've tried to describe a ranking in order of simplest to most complicated.

Install snap with a stable channel and explicit version
Same as above and with a custom update schedule for non-critical times
HA with >= 3 nodes and each node with a custom update schedule so that no two nodes are down at the same time
Explicit version pinning to disable auto-updates

What y'all think?

ktsakalozos · 2020-11-03T09:02:18Z

@lazzarello thank you for this work. I appreciate your focus and persistence in improving MicroK8s and tackling this real life problem.

It is almost impossible to accept a PR suggesting to do an snap download and snap install with --dangerous.

Production deployments where control over refreshes is needed should consider the snap store proxy [1, 2]. The snap store proxy allows for caching, snap revision overriding, air-gapped deployments.

The LXD snap makes a good job in enumerating all available options [3] including the snap store proxy one.

The automatic refresh behavior has been discussed many times at the snapcraft forum [4].

[1] https://snapcraft.io/snap-store-proxy
[2] https://docs.ubuntu.com/snap-store-proxy/en/
[3] https://discuss.linuxcontainers.org/t/managing-the-lxd-snap/8178
[4] https://forum.snapcraft.io/

lazzarello · 2020-12-18T00:36:32Z

It is almost impossible to accept a PR suggesting to do an snap download and snap install with --dangerous.

Wow, it's been a crazy month! Revisiting this PR because I want to say thank you to all the microk8s devs and also give a video demonstration of why I'm pushing for a mission critical category in documentation. The following video is a rocket launch I've been working on with a team at Astra. As of two days ago, microk8s has helped get a vehicle into space.

https://twitter.com/Astra/status/1338999451893915649?s=20

If this project had front-and-center documentation about disabling auto-updates, the road to this launch would have been much smoother. I hope this example is compelling enough to add a few sentences of text to cover other mission critical applications for future projects that are required to control their downtime.

Bessonov · 2020-12-27T20:11:40Z

Coordinated auto upgrade for k3s: https://rancher.com/docs/k3s/latest/en/upgrades/automated/

From my point of view it seems like to be a better way than rely on snap to do it right more by an accident than a strategy.

stale · 2021-11-23T22:00:04Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

lazzarello · 2021-12-16T21:57:14Z

Thanks to whomever within Canonical who has documented how to do an offline install of microk8s! Closing this PR as a duplicate!

Lee Azzarello added 4 commits December 31, 2019 16:11

upgrade syntax to etcd 3.4.x

6b9f67f

Merge branch 'master' of github.com:lazzarello/microk8s

78bea35

Merge branch 'master' of https://github.com/ubuntu/microk8s

2d174bd

document procedure for production installation

507e762

* version pinning via dangerous mode * HA >= version 1.19

lazzarello requested a review from ktsakalozos as a code owner October 19, 2020 19:16

be explicit about mission critical version pinning

26d285c

Bessonov mentioned this pull request Oct 23, 2020

microk8s restarted without supervision #923

Closed

recommend HA for mission critical

d8899a6

balchua mentioned this pull request Aug 4, 2021

Snap is a massive pain #2491

Closed

balchua mentioned this pull request Oct 25, 2021

Apiserver stopped working (no changes made) #2667

Closed

stale bot added the inactive label Nov 23, 2021

lazzarello closed this Dec 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation for production deployments #1658

Documentation for production deployments #1658

lazzarello commented Oct 19, 2020

balchua commented Oct 20, 2020

lazzarello commented Oct 20, 2020 •

edited

Loading

ktsakalozos commented Oct 21, 2020 •

edited

Loading

lazzarello commented Oct 21, 2020

lazzarello commented Oct 21, 2020

Bessonov commented Oct 23, 2020

lazzarello commented Oct 26, 2020

Bessonov commented Oct 26, 2020

lazzarello commented Nov 2, 2020

ktsakalozos commented Nov 3, 2020

lazzarello commented Dec 18, 2020

Bessonov commented Dec 27, 2020

stale bot commented Nov 23, 2021

lazzarello commented Dec 16, 2021

Documentation for production deployments #1658

Documentation for production deployments #1658

Conversation

lazzarello commented Oct 19, 2020

balchua commented Oct 20, 2020

lazzarello commented Oct 20, 2020 • edited Loading

ktsakalozos commented Oct 21, 2020 • edited Loading

lazzarello commented Oct 21, 2020

lazzarello commented Oct 21, 2020

Bessonov commented Oct 23, 2020

lazzarello commented Oct 26, 2020

Bessonov commented Oct 26, 2020

lazzarello commented Nov 2, 2020

ktsakalozos commented Nov 3, 2020

lazzarello commented Dec 18, 2020

Bessonov commented Dec 27, 2020

stale bot commented Nov 23, 2021

lazzarello commented Dec 16, 2021

lazzarello commented Oct 20, 2020 •

edited

Loading

ktsakalozos commented Oct 21, 2020 •

edited

Loading