Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation for production deployments #1658

Closed
wants to merge 6 commits into from

Conversation

lazzarello
Copy link
Contributor

Add section for production installation describing snap version pinning and HA option starting with 1.19.

I had some old commits in my local repo, these can be squashed since the diff is only the README.

@balchua
Copy link
Collaborator

balchua commented Oct 20, 2020

@lazzarello the docs can now be contributed from https://discuss.kubernetes.io/c/general-discussions/microk8s/26

Hi @evilnick what do you think about additional option in the docs as mentioned by the PR?

@lazzarello
Copy link
Contributor Author

lazzarello commented Oct 20, 2020

the docs can now be contributed from https://discuss.kubernetes.io/c/general-discussions/microk8s/26

Thanks for the update. The only language I found to reference a production deployment with no auto-update is under the Alternative installs section. My reasoning for the editorial change to the README in this PR is based on over a year of operating a production deployment of microk8s used with industrial control software. This is a production environment where an unattended restart of microk8s could at worst destroy physical machines. At the best, it would require losing a day of work and/or incident reporting to an outside regulatory agency.

I feel strongly that version pinning as described here is not merely an alternative installation method but a strong requirement. I'd suggest anyone else running a production system where availability is a priority to do the same.

Unrelated but not sure why the build failed since it's just a diff to the README.md file 🤷

@ktsakalozos
Copy link
Member

ktsakalozos commented Oct 21, 2020

@lazzarello , the easy part to your question first. The tests failed due to flakiness.

The hard part now. It is not easy to accept this PR. It is true that with what you recommend you will not get updates and in setups where you cannot tolerate restarts this is a way to stop snap refreshes. However, we cannot recommend to users to not apply patches. Think for example security patches. Also consider the cost we will have to support users on any possible snap revision. This is why in snaps you can schedule when updates will reach you but you cannot completely disable them [1].

Feel free to mention your approach in the respective discourse topic as a workaround to this limitation.

[1] https://snapcraft.io/docs/keeping-snaps-up-to-date

@lazzarello
Copy link
Contributor Author

we cannot recommend to users to not apply patches.

No one will be recommending users to not apply patches. By disabling auto-update, users control when patches are applied and which ones to apply. Control has shifted from the upstream authors to the users. Trusting the user might be a change in philosophy, so I understand the controversy.

I will continue to advocate for this philosophy because microk8s has been a good platform for me and my colleagues, though the dependence on snapcraft is the weakest link. It has broke safety critical systems during their operation. I've personally had to report to a external regulatory agency after an unattended update broke a high-pressure control system during operation.

I'll make some edits to the text in the README about a distinction between Getting Started, Production and a third, yet to be named type of deployment. Maybe Mission Critical?

@lazzarello
Copy link
Contributor Author

@ktsakalozos How about this latest diff?

@Bessonov
Copy link

I think the critical part is what users are expecting. Auto update of snaps is very confusing. But from my point of view there are enough red flags to say "select the channel and you get all updates from this channel".

Another way to see it, is what is the goal of microk8s. One of them is the zero ops. True zero ops means for me that I install something and just forget about it. Like book a SaaS which has some SLAs and regularly updates to the most last stable version. That's not possible with kubernetes. But to pin a channel seems like a very good compromise for production use cases.

But there are enough use cases for pinning the exact version of microk8s. Therefore the information from this PR is very important.

From my point of view there is a need for two things. The first one is an official documentation on how to setup microk8s for project goals compatible production usage. And from my point of view this is the channel pinned version. The second thing is a documentation which covers another use cases like version pinning (there are other topics like production storage etc.). For version pinning I would suggest not the README, but Channels, releases and upgrades page.

@lazzarello
Copy link
Contributor Author

Another way to see it, is what is the goal of microk8s. One of them is the zero ops. True zero ops means for me that I install something and just forget about it.

I agree that Zero Ops is a goal we can all rally behind. I'm rallying for it right now. I'll give an example of my motivation for providing the explicit information in an obvious place in this PR.

Back in The Before Times when we could all be closer to each other, I was responsible for maintaining infrastructure for a group of people using a custom platform built on top of microk8s to load high pressure cryogenic gas. I installed microk8s, and tracked a stable channel, 1.15/stable, if I remember correctly. This application provided a UI to people at workstations in a control room, like the ones you see in the movies but...like...purchased from Best Buy. These people could see the realtime metrics of things like PSI, ullage, etc. The application was required to continue running to maintain control over these values. If it stopped, we no longer knew the current PSI, which could lead to some physical damage to materials. Given the Zero Ops value of microk8s, I expected to "set it and forget it" so I wouldn't have to worry about this application restarting without my attention. The principal of least surprise, as some call it.

But one day, there was a surprise. A considerable amount of high pressure liquid nitrogen was loaded and the operators could no longer view or control the system. We scrambled to troubleshoot the issue. Network connectivity was established to the servers running the control room application but all our data and UI screens were down for about 5 minutes so far. A high pressure tank was sitting in a fenced in yard while no person could get near it. About 10 minutes in, we saw logs that indicated microk8s had stopped because an auto-update checkpoint was detected to bump the stable channel to a minor point release. This required shutting down microk8s, saving the old snap and refreshing to the newer version. About 15 minutes from the outage, microk8s had started again. Fortunately, everything was safely shut down and the work for the day was cut short, at a considerable expense.

Later that day I had to file an incident report with a federal regulatory agency in my home state. microk8s was mentioned in the report.

I'd like the language expressed in the diff to the README in this PR to address production (mission critical?) scenarios like this one. Currently, my solution is to disable auto-update by pinning a version of the snap. I update when I know it's safe to do so. This information should be front-and-center in the README.md file and included in deeper documentation. I'll give it another go.

@Bessonov
Copy link

I think you try to address a different issue. If it was a single node, then:

The principal of least surprise, as some call it.

isn't surprise given the documentation. If it was a HA setup, then the issue is a not coordinated update. They are two different beasts. Which of them leads to the situation?

Does someone know if the second issue still persists? I assume that snap package is updated without coordinating with microk8s and therefore a HA setup is impossible without exact version pinning, right? Well, then I agree with @lazzarello .

@lazzarello
Copy link
Contributor Author

I've updated the language to provide the list of options for production and mission critical deployments. I think it's a good compromise to not recommend version pinning but suggest it for a niche use case. I've tried to describe a ranking in order of simplest to most complicated.

  1. Install snap with a stable channel and explicit version
  2. Same as above and with a custom update schedule for non-critical times
  3. HA with >= 3 nodes and each node with a custom update schedule so that no two nodes are down at the same time
  4. Explicit version pinning to disable auto-updates

What y'all think?

@ktsakalozos
Copy link
Member

@lazzarello thank you for this work. I appreciate your focus and persistence in improving MicroK8s and tackling this real life problem.

It is almost impossible to accept a PR suggesting to do an snap download and snap install with --dangerous.

Production deployments where control over refreshes is needed should consider the snap store proxy [1, 2]. The snap store proxy allows for caching, snap revision overriding, air-gapped deployments.

The LXD snap makes a good job in enumerating all available options [3] including the snap store proxy one.

The automatic refresh behavior has been discussed many times at the snapcraft forum [4].

[1] https://snapcraft.io/snap-store-proxy
[2] https://docs.ubuntu.com/snap-store-proxy/en/
[3] https://discuss.linuxcontainers.org/t/managing-the-lxd-snap/8178
[4] https://forum.snapcraft.io/

@lazzarello
Copy link
Contributor Author

It is almost impossible to accept a PR suggesting to do an snap download and snap install with --dangerous.

Wow, it's been a crazy month! Revisiting this PR because I want to say thank you to all the microk8s devs and also give a video demonstration of why I'm pushing for a mission critical category in documentation. The following video is a rocket launch I've been working on with a team at Astra. As of two days ago, microk8s has helped get a vehicle into space.

https://twitter.com/Astra/status/1338999451893915649?s=20

If this project had front-and-center documentation about disabling auto-updates, the road to this launch would have been much smoother. I hope this example is compelling enough to add a few sentences of text to cover other mission critical applications for future projects that are required to control their downtime.

@Bessonov
Copy link

Coordinated auto upgrade for k3s: https://rancher.com/docs/k3s/latest/en/upgrades/automated/

From my point of view it seems like to be a better way than rely on snap to do it right more by an accident than a strategy.

@stale
Copy link

stale bot commented Nov 23, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the inactive label Nov 23, 2021
@lazzarello
Copy link
Contributor Author

Thanks to whomever within Canonical who has documented how to do an offline install of microk8s! Closing this PR as a duplicate!

@lazzarello lazzarello closed this Dec 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants