Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

experiment with NixOS #3221

Open
billimek opened this issue Jul 17, 2023 · 6 comments
Open

experiment with NixOS #3221

billimek opened this issue Jul 17, 2023 · 6 comments
Labels
exploration Something to explore

Comments

@billimek
Copy link
Owner

Background

Related, in theme, to #2865

After adopting Nix & NixOS for other uses, I thought it would be fun to try using NixOS as the basis for running k3s nodes in this cluster.

Motivation

Ubuntu

There's nothing 'wrong' with running k3s in headless ubuntu server nodes:

  • Ubuntu self-updates with security updates
  • Most of the prerequisites to running k3s are already present in ubuntu
  • Ubuntu is so widely used that common problems and solutions are well documented

NixOS

However, the desire to try new things and benefits of a single git-based declarative configuration, along with a consistent shell experience has some appeal to try this for running kubernetes nodes.

Approach

Will document process of using NixOS for a k3s node in this issue along with blockers and solutions and eventual conclusion.

@billimek billimek added the exploration Something to explore label Jul 17, 2023
@billimek
Copy link
Owner Author

billimek commented Jul 17, 2023

Leveraging an intel N100 'T9 Plus mini PC' from China as the node.

Installing NixOS via USB-based installer. Most of this is documented here and won't repeat it for this write-up.

Using Nix to install and configure k3s via this configuration.

@billimek
Copy link
Owner Author

billimek commented Jul 17, 2023

Issues encountered and resolutions:

Issue Resolution
Password problems This is more of a NixOS issue but when initially creating a user, the password was being 'unset' and it was only possible to access via SSH and associated SSH keys. It wasn't possible to log-in via the physical terminal. The resolution was to make the user 'mutable', set an default password, and then set the real password (using passwd) at the time of initial bootstrapping (ref)
Swap device exists swapDevices = lib.mkForce [ ]; doesn't seem to work as expected (ref). Following additional documentation here by having the GPT partition not automount fixed the issue
VLAN configuration Intended to have the host configure itself to use a VLAN and not require switch configuration. Spent a lot of time trying different configurations. Ultimately was not successful in having the host set VLAN and needed to configure this on the switch instead. Will likely revisit this in a more controlled environment like a VM
ceph rdb issues csi-cephfsplugin pods were crashlooping. Eventually determined the issue to be the lack of the rdb kernel module. Resolved by configuring boot.kernelModules = [ "kvm-intel" "rbd" ]; (ref)
no logs Oddly, couldn't view any logs from pods running on this node (i.e. kubectl logs -f <some pod>). No issues from pods running on other nodes. Also saw entries in dmesg output suggesting that a firewall was blocking some stuff. Eventually disabled the firewall (via networking.firewall.enable = false;) and this issue was resolved (ref)
NFS issues Workloads requiring NFS access (i.e. plex) complained about mounting an NFS volume. Investigation led me to understand that rpcbind is required. Resolved by setting services.rpcbind.enable = true (ref)
system-upgrade-controller system-upgrade-controller was failing to operate properly on the new node. Determined that when Nix is controlling k3s installation, (as is the case here) it prevents it from being messed-with. As system-upgrade-controller is trying to do an in-place replacement of the k3s binary on the host, this fails. Resolution was to switch to the 'unstable' nix branch for k3s so that the k3s version will (currently) match the version that system-upgrade-controller handles on the other nodes (ref). Long-term, this probably needs to be handled consistently. Two options are: 1. use NixOS for all nodes, or 2. install and manage k3s out-of-band from Nix
no automated OS updates Ubuntu can do auto upgrades for security vulnerabilities, and it even flags the node for reboot when required vis kured. NixOS can do this too via the system.autoUpgrade but I need to understand how to properly configure it and, if possible, support kured for safe reboots as well. (ref) Resolved partly via this and this

@billimek
Copy link
Owner Author

billimek commented Aug 1, 2023

This is going well so far. Pretty easily added two more nodes running NixOS.

However, the issue of k3s upgrades become apparent again when I discovered that system upgrade controller upgraded the rest of the cluster to v1.27.4+k3s1 while the NixOS nodes were still running v1.27.3+k3s1. I was able to run nixos-rebuild switch to get them upgraded but it would be better if this was more automated.

Still don't have a good solution for how to solve automated nixos-rebuilds, especially with secrets involved. Still pondering.

@billimek
Copy link
Owner Author

billimek commented Nov 5, 2023

Updates are currently being handled from a base host and 'pushing' new config to the k3s nodes, for example:

for node in f g h; NIX_SSHOPTS="-A" nixos-rebuild switch --flake .#k3s-$node --target-host nix@k3s-$node --use-remote-sudo; end

This, in conjunction with a scheduled reboot checker and associated /usr/bin symlinks ensures that cured will properly drain and reboot the notes when required.
image

@szinn
Copy link

szinn commented May 1, 2024

For the initial user password, hashedPasswordFile can be a possibility

@superherointj
Copy link

superherointj commented May 8, 2024

However, the issue of k3s upgrades become apparent again when I discovered that system upgrade controller upgraded the rest of the cluster to v1.27.4+k3s1 while the NixOS nodes were still running v1.27.3+k3s1. I was able to run nixos-rebuild switch to get them upgraded but it would be better if this was more automated.

Still don't have a good solution for how to solve automated nixos-rebuilds, especially with secrets involved. Still pondering.

I'm NixOS k3s maintainer. I use ansible to trigger syncing of all hosts and for whatever random upkeep. It can trigger nixos-rebuild and do anything else.
There are 2 unpublished abstractions that I use that are helpful but aren't in nixpkgs. I need to polish it before I can publish.
The experience I have of using NixOS for K3s is quite nice.
The cluster provisioning, destruction, reset and upkeep, such as updates, is automated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
exploration Something to explore
Projects
None yet
Development

No branches or pull requests

3 participants