host: add cgroups v2 support and fix Debian 13 compatibility#3
Merged
Conversation
Add full cgroups v2 support to enable Flynn to run on modern Linux distributions (Debian 13+) where cgroups v1 is compiled out entirely. Changes: - host: dual v1/v2 cgroup setup with cpu.shares-to-cpu.weight conversion, unified hierarchy controller enablement, and per-container CpuWeight - host: cgroups v2 OOM notification via inotify on memory.events - host: guard CheckCpushares to only run on cgroups v1 - host: FIEMAP fallback to sequential copy when tmpfs doesn't support it - postgres: disable TimescaleDB/ExtWhitelist (unavailable in packages layer) - postgres: pre-install uuid-ossp and pgcrypto extensions in template1 so non-superuser app database users can use them without pgextwlist - dns: fix off-by-one panic in clientconfig.go (len>=8 but slice [:9]) These changes, combined with the rebuilt TUF images (3-layer postgres with PostgreSQL 11 packages, controller with JSON schemas), enable successful single-node Flynn cluster bootstrap on Debian 13 (Trixie) with cgroups v2 and ZFS 2.3.
…MACs for multi-node Fix two bugs blocking 3-node cluster bootstrap: 1. PostgreSQL primary crash-loops when sync replica exists because installExtensionsInTemplate() runs CREATE EXTENSION against template1 without overriding default_transaction_read_only=on (set in postgresql.conf when downstream != nil). Add SET default_transaction_read_only=off before extension DDL. 2. Flannel VXLAN overlay is broken on cloned VMs because all nodes get identical flannel.1 MAC addresses (kernel derives MAC deterministically from VNI + machine state). Add netlink.LinkSetHardwareAddr() after device creation to set a unique MAC derived from the VTEP IP (02:42:IP[0]:IP[1]:IP[2]:IP[3]). Also increase bootstrap wait timeouts from 5 to 10 minutes and add configurable timeout field to WaitAction for multi-node clusters where service startup takes longer.
The router registered services (router-api, router-http) with discoverd using LISTEN_IP (typically 0.0.0.0), which is not a routable address. Other services (status aggregator, scheduler) could not reach the router, causing the cluster to report unhealthy and the scheduler to loop with "route not found" errors. Now uses EXTERNAL_IP for registration while keeping LISTEN_IP for binding.
Replace NeighAdd with NeighSet for FDB entries to avoid 'file exists' errors when entries already exist (idempotent upsert vs exclusive create). Derive a unique MAC address for each node's flannel.1 device from its VTEP IP (02:42:xx:xx:xx:xx) instead of using the default MAC from the base image. Without this, nodes cloned from the same image share identical MACs which breaks VXLAN forwarding. Only set the MAC when it differs from the current one to avoid flushing ARP neighbor entries. Add retry logic that brings the link down/up if setting the MAC on a running interface fails.
The primary starts PostgreSQL in read-only mode when a downstream (sync standby) exists. But assumePrimary needs read-write access to create the superuser and install extensions in the freshly-initialized database. The session-level SET default_transaction_read_only=off was insufficient — CREATE EXTENSION still failed with 'cannot execute in a read-only transaction', causing assumePrimary to fail, which called p.stop(), killing postgres and all replication connections, creating an infinite loop. Fix: Start read-write during initial setup (the database was just created with initdb, there is no user data to protect). After setup completes, switch to read-only mode and SIGHUP postgres before calling waitForSync. Remove the now-unnecessary SET TRANSACTION READ WRITE hack.
On cgroups v2, each container's OOM notification uses an inotify instance. With many containers (89+ from bootstrap), the default max_user_instances=128 is exhausted, causing NotifyOOM to fail. The watch() goroutine previously returned this error, which triggered Destroy() and killed the container within ~1 second with no user-visible error message. Make the OOM notification failure non-fatal: log a warning and continue watching for state changes. The container still functions correctly without OOM monitoring. Also fix DNS resolver detection for systemd-resolved environments. On Debian 13, /etc/resolv.conf points to the stub resolver at 127.0.0.53 which is unreachable from containers in separate network namespaces. Fall back to /run/systemd/resolve/resolv.conf which contains the real upstream resolver IPs.
Update resource limit tests to work on both cgroups v1 and v2: - resourceCmd: auto-detect cgroup version inside containers. On v2, read memory.max and cpu.weight from the container's cgroup path (discovered via /proc/1/cgroup) instead of v1's fixed paths. - Add cpuSharesToWeight() helper matching the kernel's conversion formula: weight = 1 + ((shares - 2) * 9999) / 262142. - Add isCgroupV2() detection based on /sys/fs/cgroup/cgroup.controllers. - Set DisableLog: true on test jobs that capture output via attach streams. This avoids a race condition in the log mux where short-lived jobs complete before StreamLog sets up its subscription, causing the attach client to block forever. - Make setupGitreceive() conditional on the -run filter matching git-related tests, so non-git tests don't block on broken deployments. - Update slugbuilder-limit test app to read v2 cgroup files. All 4 resource limit tests pass: CLISuite.TestRunLimits HostSuite.TestResourceLimits ControllerSuite.TestResourceLimitsOneOffJob ControllerSuite.TestResourceLimitsReleaseJob
This project is actively maintained again — remove the warning that was added when Flynn was abandoned.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add full cgroups v2 support to enable Flynn to run on modern Linux distributions (Debian 13+) where cgroups v1 is compiled out entirely.
Changes:
These changes, combined with the rebuilt TUF images (3-layer postgres with PostgreSQL 11 packages, controller with JSON schemas), enable successful single-node Flynn cluster bootstrap on Debian 13 (Trixie) with cgroups v2 and ZFS 2.3.