Skip to content

host: add cgroups v2 support and fix Debian 13 compatibility#3

Merged
phaus merged 9 commits intomasterfrom
debian13-cgroups-v2-bootstrap
Apr 15, 2026
Merged

host: add cgroups v2 support and fix Debian 13 compatibility#3
phaus merged 9 commits intomasterfrom
debian13-cgroups-v2-bootstrap

Conversation

@phaus
Copy link
Copy Markdown
Member

@phaus phaus commented Apr 13, 2026

Add full cgroups v2 support to enable Flynn to run on modern Linux distributions (Debian 13+) where cgroups v1 is compiled out entirely.

Changes:

  • host: dual v1/v2 cgroup setup with cpu.shares-to-cpu.weight conversion, unified hierarchy controller enablement, and per-container CpuWeight
  • host: cgroups v2 OOM notification via inotify on memory.events
  • host: guard CheckCpushares to only run on cgroups v1
  • host: FIEMAP fallback to sequential copy when tmpfs doesn't support it
  • postgres: disable TimescaleDB/ExtWhitelist (unavailable in packages layer)
  • postgres: pre-install uuid-ossp and pgcrypto extensions in template1 so non-superuser app database users can use them without pgextwlist
  • dns: fix off-by-one panic in clientconfig.go (len>=8 but slice [:9])

These changes, combined with the rebuilt TUF images (3-layer postgres with PostgreSQL 11 packages, controller with JSON schemas), enable successful single-node Flynn cluster bootstrap on Debian 13 (Trixie) with cgroups v2 and ZFS 2.3.

phaus and others added 9 commits April 13, 2026 23:15
Add full cgroups v2 support to enable Flynn to run on modern Linux
distributions (Debian 13+) where cgroups v1 is compiled out entirely.

Changes:
- host: dual v1/v2 cgroup setup with cpu.shares-to-cpu.weight conversion,
  unified hierarchy controller enablement, and per-container CpuWeight
- host: cgroups v2 OOM notification via inotify on memory.events
- host: guard CheckCpushares to only run on cgroups v1
- host: FIEMAP fallback to sequential copy when tmpfs doesn't support it
- postgres: disable TimescaleDB/ExtWhitelist (unavailable in packages layer)
- postgres: pre-install uuid-ossp and pgcrypto extensions in template1
  so non-superuser app database users can use them without pgextwlist
- dns: fix off-by-one panic in clientconfig.go (len>=8 but slice [:9])

These changes, combined with the rebuilt TUF images (3-layer postgres with
PostgreSQL 11 packages, controller with JSON schemas), enable successful
single-node Flynn cluster bootstrap on Debian 13 (Trixie) with cgroups v2
and ZFS 2.3.
…MACs for multi-node

Fix two bugs blocking 3-node cluster bootstrap:

1. PostgreSQL primary crash-loops when sync replica exists because
   installExtensionsInTemplate() runs CREATE EXTENSION against template1
   without overriding default_transaction_read_only=on (set in
   postgresql.conf when downstream != nil). Add SET default_transaction_read_only=off
   before extension DDL.

2. Flannel VXLAN overlay is broken on cloned VMs because all nodes get
   identical flannel.1 MAC addresses (kernel derives MAC deterministically
   from VNI + machine state). Add netlink.LinkSetHardwareAddr() after
   device creation to set a unique MAC derived from the VTEP IP
   (02:42:IP[0]:IP[1]:IP[2]:IP[3]).

Also increase bootstrap wait timeouts from 5 to 10 minutes and add
configurable timeout field to WaitAction for multi-node clusters where
service startup takes longer.
The router registered services (router-api, router-http) with discoverd
using LISTEN_IP (typically 0.0.0.0), which is not a routable address.
Other services (status aggregator, scheduler) could not reach the router,
causing the cluster to report unhealthy and the scheduler to loop with
"route not found" errors. Now uses EXTERNAL_IP for registration while
keeping LISTEN_IP for binding.
Replace NeighAdd with NeighSet for FDB entries to avoid 'file exists'
errors when entries already exist (idempotent upsert vs exclusive create).

Derive a unique MAC address for each node's flannel.1 device from its
VTEP IP (02:42:xx:xx:xx:xx) instead of using the default MAC from the
base image. Without this, nodes cloned from the same image share
identical MACs which breaks VXLAN forwarding.

Only set the MAC when it differs from the current one to avoid flushing
ARP neighbor entries. Add retry logic that brings the link down/up if
setting the MAC on a running interface fails.
The primary starts PostgreSQL in read-only mode when a downstream
(sync standby) exists. But assumePrimary needs read-write access to
create the superuser and install extensions in the freshly-initialized
database. The session-level SET default_transaction_read_only=off was
insufficient — CREATE EXTENSION still failed with 'cannot execute in a
read-only transaction', causing assumePrimary to fail, which called
p.stop(), killing postgres and all replication connections, creating an
infinite loop.

Fix: Start read-write during initial setup (the database was just
created with initdb, there is no user data to protect). After setup
completes, switch to read-only mode and SIGHUP postgres before calling
waitForSync. Remove the now-unnecessary SET TRANSACTION READ WRITE hack.
On cgroups v2, each container's OOM notification uses an inotify
instance. With many containers (89+ from bootstrap), the default
max_user_instances=128 is exhausted, causing NotifyOOM to fail. The
watch() goroutine previously returned this error, which triggered
Destroy() and killed the container within ~1 second with no user-visible
error message.

Make the OOM notification failure non-fatal: log a warning and continue
watching for state changes. The container still functions correctly
without OOM monitoring.

Also fix DNS resolver detection for systemd-resolved environments. On
Debian 13, /etc/resolv.conf points to the stub resolver at 127.0.0.53
which is unreachable from containers in separate network namespaces.
Fall back to /run/systemd/resolve/resolv.conf which contains the real
upstream resolver IPs.
Update resource limit tests to work on both cgroups v1 and v2:

- resourceCmd: auto-detect cgroup version inside containers. On v2,
  read memory.max and cpu.weight from the container's cgroup path
  (discovered via /proc/1/cgroup) instead of v1's fixed paths.

- Add cpuSharesToWeight() helper matching the kernel's conversion
  formula: weight = 1 + ((shares - 2) * 9999) / 262142.

- Add isCgroupV2() detection based on /sys/fs/cgroup/cgroup.controllers.

- Set DisableLog: true on test jobs that capture output via attach
  streams. This avoids a race condition in the log mux where short-lived
  jobs complete before StreamLog sets up its subscription, causing the
  attach client to block forever.

- Make setupGitreceive() conditional on the -run filter matching
  git-related tests, so non-git tests don't block on broken deployments.

- Update slugbuilder-limit test app to read v2 cgroup files.

All 4 resource limit tests pass:
  CLISuite.TestRunLimits
  HostSuite.TestResourceLimits
  ControllerSuite.TestResourceLimitsOneOffJob
  ControllerSuite.TestResourceLimitsReleaseJob
This project is actively maintained again — remove the warning that was
added when Flynn was abandoned.
@phaus phaus merged commit 18cbedf into master Apr 15, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant