fix(controller): default FSGroup to curl_group + Longhorn-backed e2e job (closes #418, closes #420) by Defilan · Pull Request #419 · defilantech/LLMKube

Defilan · 2026-05-10T14:45:21Z

Closes #418. Closes #420.

What this PR does

Two related changes shipped together so the regression class is fixed AND caught by CI going forward.

1. Fix: default `FSGroup: 102` on `inferPodSecurityContext`

inferPodSecurityContext defaults did not set FSGroup. The default init container image is docker.io/curlimages/curl:8.18.0, which runs as uid=101 gid=102. On storage classes that provision PVCs as root:root 0755 (Longhorn ext4 confirmed by the reporter; OpenShift restricted-v2 SCC was the original motivating case for #239), the volume is non-writable for the non-root init container, and model-downloader fails with mkdir: can't create directory '/models/<hash>': Permission denied.

The in-source comment block on initContainerSecurityContext openly acknowledged the foot-gun ("Users MUST specify runAsUser/runAsGroup..."). Default FSGroup: 102 removes it: K8s recursively chowns the volume to curl_group and adds it to every container's supplementary groups, which makes the volume writable for the curl init container and readable for the inference container regardless of its primary UID.

2. CI: new Longhorn-backed e2e job

The default kind e2e job runs against local-path-provisioner, which uses host-path mode 0777. That permissive mode masks fsGroup omissions, which is why #418 reached release without being caught at PR time.

Added a second e2e job that runs the same suite against a kind cluster with Longhorn installed and set as the default storage class. Longhorn provisions PVCs as root:root 0755 (matching the production storage classes contributors actually run), so any future fsGroup or PodSecurityContext regression that would hit a real-world contributor will now hit CI first.

The new job is continue-on-error: true initially because Longhorn-on-kind has known fragility around iSCSI tooling on GitHub runners. The default kind job remains the merge gate. Once the Longhorn job is stable across a few PRs it will be promoted to required.

Why now

Reported by @Dazag in #418 with a clean reproducer (K3s 1.31 + Longhorn + RTX 3060 + fresh PVC) and a working manual workaround (kubectl patch ... podSecurityContext.fsGroup). Their root-cause hypothesis on the non-root init image was correct; the only correction was on the UID (uid=101 curl_user, gid=102 curl_group, not 1000 as initially stated).

Verification

make test green locally:

ok  	github.com/defilantech/llmkube/internal/controller	9.649s	coverage: 85.7% of statements

New unit test cases in internal/controller/inferenceservice_deployment_test.go:

should default FSGroup to 102 to match curlimages/curl curl_group — locks the new default plus the existing seccomp RuntimeDefault.
should preserve user-supplied PodSecurityContext verbatim without merging the default FSGroup — locks the explicit-no-merge behavior so a user passing their own PodSecurityContext without FSGroup is respected as-is.

The existing user-override-with-fsGroup test from #239 (line 3015 of the test file) continues to pass unchanged.

E2E verification will land via the new Longhorn job on this PR's first CI run.

Behavior

Caller	Result
No `Spec.PodSecurityContext` set	Default returns seccomp `RuntimeDefault` + `FSGroup: 102` (new)
`Spec.PodSecurityContext` set with `FSGroup`	User's value used verbatim (unchanged)
`Spec.PodSecurityContext` set without `FSGroup`	User's value used verbatim, no FSGroup (unchanged; explicit user intent respected, no merging)

Operators using a custom --init-container-image with a different default UID/GID should continue to override Spec.PodSecurityContext to match their image. Documented in the new doc comment on inferPodSecurityContext.

Files changed

internal/controller/deployment_builder.go — add FSGroup: 102 default + doc comment
internal/controller/inferenceservice_controller.go — replace misleading "Users MUST" comment block on initContainerSecurityContext
internal/controller/inferenceservice_deployment_test.go — add Context("default pod security context", ...) block with 2 new specs
.github/workflows/test-e2e.yml — add test-e2e-longhorn job (continue-on-error initially)

BEGIN_COMMIT_OVERRIDE
fix(controller): default FSGroup to curl_group + Longhorn-backed e2e job (closes #418, closes #420)
docs(upgrade): v0.7.7 rolls every InferenceService Pod once on first reconcile (Deployment template gains fsGroup=102 and the new inference.llmkube.dev/runtime label)
docs(upgrade): OpenShift / OKD / MicroShift installs must use helm ... -f charts/llmkube/values-openshift.yaml so restricted-v2 SCC can inject fsGroup from the namespace's allocated range
docs(upgrade): operators using a custom --init-container-image whose user is not curl (uid=101 gid=102) should set spec.podSecurityContext on each InferenceService or pass --default-fsgroup=<gid> to the controller
END_COMMIT_OVERRIDE

@Dazag

…eploys (closes defilantech#418) When the InferenceService Deployment path provisions a fresh PVC for the model cache, the model-downloader init container (curlimages/curl, uid=101 gid=102) cannot create directories on the volume because the default inferPodSecurityContext did not set FSGroup. Storage classes that initialize PVCs as root:root 0755 (Longhorn ext4 confirmed; OpenShift restricted-v2 SCC is the original motivating case from defilantech#239) leave the volume non-writable for any non-root container, so the init container fails with "mkdir: can't create directory '/models/<hash>': Permission denied". Default FSGroup to 102 (curl_group GID, verified by running id inside docker.io/curlimages/curl:8.18.0). Kubernetes recursively chowns the volume to this GID and adds it to all containers' supplementary groups, which makes the volume writable for the curl init container and readable for the inference container regardless of its primary UID. User-supplied Spec.PodSecurityContext continues to take precedence verbatim when present, with no merging. New tests in inferenceservice_deployment_test.go lock both behaviors: - default returns FSGroup=102 plus seccomp RuntimeDefault - user-supplied PodSecurityContext is preserved as-is, even when the user omits FSGroup (operator must respect explicit user intent) Also updates the misleading "Users MUST specify runAsUser/runAsGroup..." comment block in initContainerSecurityContext to reflect the new default ergonomics. Reported by @Dazag with a clean reproducer, root-cause hypothesis, and a working manual workaround. The hypothesis about the non-root init image is correct; only correction is on the actual UID (101 not 1000, with GID 102). Signed-off-by: Christopher Maher <chris@mahercode.io>

codecov · 2026-05-10T14:49:20Z

Codecov Report

❌ Patch coverage is 43.75000% with 9 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
cmd/main.go	0.00%	8 Missing ⚠️
pkg/cli/cache_inspect.go	0.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

…e regressions (refs defilantech#420) The default e2e job runs against kind with local-path-provisioner, which uses host-path mode 0777. That permissive mode masks fsGroup omissions and similar restrictive-storage regressions, which is how defilantech#418 reached release without being caught at PR time. Adds a second e2e job that runs the same test suite against a kind cluster with Longhorn installed and set as the default storage class. Longhorn provisions PVCs as root:root 0755 (matching production storage classes like OpenShift restricted-v2 SCC and AKS Azure Disk default), so any future fsGroup or PodSecurityContext regression that would hit a real-world contributor will now hit CI first. Marked continue-on-error: true initially because Longhorn-on-kind has known fragility around iSCSI tooling on GitHub runners. The default kind job remains the merge gate. Once the Longhorn job has stabilized across a few PRs it will be promoted to required. Subsumes the work scoped by defilantech#420. Signed-off-by: Christopher Maher <chris@mahercode.io>

Mounting /sys from the host breaks kind's systemd init: ERROR: failed to create cluster: could not find a log line that matches "Reached target .*Multi-User System.*|detected cgroup v1" /lib/modules is the canonical Longhorn-on-kind mount. /dev is sometimes recommended but not required for the basic install path; can be added back if a later iteration shows iSCSI device-file issues. Signed-off-by: Christopher Maher <chris@mahercode.io>

…roup flag for OpenShift compat The previous commit set FSGroup: 102 unconditionally as the default PodSecurityContext fsGroup. That breaks OpenShift users on the default restricted-v2 SCC, which uses MustRunAs to inject fsGroup from the namespace's allocated supplemental-groups range and rejects pods with explicit values outside that range. This commit makes the default configurable so OpenShift operators can opt out: - New controller flag --default-fsgroup int64 (default 102). Values <= 0 disable the default, leaving the rendered PodSecurityContext FSGroup unset so the SCC admission controller injects an appropriate value. - New reconciler field DefaultFSGroup propagates the flag value to the inferPodSecurityContext rendering. - Helm chart exposes controllerManager.initContainer.defaultFSGroup (default 102), wired through to --default-fsgroup on the controller args. Schema constrains the value to integer >= 0. OpenShift operators set: controllerManager: initContainer: defaultFSGroup: 0 Updated README troubleshooting section to call out the new flag and both Option-1 (operator-wide disable) and Option-2 (per-InferenceService override) workflows for OpenShift users. Test changes: - New spec "should omit FSGroup when DefaultFSGroup is 0 (OpenShift compatibility mode)" exercises the disable path. - All existing reconciler test fixtures updated to set DefaultFSGroup: 102 to match the runtime default, keeping pre-existing security- context expectations consistent. Signed-off-by: Christopher Maher <chris@mahercode.io>

The cache inspector pod (spawned by 'llmkube cache list') waited at most 60s for the pod to reach Running. On storage classes with non-trivial attach time (Longhorn, OpenShift CSI, remote-attach disks), PVC binding plus volume attach plus iSCSI session setup can exceed 60s, leaving the test or user with a misleading 'inspector pod failed to start' error. Raise the ceiling to 120s. Fast local-path provisioners still resolve in a couple of seconds; the only behavior change is a longer wait before the user sees the error when the inspector genuinely cannot start, which is acceptable for a one-off CLI command. Surfaced by the kind+Longhorn e2e job in defilantech#419 where the pre-existing 'cache list with Model CR and PVC ... should show active cache entries' spec was timing out at 121s waiting for an inspector pod that the slower Longhorn attach path could not bring up in 60s. Signed-off-by: Christopher Maher <chris@mahercode.io>

The cache-inspect path spawns an inspector pod that mounts a freshly- provisioned PVC. Slow CSI drivers (Longhorn-on-kind, OpenShift CSI under load) routinely take 90-120s to attach a volume to a new pod. With the inner inspector pod-ready timeout at 120s (pkg/cli/cache_inspect.go) and the outer Eventually capped at 2m, the test got at most one attempt and failed deterministically when Longhorn was on the slow side of its attach distribution. Bumping the outer budget to 4m allows the inner inspector wait to run twice before the test gives up, restoring retry headroom that was lost when the inner timeout was raised from 60s. Other 2-minute Eventually calls in this file are left as-is; they wait on different state (controller reconcile, deployment readiness, status phase transitions) that does not exercise the slow-CSI attach path. Signed-off-by: Christopher Maher <chris@mahercode.io>

…e list The cache-list test was racing the InferenceService Deployment startup. 'cache list' was called immediately after the PVC existed, before any pod was Running with that PVC mounted. cache_inspect.go's findPodWithCachePVC consequently returned nil, falling back to spawning a separate inspector pod that tried to mount the same RWO PVC. On Longhorn-backed kind that mount blocks indefinitely (RWO contention with the Deployment pod still in ContainerCreating attaching the same volume), and the inspector pod never reaches Running. Wait for the Deployment to be Available before invoking cache list. That makes findPodWithCachePVC succeed and exercises the reuse-via-exec path that real users hit, sidestepping the inspector-pod spawn. This makes the test deterministic on slow CSI drivers (Longhorn, OpenShift CSI under load) without hiding any genuine fsGroup or controller bugs the test was meant to catch. Signed-off-by: Christopher Maher <chris@mahercode.io>

…ghorn instead The previous commit (5c6fc78) added 'kubectl wait --for=condition=available' on the InferenceService Deployment before invoking 'cache list'. That broke the default kind job because the test fixture serves fake GGUF bytes ('fake-gguf-data', base64-encoded). The llama-server main container cannot parse fake GGUF and crashloops, so the Deployment never reaches Available. The wait timed out at 4m and the test failed. The original test design relied on cache_inspect.go spawning a transient inspector pod (which uses a sleep container, not llama-server, and therefore does not need a real model file). That works on local-path provisioners but blocks indefinitely on Longhorn because Longhorn strictly enforces RWO and the PVC the inspector pod wants is also pending against the Deployment pod. Revert the deployment-Available wait. Add an env-gated Skip for the cache-inspect spec under Longhorn, controlled by LLMKUBE_E2E_LONGHORN. The Longhorn workflow job sets that env to skip just this spec; all other e2e specs continue to run on Longhorn (where they exercise the actual fsGroup regression class defilantech#418/defilantech#419 was filed for). Default kind continues to run the full suite including the cache list spec, which is where it belongs (local-path-provisioner permits the multi-RWO mounts that path needs). Signed-off-by: Christopher Maher <chris@mahercode.io>

…st specs The previous commit (5df56f3) put the LLMKUBE_E2E_LONGHORN skip inside the first It of the 'cache list with Model CR and PVC' Context, but the Context contains four It specs and all of them exercise the same cache_inspect.go inspector-pod path that is incompatible with Longhorn's RWO enforcement. Only one was being skipped; the second one ('should report correct summary with active/orphaned counts') still ran and hit the same inspector-pod-failed-to-start failure. Move the skip to a BeforeEach so every spec in the Context skips under Longhorn. The default kind path continues to run all four specs. Signed-off-by: Christopher Maher <chris@mahercode.io>

… contexts Commit 64b129c skipped only the first cache-list Context on Longhorn. The next CI run revealed two more sibling Contexts in the Cache Inspection block that also exercise the inspector-pod / RWO-PVC path: - 'cache list orphaned entry detection' depends on the PVC the previous Context creates AND uses cache list (which spawns the inspector pod) - 'cache list inspector pod lifecycle' explicitly tests the inspector- pod spawn path Add the same LLMKUBE_E2E_LONGHORN-gated Skip BeforeEach to both. The two sibling Contexts that do not depend on PVC mounts ('cache list graceful fallback without PVC' and 'cache list help documentation') continue to run on Longhorn unchanged. Signed-off-by: Christopher Maher <chris@mahercode.io>

Defilan changed the title ~~fix(controller): default FSGroup to curl_group to unbreak fresh-PVC deploys (closes #418)~~ fix(controller): default FSGroup to curl_group + Longhorn-backed e2e job (closes #418, closes #420) May 10, 2026

Defilan mentioned this pull request May 10, 2026

ci: add Longhorn-backed K3s e2e job to catch fsGroup / PVC permission regressions #420

Closed

Defilan added 2 commits May 10, 2026 07:53

Defilan mentioned this pull request May 10, 2026

ci: add OKD/OpenShift-style e2e job to verify SCC behavior end-to-end #421

Closed

Defilan added 6 commits May 10, 2026 08:30

Defilan merged commit adce90f into defilantech:main May 10, 2026
20 checks passed

github-actions Bot mentioned this pull request May 9, 2026

chore: release 0.7.7 #402

Merged

Defilan mentioned this pull request May 11, 2026

feat(ci+chart): make OpenShift a first-class deploy target (closes #421) #422

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(controller): default FSGroup to curl_group + Longhorn-backed e2e job (closes #418, closes #420)#419

fix(controller): default FSGroup to curl_group + Longhorn-backed e2e job (closes #418, closes #420)#419
Defilan merged 10 commits intodefilantech:mainfrom
Defilan:fix/418-default-fsgroup-pvc-permission

Defilan commented May 10, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Defilan commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

1. Fix: default FSGroup: 102 on inferPodSecurityContext

2. CI: new Longhorn-backed e2e job

Why now

Verification

Behavior

Files changed

Uh oh!

codecov Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Defilan commented May 10, 2026 •

edited

Loading

1. Fix: default `FSGroup: 102` on `inferPodSecurityContext`

codecov Bot commented May 10, 2026 •

edited

Loading