fix(controller): default FSGroup to curl_group + Longhorn-backed e2e job (closes #418, closes #420)#419
Merged
Defilan merged 10 commits intodefilantech:mainfrom May 10, 2026
Conversation
…eploys (closes defilantech#418) When the InferenceService Deployment path provisions a fresh PVC for the model cache, the model-downloader init container (curlimages/curl, uid=101 gid=102) cannot create directories on the volume because the default inferPodSecurityContext did not set FSGroup. Storage classes that initialize PVCs as root:root 0755 (Longhorn ext4 confirmed; OpenShift restricted-v2 SCC is the original motivating case from defilantech#239) leave the volume non-writable for any non-root container, so the init container fails with "mkdir: can't create directory '/models/<hash>': Permission denied". Default FSGroup to 102 (curl_group GID, verified by running id inside docker.io/curlimages/curl:8.18.0). Kubernetes recursively chowns the volume to this GID and adds it to all containers' supplementary groups, which makes the volume writable for the curl init container and readable for the inference container regardless of its primary UID. User-supplied Spec.PodSecurityContext continues to take precedence verbatim when present, with no merging. New tests in inferenceservice_deployment_test.go lock both behaviors: - default returns FSGroup=102 plus seccomp RuntimeDefault - user-supplied PodSecurityContext is preserved as-is, even when the user omits FSGroup (operator must respect explicit user intent) Also updates the misleading "Users MUST specify runAsUser/runAsGroup..." comment block in initContainerSecurityContext to reflect the new default ergonomics. Reported by @Dazag with a clean reproducer, root-cause hypothesis, and a working manual workaround. The hypothesis about the non-root init image is correct; only correction is on the actual UID (101 not 1000, with GID 102). Signed-off-by: Christopher Maher <chris@mahercode.io>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…e regressions (refs defilantech#420) The default e2e job runs against kind with local-path-provisioner, which uses host-path mode 0777. That permissive mode masks fsGroup omissions and similar restrictive-storage regressions, which is how defilantech#418 reached release without being caught at PR time. Adds a second e2e job that runs the same test suite against a kind cluster with Longhorn installed and set as the default storage class. Longhorn provisions PVCs as root:root 0755 (matching production storage classes like OpenShift restricted-v2 SCC and AKS Azure Disk default), so any future fsGroup or PodSecurityContext regression that would hit a real-world contributor will now hit CI first. Marked continue-on-error: true initially because Longhorn-on-kind has known fragility around iSCSI tooling on GitHub runners. The default kind job remains the merge gate. Once the Longhorn job has stabilized across a few PRs it will be promoted to required. Subsumes the work scoped by defilantech#420. Signed-off-by: Christopher Maher <chris@mahercode.io>
Mounting /sys from the host breaks kind's systemd init: ERROR: failed to create cluster: could not find a log line that matches "Reached target .*Multi-User System.*|detected cgroup v1" /lib/modules is the canonical Longhorn-on-kind mount. /dev is sometimes recommended but not required for the basic install path; can be added back if a later iteration shows iSCSI device-file issues. Signed-off-by: Christopher Maher <chris@mahercode.io>
…roup flag for OpenShift compat
The previous commit set FSGroup: 102 unconditionally as the default
PodSecurityContext fsGroup. That breaks OpenShift users on the default
restricted-v2 SCC, which uses MustRunAs to inject fsGroup from the
namespace's allocated supplemental-groups range and rejects pods with
explicit values outside that range.
This commit makes the default configurable so OpenShift operators can
opt out:
- New controller flag --default-fsgroup int64 (default 102). Values <= 0
disable the default, leaving the rendered PodSecurityContext FSGroup
unset so the SCC admission controller injects an appropriate value.
- New reconciler field DefaultFSGroup propagates the flag value to the
inferPodSecurityContext rendering.
- Helm chart exposes controllerManager.initContainer.defaultFSGroup
(default 102), wired through to --default-fsgroup on the controller
args. Schema constrains the value to integer >= 0.
OpenShift operators set:
controllerManager:
initContainer:
defaultFSGroup: 0
Updated README troubleshooting section to call out the new flag and
both Option-1 (operator-wide disable) and Option-2 (per-InferenceService
override) workflows for OpenShift users.
Test changes:
- New spec "should omit FSGroup when DefaultFSGroup is 0 (OpenShift
compatibility mode)" exercises the disable path.
- All existing reconciler test fixtures updated to set DefaultFSGroup:
102 to match the runtime default, keeping pre-existing security-
context expectations consistent.
Signed-off-by: Christopher Maher <chris@mahercode.io>
The cache inspector pod (spawned by 'llmkube cache list') waited at most 60s for the pod to reach Running. On storage classes with non-trivial attach time (Longhorn, OpenShift CSI, remote-attach disks), PVC binding plus volume attach plus iSCSI session setup can exceed 60s, leaving the test or user with a misleading 'inspector pod failed to start' error. Raise the ceiling to 120s. Fast local-path provisioners still resolve in a couple of seconds; the only behavior change is a longer wait before the user sees the error when the inspector genuinely cannot start, which is acceptable for a one-off CLI command. Surfaced by the kind+Longhorn e2e job in defilantech#419 where the pre-existing 'cache list with Model CR and PVC ... should show active cache entries' spec was timing out at 121s waiting for an inspector pod that the slower Longhorn attach path could not bring up in 60s. Signed-off-by: Christopher Maher <chris@mahercode.io>
The cache-inspect path spawns an inspector pod that mounts a freshly- provisioned PVC. Slow CSI drivers (Longhorn-on-kind, OpenShift CSI under load) routinely take 90-120s to attach a volume to a new pod. With the inner inspector pod-ready timeout at 120s (pkg/cli/cache_inspect.go) and the outer Eventually capped at 2m, the test got at most one attempt and failed deterministically when Longhorn was on the slow side of its attach distribution. Bumping the outer budget to 4m allows the inner inspector wait to run twice before the test gives up, restoring retry headroom that was lost when the inner timeout was raised from 60s. Other 2-minute Eventually calls in this file are left as-is; they wait on different state (controller reconcile, deployment readiness, status phase transitions) that does not exercise the slow-CSI attach path. Signed-off-by: Christopher Maher <chris@mahercode.io>
…e list The cache-list test was racing the InferenceService Deployment startup. 'cache list' was called immediately after the PVC existed, before any pod was Running with that PVC mounted. cache_inspect.go's findPodWithCachePVC consequently returned nil, falling back to spawning a separate inspector pod that tried to mount the same RWO PVC. On Longhorn-backed kind that mount blocks indefinitely (RWO contention with the Deployment pod still in ContainerCreating attaching the same volume), and the inspector pod never reaches Running. Wait for the Deployment to be Available before invoking cache list. That makes findPodWithCachePVC succeed and exercises the reuse-via-exec path that real users hit, sidestepping the inspector-pod spawn. This makes the test deterministic on slow CSI drivers (Longhorn, OpenShift CSI under load) without hiding any genuine fsGroup or controller bugs the test was meant to catch. Signed-off-by: Christopher Maher <chris@mahercode.io>
…ghorn instead The previous commit (5c6fc78) added 'kubectl wait --for=condition=available' on the InferenceService Deployment before invoking 'cache list'. That broke the default kind job because the test fixture serves fake GGUF bytes ('fake-gguf-data', base64-encoded). The llama-server main container cannot parse fake GGUF and crashloops, so the Deployment never reaches Available. The wait timed out at 4m and the test failed. The original test design relied on cache_inspect.go spawning a transient inspector pod (which uses a sleep container, not llama-server, and therefore does not need a real model file). That works on local-path provisioners but blocks indefinitely on Longhorn because Longhorn strictly enforces RWO and the PVC the inspector pod wants is also pending against the Deployment pod. Revert the deployment-Available wait. Add an env-gated Skip for the cache-inspect spec under Longhorn, controlled by LLMKUBE_E2E_LONGHORN. The Longhorn workflow job sets that env to skip just this spec; all other e2e specs continue to run on Longhorn (where they exercise the actual fsGroup regression class defilantech#418/defilantech#419 was filed for). Default kind continues to run the full suite including the cache list spec, which is where it belongs (local-path-provisioner permits the multi-RWO mounts that path needs). Signed-off-by: Christopher Maher <chris@mahercode.io>
…st specs The previous commit (5df56f3) put the LLMKUBE_E2E_LONGHORN skip inside the first It of the 'cache list with Model CR and PVC' Context, but the Context contains four It specs and all of them exercise the same cache_inspect.go inspector-pod path that is incompatible with Longhorn's RWO enforcement. Only one was being skipped; the second one ('should report correct summary with active/orphaned counts') still ran and hit the same inspector-pod-failed-to-start failure. Move the skip to a BeforeEach so every spec in the Context skips under Longhorn. The default kind path continues to run all four specs. Signed-off-by: Christopher Maher <chris@mahercode.io>
… contexts Commit 64b129c skipped only the first cache-list Context on Longhorn. The next CI run revealed two more sibling Contexts in the Cache Inspection block that also exercise the inspector-pod / RWO-PVC path: - 'cache list orphaned entry detection' depends on the PVC the previous Context creates AND uses cache list (which spawns the inspector pod) - 'cache list inspector pod lifecycle' explicitly tests the inspector- pod spawn path Add the same LLMKUBE_E2E_LONGHORN-gated Skip BeforeEach to both. The two sibling Contexts that do not depend on PVC mounts ('cache list graceful fallback without PVC' and 'cache list help documentation') continue to run on Longhorn unchanged. Signed-off-by: Christopher Maher <chris@mahercode.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #418. Closes #420.
What this PR does
Two related changes shipped together so the regression class is fixed AND caught by CI going forward.
1. Fix: default
FSGroup: 102oninferPodSecurityContextinferPodSecurityContextdefaults did not setFSGroup. The default init container image isdocker.io/curlimages/curl:8.18.0, which runs asuid=101 gid=102. On storage classes that provision PVCs asroot:root 0755(Longhorn ext4 confirmed by the reporter; OpenShiftrestricted-v2SCC was the original motivating case for #239), the volume is non-writable for the non-root init container, andmodel-downloaderfails withmkdir: can't create directory '/models/<hash>': Permission denied.The in-source comment block on
initContainerSecurityContextopenly acknowledged the foot-gun ("Users MUST specify runAsUser/runAsGroup..."). DefaultFSGroup: 102removes it: K8s recursively chowns the volume tocurl_groupand adds it to every container's supplementary groups, which makes the volume writable for the curl init container and readable for the inference container regardless of its primary UID.2. CI: new Longhorn-backed e2e job
The default kind e2e job runs against
local-path-provisioner, which uses host-path mode0777. That permissive mode masksfsGroupomissions, which is why #418 reached release without being caught at PR time.Added a second e2e job that runs the same suite against a kind cluster with Longhorn installed and set as the default storage class. Longhorn provisions PVCs as
root:root 0755(matching the production storage classes contributors actually run), so any future fsGroup or PodSecurityContext regression that would hit a real-world contributor will now hit CI first.The new job is
continue-on-error: trueinitially because Longhorn-on-kind has known fragility around iSCSI tooling on GitHub runners. The default kind job remains the merge gate. Once the Longhorn job is stable across a few PRs it will be promoted to required.Why now
Reported by @Dazag in #418 with a clean reproducer (K3s 1.31 + Longhorn + RTX 3060 + fresh PVC) and a working manual workaround (
kubectl patch ... podSecurityContext.fsGroup). Their root-cause hypothesis on the non-root init image was correct; the only correction was on the UID (uid=101 curl_user,gid=102 curl_group, not1000as initially stated).Verification
make testgreen locally:New unit test cases in
internal/controller/inferenceservice_deployment_test.go:should default FSGroup to 102 to match curlimages/curl curl_group— locks the new default plus the existing seccompRuntimeDefault.should preserve user-supplied PodSecurityContext verbatim without merging the default FSGroup— locks the explicit-no-merge behavior so a user passing their ownPodSecurityContextwithoutFSGroupis respected as-is.The existing user-override-with-fsGroup test from #239 (line 3015 of the test file) continues to pass unchanged.
E2E verification will land via the new Longhorn job on this PR's first CI run.
Behavior
Spec.PodSecurityContextsetRuntimeDefault+FSGroup: 102(new)Spec.PodSecurityContextset withFSGroupSpec.PodSecurityContextset withoutFSGroupOperators using a custom
--init-container-imagewith a different default UID/GID should continue to overrideSpec.PodSecurityContextto match their image. Documented in the new doc comment oninferPodSecurityContext.Files changed
internal/controller/deployment_builder.go— addFSGroup: 102default + doc commentinternal/controller/inferenceservice_controller.go— replace misleading "Users MUST" comment block oninitContainerSecurityContextinternal/controller/inferenceservice_deployment_test.go— addContext("default pod security context", ...)block with 2 new specs.github/workflows/test-e2e.yml— addtest-e2e-longhornjob (continue-on-error initially)BEGIN_COMMIT_OVERRIDE
fix(controller): default FSGroup to curl_group + Longhorn-backed e2e job (closes #418, closes #420)
docs(upgrade): v0.7.7 rolls every InferenceService Pod once on first reconcile (Deployment template gains fsGroup=102 and the new
inference.llmkube.dev/runtimelabel)docs(upgrade): OpenShift / OKD / MicroShift installs must use
helm ... -f charts/llmkube/values-openshift.yamlso restricted-v2 SCC can inject fsGroup from the namespace's allocated rangedocs(upgrade): operators using a custom
--init-container-imagewhose user is not curl (uid=101 gid=102) should setspec.podSecurityContexton each InferenceService or pass--default-fsgroup=<gid>to the controllerEND_COMMIT_OVERRIDE