Skip to content

fix(controller): default FSGroup to curl_group + Longhorn-backed e2e job (closes #418, closes #420)#419

Merged
Defilan merged 10 commits intodefilantech:mainfrom
Defilan:fix/418-default-fsgroup-pvc-permission
May 10, 2026
Merged

fix(controller): default FSGroup to curl_group + Longhorn-backed e2e job (closes #418, closes #420)#419
Defilan merged 10 commits intodefilantech:mainfrom
Defilan:fix/418-default-fsgroup-pvc-permission

Conversation

@Defilan
Copy link
Copy Markdown
Member

@Defilan Defilan commented May 10, 2026

Closes #418. Closes #420.

What this PR does

Two related changes shipped together so the regression class is fixed AND caught by CI going forward.

1. Fix: default FSGroup: 102 on inferPodSecurityContext

inferPodSecurityContext defaults did not set FSGroup. The default init container image is docker.io/curlimages/curl:8.18.0, which runs as uid=101 gid=102. On storage classes that provision PVCs as root:root 0755 (Longhorn ext4 confirmed by the reporter; OpenShift restricted-v2 SCC was the original motivating case for #239), the volume is non-writable for the non-root init container, and model-downloader fails with mkdir: can't create directory '/models/<hash>': Permission denied.

The in-source comment block on initContainerSecurityContext openly acknowledged the foot-gun ("Users MUST specify runAsUser/runAsGroup..."). Default FSGroup: 102 removes it: K8s recursively chowns the volume to curl_group and adds it to every container's supplementary groups, which makes the volume writable for the curl init container and readable for the inference container regardless of its primary UID.

2. CI: new Longhorn-backed e2e job

The default kind e2e job runs against local-path-provisioner, which uses host-path mode 0777. That permissive mode masks fsGroup omissions, which is why #418 reached release without being caught at PR time.

Added a second e2e job that runs the same suite against a kind cluster with Longhorn installed and set as the default storage class. Longhorn provisions PVCs as root:root 0755 (matching the production storage classes contributors actually run), so any future fsGroup or PodSecurityContext regression that would hit a real-world contributor will now hit CI first.

The new job is continue-on-error: true initially because Longhorn-on-kind has known fragility around iSCSI tooling on GitHub runners. The default kind job remains the merge gate. Once the Longhorn job is stable across a few PRs it will be promoted to required.

Why now

Reported by @Dazag in #418 with a clean reproducer (K3s 1.31 + Longhorn + RTX 3060 + fresh PVC) and a working manual workaround (kubectl patch ... podSecurityContext.fsGroup). Their root-cause hypothesis on the non-root init image was correct; the only correction was on the UID (uid=101 curl_user, gid=102 curl_group, not 1000 as initially stated).

Verification

make test green locally:

ok  	github.com/defilantech/llmkube/internal/controller	9.649s	coverage: 85.7% of statements

New unit test cases in internal/controller/inferenceservice_deployment_test.go:

  1. should default FSGroup to 102 to match curlimages/curl curl_group — locks the new default plus the existing seccomp RuntimeDefault.
  2. should preserve user-supplied PodSecurityContext verbatim without merging the default FSGroup — locks the explicit-no-merge behavior so a user passing their own PodSecurityContext without FSGroup is respected as-is.

The existing user-override-with-fsGroup test from #239 (line 3015 of the test file) continues to pass unchanged.

E2E verification will land via the new Longhorn job on this PR's first CI run.

Behavior

Caller Result
No Spec.PodSecurityContext set Default returns seccomp RuntimeDefault + FSGroup: 102 (new)
Spec.PodSecurityContext set with FSGroup User's value used verbatim (unchanged)
Spec.PodSecurityContext set without FSGroup User's value used verbatim, no FSGroup (unchanged; explicit user intent respected, no merging)

Operators using a custom --init-container-image with a different default UID/GID should continue to override Spec.PodSecurityContext to match their image. Documented in the new doc comment on inferPodSecurityContext.

Files changed

  • internal/controller/deployment_builder.go — add FSGroup: 102 default + doc comment
  • internal/controller/inferenceservice_controller.go — replace misleading "Users MUST" comment block on initContainerSecurityContext
  • internal/controller/inferenceservice_deployment_test.go — add Context("default pod security context", ...) block with 2 new specs
  • .github/workflows/test-e2e.yml — add test-e2e-longhorn job (continue-on-error initially)

BEGIN_COMMIT_OVERRIDE
fix(controller): default FSGroup to curl_group + Longhorn-backed e2e job (closes #418, closes #420)
docs(upgrade): v0.7.7 rolls every InferenceService Pod once on first reconcile (Deployment template gains fsGroup=102 and the new inference.llmkube.dev/runtime label)
docs(upgrade): OpenShift / OKD / MicroShift installs must use helm ... -f charts/llmkube/values-openshift.yaml so restricted-v2 SCC can inject fsGroup from the namespace's allocated range
docs(upgrade): operators using a custom --init-container-image whose user is not curl (uid=101 gid=102) should set spec.podSecurityContext on each InferenceService or pass --default-fsgroup=<gid> to the controller
END_COMMIT_OVERRIDE

…eploys (closes defilantech#418)

When the InferenceService Deployment path provisions a fresh PVC for the
model cache, the model-downloader init container (curlimages/curl, uid=101
gid=102) cannot create directories on the volume because the default
inferPodSecurityContext did not set FSGroup. Storage classes that initialize
PVCs as root:root 0755 (Longhorn ext4 confirmed; OpenShift restricted-v2
SCC is the original motivating case from defilantech#239) leave the volume non-writable
for any non-root container, so the init container fails with
"mkdir: can't create directory '/models/<hash>': Permission denied".

Default FSGroup to 102 (curl_group GID, verified by running id inside
docker.io/curlimages/curl:8.18.0). Kubernetes recursively chowns the volume
to this GID and adds it to all containers' supplementary groups, which
makes the volume writable for the curl init container and readable for
the inference container regardless of its primary UID.

User-supplied Spec.PodSecurityContext continues to take precedence verbatim
when present, with no merging. New tests in inferenceservice_deployment_test.go
lock both behaviors:
- default returns FSGroup=102 plus seccomp RuntimeDefault
- user-supplied PodSecurityContext is preserved as-is, even when the user
  omits FSGroup (operator must respect explicit user intent)

Also updates the misleading "Users MUST specify runAsUser/runAsGroup..."
comment block in initContainerSecurityContext to reflect the new default
ergonomics.

Reported by @Dazag with a clean reproducer, root-cause hypothesis, and a
working manual workaround. The hypothesis about the non-root init image
is correct; only correction is on the actual UID (101 not 1000, with
GID 102).

Signed-off-by: Christopher Maher <chris@mahercode.io>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 10, 2026

Codecov Report

❌ Patch coverage is 43.75000% with 9 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
cmd/main.go 0.00% 8 Missing ⚠️
pkg/cli/cache_inspect.go 0.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

…e regressions (refs defilantech#420)

The default e2e job runs against kind with local-path-provisioner, which
uses host-path mode 0777. That permissive mode masks fsGroup omissions
and similar restrictive-storage regressions, which is how defilantech#418 reached
release without being caught at PR time.

Adds a second e2e job that runs the same test suite against a kind
cluster with Longhorn installed and set as the default storage class.
Longhorn provisions PVCs as root:root 0755 (matching production storage
classes like OpenShift restricted-v2 SCC and AKS Azure Disk default), so
any future fsGroup or PodSecurityContext regression that would hit a
real-world contributor will now hit CI first.

Marked continue-on-error: true initially because Longhorn-on-kind has
known fragility around iSCSI tooling on GitHub runners. The default
kind job remains the merge gate. Once the Longhorn job has stabilized
across a few PRs it will be promoted to required.

Subsumes the work scoped by defilantech#420.

Signed-off-by: Christopher Maher <chris@mahercode.io>
@Defilan Defilan changed the title fix(controller): default FSGroup to curl_group to unbreak fresh-PVC deploys (closes #418) fix(controller): default FSGroup to curl_group + Longhorn-backed e2e job (closes #418, closes #420) May 10, 2026
Defilan added 2 commits May 10, 2026 07:53
Mounting /sys from the host breaks kind's systemd init:
  ERROR: failed to create cluster: could not find a log line that matches
  "Reached target .*Multi-User System.*|detected cgroup v1"

/lib/modules is the canonical Longhorn-on-kind mount. /dev is sometimes
recommended but not required for the basic install path; can be added
back if a later iteration shows iSCSI device-file issues.

Signed-off-by: Christopher Maher <chris@mahercode.io>
…roup flag for OpenShift compat

The previous commit set FSGroup: 102 unconditionally as the default
PodSecurityContext fsGroup. That breaks OpenShift users on the default
restricted-v2 SCC, which uses MustRunAs to inject fsGroup from the
namespace's allocated supplemental-groups range and rejects pods with
explicit values outside that range.

This commit makes the default configurable so OpenShift operators can
opt out:

- New controller flag --default-fsgroup int64 (default 102). Values <= 0
  disable the default, leaving the rendered PodSecurityContext FSGroup
  unset so the SCC admission controller injects an appropriate value.
- New reconciler field DefaultFSGroup propagates the flag value to the
  inferPodSecurityContext rendering.
- Helm chart exposes controllerManager.initContainer.defaultFSGroup
  (default 102), wired through to --default-fsgroup on the controller
  args. Schema constrains the value to integer >= 0.

OpenShift operators set:

  controllerManager:
    initContainer:
      defaultFSGroup: 0

Updated README troubleshooting section to call out the new flag and
both Option-1 (operator-wide disable) and Option-2 (per-InferenceService
override) workflows for OpenShift users.

Test changes:
- New spec "should omit FSGroup when DefaultFSGroup is 0 (OpenShift
  compatibility mode)" exercises the disable path.
- All existing reconciler test fixtures updated to set DefaultFSGroup:
  102 to match the runtime default, keeping pre-existing security-
  context expectations consistent.

Signed-off-by: Christopher Maher <chris@mahercode.io>
Defilan added 6 commits May 10, 2026 08:30
The cache inspector pod (spawned by 'llmkube cache list') waited at most
60s for the pod to reach Running. On storage classes with non-trivial
attach time (Longhorn, OpenShift CSI, remote-attach disks), PVC binding
plus volume attach plus iSCSI session setup can exceed 60s, leaving the
test or user with a misleading 'inspector pod failed to start' error.

Raise the ceiling to 120s. Fast local-path provisioners still resolve in
a couple of seconds; the only behavior change is a longer wait before
the user sees the error when the inspector genuinely cannot start, which
is acceptable for a one-off CLI command.

Surfaced by the kind+Longhorn e2e job in defilantech#419 where the pre-existing
'cache list with Model CR and PVC ... should show active cache entries'
spec was timing out at 121s waiting for an inspector pod that the slower
Longhorn attach path could not bring up in 60s.

Signed-off-by: Christopher Maher <chris@mahercode.io>
The cache-inspect path spawns an inspector pod that mounts a freshly-
provisioned PVC. Slow CSI drivers (Longhorn-on-kind, OpenShift CSI under
load) routinely take 90-120s to attach a volume to a new pod. With the
inner inspector pod-ready timeout at 120s (pkg/cli/cache_inspect.go) and
the outer Eventually capped at 2m, the test got at most one attempt and
failed deterministically when Longhorn was on the slow side of its
attach distribution.

Bumping the outer budget to 4m allows the inner inspector wait to run
twice before the test gives up, restoring retry headroom that was lost
when the inner timeout was raised from 60s.

Other 2-minute Eventually calls in this file are left as-is; they wait
on different state (controller reconcile, deployment readiness, status
phase transitions) that does not exercise the slow-CSI attach path.

Signed-off-by: Christopher Maher <chris@mahercode.io>
…e list

The cache-list test was racing the InferenceService Deployment startup.
'cache list' was called immediately after the PVC existed, before any
pod was Running with that PVC mounted. cache_inspect.go's
findPodWithCachePVC consequently returned nil, falling back to spawning
a separate inspector pod that tried to mount the same RWO PVC. On
Longhorn-backed kind that mount blocks indefinitely (RWO contention
with the Deployment pod still in ContainerCreating attaching the same
volume), and the inspector pod never reaches Running.

Wait for the Deployment to be Available before invoking cache list.
That makes findPodWithCachePVC succeed and exercises the reuse-via-exec
path that real users hit, sidestepping the inspector-pod spawn.

This makes the test deterministic on slow CSI drivers (Longhorn,
OpenShift CSI under load) without hiding any genuine fsGroup or
controller bugs the test was meant to catch.

Signed-off-by: Christopher Maher <chris@mahercode.io>
…ghorn instead

The previous commit (5c6fc78) added 'kubectl wait --for=condition=available'
on the InferenceService Deployment before invoking 'cache list'. That
broke the default kind job because the test fixture serves fake GGUF
bytes ('fake-gguf-data', base64-encoded). The llama-server main
container cannot parse fake GGUF and crashloops, so the Deployment
never reaches Available. The wait timed out at 4m and the test failed.

The original test design relied on cache_inspect.go spawning a transient
inspector pod (which uses a sleep container, not llama-server, and
therefore does not need a real model file). That works on local-path
provisioners but blocks indefinitely on Longhorn because Longhorn
strictly enforces RWO and the PVC the inspector pod wants is also
pending against the Deployment pod.

Revert the deployment-Available wait. Add an env-gated Skip for the
cache-inspect spec under Longhorn, controlled by LLMKUBE_E2E_LONGHORN.
The Longhorn workflow job sets that env to skip just this spec; all
other e2e specs continue to run on Longhorn (where they exercise the
actual fsGroup regression class defilantech#418/defilantech#419 was filed for).

Default kind continues to run the full suite including the cache list
spec, which is where it belongs (local-path-provisioner permits the
multi-RWO mounts that path needs).

Signed-off-by: Christopher Maher <chris@mahercode.io>
…st specs

The previous commit (5df56f3) put the LLMKUBE_E2E_LONGHORN skip inside
the first It of the 'cache list with Model CR and PVC' Context, but the
Context contains four It specs and all of them exercise the same
cache_inspect.go inspector-pod path that is incompatible with Longhorn's
RWO enforcement. Only one was being skipped; the second one ('should
report correct summary with active/orphaned counts') still ran and hit
the same inspector-pod-failed-to-start failure.

Move the skip to a BeforeEach so every spec in the Context skips under
Longhorn. The default kind path continues to run all four specs.

Signed-off-by: Christopher Maher <chris@mahercode.io>
… contexts

Commit 64b129c skipped only the first cache-list Context on Longhorn.
The next CI run revealed two more sibling Contexts in the Cache
Inspection block that also exercise the inspector-pod / RWO-PVC path:

- 'cache list orphaned entry detection' depends on the PVC the previous
  Context creates AND uses cache list (which spawns the inspector pod)
- 'cache list inspector pod lifecycle' explicitly tests the inspector-
  pod spawn path

Add the same LLMKUBE_E2E_LONGHORN-gated Skip BeforeEach to both. The
two sibling Contexts that do not depend on PVC mounts ('cache list
graceful fallback without PVC' and 'cache list help documentation')
continue to run on Longhorn unchanged.

Signed-off-by: Christopher Maher <chris@mahercode.io>
@Defilan Defilan merged commit adce90f into defilantech:main May 10, 2026
20 checks passed
@github-actions github-actions Bot mentioned this pull request May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ci: add Longhorn-backed K3s e2e job to catch fsGroup / PVC permission regressions [BUG] Permission Denied on Model Download

1 participant