Skip to content

chore: code review fixes 2026-05 (with breaking API tightenings)#2

Merged
CMGS merged 20 commits into
mainfrom
chore-code-review-202605
May 13, 2026
Merged

chore: code review fixes 2026-05 (with breaking API tightenings)#2
CMGS merged 20 commits into
mainfrom
chore-code-review-202605

Conversation

@CMGS
Copy link
Copy Markdown
Contributor

@CMGS CMGS commented May 13, 2026

Summary

Round of code-review fixes across auth, k8s, apis/v1, meta, and log — a mix of bug fixes, API tightenings, and structural refactors.

Breaking API changes (coordinated downstream PRs required)

The following signatures or names changed. Each is annotated in the commit with a ! and a body explaining the migration. Downstream projects (vk-cocoon, cocoon-operator, cocoon-webhook, glance, epoch, cocoon-net) need follow-up PRs.

  • auth.VerifySession — now also rejects expired sessions (when Exp != 0 and in the past). Downstream epoch/server/auth_session.go and glance/auth/sso/session.go perform a redundant sess.Exp < time.Now().Unix() check that can be deleted in their respective PRs.
  • log.Setup(ctx, envVar) error — was func Setup(ctx, envVar) and called Fatalf internally. Callers now choose the failure policy. Consumers: vk-cocoon/main.go, cocoon-operator/main.go, cocoon-webhook/main.go, glance/main.go, epoch/main.go, cocoon-net/cmd/root.go.
  • meta.OwnerDeploymentName(ownerRefs) (string, bool) — was string. The bool disambiguates "no owner" from "owner present but no parseable hash suffix".
  • k8s.DetectNodeIP() (string, error) — was string. Previously returned "127.0.0.1" on failure, masking misconfigured network namespaces. vk-cocoon/main.go:77 needs to handle the new return.
  • meta.HasCocoonToleration -> meta.HasCocoonTolerationKey — rename only, semantics unchanged. The new name makes the legacy looseness (only Key is matched) explicit. Consumers: cocoon-webhook/admission/pod_mutator.go, cocoon-webhook/admission/workload_validator.go, glance/discovery/kubernetes/metadata.go.

Additive / soft-deprecation

  • meta.ExtractAgentSlot(ns, cs, vmName), meta.MainAgentVMNameFor(ns, cs), meta.InferRoleFromAgentSlot(slot), meta.AgentVMNamePrefix(ns, cs) are new. The legacy ExtractSlotFromVMName / MainAgentVMName / InferRoleFromVMName trio is now godoc-deprecated because it misclassifies toolbox VM names whose pod-name ends in -N (e.g. toolbox db-2). Downstream call sites in glance/discovery/kubernetes/ownership.go and cocoon-common/meta/vmspec.go::ShouldSnapshotVM still work today but should migrate.
  • auth.RandomState now panics (with a wrapped message) on crypto/rand.Read failure instead of silently returning an all-zero state. Behavior change only on the previously-unreachable failure path.
  • k8s.LoadOrGenerateCert now falls back to self-signed when the on-disk cert is expired. A tls.LoadX509KeyPair failure still surfaces as fatal.
  • apis/v1.CocoonSetStatus.{Agents,Toolboxes} gained +listType=map markers. CRD YAML regenerated.

Non-API improvements

  • meta package split: meta.go is now the package godoc; constants moved to keys.go (with one-line docs), helpers split into owner.go, vmname.go, connection.go.
  • meta.LabelManagedBy removed (no consumers in any cocoonstack project).
  • apis/v1/enums.go modernized: cmp.Or for Default, slices.Contains against per-type value tables for IsValid.
  • meta.LifecycleState.IsTerminal now table-driven; const block moved above the type to match the package convention.

Test plan

  • make lint 0 issues on GOOS=linux and GOOS=darwin
  • go test -race -count=1 ./... clean (all packages green)
  • CRD diff inspected — both list-map markers emitted correctly
  • Coordinated downstream PRs to follow for each breaking change above

CMGS added 16 commits May 13, 2026 15:36
VerifySession parsed Exp but never validated it, leaving caller correctness
as the only gate against expired cookies. Reject sessions where Exp is set
and in the past; Exp == 0 still means "no expiry" so callers that opt out
of expiration keep working.

Downstream consumers in epoch and glance already perform a redundant Exp
check at the call site; those will be cleaned up in coordinated PRs.
Swallowing rand.Read's error meant a degraded OS RNG would silently
return all-zero state, defeating CSRF protection. Panic instead — the
syscall is documented to never fail on supported platforms, and if it
ever does, halting is the only safe option.
LoadOrGenerateCert previously accepted any keypair that parsed, so a
kubelet whose mounted secret rotated out from under it would silently
keep serving an expired cert until the API server refused TLS — a
diagnosis-hostile failure mode.

Parse the leaf cert after a successful keypair load; if NotAfter is in
the past, log a warning and mint a self-signed cert instead. Parse
errors on the cert chain are logged but non-fatal so a working keypair
is never gratuitously discarded.

Keypair load failures (corrupt PEM, mismatched key) continue to surface
as fatal — those signal operator misconfiguration and silent substitution
would hide them.
Status arrays without an x-kubernetes-list-type were treated as atomic
by Server-Side Apply, so two controllers (e.g. the operator and vk)
patching different slots would clobber each other on every write.

Tag Agents with listMapKey=slot and Toolboxes with listMapKey=name so
apiserver merges entries by their natural identifier. Regenerate the
CRD YAML.
Move the const block above the type declaration to match the rest of
the package, and back IsTerminal with a package-private terminalStates
map so adding or removing a terminal state is a single-line edit
instead of a switch-case maintenance burden.
Replace each per-type Default's if-empty-return ladder with cmp.Or, and
each IsValid's hand-rolled OR chain with slices.Contains against a
package-private value table. Adding a new enum member is now a single
edit to the const block plus one entry in the validity table — no
parallel switch to keep in sync.

Behavior is unchanged: all existing IsValid/Default tests pass without
modification.
meta.go was a 160-line grab-bag of every constant and helper in the
package. Split by concern so each file matches one responsibility:

  keys.go        const vocabulary (annotations, labels, roles)
  owner.go       toleration + owner-reference helpers
  vmname.go      VM name builders + slot extraction
  connection.go  connection-protocol inference

meta.go keeps only the package godoc. No exported symbol changes;
godoc strings on each annotation/label const are new.
LabelManagedBy was declared but never referenced — not by cocoon-common,
not by any downstream consumer (vk-cocoon, cocoon-operator,
cocoon-webhook, glance, epoch). Dead vocabulary in the public API
costs maintenance with no payoff.
Setup used to call Fatalf on failure, which killed the process before
the caller could run defers or surface a structured exit. The new
signature returns the wrapped error so callers in main packages can
log via their own logger and choose the exit policy.

Breaking change: all downstream consumers must check the returned
error. Each consumer (vk-cocoon, cocoon-operator, cocoon-webhook,
glance, epoch, cocoon-net) will be updated in a coordinated PR.
The previous string-only return conflated three different signals into
"": no owners at all, no ReplicaSet among the owners, and a
ReplicaSet whose name had no hash suffix. Callers had no way to tell
those apart and either treated empty as benign (silently dropping the
"no owner" case) or wrote a duplicate check on the input.

Return a comma-ok pair instead so the caller decides. Add table-driven
tests for each branch.

Breaking change: downstream callers (none today outside cocoon-common,
mentioned only in cocoon-specs docs) must adopt the comma-ok form.
Returning "127.0.0.1" when net.InterfaceAddrs failed (or returned only
loopback / IPv6) made misconfigured hosts look healthy: a vk-cocoon
that couldn't see its own primary IP would advertise localhost,
register fine with the API server, then break every kubelet TLS
handshake from outside.

Surface the failure instead. Callers that want the previous fallback
write `if err != nil { ip = "127.0.0.1" }` and own the decision.
ErrNoNodeIP is exported so callers can branch on the "no IPv4
interface" case specifically.

Breaking change: vk-cocoon main.go must adopt the new signature in a
coordinated PR.
The function only checks the Toleration Key; Operator, Value, and
Effect are ignored. The previous name implied a stronger match than
the implementation gave, so any caller wanting strict semantics had
to read the body to find out. Rename to make the looseness explicit:
the cocoon-webhook gate is intentionally permissive and a stricter
check belongs at the call site.

Breaking change: downstream callers (cocoon-webhook, glance) must
adopt the new name.
The legacy ExtractSlotFromVMName / MainAgentVMName / InferRoleFromVMName
trio split on the last dash, which misclassifies a toolbox whose
pod-name happens to end in -N (e.g. toolbox "db-2" produces vmName
vk-NS-CS-db-2 and reads as slot 2). The bug surfaces in glance role
inference and in vmspec.ShouldSnapshotVM under MainOnly policy.

Add a (namespace, cocoonSet)-anchored variant family:

  AgentVMNamePrefix      shared agent-VM-name prefix builder
  ExtractAgentSlot       parses the slot only when the prefix matches
                         and the suffix has no inner dash
  MainAgentVMNameFor     constant function — no parsing involved
  InferRoleFromAgentSlot pure int -> role mapping

Existing helpers are kept and marked Deprecated in the godoc so
downstream consumers can migrate one call site at a time.
The PR-context framing ("the previous implementation returned X")
belongs in the commit message, not in code that has to live with
every future reader.
Trim narrative godocs that restate the signature or describe historical
behavior. Keep only the contract and non-obvious WHY.
k8s/tls.go: LoadOrGenerateCert now takes ctx as its first argument and
threads it through tryLoadDiskCert and isCertExpired so library
logging no longer fabricates context.Background(). Drop the redundant
os.Stat pre-check in tryLoadDiskCert; LoadX509KeyPair's fs.ErrNotExist
covers the "missing cert" path and stops swallowing permission errors.
Capture time.Now() once in GenerateSelfSignedCert.

meta/lifecycle.go: inline LifecycleStatus.Apply so the per-pod
reconcile path no longer allocates a throwaway map[string]any with
boxed values just to iterate it.

meta/connection.go: compare against string(cocoonv1.OS*) instead of
raw "android"/"windows" literals so renames break the build.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR applies a set of cross-package refactors and API tightenings across auth, k8s, meta, log, and apis/v1 to improve correctness (e.g., session expiry handling, node IP detection), make failure modes explicit (returning errors instead of fatal exits), and clarify/extend metadata helpers (agent/toolbox VM naming disambiguation, owner parsing).

Changes:

  • meta: split helpers/keys into focused files, add agent-slot-aware VM naming helpers, rename toleration helper, and tighten OwnerDeploymentName to return (string, bool).
  • k8s: make DetectNodeIP return (string, error) and enhance TLS cert loading to fall back to self-signed when disk certs are expired.
  • auth/log/apis/v1: session verification now rejects expired sessions, log setup returns an error, and enum validation/defaulting is modernized; CRD list-map markers added for status arrays.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
meta/vmname.go Introduces agent-vs-toolbox-safe VM naming helpers and deprecates legacy parsing helpers.
meta/owner.go Renames toleration helper and changes OwnerDeploymentName to return (string, bool).
meta/meta.go Converts meta.go into package-level documentation after splitting helpers/keys into new files.
meta/meta_test.go Adds tests for new VM naming helpers and updated owner/toleration behaviors.
meta/lifecycle.go Refactors IsTerminal to use a lookup table.
meta/keys.go Moves metadata keys/constants into a dedicated file with one-line docs.
meta/connection.go Moves ConnectionType helper into a dedicated file.
log/log.go Changes Setup to return an error instead of fatal logging.
k8s/tls.go Refactors cert loading to fall back to self-signed on expired disk certs; adds logging.
k8s/tls_test.go Adds coverage for expired disk cert fallback and updates node IP test for new error return.
k8s/netip.go Changes DetectNodeIP to return (string, error) and introduces ErrNoNodeIP.
auth/state.go Makes RandomState panic on crypto/rand.Read failure.
auth/session.go Updates VerifySession to reject expired sessions.
auth/session_test.go Adds tests for session expiry and Exp == 0 behavior.
apis/v1/enums.go Modernizes enum IsValid/Default implementations using slices + cmp.Or.
apis/v1/crds/cocoonset.cocoonstack.io_cocoonsets.yaml Regenerates CRD with list-map markers for status lists.
apis/v1/cocoonset_types.go Adds +listType=map / +listMapKey markers to status fields for stable patch semantics.
Comments suppressed due to low confidence (3)

meta/vmname.go:105

  • This deprecation guidance suggests InferRoleFromAgentSlot(ExtractAgentSlot(ns, cs, vmName)), but ExtractAgentSlot returns -1 for non-agent names and InferRoleFromAgentSlot currently maps that to RoleSubAgent. Update the guidance to include a slot<0 branch (or adjust InferRoleFromAgentSlot) so the suggested migration path can’t misclassify toolboxes.
	if idx < 0 {
		return s, "", false
	}
	return s[:idx], s[idx+len(sep):], true
}

meta/owner.go:41

  • OwnerDeploymentName claims to return ok=false when the ReplicaSet name has no recognizable hash suffix, but the implementation treats any name containing a '-' as OK (e.g. "demo-" or "demo-foo" would return ok=true). Either validate the suffix matches the expected Deployment ReplicaSet hash format or adjust the doc/comment to match the actual behavior.
	for _, ref := range ownerRefs {
		if ref.Kind != "ReplicaSet" {
			continue
		}
		if before, _, ok := lastCut(ref.Name, "-"); ok {
			return before, true
		}
	}
	return "", false
}

auth/state.go:20

  • The doc says RandomState panics with a wrapped message on rand.Read failure, but the code panics with a formatted string, losing the original error type/chain. Prefer panicking with an error value (e.g. fmt.Errorf("crypto/rand.Read: %w", err)) or update the comment to avoid implying wrapping.
// Panics on crypto/rand failure — a weak nonce silently breaks CSRF.
func RandomState() string {
	b := make([]byte, 16)
	if _, err := rand.Read(b); err != nil {
		panic(fmt.Sprintf("crypto/rand.Read: %v", err))
	}
	return hex.EncodeToString(b)
}


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread meta/vmname.go
Comment thread k8s/tls.go
Without this defensive branch, the suggested migration chain
InferRoleFromAgentSlot(ExtractAgentSlot(ns, cs, vmName)) misclassified
toolbox VM names as RoleSubAgent — ExtractAgentSlot returns -1 for
non-agent names but the legacy slot != 0 fall-through still hit
RoleSubAgent. Also drop the unused MainAgentVMName helper.
@CMGS
Copy link
Copy Markdown
Contributor Author

CMGS commented May 13, 2026

@copilot review

Copy link
Copy Markdown

Copilot AI commented May 13, 2026

@copilot review

Reviewed current PR state and CI status; no new issues found and no additional code changes are needed at this time (latest relevant commit: 27035ab).

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.

Comment thread auth/session.go
Comment thread k8s/tls.go Outdated
CMGS added 2 commits May 13, 2026 17:18
Previously, tryLoadDiskCert silently fell back to self-signed for any
fs.ErrNotExist from LoadX509KeyPair — including the asymmetric case
where the cert exists but the key is missing (or vice versa). Stat
the cert path explicitly so only "cert path missing" triggers the
fallback; other errors (e.g. key missing while cert exists, permission
denied) propagate so operators can detect the misconfiguration.
@CMGS
Copy link
Copy Markdown
Contributor Author

CMGS commented May 13, 2026

@copilot review

Copy link
Copy Markdown

Copilot AI commented May 13, 2026

@copilot review

Reviewed current PR head d15dd3c: latest lint/test workflow runs are green, and no new issues were found in this pass. UI changes: none (screenshot N/A).

Copilot finished work on behalf of CMGS May 13, 2026 09:26
@CMGS CMGS merged commit fbadeee into main May 13, 2026
5 checks passed
@CMGS CMGS deleted the chore-code-review-202605 branch May 13, 2026 09:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants