Skip to content

chore: code review fixes for DHCP races, atomic state, and platform robustness#3

Merged
CMGS merged 18 commits into
mainfrom
chore-code-review-202605
May 13, 2026
Merged

chore: code review fixes for DHCP races, atomic state, and platform robustness#3
CMGS merged 18 commits into
mainfrom
chore-code-review-202605

Conversation

@CMGS
Copy link
Copy Markdown
Contributor

@CMGS CMGS commented May 13, 2026

Summary

A batch of correctness and resilience fixes surfaced by code review. The headline change is the DHCP REQUEST race (a TOCTOU between isFree and markUsed that let two clients win the same IP); the rest are smaller robustness fixes on the state, route, and platform paths.

DHCP (race + ordering)

  • pool.tryClaim folds the validate-and-commit into one critical section so two concurrent REQUESTs for the same free IP cannot both pass. Also drop the IP4ToUint32(v4.String()) round-trip and key the pool directly with binary.BigEndian.Uint32 on the 4-byte form.
  • Delete /32 route before releasing the IP in cleanupLoop and handleRelease; releasing first opened a window where a concurrent DISCOVER could install its own route and have ours wipe it.
  • leaseStore.add now returns []evictedLease (MAC + IP) so the caller can clean the orphaned route + pool slot when the same MAC rebinds to a different IP. Previously the entry was silently overwritten and the old IP was stranded.

State (atomic + cleanup)

  • pool.State.Save writes to pool.json.tmp and renames over pool.json — mirrors the lease store. A crash mid-write no longer leaves a half-written file Load refuses to parse.
  • cmd teardown removes leases.json alongside pool.json so re-init doesn't restore stale leases.

Platform

  • GKE assignAliasIP reads existing nic0 aliases, returns early if our entry is already bound, otherwise appends — gcloud --aliases=... is a full replacement, so previously a re-run clobbered operator-configured aliases.
  • Volcengine: drop the env field that was only ever written; loadEnv is now a side-effect-only function that exports env vars for the ve child binary (its only consumer).
  • platform.SubnetIPs validates the gateway (errors on empty/malformed) and stops iteration before the broadcast.

CLI / misc

  • splitTrim filters empty entries so an empty --dns doesn't slip through as [""].
  • node.setupSysctl logs-and-continues on per-key write failure (per-interface keys vanish when ENIs are detached out-of-band; aborting startup on those was painful).
  • Drive-by: check commonlog.Setup's error in cmd.root (pre-existing lint issue surfaced when the cache was cleared).

Tests

0 unit tests existed before; added focused coverage for the most race-prone and untested areas:

  • dhcp/pool_test.gotryClaim concurrency + idempotency
  • dhcp/lease_test.go — eviction paths (same-MAC-rebind, other-MAC-same-IP, expired-other-MAC), save/load
  • pool/pool_test.go — Save/Load roundtrip, atomic-write recovery (bogus .tmp must be ignored)
  • cmd/helpers_test.gosplitTrim table tests
  • platform/ip_test.goSubnetIPs for /24 + /28 with explicit broadcast / gateway / network exclusions

Test plan

  • make lint passes on GOOS=linux and GOOS=darwin (verified locally — 0 issues both targets)
  • go test -race -count=1 ./... clean (verified locally)
  • Cross-compile build clean for GOOS=linux and GOOS=darwin
  • Manual smoke on a GKE node: init → daemon → teardown leaves no pool.json / leases.json behind
  • Manual smoke on a Volcengine node: ENI detach during daemon run no longer panics startup

CMGS added 15 commits May 13, 2026 15:40
handleRequest used to call pool.isFree() and then pool.markUsed() under
two separate lock acquisitions, so two concurrent REQUESTs for the same
free IP could both pass the free check and both commit, producing a
double-allocated IP and a phantom lease until eviction. tryClaim folds
the check-and-commit into one critical section so exactly one racer
wins; the loser is NAKed.

While here, drop the platform.IP4ToUint32(v4.String()) round-trip on the
hot pool path and pack the 4-byte IPv4 directly with
binary.BigEndian.Uint32, and switch pool.used to map[uint32]struct{}
since the bool value was unused. Add pool concurrency tests covering
the race, idempotency, and out-of-pool input.
The previous Save was a single os.WriteFile, so a crash mid-write would
leave behind a half-written pool.json that Load would refuse to parse,
forcing a manual recovery. Mirror dhcp/lease.go's pattern: write to
pool.json.tmp first and rename over pool.json so the final file is
either complete or unchanged.

Add a pool_test.go covering Save/Load roundtrip, the atomic-write
recovery path (a bogus .tmp must be ignored by Load), the missing-file
case, and idempotent Delete.
cleanupLoop and handleRelease used to call pool.release before delRoute.
Releasing first opens a tight window: a concurrent DISCOVER can claim
the freed IP and install its own /32 route, only for our late delRoute
to remove that brand-new route and blackhole the new lease's traffic.
Swap the order so the route teardown completes against the lease that
owned it, then the IP returns to the free pool.
leaseStore.add used to silently overwrite an existing entry when the
same MAC ACKed a different IP. The old IP's /32 route and pool.used
slot were left dangling — the route blackholed nothing useful but
locked the IP out of reallocation until cleanupLoop noticed the
original expiry, which never happens because the entry was overwritten.

Replace the []string return with []evictedLease that carries both the
displaced MAC and IP. handleRequest now distinguishes the two cases:
same-MAC-different-IP triggers delRoute + pool.release for the old IP
(orphan cleanup); other-MAC-same-IP keeps the route (still valid for
the new lease) and just logs. Add lease_test.go covering both eviction
paths, renewal (no eviction), and the save/load roundtrip.
`gcloud compute instances network-interfaces update --aliases=...` is
a full replacement, so assignAliasIP used to overwrite any aliases the
operator had configured on nic0 (or that a previous run of cocoon-net
left in place). Mirror Teardown's read-modify-write: describe nic0
first, return early if our entry is already bound, otherwise append
to the existing list and write back the union. This makes init/adopt
safe to re-run and safe on shared-NIC topologies.
Volcengine.env was only ever written. The `ve` child binary is the
sole consumer and reads credentials/region from the process
environment loadEnv exports, not from the parent struct. Strip the
field and rewrite loadEnv to return just an error: the side effect
(os.Setenv) is now its declared purpose, which matches what callers
actually rely on.
teardown deleted pool.json but left the daemon's leases.json behind in
the same state dir. A subsequent init that recreated pool.json would
then have the daemon restore stale leases (and stale /32 routes) for
the new pool. Remove leases.json alongside pool.json; NotFound is
ignored so teardown stays idempotent.
A bare or whitespace-only --dns flag (or trailing commas like "a,")
used to yield []string{""} or a slice with empty entries, which then
slipped through parseIPs as zero valid IPs and was silently treated
as "no DNS configured" instead of failing or falling back. Filter
empties after TrimSpace and return nil for the all-empty case so the
existing len()==0 check in daemon.go takes the fallback path.

Add helpers_test.go covering the empty, whitespace-only, mixed-empty,
and well-formed cases.
SubnetIPs used to swallow gateway parse errors (an empty or malformed
gateway just meant "every host IP is fair game") and would happily
hand out the broadcast address as a usable host. Reject an
unparseable gateway up front so misconfigured callers fail fast, and
stop iteration before the broadcast so /24 and /28 pools never lease
out network|~mask.

Add ip_test.go with /24 and /28 round-trips that pin the gateway,
broadcast, and network-address exclusions plus error coverage for
empty/invalid gateway, invalid CIDR, and IPv6 input.
Previously, the first sysctl write that failed (typically a
per-interface rp_filter entry for an ENI that had been detached
out-of-band) returned an error and aborted node.Setup, leaving the
daemon unrunnable until an operator cleared the stale interface
reference. Log the failure and continue so transient stale entries
do not block startup; the iptables/route layer surfaces real
connectivity problems if any. Drop the now-unused error return.

Drive-by: check commonlog.Setup's error in cmd.root so the lint
suite (which surfaced the pre-existing errcheck violation when the
cache was cleared) stays clean.
CI was failing because the pinned cocoon-common v0.2.1 (eaa6e94) ships
log.Setup as void, but cmd/root.go in this branch now checks its
returned error. The newer pseudo-version (4c7f78a) carries the
fix(log)! breaking change that returns the error, so cocoon-net builds
again on CI.
Drop comments that restate the code or narrate the fix:
- evictedLease godoc collapses to bullet-only WHY (no rehash of cleanup
  rules already encoded in request.go)
- request.go inline now refers to evictedLease instead of duplicating
  the orphan-cleanup explanation
- pool_test/lease_test test-name doc comments lose the 'with the old
  isFree+markUsed split' / 'finding flags' narration
- volcengine env.go loadEnv godoc trims the redundant 'returns no
  value because' clause
Picks up the latest chore-code-review-202605 head from cocoon-common so
cocoon-net tracks the same code-review pass (commonlog.Setup error
return, DetectNodeIP error return, HasCocoonTolerationKey rename,
OwnerDeploymentName (string, bool)). No source-level changes are needed
here — cmd/root.go and the platform helpers in this branch already
match the new signatures.
Split sysctl tuning into required global keys (ip_forward, all/default/cni0
rp_filter) and best-effort per-interface keys. A write failure on a required
key now returns an error from Setup instead of being logged-and-continued,
so the daemon won't come up "successfully" with broken forwarding.

Also trim verbose comments across node/cmd packages.
@CMGS CMGS requested a review from Copilot May 13, 2026 09:51
@CMGS CMGS requested review from Copilot and removed request for Copilot May 13, 2026 09:52
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR applies a set of correctness and robustness fixes across the DHCP server, persisted state handling, and cloud-platform integration, with new unit tests added to cover previously untested, race-prone paths.

Changes:

  • Fix DHCP REQUEST race conditions by atomically claiming pool IPs and improving lease eviction cleanup (including route/pool orphan cleanup on same-MAC rebind).
  • Make pool state persistence crash-safe via atomic tmp+rename writes and expand teardown cleanup to remove persisted leases.
  • Improve platform behavior (GKE alias IP updates without clobbering existing aliases; Volcengine env loading as a side-effect-only path; safer SubnetIPs generation) and add focused tests.

Reviewed changes

Copilot reviewed 22 out of 23 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pool/pool.go Make pool state Save atomic using tmp + rename; clarify Load behavior with leftover tmp files.
pool/pool_test.go Add Save/Load roundtrip + atomic tmp ignore tests and Delete idempotency coverage.
platform/volcengine/volcengine.go Remove unused env field; New() now loads env once and returns an empty handle.
platform/volcengine/env.go Convert loadEnv into side-effect-only env exporter (no returned config struct).
platform/ip.go Harden SubnetIPs (gateway parsing validation, IPv4-only, skip broadcast).
platform/ip_test.go Add coverage for SubnetIPs (broadcast/gateway/network exclusions, error cases) + existing helpers.
platform/gke/gcloud.go Prevent alias clobbering by read-modify-write merge of nic0 aliases; no-op if already present.
node/sysctl.go Split required vs best-effort sysctl writes; log-and-continue for per-interface sysctls.
node/node.go Adjust Setup doc comments to reflect ordering constraints more succinctly.
go.mod Bump cocoon-common dependency version.
go.sum Update checksums to match the bumped cocoon-common version.
dhcp/request.go Use atomic pool.tryClaim to eliminate REQUEST TOCTOU; release claims on build/route failure; cleanup orphaned IPs on same-MAC rebind.
dhcp/release.go Fix ordering: delete /32 route before releasing IP back to pool.
dhcp/pool.go Use uint32 key directly from packed IPv4; add tryClaim; change used map to set semantics.
dhcp/pool_test.go Add concurrency/idempotency tests for tryClaim and allocation behavior.
dhcp/lease.go Return structured eviction info (MAC+IP) from add() to enable cleanup on rebind/conflict.
dhcp/lease_test.go Add tests for eviction cases and save/load roundtrip behavior.
dhcp/cleanup.go Fix ordering in cleanup: delete /32 route before releasing IP for expired leases.
cmd/teardown.go Remove leases.json alongside pool.json during teardown.
cmd/root.go Handle and report commonlog.Setup errors instead of ignoring them.
cmd/helpers.go Make splitTrim drop empty entries and return nil for empty/whitespace-only input.
cmd/helpers_test.go Add table tests for splitTrim behavior.
cmd/daemon.go Minor comment adjustments (no functional change).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread dhcp/pool.go
Comment thread platform/ip.go
ipKey: a 16-byte net.IP doesn't read out-of-bounds; it reads from the
wrong offset and yields a junk key. Rewording reflects the real failure.

SubnetIPs: comment claimed iteration stops "one address before broadcast";
actual code iterates through and skips bcast via `continue`. Trim the doc
to describe what the code does.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 23 changed files in this pull request and generated 3 comments.

Comment thread platform/gke/gcloud.go
Comment thread platform/ip.go
Comment thread cmd/teardown.go
assignAliasIP: re-running with a different --subnet would leave the old
cocoon-pods alias in place and add a second one. Drop existing entries
under our range name before merging so a new cidr replaces the old; keep
operator-managed entries under other names.

SubnetIPs: reject a gateway outside the CIDR or equal to network /
broadcast. The skip-on-equality only mattered when the gateway was a
real host; misconfigs now fail fast instead of silently producing a
pool that includes the actual gateway.
@CMGS CMGS merged commit 29297d0 into main May 13, 2026
4 checks passed
@CMGS CMGS deleted the chore-code-review-202605 branch May 13, 2026 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants