Skip to content

reftracker: return a snapshot from AppsForRef / RefsForApp to fix concurrent-map panic#1817

Open
SAY-5 wants to merge 1 commit intocarvel-dev:developfrom
SAY-5:fix/reftracker-return-snapshot-1812
Open

reftracker: return a snapshot from AppsForRef / RefsForApp to fix concurrent-map panic#1817
SAY-5 wants to merge 1 commit intocarvel-dev:developfrom
SAY-5:fix/reftracker-return-snapshot-1812

Conversation

@SAY-5
Copy link
Copy Markdown

@SAY-5 SAY-5 commented Apr 20, 2026

Fixes #1812.

AppRefTracker protects its internal maps with a sync.Mutex, but AppsForRef and RefsForApp returned a.refsToApps[refKey] / a.appsToRefs[appKey] directly. The caller (e.g. SecretHandler.enqueueAppsForUpdate) then iterates that returned map without holding the tracker lock:

apps, err := sch.appRefTracker.AppsForRef(reftracker.NewSecretKey(...))
...
for refKey := range apps {  // concurrent modifier can fire here
    ...
}

Under the reported production load (1,680+ namespaces, rapid Secret and ConfigMap churn, many reconcile goroutines) a parallel ReconcileRefs or RemoveAppFromAllRefs mutates the very same inner map the handler is ranging over, and the Go runtime aborts with

fatal error: concurrent map iteration and map write

crashing the kapp-controller pod.

This returns a shallow copy of the inner set from both lookup methods so callers can iterate without holding the tracker lock. The copy is cheap (refs per app is small in practice; the outer app-to-refs map stays unbounded either way), and the concurrent writers keep exclusive ownership of the originals. A small cloneRefKeySet helper keeps the two call sites in sync.

…current-map panic

AppRefTracker protects its internal maps with a sync.Mutex, but
AppsForRef and RefsForApp returned a.refsToApps[refKey] / a.appsToRefs[appKey]
directly. The caller (e.g. SecretHandler.enqueueAppsForUpdate) then
iterates that returned map without holding the tracker lock:

    apps, err := sch.appRefTracker.AppsForRef(reftracker.NewSecretKey(...))
    ...
    for refKey := range apps {  // concurrent modifier can fire here
        ...
    }

Under the reported production load (1,680+ namespaces, rapid Secret
and ConfigMap churn, many reconcile goroutines) a parallel
ReconcileRefs or RemoveAppFromAllRefs mutates the very same inner
map the handler is ranging over, and the Go runtime aborts with

    fatal error: concurrent map iteration and map write

crashing the kapp-controller pod (carvel-dev#1812).

Return a shallow copy of the inner set from both lookup methods so
callers can iterate without holding the tracker lock. The copy is
cheap (refs per app is small in practice; the outer app-to-refs map
stays unbounded either way), and the concurrent writers keep
exclusive ownership of the originals. A small cloneRefKeySet helper
keeps the two call sites in sync.

Fixes carvel-dev#1812
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Race Condition in AppRefTracker: Concurrent Map Iteration and Write

2 participants