Skip to content

fix(operator): remove per-session monitorPod goroutines to fix memory leak#1004

Merged
Gkrumbach07 merged 2 commits intomainfrom
fix/operator-remove-monitorpod-goroutines
Mar 24, 2026
Merged

fix(operator): remove per-session monitorPod goroutines to fix memory leak#1004
Gkrumbach07 merged 2 commits intomainfrom
fix/operator-remove-monitorpod-goroutines

Conversation

@Gkrumbach07
Copy link
Copy Markdown
Contributor

Summary

  • Remove legacy monitorPod goroutine-per-session pattern that caused memory to grow linearly with session count and never release
  • Move inactivity auto-stop check into the existing controller-runtime reconcileRunning reconciler
  • Net -275 lines — the 10-worker controller-runtime pool already handled all pod status monitoring, phase transitions, and token refresh

Context

On UAT with 214 running sessions, operator memory was 161Mi. After stopping 145 sessions (down to 69), memory stayed at 162Mi. Root cause: 214 monitorPod goroutines (each polling every 5s) duplicated work already done by the controller-runtime reconciler. Go's runtime retained the heap high-water mark even after goroutines exited.

Test plan

  • go build ./... passes
  • go test ./internal/handlers/ -run "TestShouldAutoStop|TestTriggerInactivityStop" — all pass
  • Deploy to UAT, verify memory drops proportionally when sessions are stopped
  • Verify inactivity auto-stop still triggers (check logs for idle beyond inactivity timeout)
  • Verify session lifecycle (create → run → stop) works end-to-end

🤖 Generated with Claude Code

… leak

The legacy monitorPod function spawned a long-lived goroutine per session
(5s ticker polling API server for pod/session status). With 214 sessions
this meant 214 goroutines duplicating work already handled by the
controller-runtime reconciler, causing memory that never released after
sessions stopped.

Remove monitorPod entirely and move its only unique functionality
(inactivity auto-stop) into reconcileRunning. The 10-worker controller-runtime
pool already handles pod status, phase transitions, and token refresh.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 24, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 704ce6f2-a81c-48e4-a20a-20faac9c961b

📥 Commits

Reviewing files that changed from the base of the PR and between 9890bee and f493054.

📒 Files selected for processing (1)
  • components/operator/internal/handlers/sessions.go

Walkthrough

Controller reconcile for running sessions now performs an inactivity check after refreshing runner tokens and can trigger an automatic stop; legacy in-process pod-monitoring goroutines were removed, delegating steady-state monitoring to controller-runtime reconcilers.

Changes

Cohort / File(s) Summary
Inactivity handler exports
components/operator/internal/handlers/inactivity.go, components/operator/internal/handlers/inactivity_test.go
Renamed and exported shouldAutoStopShouldAutoStop and triggerInactivityStopTriggerInactivityStop; updated tests and package comments to reflect reconciler-driven invocation.
Reconciler inactivity integration
components/operator/internal/controller/reconcile_phases.go
Added post-token-refresh inactivity check in reconcileRunning that logs idle sessions, calls TriggerInactivityStop(namespace, name), and returns requeue results: immediate requeue on successful trigger or delayed requeue on failure.
Legacy monitoring removal
components/operator/internal/handlers/sessions.go
Removed monitoredPods map, its mutex, and the entire monitorPod goroutine/ticker loop and related bookkeeping; controller-runtime reconciler now handles monitoring responsibilities.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Controller as Controller (reconciler)
  participant Handlers as Handlers (ShouldAutoStop/TriggerInactivityStop)
  participant K8s as Kubernetes API / Pod
  Controller->>K8s: refresh runner token / inspect session & pod
  Controller->>Handlers: ShouldAutoStop(session)
  alt session idle
    Controller->>Handlers: TriggerInactivityStop(namespace,name)
    Handlers->>K8s: update session/request stop
    Handlers-->>Controller: success / error
    alt success
      Controller-->>Controller: return Requeue: true
    else error
      Controller-->>Controller: return RequeueAfter: 30s
    end
  else not idle
    Controller-->>Controller: continue normal reconcile (requeue after interval)
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately describes the main change: removing per-session monitorPod goroutines to fix a memory leak, which aligns with the primary objective of eliminating 254 lines of redundant goroutine-based polling.
Description check ✅ Passed The PR description clearly explains the changeset: removing legacy monitorPod goroutines, moving inactivity checks to the reconciler, and providing context on the memory leak root cause. It is directly related to all file changes.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/operator-remove-monitorpod-goroutines

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
components/operator/internal/controller/reconcile_phases.go (1)

1-17: ⚠️ Potential issue | 🟡 Minor

Fix gofmt formatting to pass CI.

Run gofmt -w . to fix the formatting issues flagged by CI.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/operator/internal/controller/reconcile_phases.go` around lines 1 -
17, The file in package controller (reconcile_phases.go) has formatting issues;
run gofmt -w . (or gofmt on this file) to reformat the imports and code so it
conforms to gofmt rules and passes CI, then re-stage the changed file (ensure
package name "controller" and the import block with symbols like corev1, errors,
unstructured, types, ctrl, log, and handlers remain unchanged other than
formatting).
components/operator/internal/handlers/sessions.go (1)

1-29: ⚠️ Potential issue | 🟡 Minor

Fix gofmt formatting to pass CI.

The pipeline is failing due to unformatted Go files. Run gofmt -w . from the repository root.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/operator/internal/handlers/sessions.go` around lines 1 - 29, The
file has formatting issues causing CI failure; run the Go formatter on the
repository (e.g., execute `gofmt -w .` from the repo root) and commit the
updated formatting for files including this package (package handlers in
components/operator/internal/handlers/sessions.go) so imports and code lines
(e.g., the import block and any functions in sessions.go) are properly gofmt'd;
ensure you re-run `gofmt` or `go fmt` until `gofmt` reports no changes and push
the formatted files.
components/operator/internal/handlers/inactivity_test.go (1)

1-16: ⚠️ Potential issue | 🟡 Minor

Fix gofmt formatting to pass CI.

Same formatting issue as the main source file. Run gofmt -w . to fix.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/operator/internal/handlers/inactivity_test.go` around lines 1 -
16, The test file in package "handlers" has gofmt/style issues; run the Go
formatter (e.g., execute "gofmt -w ." at the repo root or run gofmt on the file)
to reformat inactivity_test.go so imports and spacing conform to gofmt, then
re-run CI; this will fix the formatting errors in the handlers package test
file.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/operator/internal/handlers/inactivity.go`:
- Around line 1-5: Run gofmt to fix formatting in the handlers file that
contains the package comment and functions referenced (ShouldAutoStop,
TriggerInactivityStop). From the repo root run `gofmt -w .` (or `gofmt -w
components/operator/internal/handlers/inactivity.go`) and commit the changes so
the file formatting passes CI; ensure the package comment and any surrounding
declarations remain unchanged except for formatting.

---

Outside diff comments:
In `@components/operator/internal/controller/reconcile_phases.go`:
- Around line 1-17: The file in package controller (reconcile_phases.go) has
formatting issues; run gofmt -w . (or gofmt on this file) to reformat the
imports and code so it conforms to gofmt rules and passes CI, then re-stage the
changed file (ensure package name "controller" and the import block with symbols
like corev1, errors, unstructured, types, ctrl, log, and handlers remain
unchanged other than formatting).

In `@components/operator/internal/handlers/inactivity_test.go`:
- Around line 1-16: The test file in package "handlers" has gofmt/style issues;
run the Go formatter (e.g., execute "gofmt -w ." at the repo root or run gofmt
on the file) to reformat inactivity_test.go so imports and spacing conform to
gofmt, then re-run CI; this will fix the formatting errors in the handlers
package test file.

In `@components/operator/internal/handlers/sessions.go`:
- Around line 1-29: The file has formatting issues causing CI failure; run the
Go formatter on the repository (e.g., execute `gofmt -w .` from the repo root)
and commit the updated formatting for files including this package (package
handlers in components/operator/internal/handlers/sessions.go) so imports and
code lines (e.g., the import block and any functions in sessions.go) are
properly gofmt'd; ensure you re-run `gofmt` or `go fmt` until `gofmt` reports no
changes and push the formatted files.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: ead46c31-4429-433b-8f2f-f2b571472ea4

📥 Commits

Reviewing files that changed from the base of the PR and between 0d78e86 and 9890bee.

📒 Files selected for processing (4)
  • components/operator/internal/controller/reconcile_phases.go
  • components/operator/internal/handlers/inactivity.go
  • components/operator/internal/handlers/inactivity_test.go
  • components/operator/internal/handlers/sessions.go

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Gkrumbach07 Gkrumbach07 merged commit 76558b4 into main Mar 24, 2026
26 checks passed
@Gkrumbach07 Gkrumbach07 deleted the fix/operator-remove-monitorpod-goroutines branch March 24, 2026 16:57
jeremyeder pushed a commit to jeremyeder/platform that referenced this pull request Mar 26, 2026
… leak (ambient-code#1004)

## Summary

- Remove legacy `monitorPod` goroutine-per-session pattern that caused
memory to grow linearly with session count and never release
- Move inactivity auto-stop check into the existing controller-runtime
`reconcileRunning` reconciler
- Net **-275 lines** — the 10-worker controller-runtime pool already
handled all pod status monitoring, phase transitions, and token refresh

## Context

On UAT with 214 running sessions, operator memory was 161Mi. After
stopping 145 sessions (down to 69), memory stayed at 162Mi. Root cause:
214 `monitorPod` goroutines (each polling every 5s) duplicated work
already done by the controller-runtime reconciler. Go's runtime retained
the heap high-water mark even after goroutines exited.

## Test plan

- [x] `go build ./...` passes
- [x] `go test ./internal/handlers/ -run
"TestShouldAutoStop|TestTriggerInactivityStop"` — all pass
- [ ] Deploy to UAT, verify memory drops proportionally when sessions
are stopped
- [ ] Verify inactivity auto-stop still triggers (check logs for `idle
beyond inactivity timeout`)
- [ ] Verify session lifecycle (create → run → stop) works end-to-end

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Ambient Code Bot <bot@ambient-code.local>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant