Fix: Add node affinity to prevent Multi-Attach PVC errors #344

jeremyeder · 2025-11-19T02:03:08Z

Fixes #298

Problem

AgenticSessions stuck in "Creating" phase due to temp-content and Job pods scheduled on different nodes with RWO PVCs.

Solution

Add node affinity to ensure all containers schedule on the same node as the PVC attachment.

Implementation

See docs/implementation-prompts/fix-multi-attach-node-affinity.md for complete cold-startable implementation instructions.

Testing Needed

Create new AgenticSession and verify no Multi-Attach errors
Verify all containers scheduled on same node
Test with cluster autoscaling enabled

Note: This is a draft PR with implementation guidance. Code changes still need to be implemented.

Provides cold-start implementation instructions for fixing AgenticSession stuck in Creating phase due to PVC Multi-Attach errors when init containers and main containers schedule on different nodes with RWO PVCs.

github-actions · 2025-11-19T02:04:49Z

Claude Code Review

Summary

This PR adds implementation guidance for fixing Multi-Attach PVC errors when AgenticSessions get stuck in "Creating" phase. The PR contains only a documentation file with implementation instructions - no actual code changes yet. The proposed solution uses PodAffinity to co-locate all containers on the same node as the PVC attachment.

Issues by Severity

🚫 Blocker Issues

1. INCORRECT AFFINITY TYPE - WILL NOT SOLVE THE PROBLEM

Location: docs/implementation-prompts/fix-multi-attach-node-affinity.md:31-51
Issue: The proposed solution uses PodAffinity instead of PodAntiAffinity or volume-based node constraints. PodAffinity schedules pods near OTHER pods with matching labels, not based on where volumes are attached.
Problem: This will NOT solve the Multi-Attach error. On the first pod creation, there are no existing pods with agentic-session=<name> label, so the affinity has no effect. The scheduler will place the pod arbitrarily, and the PVC will attach to that node. But this doesn't prevent future issues.
Correct Solution: Use volume topology awareness (already built into Kubernetes) or remove the temp-content initContainer entirely. The current code doesn't have a "temp-content" initContainer (see sessions.go:401-413 - only init-workspace exists).

2. DOCUMENTATION DESCRIBES NON-EXISTENT CODE

Location: docs/implementation-prompts/fix-multi-attach-node-affinity.md:7-10
Issue: Documentation claims there's a "temp-content" initContainer, but examining components/operator/internal/handlers/sessions.go:401-413 shows only one initContainer: init-workspace. No "temp-content" initContainer exists.
Impact: The root cause analysis is based on incorrect understanding of the current codebase. This means the proposed solution may not address the actual issue.

🔴 Critical Issues

3. MISSING ROOT CAUSE INVESTIGATION

Issue: The PR claims sessions get stuck in "Creating" phase due to Multi-Attach errors but doesn't provide evidence (logs, events, or error messages from issue Bug: [restarted timed-out sessions fail to start due to multi-attach PVC error] #298).
Impact: Without confirming the actual error, we're implementing a solution to a potentially misdiagnosed problem.
Recommendation: Include actual error messages/events showing Multi-Attach failures before implementing a fix.

4. AFFINITY IMPLEMENTATION VIOLATES KUBERNETES BEST PRACTICES

Location: docs/implementation-prompts/fix-multi-attach-node-affinity.md:32-33
Issue: Uses PreferredDuringSchedulingIgnoredDuringExecution (soft constraint) instead of RequiredDuringSchedulingIgnoredDuringExecution (hard constraint).
Impact: If this were the correct approach (which it isn't per issue Outcome: Reduce Refinement Time with agent System #1), a soft constraint wouldn't guarantee co-location. Pods could still be scheduled on different nodes if resources are tight, causing the exact same Multi-Attach error.
Correct Approach: If using affinity, it should be Required not Preferred.

5. REAL ISSUE: VOLUME TOPOLOGY NOT CONSIDERED

Actual Solution: Kubernetes has built-in volume topology awareness. If PVCs are using StorageClasses with volumeBindingMode: WaitForFirstConsumer, the PVC won't bind until a pod is scheduled, and the scheduler will consider volume topology.
Better Fix: Ensure StorageClass uses WaitForFirstConsumer binding mode, or verify that all containers in the Job pod (initContainers + main containers) are part of the SAME pod (which they already are in sessions.go:351-627).

🟡 Major Issues

6. INCORRECT IMPORT REFERENCE

Location: docs/implementation-prompts/fix-multi-attach-node-affinity.md:37
Issue: Code example references v1.LabelSelector but sessions.go imports use v1 for k8s.io/apimachinery/pkg/apis/meta/v1.
Impact: The code would need metav1.LabelSelector to be correct, or use the local v1 alias (which is already defined).
Fix: Change to &metav1.LabelSelector{ or verify v1 is the correct import alias.

7. DOCUMENTATION STRUCTURE DOESN'T MATCH REPO STANDARDS

Location: docs/implementation-prompts/ directory
Issue: The docs/implementation-prompts/ directory is not a standard location mentioned in CLAUDE.md or mkdocs.yml. Implementation guidance should either be:
- In the issue itself (Bug: [restarted timed-out sessions fail to start due to multi-attach PVC error] #298)
- In a proper design doc under docs/
- Or in code comments
Impact: Implementation prompts in this location may become stale and unmaintained.

8. TESTING SECTION LACKS VERIFICATION OF FIX

Location: docs/implementation-prompts/fix-multi-attach-node-affinity.md:57-62
Issue: Testing section doesn't verify the actual Multi-Attach error is resolved. It only checks if pods start and are on the same node.
Missing Tests:
- Create session in multi-node cluster with RWO PVCs
- Verify no Multi-Attach error events in pod/PVC events
- Verify initContainers and main containers all run successfully
- Test with cluster autoscaler to ensure affinity doesn't break scheduling

🔵 Minor Issues

9. ALTERNATIVE APPROACHES MISSING THE ACTUAL BEST PRACTICE

Location: docs/implementation-prompts/fix-multi-attach-node-affinity.md:64-69
Issue: Doesn't mention the actual Kubernetes-native solution: volume topology awareness via volumeBindingMode: WaitForFirstConsumer.
Recommendation: Add this as the preferred approach.

10. ROLLBACK INSTRUCTIONS INCOMPLETE

Location: docs/implementation-prompts/fix-multi-attach-node-affinity.md:72-79
Issue: Rollback instructions assume deployment process but don't mention how to fix already-stuck sessions.
Better Rollback:
- Include how to manually reschedule stuck sessions
- How to verify rollback succeeded

Positive Highlights

✅ Good Documentation Structure: The implementation prompt follows a clear problem/solution/testing format that's easy to follow.

✅ Rollback Considered: Including rollback instructions shows operational awareness.

✅ Multiple Approaches Evaluated: The document considers alternative approaches (RWX, single initContainer, node selector) which shows thoughtful analysis.

✅ Links to Issue: PR description properly references issue #298.

Recommendations

CRITICAL - DO NOT MERGE WITHOUT THESE CHANGES:

Verify the Actual Problem: Review issue Bug: [restarted timed-out sessions fail to start due to multi-attach PVC error] #298 and include actual error messages/logs showing Multi-Attach errors. Confirm the root cause before implementing a solution.
Fix Root Cause Analysis: Examine actual codebase to confirm what containers exist. Currently, sessions.go shows:
- 1 InitContainer: init-workspace (line 401-413)
- 2 Main Containers: ambient-content and ambient-code-runner (line 416-623)
All these are in the SAME pod, so they already share the same node. Multi-Attach errors shouldn't occur unless there's a different issue.
Reconsider Solution Approach:
- Option A (Recommended): Verify StorageClass uses volumeBindingMode: WaitForFirstConsumer
- Option B: If PVC topology is the issue, use NodeAffinity based on PVC node affinity, not PodAffinity
- Option C: If init containers are racing with pod restarts, ensure proper ordering/dependencies
If Keeping Affinity Approach (not recommended):
- Change PodAffinity to use volume-based node selection
- Change Preferred to Required
- Add actual implementation code, not just documentation
Testing Requirements:
- Add test case that reproduces Multi-Attach error before fix
- Add test case showing error is resolved after fix
- Include in e2e test suite (see e2e/ directory)
Documentation Location:
- Move implementation guidance to issue Bug: [restarted timed-out sessions fail to start due to multi-attach PVC error] #298 comments
- Or create proper design doc under docs/design/
- Remove docs/implementation-prompts/ if it's not a standard pattern

Overall Assessment: ❌ DO NOT MERGE - The proposed solution uses an incorrect Kubernetes API (PodAffinity instead of volume topology awareness) and is based on incorrect understanding of the codebase (references non-existent temp-content initContainer). The PR needs fundamental rework of both the problem analysis and proposed solution.

Next Steps:

Reproduce the Multi-Attach error in a test environment
Capture actual error messages and pod events
Verify what containers/pods are actually involved
Implement correct solution based on root cause
Add tests to verify fix works

docs: add implementation prompt for multi-attach node affinity fix

4580700

Provides cold-start implementation instructions for fixing AgenticSession stuck in Creating phase due to PVC Multi-Attach errors when init containers and main containers schedule on different nodes with RWO PVCs.

jeremyeder marked this pull request as ready for review November 19, 2025 02:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Add node affinity to prevent Multi-Attach PVC errors #344

Fix: Add node affinity to prevent Multi-Attach PVC errors #344

Uh oh!

jeremyeder commented Nov 19, 2025

Uh oh!

github-actions bot commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix: Add node affinity to prevent Multi-Attach PVC errors #344

Are you sure you want to change the base?

Fix: Add node affinity to prevent Multi-Attach PVC errors #344

Uh oh!

Conversation

jeremyeder commented Nov 19, 2025

Problem

Solution

Implementation

Testing Needed

Uh oh!

github-actions bot commented Nov 19, 2025

Claude Code Review

Summary

Issues by Severity

🚫 Blocker Issues

🔴 Critical Issues

🟡 Major Issues

🔵 Minor Issues

Positive Highlights

Recommendations

CRITICAL - DO NOT MERGE WITHOUT THESE CHANGES:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant