Skip to content

Conversation

@jeremyeder
Copy link
Collaborator

Fixes #298

Problem

AgenticSessions stuck in "Creating" phase due to temp-content and Job pods scheduled on different nodes with RWO PVCs.

Solution

Add node affinity to ensure all containers schedule on the same node as the PVC attachment.

Implementation

See docs/implementation-prompts/fix-multi-attach-node-affinity.md for complete cold-startable implementation instructions.

Testing Needed

  • Create new AgenticSession and verify no Multi-Attach errors
  • Verify all containers scheduled on same node
  • Test with cluster autoscaling enabled

Note: This is a draft PR with implementation guidance. Code changes still need to be implemented.

Provides cold-start implementation instructions for fixing AgenticSession
stuck in Creating phase due to PVC Multi-Attach errors when init containers
and main containers schedule on different nodes with RWO PVCs.
@github-actions
Copy link
Contributor

Claude Code Review

Summary

This PR adds implementation guidance for fixing Multi-Attach PVC errors when AgenticSessions get stuck in "Creating" phase. The PR contains only a documentation file with implementation instructions - no actual code changes yet. The proposed solution uses PodAffinity to co-locate all containers on the same node as the PVC attachment.

Issues by Severity

🚫 Blocker Issues

1. INCORRECT AFFINITY TYPE - WILL NOT SOLVE THE PROBLEM

  • Location: docs/implementation-prompts/fix-multi-attach-node-affinity.md:31-51
  • Issue: The proposed solution uses PodAffinity instead of PodAntiAffinity or volume-based node constraints. PodAffinity schedules pods near OTHER pods with matching labels, not based on where volumes are attached.
  • Problem: This will NOT solve the Multi-Attach error. On the first pod creation, there are no existing pods with agentic-session=<name> label, so the affinity has no effect. The scheduler will place the pod arbitrarily, and the PVC will attach to that node. But this doesn't prevent future issues.
  • Correct Solution: Use volume topology awareness (already built into Kubernetes) or remove the temp-content initContainer entirely. The current code doesn't have a "temp-content" initContainer (see sessions.go:401-413 - only init-workspace exists).

2. DOCUMENTATION DESCRIBES NON-EXISTENT CODE

  • Location: docs/implementation-prompts/fix-multi-attach-node-affinity.md:7-10
  • Issue: Documentation claims there's a "temp-content" initContainer, but examining components/operator/internal/handlers/sessions.go:401-413 shows only one initContainer: init-workspace. No "temp-content" initContainer exists.
  • Impact: The root cause analysis is based on incorrect understanding of the current codebase. This means the proposed solution may not address the actual issue.

🔴 Critical Issues

3. MISSING ROOT CAUSE INVESTIGATION

  • Issue: The PR claims sessions get stuck in "Creating" phase due to Multi-Attach errors but doesn't provide evidence (logs, events, or error messages from issue Bug: [restarted timed-out sessions fail to start due to multi-attach PVC error] #298).
  • Impact: Without confirming the actual error, we're implementing a solution to a potentially misdiagnosed problem.
  • Recommendation: Include actual error messages/events showing Multi-Attach failures before implementing a fix.

4. AFFINITY IMPLEMENTATION VIOLATES KUBERNETES BEST PRACTICES

  • Location: docs/implementation-prompts/fix-multi-attach-node-affinity.md:32-33
  • Issue: Uses PreferredDuringSchedulingIgnoredDuringExecution (soft constraint) instead of RequiredDuringSchedulingIgnoredDuringExecution (hard constraint).
  • Impact: If this were the correct approach (which it isn't per issue Outcome: Reduce Refinement Time with agent System #1), a soft constraint wouldn't guarantee co-location. Pods could still be scheduled on different nodes if resources are tight, causing the exact same Multi-Attach error.
  • Correct Approach: If using affinity, it should be Required not Preferred.

5. REAL ISSUE: VOLUME TOPOLOGY NOT CONSIDERED

  • Actual Solution: Kubernetes has built-in volume topology awareness. If PVCs are using StorageClasses with volumeBindingMode: WaitForFirstConsumer, the PVC won't bind until a pod is scheduled, and the scheduler will consider volume topology.
  • Better Fix: Ensure StorageClass uses WaitForFirstConsumer binding mode, or verify that all containers in the Job pod (initContainers + main containers) are part of the SAME pod (which they already are in sessions.go:351-627).

🟡 Major Issues

6. INCORRECT IMPORT REFERENCE

  • Location: docs/implementation-prompts/fix-multi-attach-node-affinity.md:37
  • Issue: Code example references v1.LabelSelector but sessions.go imports use v1 for k8s.io/apimachinery/pkg/apis/meta/v1.
  • Impact: The code would need metav1.LabelSelector to be correct, or use the local v1 alias (which is already defined).
  • Fix: Change to &metav1.LabelSelector{ or verify v1 is the correct import alias.

7. DOCUMENTATION STRUCTURE DOESN'T MATCH REPO STANDARDS

  • Location: docs/implementation-prompts/ directory
  • Issue: The docs/implementation-prompts/ directory is not a standard location mentioned in CLAUDE.md or mkdocs.yml. Implementation guidance should either be:
  • Impact: Implementation prompts in this location may become stale and unmaintained.

8. TESTING SECTION LACKS VERIFICATION OF FIX

  • Location: docs/implementation-prompts/fix-multi-attach-node-affinity.md:57-62
  • Issue: Testing section doesn't verify the actual Multi-Attach error is resolved. It only checks if pods start and are on the same node.
  • Missing Tests:
    • Create session in multi-node cluster with RWO PVCs
    • Verify no Multi-Attach error events in pod/PVC events
    • Verify initContainers and main containers all run successfully
    • Test with cluster autoscaler to ensure affinity doesn't break scheduling

🔵 Minor Issues

9. ALTERNATIVE APPROACHES MISSING THE ACTUAL BEST PRACTICE

  • Location: docs/implementation-prompts/fix-multi-attach-node-affinity.md:64-69
  • Issue: Doesn't mention the actual Kubernetes-native solution: volume topology awareness via volumeBindingMode: WaitForFirstConsumer.
  • Recommendation: Add this as the preferred approach.

10. ROLLBACK INSTRUCTIONS INCOMPLETE

  • Location: docs/implementation-prompts/fix-multi-attach-node-affinity.md:72-79
  • Issue: Rollback instructions assume deployment process but don't mention how to fix already-stuck sessions.
  • Better Rollback:
    • Include how to manually reschedule stuck sessions
    • How to verify rollback succeeded

Positive Highlights

Good Documentation Structure: The implementation prompt follows a clear problem/solution/testing format that's easy to follow.

Rollback Considered: Including rollback instructions shows operational awareness.

Multiple Approaches Evaluated: The document considers alternative approaches (RWX, single initContainer, node selector) which shows thoughtful analysis.

Links to Issue: PR description properly references issue #298.

Recommendations

CRITICAL - DO NOT MERGE WITHOUT THESE CHANGES:

  1. Verify the Actual Problem: Review issue Bug: [restarted timed-out sessions fail to start due to multi-attach PVC error] #298 and include actual error messages/logs showing Multi-Attach errors. Confirm the root cause before implementing a solution.

  2. Fix Root Cause Analysis: Examine actual codebase to confirm what containers exist. Currently, sessions.go shows:

    • 1 InitContainer: init-workspace (line 401-413)
    • 2 Main Containers: ambient-content and ambient-code-runner (line 416-623)

    All these are in the SAME pod, so they already share the same node. Multi-Attach errors shouldn't occur unless there's a different issue.

  3. Reconsider Solution Approach:

    • Option A (Recommended): Verify StorageClass uses volumeBindingMode: WaitForFirstConsumer
    • Option B: If PVC topology is the issue, use NodeAffinity based on PVC node affinity, not PodAffinity
    • Option C: If init containers are racing with pod restarts, ensure proper ordering/dependencies
  4. If Keeping Affinity Approach (not recommended):

    • Change PodAffinity to use volume-based node selection
    • Change Preferred to Required
    • Add actual implementation code, not just documentation
  5. Testing Requirements:

    • Add test case that reproduces Multi-Attach error before fix
    • Add test case showing error is resolved after fix
    • Include in e2e test suite (see e2e/ directory)
  6. Documentation Location:


Overall Assessment: ❌ DO NOT MERGE - The proposed solution uses an incorrect Kubernetes API (PodAffinity instead of volume topology awareness) and is based on incorrect understanding of the codebase (references non-existent temp-content initContainer). The PR needs fundamental rework of both the problem analysis and proposed solution.

Next Steps:

  1. Reproduce the Multi-Attach error in a test environment
  2. Capture actual error messages and pod events
  3. Verify what containers/pods are actually involved
  4. Implement correct solution based on root cause
  5. Add tests to verify fix works

@jeremyeder jeremyeder marked this pull request as ready for review November 19, 2025 02:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: [restarted timed-out sessions fail to start due to multi-attach PVC error]

1 participant