Skip to content

Conversation

@sallyom
Copy link
Collaborator

@sallyom sallyom commented Nov 19, 2025

Root cause: Recent package auto-upgrades broke SDK initialization.

Fixes applied:

  1. npm @anthropic-ai/claude-code: 2.0.46 → 2.0.41 (CRITICAL FIX!)

    • Version 2.0.46 (released Nov 19) breaks SDK subprocess initialization
    • Reverts to last working version from Nov 14 upstream build
  2. anthropic[vertex]: >=0.68.0 → ==0.73.0

    • Pin to tested version for stability
    • 0.74.0+ untested with Vertex AI
  3. claude-agent-sdk: >=0.1.4 → ==0.1.6

    • Pin to tested version for stability
    • 0.1.7+ untested

Evidence:

  • Working upstream image (quay.io/ambient_code/vteam_claude_runner:latest) had npm CLI 2.0.41, anthropic 0.73.0, claude-agent-sdk 0.1.6
  • Recent builds auto-installed 2.0.46 → broke initialization
  • Python packages were already correct (red herring during investigation)

The npm pin is the critical fix. Python pins add build stability.

Root cause: Recent package auto-upgrades broke SDK initialization.

Fixes applied:
1. npm @anthropic-ai/claude-code: 2.0.46 → 2.0.41 (CRITICAL FIX!)
   - Version 2.0.46 (released Nov 19) breaks SDK subprocess initialization
   - Reverts to last working version from Nov 14 upstream build

2. anthropic[vertex]: >=0.68.0 → ==0.73.0
   - Pin to tested version for stability
   - 0.74.0+ untested with Vertex AI

3. claude-agent-sdk: >=0.1.4 → ==0.1.6
   - Pin to tested version for stability
   - 0.1.7+ untested

Evidence:
- Working upstream image (quay.io/ambient_code/vteam_claude_runner:latest)
  had npm CLI 2.0.41, anthropic 0.73.0, claude-agent-sdk 0.1.6
- Recent builds auto-installed 2.0.46 → broke initialization
- Python packages were already correct (red herring during investigation)

The npm pin is the critical fix. Python pins add build stability.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: sallyom <somalley@redhat.com>
@github-actions
Copy link
Contributor

Claude Code Review

Summary

This PR addresses a critical production issue by pinning runner dependencies to known-working versions after recent package auto-upgrades broke SDK initialization. The changes are minimal, focused, and well-justified with clear evidence in the PR description. The fix is appropriate and should be merged promptly to restore functionality.

Issues by Severity

Critical Issues

1. Missing Regression Test Strategy

  • Location: components/runners/claude-code-runner/tests/
  • Issue: No automated tests exist to detect SDK initialization failures before they reach production
  • Impact: Critical breakages like this one are only discovered in production/CI
  • Recommendation: Add integration tests that verify SDK subprocess initialization succeeds, Claude Code CLI version compatibility, and basic agent session execution completes

2. Dependency Pinning Strategy Not Documented

  • Location: CLAUDE.md:219-225, components/runners/claude-code-runner/
  • Issue: CLAUDE.md specifies Prefer uv over pip but doesnt provide guidance on when to use exact pins vs ranges
  • Context: Recent history shows repeated pin adjustments indicating trial-and-error approach
  • Recommendation: Add to CLAUDE.md under Configuration Standards Python: guidance on production containers using exact pins for stability

Major Issues

3. No Automated Dependency Update Process

  • Issue: The breakage occurred because anthropic-ai/claude-code auto-upgraded from 2.0.41 to 2.0.46 during builds
  • Root Cause: Dockerfile line 10 previously used unversioned npm install allowing latest version to be installed
  • Recommendation: Set up Dependabot/Renovate to create PRs for dependency updates. Consider extracting to package.json or use Renovate which supports Dockerfile parsing.

4. Version Selection Rationale Missing

  • Issue: PR states 0.74.0+ untested and 0.1.7+ untested but doesnt explain why these versions broke or what changed
  • Context: Makes future maintenance difficult - if 0.74.0 has a critical security fix, unclear if it can be safely adopted
  • Recommendation: Document specific breaking changes. Create GitHub issue tracking: Investigate anthropic 0.74.0+ and claude-agent-sdk 0.1.7+ compatibility

5. uv.lock Diff Too Large for Manual Review

  • Issue: 1489 lines of uv.lock changes mostly timestamps obscure meaningful dependency changes
  • Impact: Difficult to spot if unexpected transitive dependencies changed
  • Recommendation: Add CI check to validate only expected packages changed versions

Minor Issues

6. Inline Comments Use Different Style

  • Location: pyproject.toml:14-15
  • Issue: Inline comments after dependency declarations are inconsistent with Python conventions
  • Severity: Minor - current style is acceptable but comments above dependencies would be more scannable

7. No Rollback Plan Documented

  • Issue: If this pinning strategy fails, no documented process for emergency rollback
  • Recommendation: Document how to identify if version pins are causing issues and emergency rollback procedure

8. BOT_TOKEN Handling Could Be More Secure

  • Location: wrapper.py:78-83
  • Observation: Token appended to WebSocket URL as query parameter
  • Context: While not introduced by this PR, worth noting for security review
  • Note: Query parameters may appear in logs. Verify token redaction in logs per CLAUDE.md standards

Positive Highlights

Excellent Root Cause Analysis: PR description provides clear evidence trail

Minimal Blast Radius: Changes are scoped to exactly what is needed - 3 version pins across 3 files

Follows CLAUDE.md Standards: Uses uv as preferred package manager, Dockerfile follows established patterns, commit message has co-author attribution

Clear Commenting: Inline comments explain why versions are pinned, not just what is pinned

No Security Vulnerabilities Introduced: All pinned versions are recent and actively maintained

CI Validation in Progress: Build and E2E tests running

Recommendations

Immediate (Before Merge)

  1. Merge as-is - This is a critical fix that restores production functionality
  2. Add follow-up issue for test coverage
  3. Verify CI passes - Especially E2E tests and claude-runner build

Short-term (Next Sprint)

  1. Document dependency strategy in CLAUDE.md
  2. Set up npm dependency monitoring
  3. Create tracking issues for investigating newer versions compatibility and adding SDK initialization regression tests

Long-term (Next Quarter)

  1. Improve dependency testing: Integration tests that catch SDK breakage before production
  2. Automate version compatibility matrix: Test runner against multiple SDK versions in CI
  3. Consider lock file tooling: Automated lock file diff summaries in PR comments

Final Verdict

APPROVE with follow-up tasks

This PR correctly addresses an urgent production issue with appropriate evidence and minimal risk. The identified issues are primarily about preventing future occurrences rather than problems with the current fix.

Blocking concerns: None
Non-blocking improvements: 8 recommendations tracked above

Estimated merge safety: High - restores known-good configuration with clear rollback path

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants