Skip to content

Conversation

@jeremyeder
Copy link
Collaborator

Summary

Fixes the intermittent frontend health check failures in smoke tests by addressing a race condition between pod readiness and NodePort service routing.

Problem

PR #453 and potentially other PRs were failing with:

✗ Frontend not responding

Even though:

  • Pods showed as Running and Ready
  • All readiness probes passed
  • The frontend was actually working

This was a timing issue: the smoke test ran immediately after pods became Ready, but NodePort service endpoint propagation can lag slightly behind pod readiness state.

Solution

  1. Added explicit pod readiness wait using kubectl wait --for=condition=ready
  2. Implemented retry logic for health checks (5 attempts with 2s delays)
  3. Fixed deployment timeout handling in GitHub Actions to properly fail the job

Testing

The fix has been tested on branch fix-frontend-smoke-test-timing and successfully addresses the timing issue.

Related Issues

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com

The frontend health check was failing because the test ran immediately
after pods reached Ready status, but NodePort routing may not be fully
established yet. This is a race condition between pod readiness and
service endpoint propagation.

Changes:
- Added kubectl wait for pod readiness before health checks
- Implemented retry logic (5 attempts with 2s delay) for both backend and frontend
- Fixed GitHub Actions to properly fail on deployment timeouts

This ensures smoke tests wait for actual service availability, not just pod readiness.

Fixes: https://github.com/ambient-code/platform/actions/runs/20064017469/job/57547783304

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@github-actions
Copy link
Contributor

github-actions bot commented Dec 9, 2025

Claude Code Review

Summary

This PR addresses intermittent frontend health check failures in smoke tests by fixing a race condition between pod readiness and NodePort service routing.

Files Changed: .github/workflows/test-local-dev.yml, Makefile

Issues by Severity

Blocker Issues

None identified.

Critical Issues

None identified.

Major Issues

1. Inconsistent retry logic - The Makefile now has retry logic (5 attempts with 2s delays) for health checks, but verify the GitHub Actions workflow uses this updated Makefile target.

Minor Issues

1. Magic numbers - Retry count (5) and sleep duration (2s) are hardcoded. Consider extracting as variables.

2. Shell loop complexity - Consider extracting to a helper script for better maintainability.

Positive Highlights

  • Excellent problem diagnosis identifying the race condition
  • Multi-layered approach with pod readiness waits + retry logic
  • Proper use of kubectl wait --for=condition=ready pod
  • Good retry parameters (10s total is reasonable for NodePort propagation)
  • Fixes a real CI/CD pain point
  • Clear documentation in PR description

Recommendations

Priority 1: Verify GitHub Actions workflow uses the updated Makefile target with retry logic.

Priority 2: Extract retry config to variables, consider helper script, add test coverage.

Priority 3: Add monitoring metrics, update docs/testing/ with notes about this race condition.

Assessment

Architectural Alignment: Aligns with CLAUDE.md Testing Strategy section.

Security: No concerns - only test infrastructure changes.

Performance: Adds up to 10s in worst case, acceptable for reliability.

Final Verdict

LGTM with minor suggestions

Solid fix following Kubernetes best practices. Safe to merge as-is, with Priority 1 recommendation to address before or immediately after merge.


Review by Claude Code following CLAUDE.md and .claude/patterns/ standards


🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

@jeremyeder jeremyeder merged commit cd930b9 into ambient-code:main Dec 9, 2025
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant