Skip to content

fix: eliminate 502 errors during rolling deployments#560

Open
scotwells wants to merge 2 commits intomainfrom
fix/rolling-deploy-502s
Open

fix: eliminate 502 errors during rolling deployments#560
scotwells wants to merge 2 commits intomainfrom
fix/rolling-deploy-502s

Conversation

@scotwells
Copy link
Copy Markdown
Contributor

Summary

Clients occasionally see brief 502 errors when the API server is updated. This PR makes deployments seamless so clients never see errors during rollouts.

What changed:

  • The gateway now automatically retries failed requests on a healthy pod instead of returning the error to the client
  • Health checks detect pods shutting down within seconds and stop sending them traffic
  • The API server signals it's going away before actually stopping, giving the gateway time to react
  • A disruption budget prevents too many pods from going down at once during cluster maintenance
  • The rollout strategy now waits for a new pod to be fully ready before terminating the old one

Closes #559

Test plan

  • Deploy to staging with 2+ API server replicas
  • Trigger a rolling update (image tag change) while sending continuous API requests
  • Verify zero 502 errors are returned to clients during the rollout
  • Verify the BackendTrafficPolicy is accepted by Envoy Gateway (kubectl get btp -n milo-system)
  • Test a voluntary node drain to confirm the PDB keeps at least one pod available

🤖 Generated with Claude Code

@joggrbot
Copy link
Copy Markdown
Contributor

joggrbot bot commented Apr 2, 2026

📝 Documentation Analysis

All docs are up to date! 🎉


✅ Latest commit analyzed: 46f25f2 | Powered by Joggr

scotwells and others added 2 commits April 2, 2026 17:46
Add gateway-level retry, health checking, and disruption protection so
clients never see errors when the API server is updated.

Closes #559

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@scotwells scotwells force-pushed the fix/rolling-deploy-502s branch from ace18cf to 46f25f2 Compare April 2, 2026 21:46
@scotwells scotwells marked this pull request as ready for review April 2, 2026 21:47
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
maxUnavailable: 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hahaha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reduce API errors during rolling deployments

3 participants