Skip to content

feat(elbv2): real health probes with state machine#756

Merged
vieiralucas merged 2 commits intomainfrom
worktree-batch2-elbv2-health-probes
Apr 25, 2026
Merged

feat(elbv2): real health probes with state machine#756
vieiralucas merged 2 commits intomainfrom
worktree-batch2-elbv2-health-probes

Conversation

@vieiralucas
Copy link
Copy Markdown
Member

@vieiralucas vieiralucas commented Apr 25, 2026

Summary

  • Background tokio prober per fakecloud process walks every target group with HealthCheckEnabled, hits each registered target on its configured protocol/port/path at the configured interval, and flips target health state per healthy/unhealthy threshold counts
  • HTTP/HTTPS probes honor Matcher.HttpCode (single, range, list, mixed). TCP/TLS probes succeed on connect. UDP/GENEVE return healthy without active probing
  • Newly registered targets start at initial (matches AWS) until the first probe completes
  • New env knob FAKECLOUD_ELBV2_DISABLE_HEALTH_PROBES=true opts out and keeps the historical synthetic healthy default — useful when targets are placeholder IPs
  • Removes "Health probes" from the elbv2.md "Not yet implemented" section and adds a dedicated "Health probes" section

Test plan

  • Unit tests for matcher_matches (single, range, list, mixed)
  • E2E spawns a tiny TCP server returning configurable HTTP codes, registers it as an IP target, asserts prober flips to healthy then unhealthy after status flip
  • cargo clippy --workspace --all-targets -- -D warnings clean
  • cargo fmt --check clean
  • Existing 11 elbv2 unit tests still pass

Summary by cubic

Adds real ELBv2 health checks with a background prober that updates target health based on group settings. HTTP/HTTPS honor code matchers; TCP/TLS succeed on connect; UDP/GENEVE are treated healthy.

  • New Features

    • Probes target groups with HealthCheckEnabled on the configured protocol/port/path at HealthCheckIntervalSeconds.
    • HTTP/HTTPS honor Matcher.HttpCode; TCP/TLS succeed on connect; UDP/GENEVE are marked healthy. State flips via consecutive success/failure thresholds; DescribeTargetHealth shows live state.
    • Newly registered targets start at initial; set FAKECLOUD_ELBV2_DISABLE_HEALTH_PROBES=true to keep all targets healthy. Docs add a "Health probes" section.
  • Bug Fixes

    • Removed the 30s client-level timeout; per-probe timeout uses HealthCheckTimeoutSeconds and is authoritative.
    • Validate health-check port range (1–65535) and use safe port conversion so out-of-range ports fail fast instead of wrapping.

Written for commit d8cbe34. Summary will update on new commits.

A background tokio prober walks every target group with HealthCheckEnabled,
hits each registered target on its configured protocol/port/path at the
configured interval, and flips state.target.health.state per the
healthy/unhealthy threshold counts. HTTP/HTTPS probes match against
matcher.http_code (single, range, list, mixed). TCP/TLS probes succeed on
connect. UDP/GENEVE return healthy without active probing (matches AWS NLB
semantics).

Behavior:
- Newly registered targets start at "initial" until first probe completes
- Consecutive successes >= healthy_threshold -> "healthy"
- Consecutive failures >= unhealthy_threshold -> "unhealthy" with
  Target.FailedHealthChecks reason
- DescribeTargetHealth returns the live state

Set FAKECLOUD_ELBV2_DISABLE_HEALTH_PROBES=true to opt out and keep the
historical synthetic "healthy" default for placeholder IPs that do not
actually answer.

E2E: spawns a tiny TCP server returning configurable HTTP status codes,
registers it as a target, asserts the prober flips healthy with 200 and
unhealthy after consecutive 503s.

Removes "Health probes" from the elbv2.md "Not yet implemented" section
and adds a dedicated "Health probes" section.
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 25, 2026

Codecov Report

❌ Patch coverage is 22.59615% with 161 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
crates/fakecloud-elbv2/src/prober.rs 23.46% 150 Missing ⚠️
crates/fakecloud-elbv2/src/state.rs 0.00% 7 Missing ⚠️
crates/fakecloud-elbv2/src/service.rs 20.00% 4 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 9 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/fakecloud-elbv2/src/prober.rs">

<violation number="1" location="crates/fakecloud-elbv2/src/prober.rs:29">
P2: The hard-coded reqwest client timeout (30s) caps HTTP/HTTPS health checks and can override `HealthCheckTimeoutSeconds` for target groups.</violation>

<violation number="2" location="crates/fakecloud-elbv2/src/prober.rs:221">
P1: `job.port as u16` can wrap invalid `i32` ports, causing TCP/TLS probes to hit the wrong port instead of failing fast.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread crates/fakecloud-elbv2/src/prober.rs Outdated
Comment thread crates/fakecloud-elbv2/src/prober.rs Outdated
- Drop the 30s reqwest client-level timeout. Per-probe `tokio::time::timeout`
  using `HealthCheckTimeoutSeconds` is now the only timeout, so that knob
  is authoritative and not capped at 30s.
- Validate health-check port range in `build_job` (1..=65535) and use
  `u16::try_from` at the TCP/TLS connect site so an out-of-range i32 port
  fails the probe fast instead of wrapping to a different valid port.
@vieiralucas vieiralucas merged commit 4113f42 into main Apr 25, 2026
38 checks passed
@vieiralucas vieiralucas deleted the worktree-batch2-elbv2-health-probes branch April 25, 2026 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant