Skip to content

bug: Health check state lost and checker not working after upstream node changes #13282

@solosky

Description

@solosky

Current Behavior

When upstream nodes change (e.g., Kubernetes pod scaling, service discovery update, or DNS resolution change), the health checker has two critical issues:

  1. Health check status lost: Previously detected unhealthy nodes reset to healthy after node changes
  2. Health check not running: There is a probability that the health checker stops checking after node changes

Root Cause

APISIX uses a full destroy-and-rebuild strategy for health checkers when upstream nodes change. The core flow is:

  1. Node change → _nodes_ver increments → resource_version changes
  2. fetch_checker() detects version mismatch → adds to waiting_pool, returns nil (no checker during this period)
  3. Timer (1s interval) destroys old checker → calls delayed_clear() to clear all health status from shared dict
  4. Creates brand new checker → all nodes start as healthy

Impact

  • Traffic routed to unhealthy nodes during the window between checker rebuild and next active check cycle
  • Removed nodes remain in the health checker's target list, consuming resources and potentially affecting health check results
  • In high-frequency node change scenarios, the checker may never be successfully created due to version race conditions

Suggested Fix

Implement incremental target update instead of full rebuild:

  1. When nodes change but checks config remains the same, only add/remove targets on the existing checker
  2. Use target.hostname from get_target_list() when calling remove_target() to ensure the correct target is matched
  3. Only do full rebuild when checks configuration changes

Key changes in healthcheck_manager.lua:

  • Add update_checker_targets(): incrementally adds new targets and removes stale ones
  • Add checks_config_equal(): compares checks config to decide incremental vs full rebuild
  • Fix remove_target() hostname: use stored target.hostname instead of checks.active.host
  • Save checks config in working pool for later comparison

Expected Behavior

  1. Existing nodes should retain their health status (healthy/unhealthy) when nodes are added/removed
  2. Removed nodes should be properly cleaned up from the health checker
  3. Health checking should not have gaps during node changes

Error Logs

No response

Steps to Reproduce

  1. Create a route with health check enabled and multiple upstream nodes
  2. Wait for one node to be detected as unhealthy
  3. Add a new node to the upstream (or trigger service discovery update)
  4. Observe that the previously unhealthy node resets to healthy
  5. Check health checker target list — removed nodes may still be present

Environment

Environment
APISIX version: 3.16.0
lua-resty-healthcheck-api7: 3.2.1-0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    📋 Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions