Current Behavior
When upstream nodes change (e.g., Kubernetes pod scaling, service discovery update, or DNS resolution change), the health checker has two critical issues:
- Health check status lost: Previously detected unhealthy nodes reset to healthy after node changes
- Health check not running: There is a probability that the health checker stops checking after node changes
Root Cause
APISIX uses a full destroy-and-rebuild strategy for health checkers when upstream nodes change. The core flow is:
- Node change →
_nodes_ver increments → resource_version changes
fetch_checker() detects version mismatch → adds to waiting_pool, returns nil (no checker during this period)
- Timer (1s interval) destroys old checker → calls
delayed_clear() to clear all health status from shared dict
- Creates brand new checker → all nodes start as healthy
Impact
- Traffic routed to unhealthy nodes during the window between checker rebuild and next active check cycle
- Removed nodes remain in the health checker's target list, consuming resources and potentially affecting health check results
- In high-frequency node change scenarios, the checker may never be successfully created due to version race conditions
Suggested Fix
Implement incremental target update instead of full rebuild:
- When nodes change but
checks config remains the same, only add/remove targets on the existing checker
- Use
target.hostname from get_target_list() when calling remove_target() to ensure the correct target is matched
- Only do full rebuild when
checks configuration changes
Key changes in healthcheck_manager.lua:
- Add
update_checker_targets(): incrementally adds new targets and removes stale ones
- Add
checks_config_equal(): compares checks config to decide incremental vs full rebuild
- Fix
remove_target() hostname: use stored target.hostname instead of checks.active.host
- Save
checks config in working pool for later comparison
Expected Behavior
- Existing nodes should retain their health status (healthy/unhealthy) when nodes are added/removed
- Removed nodes should be properly cleaned up from the health checker
- Health checking should not have gaps during node changes
Error Logs
No response
Steps to Reproduce
- Create a route with health check enabled and multiple upstream nodes
- Wait for one node to be detected as unhealthy
- Add a new node to the upstream (or trigger service discovery update)
- Observe that the previously unhealthy node resets to healthy
- Check health checker target list — removed nodes may still be present
Environment
Environment
APISIX version: 3.16.0
lua-resty-healthcheck-api7: 3.2.1-0
Current Behavior
When upstream nodes change (e.g., Kubernetes pod scaling, service discovery update, or DNS resolution change), the health checker has two critical issues:
Root Cause
APISIX uses a full destroy-and-rebuild strategy for health checkers when upstream nodes change. The core flow is:
_nodes_verincrements →resource_versionchangesfetch_checker()detects version mismatch → adds towaiting_pool, returnsnil(no checker during this period)delayed_clear()to clear all health status from shared dictImpact
Suggested Fix
Implement incremental target update instead of full rebuild:
checksconfig remains the same, only add/remove targets on the existing checkertarget.hostnamefromget_target_list()when callingremove_target()to ensure the correct target is matchedchecksconfiguration changesKey changes in
healthcheck_manager.lua:update_checker_targets(): incrementally adds new targets and removes stale oneschecks_config_equal(): compares checks config to decide incremental vs full rebuildremove_target()hostname: use storedtarget.hostnameinstead ofchecks.active.hostchecksconfig in working pool for later comparisonExpected Behavior
Error Logs
No response
Steps to Reproduce
Environment
Environment
APISIX version: 3.16.0
lua-resty-healthcheck-api7: 3.2.1-0