Skip to content

bug: ai-proxy-multi health checker creation always fails in timer context due to missing _dns_value #13101

@Baoyuantop

Description

@Baoyuantop

Description

The ai-proxy-multi plugin's health check mechanism has a structural bug: the construct_upstream function is called from both request context and timer context (healthcheck_manager.timer_create_checker), but only works correctly in request context.

In timer context, construct_upstream always returns nil because the _dns_value runtime field does not exist on instance configs read from etcd, causing health checker creation to permanently fail.

Current Behavior

  1. When a request hits pick_target, resolve_endpoint is called which sets instance._dns_value (a runtime-only field on the in-memory config object).
  2. fetch_checker is called, which returns nil (checker not yet created) and adds the resource to waiting_pool.
  3. The timer_create_checker timer fires, reads config from etcd via resource.fetch_latest_conf, extracts the instance config via jsonpath, and calls plugin.construct_upstream(instance_config).
  4. construct_upstream checks instance._dns_valuethis field does not exist on configs read from etcd (it's only set in request context by resolve_endpoint).
  5. construct_upstream returns nil, so create_checker is never called.
  6. The resource is removed from waiting_pool (waiting_pool[resource_path] = nil at line 211), so it will never be retried.
  7. Subsequent calls to fetch_checker from request context see the resource is neither in working_pool nor waiting_pool, so it gets re-added to waiting_pool — but the same cycle repeats on the next timer tick.

Net effect: Health checkers are never successfully created through the timer path. Unhealthy instances are never filtered out by the load balancer.

Expected Behavior

construct_upstream should be able to compute the upstream node info from the instance's static configuration (endpoint URL or provider defaults) without relying on _dns_value, so that health checkers can be created successfully in timer context.

Code References

construct_upstream requiring _dns_value (ai-proxy-multi.lua#L302-L306):

function _M.construct_upstream(instance)
    local upstream = {}
    local node = instance._dns_value
    if not node then
        return nil, "failed to resolve endpoint for instance: " .. instance.name
    end

_dns_value is only set in request context by resolve_endpoint (ai-proxy-multi.lua#L215):

instance_conf._dns_value = new_node

Timer calls construct_upstream with etcd config (no _dns_value) (healthcheck_manager.lua#L165-L179):

local res_conf = resource.fetch_latest_conf(resource_path)
-- ...
local upstream_constructor_config = jp.value(res_conf.value, json_path)
upstream = plugin.construct_upstream(upstream_constructor_config)  -- _dns_value missing

Resource permanently removed from waiting_pool after failure (healthcheck_manager.lua#L201-L211):

local checker = create_checker(upstream)  -- upstream is nil, so checker is nil
if not checker then
    goto continue                          -- skips add_working_pool
end
-- ...
::continue::
waiting_pool[resource_path] = nil          -- permanently removed

Suggested Fix Direction

Add a fallback in construct_upstream that computes the node from static config (endpoint URL or provider's default host/port) when _dns_value is not available. The existing resolve_endpoint function already contains this logic — it can be extracted into a pure function like calculate_dns_node(instance_conf) that returns {host, port, scheme} without modifying the input.

Important: Any fix must preserve the ai_driver.get_node() interface used by providers like vertex-ai that compute host dynamically (e.g., based on region).

Environment

  • APISIX version: master (current as of 2026-03-18)
  • Affects all deployment modes where ai-proxy-multi is used with health checks enabled

Context

This issue was identified during analysis of PR #12968, which attempts to fix this problem but has additional issues (removes get_node support, couples to resty.healthcheck SHM internals, includes unrelated changes). This issue is filed to track the core bug independently.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingplugin

    Type

    No type

    Projects

    Status

    📋 Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions