Fix: Preserve Gateway's independent upstream health check state when receiving config updates from Admin #6274

guanzhenxing · 2026-01-19T08:28:13Z

Problem Description

When the Admin side health check detects an upstream as unhealthy and publishes a configuration update with status=false, the Gateway side completely removes that upstream from both healthyUpstream and unhealthyUpstream maps. This causes the Gateway to lose track of the upstream, preventing its independent health check from recovering the upstream when it becomes healthy again.

Error Manifestation

divide upstream configuration error
CANNOT_FIND_HEALTHY_UPSTREAM_URL

Root Cause

ShenYu has two independent health check systems:

Admin side: Checks upstream health and publishes configuration updates
Gateway side: Independently runs health checks and maintains its own health state

The issue occurs when:

Admin's health check marks an upstream as unhealthy (status=false)
Admin publishes a configuration update
Gateway receives the update via DivideUpstreamDataHandler
UpstreamCacheManager.submit() processes status=false upstreams
Bug: Original code calls triggerRemoveOne() which removes the upstream from BOTH healthy and unhealthy maps
Result: Gateway loses all tracking of this upstream - even if it recovers, Gateway won't know

Solution

Design Principle

Gateway's health check state should be independent of Admin's configuration updates

Core Logic Change

Before:
status=false → triggerRemoveOne() → completely removed from both maps

After:
status=false AND healthCheckEnabled=true → preserve in unhealthy map → continue health checking
status=false AND healthCheckEnabled=false → remove (no monitoring needed)

Changes

UpstreamCacheManager.java

Refactored submit() method

Extracted logic into smaller, focused methods for better maintainability
Fixed ConcurrentModificationException by creating ArrayList copy before iteration

New method: processOfflineUpstreams()

Handles upstreams with status=false:

  // If upstream was previously in unhealthy map AND health check is enabled:
  //   → Keep it in unhealthy map for continued monitoring
  // If upstream was not previously unhealthy OR health check is disabled:
  //   → Remove it (no monitoring needed)

New method: processValidUpstreams()

Handles upstreams with status=true:

Checks if upstream was previously in unhealthyUpstream map
If yes, preserves the unhealthy state instead of forcing it to healthy
This allows Gateway's health check to recover it naturally

New method: getCurrentUnhealthyMap()

Helper method to get current unhealthy upstreams for state preservation

UpstreamCheckTask.java

Made putToMap() and removeFromMap() public

These methods were private but are needed by UpstreamCacheManager
Now allows preserving unhealthy state across configuration updates

Testing

Added 9 comprehensive tests to verify the fix:

UpstreamCacheManagerTest (4 new tests)

testSubmitWithStatusFalsePreservesUnhealthyState: Verifies upstreams with status=false that were previously unhealthy remain in unhealthy map
testSubmitWithNewOfflineUpstreamAddedToUnhealthy: Verifies new upstreams with status=false are added to unhealthy map for monitoring
testSubmitPreservesUnhealthyForValidUpstream: Verifies valid upstreams (status=true) that were previously unhealthy remain in unhealthy map
testSubmitWithHealthCheckDisabledAndStatusFalse: Verifies upstreams with healthCheckEnabled=false are removed, not added to unhealthy map

UpstreamCheckTaskTest (5 new tests)

testPutToMap: Tests adding upstreams to healthy map
testPutToMapUnhealthy: Tests adding upstreams to unhealthy map
testRemoveFromMap: Tests removing upstreams from healthy map
testRemoveFromMapUnhealthy: Tests removing upstreams from unhealthy map
testMoveUpstreamBetweenMaps: Tests moving upstreams between healthy and unhealthy maps

Test Results

Tests run: 19, Failures: 0, Errors: 0, Skipped: 0
BUILD SUCCESS

Impact

Before Fix

Gateway loses unhealthy upstream tracking when Admin publishes updates
Recovered upstreams cannot be detected by Gateway
Results in CANNOT_FIND_HEALTHY_UPSTREAM_URL errors

After Fix

Gateway preserves its independent health check state
Unhealthy upstreams continue to be monitored even after Admin updates
Gateway can automatically recover upstreams when they become healthy
No manual intervention required

Commits

8a0f9e9 - Fix: Preserve unhealthy upstream state when receiving config updates from admin
78822b4 - Test: Add tests for upstream unhealthy state preservation

…from admin When admin publishes configuration updates with upstreams marked as status=false, the gateway should preserve their unhealthy state and continue health checking instead of completely removing them. This allows the gateway's independent health check to recover upstreams when they become healthy. Changes: - UpstreamCacheManager: Refactored submit() method to preserve unhealthy state for both status=true and status=false upstreams - Added processOfflineUpstreams() to handle status=false upstreams with health check enabled, keeping them in unhealthy map for monitoring - Added processValidUpstreams() to check if valid upstreams were previously unhealthy and preserve that status - UpstreamCheckTask: Made removeFromMap() public to support state preservation Co-Authored-By: Claude <noreply@anthropic.com>

Add comprehensive tests to verify the fix for preserving unhealthy upstream state when receiving config updates from admin. UpstreamCacheManagerTest: - testSubmitWithStatusFalsePreservesUnhealthyState: Verify that upstreams with status=false that were previously unhealthy remain in unhealthy map - testSubmitWithNewOfflineUpstreamAddedToUnhealthy: Verify new upstreams with status=false are added to unhealthy map for monitoring - testSubmitPreservesUnhealthyForValidUpstream: Verify valid upstreams that were previously unhealthy remain in unhealthy map - testSubmitWithHealthCheckDisabledAndStatusFalse: Verify upstreams with healthCheckEnabled=false are removed, not added to unhealthy map UpstreamCheckTaskTest: - testPutToMap: Test adding upstreams to healthy map - testPutToMapUnhealthy: Test adding upstreams to unhealthy map - testRemoveFromMap: Test removing upstreams from healthy map - testRemoveFromMapUnhealthy: Test removing upstreams from unhealthy map - testMoveUpstreamBetweenMaps: Test moving upstreams between maps Co-Authored-By: Claude <noreply@anthropic.com>

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-20T05:07:49Z

...u-loadbalancer/src/test/java/org/apache/shenyu/loadbalancer/cache/UpstreamCheckTaskTest.java

+        assertTrue(!healthCheckTask.getUnhealthyUpstream().containsKey(selectorId)
+                || healthCheckTask.getUnhealthyUpstream().get(selectorId).isEmpty());


The assertion logic uses double negation which reduces code readability. Instead of checking !healthCheckTask.getUnhealthyUpstream().containsKey(selectorId) || healthCheckTask.getUnhealthyUpstream().get(selectorId).isEmpty(), consider using a more straightforward assertion that directly validates the absence of upstreams in the unhealthy map.

Copilot · 2026-01-20T05:07:50Z

...u-loadbalancer/src/test/java/org/apache/shenyu/loadbalancer/cache/UpstreamCheckTaskTest.java

+        assertTrue(!healthCheckTask.getHealthyUpstream().containsKey(selectorId)
+                || healthCheckTask.getHealthyUpstream().get(selectorId).isEmpty());


The assertion logic uses double negation which reduces code readability. Instead of checking !healthCheckTask.getHealthyUpstream().containsKey(selectorId) || healthCheckTask.getHealthyUpstream().get(selectorId).isEmpty(), consider using a more straightforward assertion that directly validates the absence of upstreams in the healthy map.

Copilot · 2026-01-20T05:07:50Z

...u-loadbalancer/src/test/java/org/apache/shenyu/loadbalancer/cache/UpstreamCheckTaskTest.java

+        assertTrue(!healthCheckTask.getUnhealthyUpstream().containsKey(selectorId)
+                || healthCheckTask.getUnhealthyUpstream().get(selectorId).isEmpty());


The assertion logic uses double negation which reduces code readability. Instead of checking !healthCheckTask.getUnhealthyUpstream().containsKey(selectorId) || healthCheckTask.getUnhealthyUpstream().get(selectorId).isEmpty(), consider using a more straightforward assertion that directly validates the absence of upstreams in the unhealthy map.

...u-loadbalancer/src/test/java/org/apache/shenyu/loadbalancer/cache/UpstreamCheckTaskTest.java

…er/cache/UpstreamCheckTaskTest.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

guanzhenxing and others added 3 commits January 19, 2026 14:52

Merge branch 'master' into fix/fix_upstream_bug

053303e

Aias00 requested a review from Copilot January 19, 2026 10:09

Copilot AI reviewed Jan 19, 2026

View reviewed changes

Copilot started reviewing on behalf of Aias00 January 19, 2026 11:06 View session

Aias00 requested a review from Copilot January 20, 2026 04:59

Copilot started reviewing on behalf of Aias00 January 20, 2026 04:59 View session

Copilot AI reviewed Jan 20, 2026

View reviewed changes

Update shenyu-loadbalancer/src/test/java/org/apache/shenyu/loadbalanc…

2b3d002

…er/cache/UpstreamCheckTaskTest.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Preserve Gateway's independent upstream health check state when receiving config updates from Admin #6274

Fix: Preserve Gateway's independent upstream health check state when receiving config updates from Admin #6274

guanzhenxing commented Jan 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 20, 2026

Uh oh!

Copilot AI Jan 20, 2026

Uh oh!

Copilot AI Jan 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		assertTrue(!healthCheckTask.getUnhealthyUpstream().containsKey(selectorId)
		\|\| healthCheckTask.getUnhealthyUpstream().get(selectorId).isEmpty());

		assertTrue(!healthCheckTask.getHealthyUpstream().containsKey(selectorId)
		\|\| healthCheckTask.getHealthyUpstream().get(selectorId).isEmpty());

Fix: Preserve Gateway's independent upstream health check state when receiving config updates from Admin #6274

Are you sure you want to change the base?

Fix: Preserve Gateway's independent upstream health check state when receiving config updates from Admin #6274

Conversation

guanzhenxing commented Jan 19, 2026

Problem Description

Error Manifestation

Root Cause

Solution

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants